[PATCH rdma-core 00/14] Revise the DMA barrier macros in ibverbs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH rdma-core 00/14] Revise the DMA barrier macros in ibverbs
@ 2017-02-16 19:22 Jason Gunthorpe
       [not found] ` <1487272989-8215-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-16 19:22 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Now that the header is private to the library we can change it.

We have never had a clear definition for what our wmb() or wc_wmb() even do,
and they don't match the same macros in the kernel. This causes problems for
non x86 arches as they have no idea what to put in their versions of the macros
and often just put the strongest thing.

This also causes problems for the driver authors who have no idea how to use
these barriers properly, there are several instances of that :(

My approach here is to introduce a selection of macros that have narrow and
clearly defined purposes. The selection is based on things the set of drivers
actually do, which turns out to be fairly narrowly defined.

Then I went through all the drivers and adjusted them to various degrees to
use the new macro names. In a few drivers I added more/stronger barriers.
Overall this tries hard not to break anything by weaking existing barriers.

A future project for someone would be to see if the CPU ASM makes sense..

https://github.com/linux-rdma/rdma-core/pull/79

Jason Gunthorpe (14):
  mlx5: Use stdatomic for the in_use barrier
  Provide new names for the CPU barriers related to DMA
  cxgb3: Update to use new udma write barriers
  cxgb4: Update to use new udma write barriers
  hns: Update to use new udma write barriers
  i40iw: Get rid of unique barrier macros
  mlx4: Update to use new udma write barriers
  mlx5: Update to use new udma write barriers
  nes: Update to use new udma write barriers
  mthca: Update to use new mmio write barriers
  ocrdma: Update to use new udma write barriers
  qedr: Update to use new udma write barriers
  vmw_pvrdma: Update to use new udma write barriers
  Remove the old barrier macros

 providers/cxgb3/cq.c             |   2 +
 providers/cxgb3/cxio_wr.h        |   2 +-
 providers/cxgb4/qp.c             |  20 +++-
 providers/cxgb4/t4.h             |  48 ++++++--
 providers/cxgb4/verbs.c          |   2 +
 providers/hns/hns_roce_u_hw_v1.c |  13 +-
 providers/i40iw/i40iw_osdep.h    |  14 ---
 providers/i40iw/i40iw_uk.c       |  26 ++--
 providers/mlx4/cq.c              |   6 +-
 providers/mlx4/qp.c              |  19 +--
 providers/mlx4/srq.c             |   2 +-
 providers/mlx5/cq.c              |   8 +-
 providers/mlx5/mlx5.h            |   7 +-
 providers/mlx5/qp.c              |  18 +--
 providers/mlx5/srq.c             |   2 +-
 providers/mthca/cq.c             |  10 +-
 providers/mthca/doorbell.h       |   2 +-
 providers/mthca/qp.c             |  20 ++--
 providers/mthca/srq.c            |   6 +-
 providers/nes/nes_uverbs.c       |  16 +--
 providers/ocrdma/ocrdma_verbs.c  |  16 ++-
 providers/qedr/qelr_verbs.c      |  32 +++--
 providers/vmw_pvrdma/cq.c        |   6 +-
 providers/vmw_pvrdma/qp.c        |   8 +-
 util/udma_barrier.h              | 250 +++++++++++++++++++++++++++------------
 25 files changed, 354 insertions(+), 201 deletions(-)

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH rdma-core 01/14] mlx5: Use stdatomic for the in_use barrier
       [not found] ` <1487272989-8215-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-02-16 19:22   ` Jason Gunthorpe
  2017-02-16 19:22   ` [PATCH rdma-core 02/14] Provide new names for the CPU barriers related to DMA Jason Gunthorpe
                     ` (13 subsequent siblings)
  14 siblings, 0 replies; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-16 19:22 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Yishai Hadas

Since this is not a DMA barrier we do not want to use wmb() here.

Replace it with a weak atomic fence.

For x86-64 this has no change to the assembly output.

Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
---
 providers/mlx5/mlx5.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/providers/mlx5/mlx5.h b/providers/mlx5/mlx5.h
index be52a0a75dded0..e63164c2caea7b 100644
--- a/providers/mlx5/mlx5.h
+++ b/providers/mlx5/mlx5.h
@@ -35,6 +35,7 @@
 
 #include <stddef.h>
 #include <stdio.h>
+#include <stdatomic.h>
 #include <util/compiler.h>
 
 #include <infiniband/driver.h>
@@ -683,7 +684,11 @@ static inline int mlx5_spin_lock(struct mlx5_spinlock *lock)
 		abort();
 	} else {
 		lock->in_use = 1;
-		wmb();
+		/*
+		 * This fence is not at all correct, but it increases the
+		 * chance that in_use is detected by another thread without
+		 * much runtime cost. */
+		atomic_thread_fence(memory_order_acq_rel);
 	}
 
 	return 0;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH rdma-core 02/14] Provide new names for the CPU barriers related to DMA
       [not found] ` <1487272989-8215-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-02-16 19:22   ` [PATCH rdma-core 01/14] mlx5: Use stdatomic for the in_use barrier Jason Gunthorpe
@ 2017-02-16 19:22   ` Jason Gunthorpe
       [not found]     ` <1487272989-8215-3-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-02-16 19:22   ` [PATCH rdma-core 03/14] cxgb3: Update to use new udma write barriers Jason Gunthorpe
                     ` (12 subsequent siblings)
  14 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-16 19:22 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Broadly speaking, providers are not using the existing macros
consistently and the existing macros are very poorly defined.

Due to this poor definition we struggled to implement a sensible
barrier for ARM64 and just went with the strongest barriers instead.

Split wmb/wmb_wc into several cases:
 udma_to_device_barrier - Think dma_map(TO_DEVICE) in kernel terms
 udma_ordering_write_barrier - Weaker than wmb() in the kernel
 mmio_flush_writes - Special to help work with WC memory
 mmio_wc_start - Special to help work with WC memory
 mmio_ordered_writes_hack - Stand in for lack an ordered writel()

rmb becomes:
 udma_from_device_barrier - Think dmap_unamp(FROM_DEVICE) in kernel terms

The split forces provider authors to think about what they are doing more
carefully and the comments provide a solid explanation for when the barrier
is actually supposed to be used and when to use it with the common idioms
all drivers seem to have.

NOTE: do not assume that the existing asm optimally implements the defined
semantics. The required semantics were derived primarily from what the
existing providers do.

Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
---
 util/udma_barrier.h | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 107 insertions(+)

diff --git a/util/udma_barrier.h b/util/udma_barrier.h
index 57ab0f76cbe33e..f9b8291db20210 100644
--- a/util/udma_barrier.h
+++ b/util/udma_barrier.h
@@ -122,4 +122,111 @@
 
 #endif
 
+/* Barriers for DMA.
+
+   These barriers are expliclty only for use with user DMA operations. If you
+   are looking for barriers to use with cache-coherent multi-threaded
+   consitency then look in stdatomic.h. If you need both kinds of synchronicity
+   for the same address then use an atomic operation followed by one
+   of these barriers.
+
+   When reasoning about these barriers there are two objects:
+     - CPU attached address space (the CPU memory could be a range of things:
+       cached/uncached/non-temporal CPU DRAM, uncached MMIO space in another
+       device, pMEM). Generally speaking the ordering is only relative
+       to the local CPU's view of the system. Eg if the local CPU
+       is not guarenteed to see a write from another CPU then it is also
+       OK for the DMA device to also not see the write after the barrier.
+     - A DMA initiator on a bus. For instance a PCI-E device issuing
+       MemRd/MemWr TLPs.
+
+   The ordering guarentee is always stated between those two streams. Eg what
+   happens if a MemRd TLP is sent in via PCI-E relative to a CPU WRITE to the
+   same memory location.
+*/
+
+/* Ensure that the device's view of memory matches the CPU's view of memory.
+   This should be placed before any MMIO store that could trigger the device
+   to begin doing DMA, such as a device doorbell ring.
+
+   eg
+    *dma_buf = 1;
+    udma_to_device_barrier();
+    mmio_write(DO_DMA_REG, dma_buf);
+   Must ensure that the device sees the '1'.
+
+   This is required to fence writes created by the libibverbs user. Those
+   writes could be to any CPU mapped memory object with any cachability mode.
+
+   NOTE: x86 has historically used a weaker semantic for this barrier, and
+   only fenced normal stores to normal memory. libibverbs users using other
+   memory types or non-temporal stores are required to use SFENCE in their own
+   code prior to calling verbs to start a DMA.
+*/
+#define udma_to_device_barrier() wmb()
+
+/* Ensure that all ordered stores from the device are observable from the
+   CPU. This only makes sense after something that observes an ordered store
+   from the device - eg by reading a MMIO register or seeing that CPU memory is
+   updated.
+
+   This guarentees that all reads that follow the barrier see the ordered
+   stores that preceded the observation.
+
+   For instance, this would be used after testing a valid bit in a memory
+   that is a DMA target, to ensure that the following reads see the
+   data written before the MemWr TLP that set the valid bit.
+*/
+#define udma_from_device_barrier() rmb()
+
+/* Order writes to CPU memory so that a DMA device cannot view writes after
+   the barrier without also seeing all writes before the barrier. This does
+   not guarentee any writes are visible to DMA.
+
+   This would be used in cases where a DMA buffer might have a valid bit and
+   data, this barrier is placed after writing the data but before writing the
+   valid bit to ensure the DMA device cannot observe a set valid bit with
+   unwritten data.
+
+   Compared to udma_to_device_barrier() this barrier is not required to fence
+   anything but normal stores to normal malloc memory. Usage should be:
+
+   write_wqe
+      udma_to_device_barrier();    // Get user memory ready for DMA
+      wqe->addr = ...;
+      wqe->flags = ...;
+      udma_ordering_write_barrier();  // Guarantee WQE written in order
+      wqe->valid = 1;
+*/
+#define udma_ordering_write_barrier() wmb()
+
+/* Promptly flush writes, possibly in a write buffer, to MMIO backed memory.
+   This is not required to have any effect on CPU memory. If done while
+   holding a lock then the ordering of MMIO writes across CPUs must be
+   guarenteed to follow the natural ordering implied by the lock.
+
+   This must also act as a barrier that prevents write combining, eg
+     *wc_mem = 1;
+     mmio_flush_writes();
+     *wc_mem = 2;
+   Must always produce two MemWr TLPs, the '2' cannot be combined with and
+   supress the '1'.
+
+   This is intended to be used in conjunction with write combining memory
+   to generate large PCI-E MemWr TLPs from the CPU.
+*/
+#define mmio_flush_writes() wc_wmb()
+
+/* Keep MMIO writes in order.
+   Currently we lack writel macros that universally guarentee MMIO
+   writes happen in order, like the kernel does. Even worse many
+   providers haphazardly open code writes to MMIO memory omitting even
+   volatile.
+
+   Until this can be fixed with a proper writel macro, this barrier
+   is a stand in to indicate places where MMIO writes should be switched
+   to some future writel.
+*/
+#define mmio_ordered_writes_hack() mmio_flush_writes()
+
 #endif
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH rdma-core 03/14] cxgb3: Update to use new udma write barriers
       [not found] ` <1487272989-8215-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-02-16 19:22   ` [PATCH rdma-core 01/14] mlx5: Use stdatomic for the in_use barrier Jason Gunthorpe
  2017-02-16 19:22   ` [PATCH rdma-core 02/14] Provide new names for the CPU barriers related to DMA Jason Gunthorpe
@ 2017-02-16 19:22   ` Jason Gunthorpe
       [not found]     ` <1487272989-8215-4-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-02-16 19:22   ` [PATCH rdma-core 04/14] cxgb4: " Jason Gunthorpe
                     ` (11 subsequent siblings)
  14 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-16 19:22 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Steve Wise

Steve says the chip reads until a EOP marked WR is found, so the only
write barrier is to make sure DMA is ready before setting the WR in
that way.

Add missing rmb()s in the obvious spot.

Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
---
 providers/cxgb3/cq.c      | 2 ++
 providers/cxgb3/cxio_wr.h | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/providers/cxgb3/cq.c b/providers/cxgb3/cq.c
index a6158ce771f89f..eddcd43dc3bd8e 100644
--- a/providers/cxgb3/cq.c
+++ b/providers/cxgb3/cq.c
@@ -121,6 +121,7 @@ static inline int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq,
 
 	*cqe_flushed = 0;
 	hw_cqe = cxio_next_cqe(cq);
+	udma_from_device_barrier();
 
 	/* 
 	 * Skip cqes not affiliated with a QP.
@@ -266,6 +267,7 @@ static int iwch_poll_cq_one(struct iwch_device *rhp, struct iwch_cq *chp,
 	int ret = 1;
 
 	hw_cqe = cxio_next_cqe(&chp->cq);
+	udma_from_device_barrier();
 
 	if (!hw_cqe)
 		return 0;
diff --git a/providers/cxgb3/cxio_wr.h b/providers/cxgb3/cxio_wr.h
index 735b64918a15c8..057f61ac3c7d2f 100644
--- a/providers/cxgb3/cxio_wr.h
+++ b/providers/cxgb3/cxio_wr.h
@@ -347,7 +347,7 @@ static inline void build_fw_riwrh(struct fw_riwrh *wqe, enum t3_wr_opcode op,
 	wqe->op_seop_flags = htonl(V_FW_RIWR_OP(op) |
 				   V_FW_RIWR_SOPEOP(M_FW_RIWR_SOPEOP) |
 				   V_FW_RIWR_FLAGS(flags));
-	mb();
+	udma_to_device_barrier();
 	wqe->gen_tid_len = htonl(V_FW_RIWR_GEN(genbit) | V_FW_RIWR_TID(tid) |
 				 V_FW_RIWR_LEN(len));
 	/* 2nd gen bit... */
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH rdma-core 04/14] cxgb4: Update to use new udma write barriers
       [not found] ` <1487272989-8215-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
                     ` (2 preceding siblings ...)
  2017-02-16 19:22   ` [PATCH rdma-core 03/14] cxgb3: Update to use new udma write barriers Jason Gunthorpe
@ 2017-02-16 19:22   ` Jason Gunthorpe
       [not found]     ` <1487272989-8215-5-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-02-16 19:23   ` [PATCH rdma-core 05/14] hns: " Jason Gunthorpe
                     ` (10 subsequent siblings)
  14 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-16 19:22 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Steve Wise

Based on help from Steve the barriers here are change to consistently
bracket WC memory writes with wc_wmb() like other drivers do.

This allows some of the wc_wmb() calls that were not related to WC
memory be downgraded to wmb().

The driver was probably correct (at least for x86-64) but did not
follow the idiom established by the other drivers for working with
WC memry.

Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
---
 providers/cxgb4/qp.c    | 20 ++++++++++++++++++--
 providers/cxgb4/t4.h    | 48 +++++++++++++++++++++++++++++++++++-------------
 providers/cxgb4/verbs.c |  2 ++
 3 files changed, 55 insertions(+), 15 deletions(-)

diff --git a/providers/cxgb4/qp.c b/providers/cxgb4/qp.c
index 700fe02c77c269..45eaca45029e60 100644
--- a/providers/cxgb4/qp.c
+++ b/providers/cxgb4/qp.c
@@ -52,7 +52,12 @@ static void copy_wr_to_sq(struct t4_wq *wq, union t4_wr *wqe, u8 len16)
 	dst = (u64 *)((u8 *)wq->sq.queue + wq->sq.wq_pidx * T4_EQ_ENTRY_SIZE);
 	if (t4_sq_onchip(wq)) {
 		len16 = align(len16, 4);
-		wc_wmb();
+
+		/* In onchip mode the copy below will be made to WC memory and
+		 * could trigger DMA. In offchip mode the copy below only
+		 * queues the WQE, DMA cannot start until t4_ring_sq_db
+		 * happens */
+		mmio_wc_start();
 	}
 	while (len16) {
 		*dst++ = *src++;
@@ -62,7 +67,13 @@ static void copy_wr_to_sq(struct t4_wq *wq, union t4_wr *wqe, u8 len16)
 		if (dst == (u64 *)&wq->sq.queue[wq->sq.size])
 			dst = (u64 *)wq->sq.queue;
 		len16--;
+
+		/* NOTE len16 cannot be large enough to write to the
+		   same sq.queue memory twice in this loop */
 	}
+
+	if (t4_sq_onchip(wq))
+		mmio_flush_writes();
 }
 
 static void copy_wr_to_rq(struct t4_wq *wq, union t4_recv_wr *wqe, u8 len16)
@@ -274,7 +285,9 @@ static void ring_kernel_db(struct c4iw_qp *qhp, u32 qid, u16 idx)
 	int mask;
 	int __attribute__((unused)) ret;
 
-	wc_wmb();
+	/* FIXME: Why do we need this barrier if the kernel is going to
+	   trigger the DMA? */
+	udma_to_device_barrier();
 	if (qid == qhp->wq.sq.qid) {
 		attr.sq_psn = idx;
 		mask = IBV_QP_SQ_PSN;
@@ -385,8 +398,11 @@ int c4iw_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
 			      len16, wqe);
 	} else
 		ring_kernel_db(qhp, qhp->wq.sq.qid, idx);
+	/* This write is only for debugging, the value does not matter for DMA
+	 */
 	qhp->wq.sq.queue[qhp->wq.sq.size].status.host_wq_pidx = \
 			(qhp->wq.sq.wq_pidx);
+
 	pthread_spin_unlock(&qhp->lock);
 	return err;
 }
diff --git a/providers/cxgb4/t4.h b/providers/cxgb4/t4.h
index a457e2f2921727..a845a367cfbb8c 100644
--- a/providers/cxgb4/t4.h
+++ b/providers/cxgb4/t4.h
@@ -317,9 +317,12 @@ enum {
 };
 
 struct t4_sq {
+	/* queue is either host memory or WC MMIO memory if
+	 * t4_sq_onchip(). */
 	union t4_wr *queue;
 	struct t4_swsqe *sw_sq;
 	struct t4_swsqe *oldest_read;
+	/* udb is either UC or WC MMIO memory depending on device version. */
 	volatile u32 *udb;
 	size_t memsize;
 	u32 qid;
@@ -367,12 +370,6 @@ struct t4_wq {
 	u8 *db_offp;
 };
 
-static inline void t4_ma_sync(struct t4_wq *wq, int page_size)
-{
-	wc_wmb();
-	*((volatile u32 *)wq->sq.ma_sync) = 1;
-}
-
 static inline int t4_rqes_posted(struct t4_wq *wq)
 {
 	return wq->rq.in_use;
@@ -444,8 +441,11 @@ static inline void t4_sq_produce(struct t4_wq *wq, u8 len16)
 	wq->sq.wq_pidx += DIV_ROUND_UP(len16*16, T4_EQ_ENTRY_SIZE);
 	if (wq->sq.wq_pidx >= wq->sq.size * T4_SQ_NUM_SLOTS)
 		wq->sq.wq_pidx %= wq->sq.size * T4_SQ_NUM_SLOTS;
-	if (!wq->error)
+	if (!wq->error) {
+		/* This write is only for debugging, the value does not matter
+		 * for DMA */
 		wq->sq.queue[wq->sq.size].status.host_pidx = (wq->sq.pidx);
+	}
 }
 
 static inline void t4_sq_consume(struct t4_wq *wq)
@@ -457,10 +457,14 @@ static inline void t4_sq_consume(struct t4_wq *wq)
 	if (++wq->sq.cidx == wq->sq.size)
 		wq->sq.cidx = 0;
 	assert((wq->sq.cidx != wq->sq.pidx) || wq->sq.in_use == 0);
-	if (!wq->error)
+	if (!wq->error){
+		/* This write is only for debugging, the value does not matter
+		 * for DMA */
 		wq->sq.queue[wq->sq.size].status.host_cidx = wq->sq.cidx;
+	}
 }
 
+/* Copies to WC MMIO memory */
 static void copy_wqe_to_udb(volatile u32 *udb_offset, void *wqe)
 {
 	u64 *src, *dst;
@@ -482,8 +486,8 @@ extern int t5_en_wc;
 static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, u8 t4, u8 len16,
 				 union t4_wr *wqe)
 {
-	wc_wmb();
 	if (!t4) {
+		mmio_wc_start();
 		if (t5_en_wc && inc == 1 && wq->sq.wc_reg_available) {
 			PDBG("%s: WC wq->sq.pidx = %d; len16=%d\n",
 			     __func__, wq->sq.pidx, len16);
@@ -494,30 +498,45 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, u8 t4, u8 len16,
 			writel(QID_V(wq->sq.bar2_qid) | PIDX_T5_V(inc),
 			       wq->sq.udb);
 		}
-		wc_wmb();
+		/* udb is WC for > t4 devices */
+		mmio_flush_writes();
 		return;
 	}
+
+	udma_to_device_barrier();
 	if (ma_wr) {
 		if (t4_sq_onchip(wq)) {
 			int i;
+
+			mmio_wc_start();
 			for (i = 0; i < 16; i++)
 				*(volatile u32 *)&wq->sq.queue[wq->sq.size].flits[2+i] = i;
+			mmio_flush_writes();
 		}
 	} else {
 		if (t4_sq_onchip(wq)) {
 			int i;
+
+			mmio_wc_start();
 			for (i = 0; i < 16; i++)
+				/* FIXME: What is this supposed to be doing?
+				 * Writing to the same address multiple times
+				 * with WC memory is not guarenteed to
+				 * generate any more than one TLP. Why isn't
+				 * writing to WC memory marked volatile? */
 				*(u32 *)&wq->sq.queue[wq->sq.size].flits[2] = i;
+			mmio_flush_writes();
 		}
 	}
+	/* udb is UC for t4 devices */
 	writel(QID_V(wq->sq.qid & wq->qid_mask) | PIDX_V(inc), wq->sq.udb);
 }
 
 static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc, u8 t4, u8 len16,
 				 union t4_recv_wr *wqe)
 {
-	wc_wmb();
 	if (!t4) {
+		mmio_wc_start();
 		if (t5_en_wc && inc == 1 && wq->sq.wc_reg_available) {
 			PDBG("%s: WC wq->rq.pidx = %d; len16=%d\n",
 			     __func__, wq->rq.pidx, len16);
@@ -528,9 +547,12 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc, u8 t4, u8 len16,
 			writel(QID_V(wq->rq.bar2_qid) | PIDX_T5_V(inc),
 			       wq->rq.udb);
 		}
-		wc_wmb();
+		/* udb is WC for > t4 devices */
+		mmio_flush_writes();
 		return;
 	}
+	/* udb is UC for t4 devices */
+	udma_to_device_barrier();
 	writel(QID_V(wq->rq.qid & wq->qid_mask) | PIDX_V(inc), wq->rq.udb);
 }
 
@@ -655,7 +677,7 @@ static inline int t4_next_hw_cqe(struct t4_cq *cq, struct t4_cqe **cqe)
 		cq->error = 1;
 		assert(0);
 	} else if (t4_valid_cqe(cq, &cq->queue[cq->cidx])) {
-		rmb();
+		udma_from_device_barrier();
 		*cqe = &cq->queue[cq->cidx];
 		ret = 0;
 	} else
diff --git a/providers/cxgb4/verbs.c b/providers/cxgb4/verbs.c
index 32ed44c63d8402..e7620dc02ae0a7 100644
--- a/providers/cxgb4/verbs.c
+++ b/providers/cxgb4/verbs.c
@@ -573,6 +573,8 @@ static void reset_qp(struct c4iw_qp *qhp)
 	qhp->wq.rq.cidx = qhp->wq.rq.pidx = qhp->wq.rq.in_use = 0;
 	qhp->wq.sq.oldest_read = NULL;
 	memset(qhp->wq.sq.queue, 0, qhp->wq.sq.memsize);
+	if (t4_sq_onchip(&qhp->wq))
+		mmio_flush_writes();
 	memset(qhp->wq.rq.queue, 0, qhp->wq.rq.memsize);
 }
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH rdma-core 05/14] hns: Update to use new udma write barriers
       [not found] ` <1487272989-8215-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
                     ` (3 preceding siblings ...)
  2017-02-16 19:22   ` [PATCH rdma-core 04/14] cxgb4: " Jason Gunthorpe
@ 2017-02-16 19:23   ` Jason Gunthorpe
  2017-02-16 19:23   ` [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros Jason Gunthorpe
                     ` (9 subsequent siblings)
  14 siblings, 0 replies; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-16 19:23 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Lijun Ou, Wei Hu(Xavier)

Move the barriers to directly before the doorbell MMIO write for
clarity.

Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
---
 providers/hns/hns_roce_u_hw_v1.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/providers/hns/hns_roce_u_hw_v1.c b/providers/hns/hns_roce_u_hw_v1.c
index 263502d4a85610..0de94e1fabd3d9 100644
--- a/providers/hns/hns_roce_u_hw_v1.c
+++ b/providers/hns/hns_roce_u_hw_v1.c
@@ -67,6 +67,8 @@ static void hns_roce_update_rq_head(struct hns_roce_context *ctx,
 	roce_set_field(rq_db.u32_8, RQ_DB_U32_8_CMD_M, RQ_DB_U32_8_CMD_S, 1);
 	roce_set_bit(rq_db.u32_8, RQ_DB_U32_8_HW_SYNC_S, 1);
 
+	udma_to_device_barrier();
+
 	hns_roce_write64((uint32_t *)&rq_db, ctx, ROCEE_DB_OTHERS_L_0_REG);
 }
 
@@ -87,6 +89,8 @@ static void hns_roce_update_sq_head(struct hns_roce_context *ctx,
 	roce_set_field(sq_db.u32_8, SQ_DB_U32_8_QPN_M, SQ_DB_U32_8_QPN_S, qpn);
 	roce_set_bit(sq_db.u32_8, SQ_DB_U32_8_HW_SYNC, 1);
 
+	udma_to_device_barrier();
+
 	hns_roce_write64((uint32_t *)&sq_db, ctx, ROCEE_DB_SQ_L_0_REG);
 }
 
@@ -261,7 +265,7 @@ static int hns_roce_v1_poll_one(struct hns_roce_cq *cq,
 	/* Get the next cqe, CI will be added gradually */
 	++cq->cons_index;
 
-	rmb();
+	udma_from_device_barrier();
 
 	qpn = roce_get_field(cqe->cqe_byte_16, CQE_BYTE_16_LOCAL_QPN_M,
 			     CQE_BYTE_16_LOCAL_QPN_S);
@@ -408,7 +412,7 @@ static int hns_roce_u_v1_poll_cq(struct ibv_cq *ibvcq, int ne,
 		if (dev->hw_version == HNS_ROCE_HW_VER1) {
 			*cq->set_ci_db = (unsigned short)(cq->cons_index &
 					 ((cq->cq_depth << 1) - 1));
-			mb();
+			mmio_ordered_writes_hack();
 		}
 
 		hns_roce_update_cq_cons_index(ctx, cq);
@@ -581,7 +585,6 @@ out:
 	/* Set DB return */
 	if (likely(nreq)) {
 		qp->sq.head += nreq;
-		wmb();
 
 		hns_roce_update_sq_head(ctx, qp->ibv_qp.qp_num,
 				qp->port_num - 1, qp->sl,
@@ -625,7 +628,7 @@ static void __hns_roce_v1_cq_clean(struct hns_roce_cq *cq, uint32_t qpn,
 
 	if (nfreed) {
 		cq->cons_index += nfreed;
-		wmb();
+		udma_to_device_barrier();
 		hns_roce_update_cq_cons_index(ctx, cq);
 	}
 }
@@ -816,8 +819,6 @@ out:
 	if (nreq) {
 		qp->rq.head += nreq;
 
-		wmb();
-
 		hns_roce_update_rq_head(ctx, qp->ibv_qp.qp_num,
 				    qp->rq.head & ((qp->rq.wqe_cnt << 1) - 1));
 	}
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros
       [not found] ` <1487272989-8215-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
                     ` (4 preceding siblings ...)
  2017-02-16 19:23   ` [PATCH rdma-core 05/14] hns: " Jason Gunthorpe
@ 2017-02-16 19:23   ` Jason Gunthorpe
       [not found]     ` <1487272989-8215-7-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-02-16 19:23   ` [PATCH rdma-core 07/14] mlx4: Update to use new udma write barriers Jason Gunthorpe
                     ` (8 subsequent siblings)
  14 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-16 19:23 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Tatyana Nikolova

Use our standard versions from util instead. There doesn't seem
to be anything tricky here, but the inlined versions were like our
wc_wmb() barriers, not the wmb().

There appears to be no WC memory in this driver, so despite the comments,
these barriers are also making sure that user DMA data is flushed out. Make
them all wmb()

Guess at where the missing rmb() should be.

Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
---
 providers/i40iw/i40iw_osdep.h | 14 --------------
 providers/i40iw/i40iw_uk.c    | 26 ++++++++++++++------------
 2 files changed, 14 insertions(+), 26 deletions(-)

diff --git a/providers/i40iw/i40iw_osdep.h b/providers/i40iw/i40iw_osdep.h
index fddedf40dd8ae2..92bedd31633eb5 100644
--- a/providers/i40iw/i40iw_osdep.h
+++ b/providers/i40iw/i40iw_osdep.h
@@ -105,18 +105,4 @@ static inline void db_wr32(u32 value, u32 *wqe_word)
 #define ACQUIRE_LOCK()
 #define RELEASE_LOCK()
 
-#if defined(__i386__)
-#define i40iw_mb() mb()		/* full memory barrier */
-#define i40iw_wmb() mb()	/* write memory barrier */
-#elif defined(__x86_64__)
-#define i40iw_mb() asm volatile("mfence" ::: "memory")	 /* full memory barrier */
-#define i40iw_wmb() asm volatile("sfence" ::: "memory")	 /* write memory barrier */
-#else
-#define i40iw_mb() mb()		/* full memory barrier */
-#define i40iw_wmb() wmb()	/* write memory barrier */
-#endif
-#define i40iw_rmb() rmb()	/* read memory barrier */
-#define i40iw_smp_mb() smp_mb()		/* memory barrier */
-#define i40iw_smp_wmb() smp_wmb()	/* write memory barrier */
-#define i40iw_smp_rmb() smp_rmb()	/* read memory barrier */
 #endif				/* _I40IW_OSDEP_H_ */
diff --git a/providers/i40iw/i40iw_uk.c b/providers/i40iw/i40iw_uk.c
index d3e4fec7d8515b..b20748e9f09199 100644
--- a/providers/i40iw/i40iw_uk.c
+++ b/providers/i40iw/i40iw_uk.c
@@ -75,7 +75,7 @@ static enum i40iw_status_code i40iw_nop_1(struct i40iw_qp_uk *qp)
 	    LS_64(signaled, I40IWQPSQ_SIGCOMPL) |
 	    LS_64(qp->swqe_polarity, I40IWQPSQ_VALID) | nop_signature++;
 
-	i40iw_wmb();	/* Memory barrier to ensure data is written before valid bit is set */
+	udma_to_device_barrier();	/* Memory barrier to ensure data is written before valid bit is set */
 
 	set_64bit_val(wqe, I40IW_BYTE_24, header);
 	return 0;
@@ -91,7 +91,7 @@ void i40iw_qp_post_wr(struct i40iw_qp_uk *qp)
 	u32 hw_sq_tail;
 	u32 sw_sq_head;
 
-	i40iw_mb(); /* valid bit is written and loads completed before reading shadow */
+	udma_to_device_barrier(); /* valid bit is written and loads completed before reading shadow */
 
 	/* read the doorbell shadow area */
 	get_64bit_val(qp->shadow_area, I40IW_BYTE_0, &temp);
@@ -297,7 +297,7 @@ static enum i40iw_status_code i40iw_rdma_write(struct i40iw_qp_uk *qp,
 		byte_off += 16;
 	}
 
-	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
+	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
 
 	set_64bit_val(wqe, I40IW_BYTE_24, header);
 
@@ -347,7 +347,7 @@ static enum i40iw_status_code i40iw_rdma_read(struct i40iw_qp_uk *qp,
 
 	i40iw_set_fragment(wqe, I40IW_BYTE_0, &op_info->lo_addr);
 
-	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
+	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
 
 	set_64bit_val(wqe, I40IW_BYTE_24, header);
 	if (post_sq)
@@ -410,7 +410,7 @@ static enum i40iw_status_code i40iw_send(struct i40iw_qp_uk *qp,
 		byte_off += 16;
 	}
 
-	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
+	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
 
 	set_64bit_val(wqe, I40IW_BYTE_24, header);
 	if (post_sq)
@@ -478,7 +478,7 @@ static enum i40iw_status_code i40iw_inline_rdma_write(struct i40iw_qp_uk *qp,
 		memcpy(dest, src, op_info->len - I40IW_BYTE_16);
 	}
 
-	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
+	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
 
 	set_64bit_val(wqe, I40IW_BYTE_24, header);
 
@@ -552,7 +552,7 @@ static enum i40iw_status_code i40iw_inline_send(struct i40iw_qp_uk *qp,
 		memcpy(dest, src, op_info->len - I40IW_BYTE_16);
 	}
 
-	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
+	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
 
 	set_64bit_val(wqe, I40IW_BYTE_24, header);
 
@@ -601,7 +601,7 @@ static enum i40iw_status_code i40iw_stag_local_invalidate(struct i40iw_qp_uk *qp
 	    LS_64(info->signaled, I40IWQPSQ_SIGCOMPL) |
 	    LS_64(qp->swqe_polarity, I40IWQPSQ_VALID);
 
-	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
+	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
 
 	set_64bit_val(wqe, I40IW_BYTE_24, header);
 
@@ -650,7 +650,7 @@ static enum i40iw_status_code i40iw_mw_bind(struct i40iw_qp_uk *qp,
 	    LS_64(info->signaled, I40IWQPSQ_SIGCOMPL) |
 	    LS_64(qp->swqe_polarity, I40IWQPSQ_VALID);
 
-	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
+	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
 
 	set_64bit_val(wqe, I40IW_BYTE_24, header);
 
@@ -694,7 +694,7 @@ static enum i40iw_status_code i40iw_post_receive(struct i40iw_qp_uk *qp,
 		byte_off += 16;
 	}
 
-	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
+	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
 
 	set_64bit_val(wqe, I40IW_BYTE_24, header);
 
@@ -731,7 +731,7 @@ static void i40iw_cq_request_notification(struct i40iw_cq_uk *cq,
 
 	set_64bit_val(cq->shadow_area, I40IW_BYTE_32, temp_val);
 
-	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
+	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
 
 	db_wr32(cq->cq_id, cq->cqe_alloc_reg);
 }
@@ -780,6 +780,8 @@ static enum i40iw_status_code i40iw_cq_poll_completion(struct i40iw_cq_uk *cq,
 	if (polarity != cq->polarity)
 		return I40IW_ERR_QUEUE_EMPTY;
 
+	udma_from_device_barrier();
+
 	q_type = (u8)RS_64(qword3, I40IW_CQ_SQ);
 	info->error = (bool)RS_64(qword3, I40IW_CQ_ERROR);
 	info->push_dropped = (bool)RS_64(qword3, I40IWCQ_PSHDROP);
@@ -1121,7 +1123,7 @@ enum i40iw_status_code i40iw_nop(struct i40iw_qp_uk *qp,
 	    LS_64(signaled, I40IWQPSQ_SIGCOMPL) |
 	    LS_64(qp->swqe_polarity, I40IWQPSQ_VALID);
 
-	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
+	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
 
 	set_64bit_val(wqe, I40IW_BYTE_24, header);
 	if (post_sq)
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH rdma-core 07/14] mlx4: Update to use new udma write barriers
       [not found] ` <1487272989-8215-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
                     ` (5 preceding siblings ...)
  2017-02-16 19:23   ` [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros Jason Gunthorpe
@ 2017-02-16 19:23   ` Jason Gunthorpe
       [not found]     ` <1487272989-8215-8-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-02-16 19:23   ` [PATCH rdma-core 08/14] mlx5: " Jason Gunthorpe
                     ` (7 subsequent siblings)
  14 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-16 19:23 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Yishai Hadas

The mlx4 comments are good so these translate fairly directly.

- Added barrier at the top of mlx4_post_send, this makes the driver
  ready for a change to a stronger udma_to_device_barrier /
  weaker udma_order_write_barrier() which  would make the post loop a bit
  faster. No change on x86-64
- The wmb() directly before the BF copy is upgraded to a wc_wmb(),
  this is consistent with what mlx5 does and makes sense.

Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
---
 providers/mlx4/cq.c  |  6 +++---
 providers/mlx4/qp.c  | 19 +++++++++++--------
 providers/mlx4/srq.c |  2 +-
 3 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/providers/mlx4/cq.c b/providers/mlx4/cq.c
index 6a5cf8be218892..14f8cbce6d75ed 100644
--- a/providers/mlx4/cq.c
+++ b/providers/mlx4/cq.c
@@ -222,7 +222,7 @@ static inline int mlx4_get_next_cqe(struct mlx4_cq *cq,
 	 * Make sure we read CQ entry contents after we've checked the
 	 * ownership bit.
 	 */
-	rmb();
+	udma_from_device_barrier();
 
 	*pcqe = cqe;
 
@@ -698,7 +698,7 @@ int mlx4_arm_cq(struct ibv_cq *ibvcq, int solicited)
 	 * Make sure that the doorbell record in host memory is
 	 * written before ringing the doorbell via PCI MMIO.
 	 */
-	wmb();
+	udma_to_device_barrier();
 
 	doorbell[0] = htonl(sn << 28 | cmd | cq->cqn);
 	doorbell[1] = htonl(ci);
@@ -764,7 +764,7 @@ void __mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq)
 		 * Make sure update of buffer contents is done before
 		 * updating consumer index.
 		 */
-		wmb();
+		udma_to_device_barrier();
 		mlx4_update_cons_index(cq);
 	}
 }
diff --git a/providers/mlx4/qp.c b/providers/mlx4/qp.c
index a607326c7c452c..77a4a34576cb69 100644
--- a/providers/mlx4/qp.c
+++ b/providers/mlx4/qp.c
@@ -204,7 +204,7 @@ static void set_data_seg(struct mlx4_wqe_data_seg *dseg, struct ibv_sge *sg)
 	 * chunk and get a valid (!= * 0xffffffff) byte count but
 	 * stale data, and end up sending the wrong data.
 	 */
-	wmb();
+	udma_ordering_write_barrier();
 
 	if (likely(sg->length))
 		dseg->byte_count = htonl(sg->length);
@@ -228,6 +228,9 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
 
 	pthread_spin_lock(&qp->sq.lock);
 
+	/* Get all user DMA buffers ready to go */
+	udma_to_device_barrier();
+
 	/* XXX check that state is OK to post send */
 
 	ind = qp->sq.head;
@@ -400,7 +403,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
 					wqe += to_copy;
 					addr += to_copy;
 					seg_len += to_copy;
-					wmb(); /* see comment below */
+					udma_ordering_write_barrier(); /* see comment below */
 					seg->byte_count = htonl(MLX4_INLINE_SEG | seg_len);
 					seg_len = 0;
 					seg = wqe;
@@ -428,7 +431,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
 				 * data, and end up sending the wrong
 				 * data.
 				 */
-				wmb();
+				udma_ordering_write_barrier();
 				seg->byte_count = htonl(MLX4_INLINE_SEG | seg_len);
 			}
 
@@ -450,7 +453,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
 		 * setting ownership bit (because HW can start
 		 * executing as soon as we do).
 		 */
-		wmb();
+		udma_ordering_write_barrier();
 
 		ctrl->owner_opcode = htonl(mlx4_ib_opcode[wr->opcode]) |
 			(ind & qp->sq.wqe_cnt ? htonl(1 << 31) : 0);
@@ -478,7 +481,7 @@ out:
 		 * Make sure that descriptor is written to memory
 		 * before writing to BlueFlame page.
 		 */
-		wmb();
+		mmio_wc_start();
 
 		++qp->sq.head;
 
@@ -486,7 +489,7 @@ out:
 
 		mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned long *) ctrl,
 			     align(size * 16, 64));
-		wc_wmb();
+		mmio_flush_writes();
 
 		ctx->bf_offset ^= ctx->bf_buf_size;
 
@@ -498,7 +501,7 @@ out:
 		 * Make sure that descriptors are written before
 		 * doorbell record.
 		 */
-		wmb();
+		udma_to_device_barrier();
 
 		mmio_writel((unsigned long)(ctx->uar + MLX4_SEND_DOORBELL),
 			    qp->doorbell_qpn);
@@ -566,7 +569,7 @@ out:
 		 * Make sure that descriptors are written before
 		 * doorbell record.
 		 */
-		wmb();
+		udma_to_device_barrier();
 
 		*qp->db = htonl(qp->rq.head & 0xffff);
 	}
diff --git a/providers/mlx4/srq.c b/providers/mlx4/srq.c
index 4f90efdf927209..6e4ff5663d019b 100644
--- a/providers/mlx4/srq.c
+++ b/providers/mlx4/srq.c
@@ -113,7 +113,7 @@ int mlx4_post_srq_recv(struct ibv_srq *ibsrq,
 		 * Make sure that descriptors are written before
 		 * we write doorbell record.
 		 */
-		wmb();
+		udma_to_device_barrier();
 
 		*srq->db = htonl(srq->counter);
 	}
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH rdma-core 08/14] mlx5: Update to use new udma write barriers
       [not found] ` <1487272989-8215-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
                     ` (6 preceding siblings ...)
  2017-02-16 19:23   ` [PATCH rdma-core 07/14] mlx4: Update to use new udma write barriers Jason Gunthorpe
@ 2017-02-16 19:23   ` Jason Gunthorpe
       [not found]     ` <1487272989-8215-9-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-02-16 19:23   ` [PATCH rdma-core 09/14] nes: " Jason Gunthorpe
                     ` (6 subsequent siblings)
  14 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-16 19:23 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Yishai Hadas

The mlx5 comments are good so these translate fairly directly.

There is one barrier in mlx5_arm_cq that I could not explain, it became
mmio_ordered_writes_hack()

Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
---
 providers/mlx5/cq.c  |  8 ++++----
 providers/mlx5/qp.c  | 18 +++++++++++-------
 providers/mlx5/srq.c |  2 +-
 3 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/providers/mlx5/cq.c b/providers/mlx5/cq.c
index 372e40bc2b6589..cc0af920c703d9 100644
--- a/providers/mlx5/cq.c
+++ b/providers/mlx5/cq.c
@@ -489,7 +489,7 @@ static inline int mlx5_get_next_cqe(struct mlx5_cq *cq,
 	 * Make sure we read CQ entry contents after we've checked the
 	 * ownership bit.
 	 */
-	rmb();
+	udma_from_device_barrier();
 
 #ifdef MLX5_DEBUG
 	{
@@ -1283,14 +1283,14 @@ int mlx5_arm_cq(struct ibv_cq *ibvcq, int solicited)
 	 * Make sure that the doorbell record in host memory is
 	 * written before ringing the doorbell via PCI MMIO.
 	 */
-	wmb();
+	udma_to_device_barrier();
 
 	doorbell[0] = htonl(sn << 28 | cmd | ci);
 	doorbell[1] = htonl(cq->cqn);
 
 	mlx5_write64(doorbell, ctx->uar[0] + MLX5_CQ_DOORBELL, &ctx->lock32);
 
-	wc_wmb();
+	mmio_ordered_writes_hack();
 
 	return 0;
 }
@@ -1395,7 +1395,7 @@ void __mlx5_cq_clean(struct mlx5_cq *cq, uint32_t rsn, struct mlx5_srq *srq)
 		 * Make sure update of buffer contents is done before
 		 * updating consumer index.
 		 */
-		wmb();
+		udma_to_device_barrier();
 		update_cons_index(cq);
 	}
 }
diff --git a/providers/mlx5/qp.c b/providers/mlx5/qp.c
index b9ae72c9827c8c..d7087d986ce79f 100644
--- a/providers/mlx5/qp.c
+++ b/providers/mlx5/qp.c
@@ -926,10 +926,13 @@ out:
 		 * Make sure that descriptors are written before
 		 * updating doorbell record and ringing the doorbell
 		 */
-		wmb();
+		udma_to_device_barrier();
 		qp->db[MLX5_SND_DBR] = htonl(qp->sq.cur_post & 0xffff);
 
-		wc_wmb();
+		/* Make sure that the doorbell write happens before the memcpy
+		 * to WC memory below */
+		mmio_wc_start();
+
 		ctx = to_mctx(ibqp->context);
 		if (bf->need_lock)
 			mlx5_spin_lock(&bf->lock);
@@ -944,15 +947,15 @@ out:
 				     &ctx->lock32);
 
 		/*
-		 * use wc_wmb() to ensure write combining buffers are flushed out
+		 * use mmio_flush_writes() to ensure write combining buffers are flushed out
 		 * of the running CPU. This must be carried inside the spinlock.
 		 * Otherwise, there is a potential race. In the race, CPU A
 		 * writes doorbell 1, which is waiting in the WC buffer. CPU B
 		 * writes doorbell 2, and it's write is flushed earlier. Since
-		 * the wc_wmb is CPU local, this will result in the HCA seeing
+		 * the mmio_flush_writes is CPU local, this will result in the HCA seeing
 		 * doorbell 2, followed by doorbell 1.
 		 */
-		wc_wmb();
+		mmio_flush_writes();
 		bf->offset ^= bf->buf_size;
 		if (bf->need_lock)
 			mlx5_spin_unlock(&bf->lock);
@@ -1119,7 +1122,7 @@ out:
 		 * Make sure that descriptors are written before
 		 * doorbell record.
 		 */
-		wmb();
+		udma_to_device_barrier();
 		*(rwq->recv_db) = htonl(rwq->rq.head & 0xffff);
 	}
 
@@ -1193,7 +1196,8 @@ out:
 		 * Make sure that descriptors are written before
 		 * doorbell record.
 		 */
-		wmb();
+		udma_to_device_barrier();
+
 		/*
 		 * For Raw Packet QP, avoid updating the doorbell record
 		 * as long as the QP isn't in RTR state, to avoid receiving
diff --git a/providers/mlx5/srq.c b/providers/mlx5/srq.c
index b6e1eaf26bbd0c..2c71730a40f875 100644
--- a/providers/mlx5/srq.c
+++ b/providers/mlx5/srq.c
@@ -137,7 +137,7 @@ int mlx5_post_srq_recv(struct ibv_srq *ibsrq,
 		 * Make sure that descriptors are written before
 		 * we write doorbell record.
 		 */
-		wmb();
+		udma_to_device_barrier();
 
 		*srq->db = htonl(srq->counter);
 	}
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH rdma-core 09/14] nes: Update to use new udma write barriers
       [not found] ` <1487272989-8215-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
                     ` (7 preceding siblings ...)
  2017-02-16 19:23   ` [PATCH rdma-core 08/14] mlx5: " Jason Gunthorpe
@ 2017-02-16 19:23   ` Jason Gunthorpe
  2017-02-16 19:23   ` [PATCH rdma-core 10/14] mthca: Update to use new mmio " Jason Gunthorpe
                     ` (5 subsequent siblings)
  14 siblings, 0 replies; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-16 19:23 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Tatyana Nikolova

This driver inexplicably uses mb() for all sorts of things, translate it to
rmb() or wmb() as appropriate based on context and comments.

Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
---
 providers/nes/nes_uverbs.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/providers/nes/nes_uverbs.c b/providers/nes/nes_uverbs.c
index 867c39e1167884..80532abdb45d75 100644
--- a/providers/nes/nes_uverbs.c
+++ b/providers/nes/nes_uverbs.c
@@ -473,7 +473,7 @@ int nes_upoll_cq(struct ibv_cq *cq, int num_entries, struct ibv_wc *entry)
 			break;
 
 		/* Make sure we read CQ entry contents *after* we've checked the valid bit. */
-		mb();
+		udma_from_device_barrier();
 
 		cqe = (volatile struct nes_hw_cqe)nesucq->cqes[head];
 
@@ -638,7 +638,7 @@ int nes_upoll_cq_no_db_read(struct ibv_cq *cq, int num_entries, struct ibv_wc *e
 			break;
 
 		/* Make sure we read CQ entry contents *after* we've checked the valid bit. */
-		mb();
+		udma_from_device_barrier();
 
 		cqe = (volatile struct nes_hw_cqe)nesucq->cqes[head];
 
@@ -1125,7 +1125,7 @@ static void nes_clean_cq(struct nes_uqp *nesuqp, struct nes_ucq *nesucq)
 
 	cq_head = nesucq->head;
 	while (le32_to_cpu(nesucq->cqes[cq_head].cqe_words[NES_CQE_OPCODE_IDX]) & NES_CQE_VALID) {
-		rmb();
+		udma_from_device_barrier();
 		lo = le32_to_cpu(nesucq->cqes[cq_head].cqe_words[NES_CQE_COMP_COMP_CTX_LOW_IDX]);
 		hi = le32_to_cpu(nesucq->cqes[cq_head].cqe_words[NES_CQE_COMP_COMP_CTX_HIGH_IDX]);
 		u64temp = (((uint64_t)hi) << 32) | ((uint64_t)lo);
@@ -1205,6 +1205,7 @@ int nes_upost_send(struct ibv_qp *ib_qp, struct ibv_send_wr *ib_wr,
 	int sge_index;
 
 	pthread_spin_lock(&nesuqp->lock);
+	udma_to_device_barrier();
 
 	head = nesuqp->sq_head;
 	while (ib_wr) {
@@ -1234,7 +1235,7 @@ int nes_upost_send(struct ibv_qp *ib_qp, struct ibv_send_wr *ib_wr,
 		u64temp = (uint64_t)((uintptr_t)nesuqp);
 		wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((uint32_t)u64temp);
 		wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((uint32_t)(u64temp>>32));
-		mb();
+		udma_ordering_write_barrier();
 		wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX] |= cpu_to_le32(head);
 
 		switch (ib_wr->opcode) {
@@ -1360,7 +1361,7 @@ int nes_upost_send(struct ibv_qp *ib_qp, struct ibv_send_wr *ib_wr,
 	}
 
 	nesuqp->sq_head = head;
-	mb();
+	udma_to_device_barrier();
 	while (wqe_count) {
 		counter = (wqe_count<(uint32_t)255) ? wqe_count : 255;
 		wqe_count -= counter;
@@ -1400,6 +1401,7 @@ int nes_upost_recv(struct ibv_qp *ib_qp, struct ibv_recv_wr *ib_wr,
 	}
 
 	pthread_spin_lock(&nesuqp->lock);
+	udma_to_device_barrier();
 
 	head = nesuqp->rq_head;
 	while (ib_wr) {
@@ -1427,7 +1429,7 @@ int nes_upost_recv(struct ibv_qp *ib_qp, struct ibv_recv_wr *ib_wr,
 				cpu_to_le32((uint32_t)u64temp);
 		wqe->wqe_words[NES_IWARP_RQ_WQE_COMP_CTX_HIGH_IDX] =
 				cpu_to_le32((uint32_t)(u64temp >> 32));
-		mb();
+		udma_ordering_write_barrier();
 		wqe->wqe_words[NES_IWARP_RQ_WQE_COMP_CTX_LOW_IDX] |= cpu_to_le32(head);
 
 		total_payload_length = 0;
@@ -1452,7 +1454,7 @@ int nes_upost_recv(struct ibv_qp *ib_qp, struct ibv_recv_wr *ib_wr,
 	}
 
 	nesuqp->rq_head = head;
-	mb();
+	udma_to_device_barrier();
 	while (wqe_count) {
 		counter = (wqe_count<(uint32_t)255) ? wqe_count : 255;
 		wqe_count -= counter;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH rdma-core 10/14] mthca: Update to use new mmio write barriers
       [not found] ` <1487272989-8215-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
                     ` (8 preceding siblings ...)
  2017-02-16 19:23   ` [PATCH rdma-core 09/14] nes: " Jason Gunthorpe
@ 2017-02-16 19:23   ` Jason Gunthorpe
  2017-02-16 19:23   ` [PATCH rdma-core 11/14] ocrdma: Update to use new udma " Jason Gunthorpe
                     ` (4 subsequent siblings)
  14 siblings, 0 replies; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-16 19:23 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Vladimir Sokolovsky

- The barrier after set_ci_db is upgraded to wc_wmb()
- The barrier in mthca_write_db_rec is switched to wc_wmb()

I am guessing that these two locations are trying to strongly order
writes, but I suspect barriers are missing in this driver it doesn't
really make too much sense..

Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
---
 providers/mthca/cq.c       | 10 +++++-----
 providers/mthca/doorbell.h |  2 +-
 providers/mthca/qp.c       | 20 +++++++++++---------
 providers/mthca/srq.c      |  6 +++---
 4 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/providers/mthca/cq.c b/providers/mthca/cq.c
index aa08e065f2757b..f41b3750f37746 100644
--- a/providers/mthca/cq.c
+++ b/providers/mthca/cq.c
@@ -152,7 +152,7 @@ static inline void update_cons_index(struct mthca_cq *cq, int incr)
 
 	if (mthca_is_memfree(cq->ibv_cq.context)) {
 		*cq->set_ci_db = htonl(cq->cons_index);
-		wmb();
+		mmio_ordered_writes_hack();
 	} else {
 		doorbell[0] = htonl(MTHCA_TAVOR_CQ_DB_INC_CI | cq->cqn);
 		doorbell[1] = htonl(incr - 1);
@@ -310,7 +310,7 @@ static inline int mthca_poll_one(struct mthca_cq *cq,
 	 * Make sure we read CQ entry contents after we've checked the
 	 * ownership bit.
 	 */
-	rmb();
+	udma_from_device_barrier();
 
 	qpn = ntohl(cqe->my_qpn);
 
@@ -472,7 +472,7 @@ int mthca_poll_cq(struct ibv_cq *ibcq, int ne, struct ibv_wc *wc)
 	}
 
 	if (freed) {
-		wmb();
+		udma_to_device_barrier();
 		update_cons_index(cq, freed);
 	}
 
@@ -516,7 +516,7 @@ int mthca_arbel_arm_cq(struct ibv_cq *ibvcq, int solicited)
 	 * Make sure that the doorbell record in host memory is
 	 * written before ringing the doorbell via PCI MMIO.
 	 */
-	wmb();
+	udma_to_device_barrier();
 
 	doorbell[0] = htonl((sn << 28)                       |
 			    (solicited ?
@@ -582,7 +582,7 @@ void __mthca_cq_clean(struct mthca_cq *cq, uint32_t qpn, struct mthca_srq *srq)
 	if (nfreed) {
 		for (i = 0; i < nfreed; ++i)
 			set_cqe_hw(get_cqe(cq, (cq->cons_index + i) & cq->ibv_cq.cqe));
-		wmb();
+		udma_to_device_barrier();
 		cq->cons_index += nfreed;
 		update_cons_index(cq, nfreed);
 	}
diff --git a/providers/mthca/doorbell.h b/providers/mthca/doorbell.h
index a3aa42a9c8ba27..0f5a67e0ac7a63 100644
--- a/providers/mthca/doorbell.h
+++ b/providers/mthca/doorbell.h
@@ -98,7 +98,7 @@ static inline void mthca_write64(uint32_t val[2], struct mthca_context *ctx, int
 static inline void mthca_write_db_rec(uint32_t val[2], uint32_t *db)
 {
 	*(volatile uint32_t *) db       = val[0];
-	mb();
+	mmio_ordered_writes_hack();
 	*(volatile uint32_t *) (db + 1) = val[1];
 }
 
diff --git a/providers/mthca/qp.c b/providers/mthca/qp.c
index d221bb19bfa67c..c6a22372c7d809 100644
--- a/providers/mthca/qp.c
+++ b/providers/mthca/qp.c
@@ -112,6 +112,7 @@ int mthca_tavor_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
 	uint32_t uninitialized_var(op0);
 
 	pthread_spin_lock(&qp->sq.lock);
+	udma_to_device_barrier();
 
 	ind = qp->sq.next_ind;
 
@@ -287,7 +288,7 @@ int mthca_tavor_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
 		/*
 		 * Make sure that nda_op is written before setting ee_nds.
 		 */
-		wmb();
+		udma_ordering_write_barrier();
 		((struct mthca_next_seg *) prev_wqe)->ee_nds =
 			htonl((size0 ? 0 : MTHCA_NEXT_DBD) | size |
 			((wr->send_flags & IBV_SEND_FENCE) ?
@@ -313,6 +314,7 @@ out:
 				     qp->send_wqe_offset) | f0 | op0);
 		doorbell[1] = htonl((ibqp->qp_num << 8) | size0);
 
+		udma_to_device_barrier();
 		mthca_write64(doorbell, to_mctx(ibqp->context), MTHCA_SEND_DOORBELL);
 	}
 
@@ -400,7 +402,7 @@ int mthca_tavor_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr,
 			 * Make sure that descriptors are written
 			 * before doorbell is rung.
 			 */
-			wmb();
+			udma_to_device_barrier();
 
 			mthca_write64(doorbell, to_mctx(ibqp->context), MTHCA_RECV_DOORBELL);
 
@@ -419,7 +421,7 @@ out:
 		 * Make sure that descriptors are written before
 		 * doorbell is rung.
 		 */
-		wmb();
+		udma_to_device_barrier();
 
 		mthca_write64(doorbell, to_mctx(ibqp->context), MTHCA_RECV_DOORBELL);
 	}
@@ -466,14 +468,14 @@ int mthca_arbel_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
 			 * Make sure that descriptors are written before
 			 * doorbell record.
 			 */
-			wmb();
+			udma_to_device_barrier();
 			*qp->sq.db = htonl(qp->sq.head & 0xffff);
 
 			/*
 			 * Make sure doorbell record is written before we
 			 * write MMIO send doorbell.
 			 */
-			wmb();
+			mmio_ordered_writes_hack();
 			mthca_write64(doorbell, to_mctx(ibqp->context), MTHCA_SEND_DOORBELL);
 
 			size0 = 0;
@@ -643,7 +645,7 @@ int mthca_arbel_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
 			htonl(((ind << qp->sq.wqe_shift) +
 			       qp->send_wqe_offset) |
 			      mthca_opcode[wr->opcode]);
-		wmb();
+		udma_ordering_write_barrier();
 		((struct mthca_next_seg *) prev_wqe)->ee_nds =
 			htonl(MTHCA_NEXT_DBD | size |
 			      ((wr->send_flags & IBV_SEND_FENCE) ?
@@ -674,14 +676,14 @@ out:
 		 * Make sure that descriptors are written before
 		 * doorbell record.
 		 */
-		wmb();
+		udma_to_device_barrier();
 		*qp->sq.db = htonl(qp->sq.head & 0xffff);
 
 		/*
 		 * Make sure doorbell record is written before we
 		 * write MMIO send doorbell.
 		 */
-		wmb();
+		mmio_ordered_writes_hack();
 		mthca_write64(doorbell, to_mctx(ibqp->context), MTHCA_SEND_DOORBELL);
 	}
 
@@ -754,7 +756,7 @@ out:
 		 * Make sure that descriptors are written before
 		 * doorbell record.
 		 */
-		wmb();
+		udma_to_device_barrier();
 		*qp->rq.db = htonl(qp->rq.head & 0xffff);
 	}
 
diff --git a/providers/mthca/srq.c b/providers/mthca/srq.c
index 66ac924a720c84..95b79020fc6a62 100644
--- a/providers/mthca/srq.c
+++ b/providers/mthca/srq.c
@@ -152,7 +152,7 @@ int mthca_tavor_post_srq_recv(struct ibv_srq *ibsrq,
 			 * Make sure that descriptors are written
 			 * before doorbell is rung.
 			 */
-			wmb();
+			udma_to_device_barrier();
 
 			mthca_write64(doorbell, to_mctx(ibsrq->context), MTHCA_RECV_DOORBELL);
 
@@ -168,7 +168,7 @@ int mthca_tavor_post_srq_recv(struct ibv_srq *ibsrq,
 		 * Make sure that descriptors are written before
 		 * doorbell is rung.
 		 */
-		wmb();
+		udma_to_device_barrier();
 
 		mthca_write64(doorbell, to_mctx(ibsrq->context), MTHCA_RECV_DOORBELL);
 	}
@@ -240,7 +240,7 @@ int mthca_arbel_post_srq_recv(struct ibv_srq *ibsrq,
 		 * Make sure that descriptors are written before
 		 * we write doorbell record.
 		 */
-		wmb();
+		udma_ordering_write_barrier();
 		*srq->db = htonl(srq->counter);
 	}
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH rdma-core 11/14] ocrdma: Update to use new udma write barriers
       [not found] ` <1487272989-8215-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
                     ` (9 preceding siblings ...)
  2017-02-16 19:23   ` [PATCH rdma-core 10/14] mthca: Update to use new mmio " Jason Gunthorpe
@ 2017-02-16 19:23   ` Jason Gunthorpe
       [not found]     ` <1487272989-8215-12-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-02-16 19:23   ` [PATCH rdma-core 12/14] qedr: " Jason Gunthorpe
                     ` (3 subsequent siblings)
  14 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-16 19:23 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Devesh Sharma

Move the barriers closer to the actual action being protected eg
put udma_to_device_barrier in ocrdma_ring_*.

Add a wc_wmb() barrier before starting WC writes for consistency
with other drivers.

Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
---
 providers/ocrdma/ocrdma_verbs.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/providers/ocrdma/ocrdma_verbs.c b/providers/ocrdma/ocrdma_verbs.c
index 7fc841a194127d..3725d63a9b88f3 100644
--- a/providers/ocrdma/ocrdma_verbs.c
+++ b/providers/ocrdma/ocrdma_verbs.c
@@ -1111,18 +1111,24 @@ int ocrdma_destroy_qp(struct ibv_qp *ibqp)
 static void ocrdma_ring_sq_db(struct ocrdma_qp *qp)
 {
 	uint32_t db_val = ocrdma_cpu_to_le((qp->sq.dbid | (1 << 16)));
+
+	udma_to_device_barrier();
 	*(uint32_t *) (((uint8_t *) qp->db_sq_va)) = db_val;
 }
 
 static void ocrdma_ring_rq_db(struct ocrdma_qp *qp)
 {
 	uint32_t db_val = ocrdma_cpu_to_le((qp->rq.dbid | (1 << qp->db_shift)));
+
+	udma_to_device_barrier();
 	*(uint32_t *) ((uint8_t *) qp->db_rq_va) = db_val;
 }
 
 static void ocrdma_ring_srq_db(struct ocrdma_srq *srq)
 {
 	uint32_t db_val = ocrdma_cpu_to_le(srq->rq.dbid | (1 << srq->db_shift));
+
+	udma_to_device_barrier();
 	*(uint32_t *) (srq->db_va) = db_val;
 }
 
@@ -1141,6 +1147,7 @@ static void ocrdma_ring_cq_db(struct ocrdma_cq *cq, uint32_t armed,
 		val |= (1 << OCRDMA_DB_CQ_SOLICIT_SHIFT);
 	val |= (num_cqe << OCRDMA_DB_CQ_NUM_POPPED_SHIFT);
 
+	udma_to_device_barrier();
 	*(uint32_t *) ((uint8_t *) (cq->db_va) + OCRDMA_DB_CQ_OFFSET) =
 	    ocrdma_cpu_to_le(val);
 }
@@ -1322,6 +1329,9 @@ static void ocrdma_build_dpp_wqe(void *va, struct ocrdma_hdr_wqe *wqe,
 {
 	uint32_t pyld_len = (wqe->cw >> OCRDMA_WQE_SIZE_SHIFT) * 2;
 	uint32_t i = 0;
+
+	mmio_wc_start();
+
 	/* convert WQE header to LE format */
 	for (; i < hdr_len; i++)
 		*((uint32_t *) va + i) =
@@ -1329,7 +1339,8 @@ static void ocrdma_build_dpp_wqe(void *va, struct ocrdma_hdr_wqe *wqe,
 	/* Convertion of data is done in HW */
 	for (; i < pyld_len; i++)
 		*((uint32_t *) va + i) = (*((uint32_t *) wqe + i));
-	wc_wmb();
+
+	mmio_flush_writes();
 }
 
 static void ocrdma_post_dpp_wqe(struct ocrdma_qp *qp,
@@ -1439,7 +1450,6 @@ int ocrdma_post_send(struct ibv_qp *ib_qp, struct ibv_send_wr *wr,
 				      OCRDMA_WQE_SIZE_MASK) *
 				      OCRDMA_WQE_STRIDE);
 
-		wmb();
 		ocrdma_ring_sq_db(qp);
 
 		/* update pointer, counter for next wr */
@@ -1501,7 +1511,6 @@ int ocrdma_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr,
 		rqe = ocrdma_hwq_head(&qp->rq);
 		ocrdma_build_rqe(rqe, wr, 0);
 		qp->rqe_wr_id_tbl[qp->rq.head] = wr->wr_id;
-		wmb();
 		ocrdma_ring_rq_db(qp);
 
 		/* update pointer, counter for next wr */
@@ -2082,7 +2091,6 @@ int ocrdma_post_srq_recv(struct ibv_srq *ibsrq, struct ibv_recv_wr *wr,
 		ocrdma_build_rqe(rqe, wr, tag);
 		srq->rqe_wr_id_tbl[tag] = wr->wr_id;
 
-		wmb();
 		ocrdma_ring_srq_db(srq);
 
 		/* update pointer, counter for next wr */
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH rdma-core 12/14] qedr: Update to use new udma write barriers
       [not found] ` <1487272989-8215-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
                     ` (10 preceding siblings ...)
  2017-02-16 19:23   ` [PATCH rdma-core 11/14] ocrdma: Update to use new udma " Jason Gunthorpe
@ 2017-02-16 19:23   ` Jason Gunthorpe
       [not found]     ` <1487272989-8215-13-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-02-16 19:23   ` [PATCH rdma-core 13/14] vmw_pvrdma: " Jason Gunthorpe
                     ` (2 subsequent siblings)
  14 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-16 19:23 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Ram Amrani, Ariel Elior

qedr uses WC memory for its '.db' mmap, so all writes to it have
to be wrapped in the WC barriers. This upgrades the leading
wmb to a wc_wmb() for consistency with other drivers.

Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
---
 providers/qedr/qelr_verbs.c | 32 +++++++++++++++-----------------
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/providers/qedr/qelr_verbs.c b/providers/qedr/qelr_verbs.c
index 95cd429e1b9b47..c8a0db2c9c1cfd 100644
--- a/providers/qedr/qelr_verbs.c
+++ b/providers/qedr/qelr_verbs.c
@@ -672,9 +672,9 @@ static int qelr_update_qp_state(struct qelr_qp *qp,
 			/* Update doorbell (in case post_recv was done before
 			 * move to RTR)
 			 */
-			wmb();
+			mmio_wc_start();
 			writel(qp->rq.db_data.raw, qp->rq.db);
-			wc_wmb();
+			mmio_flush_writes();
 			break;
 		case QELR_QPS_ERR:
 			break;
@@ -1096,7 +1096,7 @@ static void doorbell_edpm_qp(struct qelr_qp *qp)
 	if (!qp->edpm.is_edpm)
 		return;
 
-	wmb();
+	mmio_wc_start();
 
 	qp->edpm.msg.data.icid = qp->sq.db_data.data.icid;
 	qp->edpm.msg.data.prod_val = qp->sq.db_data.data.value;
@@ -1116,15 +1116,16 @@ static void doorbell_edpm_qp(struct qelr_qp *qp)
 		       sizeof(uint64_t));
 
 		bytes += sizeof(uint64_t);
-		/* Need to place a barrier after every 64 bytes */
+		/* Since we rewrite the buffer every 64 bytes we need to flush
+		   it here, otherwise the CPU could optimize alway the
+		   duplicate stores. */
 		if (bytes == 64) {
-			wc_wmb();
+			mmio_flush_writes();
 			bytes = 0;
 		}
 		offset++;
 	}
-
-	wc_wmb();
+	mmio_flush_writes();
 }
 
 int qelr_post_send(struct ibv_qp *ib_qp, struct ibv_send_wr *wr,
@@ -1363,11 +1364,9 @@ int qelr_post_send(struct ibv_qp *ib_qp, struct ibv_send_wr *wr,
 	}
 
 	if (!qp->edpm.is_edpm) {
-		wmb();
-
+		mmio_wc_start();
 		writel(qp->sq.db_data.raw, qp->sq.db);
-
-		wc_wmb();
+		mmio_flush_writes();
 	}
 
 	pthread_spin_unlock(&qp->q_lock);
@@ -1446,14 +1445,13 @@ int qelr_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr,
 
 		qelr_inc_sw_prod_u16(&qp->rq);
 
-		wmb();
+		mmio_wc_start();
 
 		db_val = le16toh(qp->rq.db_data.data.value) + 1;
 		qp->rq.db_data.data.value = htole16(db_val);
 
 		writel(qp->rq.db_data.raw, qp->rq.db);
-
-		wc_wmb();
+		mmio_flush_writes();
 
 		wr = wr->next;
 	}
@@ -1795,12 +1793,12 @@ static int qelr_poll_cq_resp(struct qelr_qp *qp, struct qelr_cq *cq,
 
 static void doorbell_cq(struct qelr_cq *cq, uint32_t cons, uint8_t flags)
 {
-	wmb();
+	mmio_wc_start();
 	cq->db.data.agg_flags = flags;
 	cq->db.data.value = htole32(cons);
 
 	writeq(cq->db.raw, cq->db_addr);
-	wc_wmb();
+	mmio_flush_writes();
 }
 
 int qelr_poll_cq(struct ibv_cq *ibcq, int num_entries, struct ibv_wc *wc)
@@ -1816,7 +1814,7 @@ int qelr_poll_cq(struct ibv_cq *ibcq, int num_entries, struct ibv_wc *wc)
 		struct qelr_qp *qp;
 
 		/* prevent speculative reads of any field of CQE */
-		rmb();
+		udma_from_device_barrier();
 
 		qp = cqe_get_qp(cqe);
 		if (!qp) {
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH rdma-core 13/14] vmw_pvrdma: Update to use new udma write barriers
       [not found] ` <1487272989-8215-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
                     ` (11 preceding siblings ...)
  2017-02-16 19:23   ` [PATCH rdma-core 12/14] qedr: " Jason Gunthorpe
@ 2017-02-16 19:23   ` Jason Gunthorpe
       [not found]     ` <1487272989-8215-14-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-02-16 19:23   ` [PATCH rdma-core 14/14] Remove the old barrier macros Jason Gunthorpe
  2017-02-28 16:00   ` [PATCH rdma-core 00/14] Revise the DMA barrier macros in ibverbs Doug Ledford
  14 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-16 19:23 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Adit Ranadive

For some reason write barriers were placed after the writes, move
them before.

Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
---
 providers/vmw_pvrdma/cq.c | 6 +++---
 providers/vmw_pvrdma/qp.c | 8 ++++----
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/providers/vmw_pvrdma/cq.c b/providers/vmw_pvrdma/cq.c
index f24d80742678bd..701f0522f7b0dd 100644
--- a/providers/vmw_pvrdma/cq.c
+++ b/providers/vmw_pvrdma/cq.c
@@ -109,7 +109,7 @@ retry:
 	if (!cqe)
 		return CQ_EMPTY;
 
-	rmb();
+	udma_from_device_barrier();
 
 	if (ctx->qp_tbl[cqe->qp & 0xFFFF])
 		*cur_qp = (struct pvrdma_qp *)ctx->qp_tbl[cqe->qp & 0xFFFF];
@@ -184,11 +184,11 @@ void pvrdma_cq_clean_int(struct pvrdma_cq *cq, uint32_t qpn)
 			if (tail < 0)
 				tail = cq->cqe_cnt - 1;
 			curr_cqe = get_cqe(cq, curr);
-			rmb();
+			udma_from_device_barrier();
 			if ((curr_cqe->qp & 0xFFFF) != qpn) {
 				if (curr != tail) {
 					cqe = get_cqe(cq, tail);
-					rmb();
+					udma_from_device_barrier();
 					*cqe = *curr_cqe;
 				}
 				tail--;
diff --git a/providers/vmw_pvrdma/qp.c b/providers/vmw_pvrdma/qp.c
index d2e2189fda6de4..116063ee07c83b 100644
--- a/providers/vmw_pvrdma/qp.c
+++ b/providers/vmw_pvrdma/qp.c
@@ -404,11 +404,10 @@ int pvrdma_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
 			sge++;
 		}
 
+		udma_to_device_barrier();
 		pvrdma_idx_ring_inc(&(qp->sq.ring_state->prod_tail),
 				    qp->sq.wqe_cnt);
 
-		wmb();
-
 		qp->sq.wrid[ind] = wr->wr_id;
 		++ind;
 		if (ind >= qp->sq.wqe_cnt)
@@ -416,11 +415,12 @@ int pvrdma_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
 	}
 
 out:
-	if (nreq)
+	if (nreq) {
+		udma_to_device_barrier();
 		pvrdma_write_uar_qp(ctx->uar,
 				    PVRDMA_UAR_QP_SEND | ibqp->qp_num);
+	}
 
-	wmb();
 	pthread_spin_unlock(&qp->sq.lock);
 
 	return ret;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH rdma-core 14/14] Remove the old barrier macros
       [not found] ` <1487272989-8215-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
                     ` (12 preceding siblings ...)
  2017-02-16 19:23   ` [PATCH rdma-core 13/14] vmw_pvrdma: " Jason Gunthorpe
@ 2017-02-16 19:23   ` Jason Gunthorpe
       [not found]     ` <1487272989-8215-15-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-02-28 16:00   ` [PATCH rdma-core 00/14] Revise the DMA barrier macros in ibverbs Doug Ledford
  14 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-16 19:23 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Inline the required assembly directly into the new names. No change to
the asm.

Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
---
 util/udma_barrier.h | 195 +++++++++++++++++++++++++---------------------------
 1 file changed, 94 insertions(+), 101 deletions(-)

diff --git a/util/udma_barrier.h b/util/udma_barrier.h
index f9b8291db20210..cc2718f33fc3a2 100644
--- a/util/udma_barrier.h
+++ b/util/udma_barrier.h
@@ -33,95 +33,6 @@
 #ifndef __UTIL_UDMA_BARRIER_H
 #define __UTIL_UDMA_BARRIER_H
 
-/*
- * Architecture-specific defines.  Currently, an architecture is
- * required to implement the following operations:
- *
- * mb() - memory barrier.  No loads or stores may be reordered across
- *     this macro by either the compiler or the CPU.
- * rmb() - read memory barrier.  No loads may be reordered across this
- *     macro by either the compiler or the CPU.
- * wmb() - write memory barrier.  No stores may be reordered across
- *     this macro by either the compiler or the CPU.
- * wc_wmb() - flush write combine buffers.  No write-combined writes
- *     will be reordered across this macro by either the compiler or
- *     the CPU.
- */
-
-#if defined(__i386__)
-
-#define mb()	 asm volatile("lock; addl $0,0(%%esp) " ::: "memory")
-#define rmb()	 mb()
-#define wmb()	 asm volatile("" ::: "memory")
-#define wc_wmb() mb()
-
-#elif defined(__x86_64__)
-
-/*
- * Only use lfence for mb() and rmb() because we don't care about
- * ordering against non-temporal stores (for now at least).
- */
-#define mb()	 asm volatile("lfence" ::: "memory")
-#define rmb()	 mb()
-#define wmb()	 asm volatile("" ::: "memory")
-#define wc_wmb() asm volatile("sfence" ::: "memory")
-
-#elif defined(__PPC64__)
-
-#define mb()	 asm volatile("sync" ::: "memory")
-#define rmb()	 asm volatile("lwsync" ::: "memory")
-#define wmb()	 mb()
-#define wc_wmb() wmb()
-
-#elif defined(__ia64__)
-
-#define mb()	 asm volatile("mf" ::: "memory")
-#define rmb()	 mb()
-#define wmb()	 mb()
-#define wc_wmb() asm volatile("fwb" ::: "memory")
-
-#elif defined(__PPC__)
-
-#define mb()	 asm volatile("sync" ::: "memory")
-#define rmb()	 mb()
-#define wmb()	 mb()
-#define wc_wmb() wmb()
-
-#elif defined(__sparc_v9__)
-
-#define mb()	 asm volatile("membar #LoadLoad | #LoadStore | #StoreStore | #StoreLoad" ::: "memory")
-#define rmb()	 asm volatile("membar #LoadLoad" ::: "memory")
-#define wmb()	 asm volatile("membar #StoreStore" ::: "memory")
-#define wc_wmb() wmb()
-
-#elif defined(__sparc__)
-
-#define mb()	 asm volatile("" ::: "memory")
-#define rmb()	 mb()
-#define wmb()	 mb()
-#define wc_wmb() wmb()
-
-#elif defined(__s390x__)
-
-#define mb()	{ asm volatile("" : : : "memory"); }	/* for s390x */
-#define rmb()	mb()					/* for s390x */
-#define wmb()	mb()					/* for s390x */
-#define wc_wmb() wmb()					/* for s390x */
-
-#elif defined(__aarch64__)
-
-/* Perhaps dmb would be sufficient? Let us be conservative for now. */
-#define mb()	{ asm volatile("dsb sy" ::: "memory"); }
-#define rmb()	{ asm volatile("dsb ld" ::: "memory"); }
-#define wmb()	{ asm volatile("dsb st" ::: "memory"); }
-#define wc_wmb() wmb()
-
-#else
-
-#error No architecture specific memory barrier defines found!
-
-#endif
-
 /* Barriers for DMA.
 
    These barriers are expliclty only for use with user DMA operations. If you
@@ -143,6 +54,10 @@
    The ordering guarentee is always stated between those two streams. Eg what
    happens if a MemRd TLP is sent in via PCI-E relative to a CPU WRITE to the
    same memory location.
+
+   The providers have a very regular and predictable use of these barriers,
+   to make things very clear each narrow use is given a name and the proper
+   name should be used in the provider as a form of documentation.
 */
 
 /* Ensure that the device's view of memory matches the CPU's view of memory.
@@ -163,7 +78,25 @@
    memory types or non-temporal stores are required to use SFENCE in their own
    code prior to calling verbs to start a DMA.
 */
-#define udma_to_device_barrier() wmb()
+#if defined(__i386__)
+#define udma_to_device_barrier() asm volatile("" ::: "memory")
+#elif defined(__x86_64__)
+#define udma_to_device_barrier() asm volatile("" ::: "memory")
+#elif defined(__PPC64__)
+#define udma_to_device_barrier() asm volatile("sync" ::: "memory")
+#elif defined(__PPC__)
+#define udma_to_device_barrier() asm volatile("sync" ::: "memory")
+#elif defined(__ia64__)
+#define udma_to_device_barrier() asm volatile("mf" ::: "memory")
+#elif defined(__sparc_v9__)
+#define udma_to_device_barrier() asm volatile("membar #StoreStore" ::: "memory")
+#elif defined(__aarch64__)
+#define wmb() asm volatile("dsb st" ::: "memory");
+#elif defined(__sparc__) || defined(__s390x__)
+#define udma_to_device_barrier() asm volatile("" ::: "memory")
+#else
+#error No architecture specific memory barrier defines found!
+#endif
 
 /* Ensure that all ordered stores from the device are observable from the
    CPU. This only makes sense after something that observes an ordered store
@@ -177,7 +110,25 @@
    that is a DMA target, to ensure that the following reads see the
    data written before the MemWr TLP that set the valid bit.
 */
-#define udma_from_device_barrier() rmb()
+#if defined(__i386__)
+#define udma_from_device_barrier() asm volatile("lock; addl $0,0(%%esp) " ::: "memory")
+#elif defined(__x86_64__)
+#define udma_from_device_barrier() asm volatile("lfence" ::: "memory")
+#elif defined(__PPC64__)
+#define udma_from_device_barrier() asm volatile("lwsync" ::: "memory")
+#elif defined(__PPC__)
+#define udma_from_device_barrier() asm volatile("sync" ::: "memory")
+#elif defined(__ia64__)
+#define udma_from_device_barrier() asm volatile("mf" ::: "memory")
+#elif defined(__sparc_v9__)
+#define udma_from_device_barrier() asm volatile("membar #LoadLoad" ::: "memory")
+#elif defined(__aarch64__)
+#define udma_from_device_barrier() asm volatile("dsb ld" ::: "memory");
+#elif defined(__sparc__) || defined(__s390x__)
+#define udma_from_device_barrier() asm volatile("" ::: "memory")
+#else
+#error No architecture specific memory barrier defines found!
+#endif
 
 /* Order writes to CPU memory so that a DMA device cannot view writes after
    the barrier without also seeing all writes before the barrier. This does
@@ -198,24 +149,66 @@
       udma_ordering_write_barrier();  // Guarantee WQE written in order
       wqe->valid = 1;
 */
-#define udma_ordering_write_barrier() wmb()
+#define udma_ordering_write_barrier() udma_to_device_barrier()
 
-/* Promptly flush writes, possibly in a write buffer, to MMIO backed memory.
-   This is not required to have any effect on CPU memory. If done while
-   holding a lock then the ordering of MMIO writes across CPUs must be
-   guarenteed to follow the natural ordering implied by the lock.
+/* Promptly flush writes to MMIO Write Cominbing memory.
+   This should be used after a write to WC memory. This is both a barrier
+   and a hint to the CPU to flush any buffers to reduce latency to TLP
+   generation.
+
+   This is not required to have any effect on CPU memory.
+
+   If done while holding a lock then the ordering of MMIO writes across CPUs
+   must be guarenteed to follow the natural ordering implied by the lock.
 
    This must also act as a barrier that prevents write combining, eg
      *wc_mem = 1;
      mmio_flush_writes();
      *wc_mem = 2;
-   Must always produce two MemWr TLPs, the '2' cannot be combined with and
-   supress the '1'.
+   Must always produce two MemWr TLPs, '1' and '2'. Without the barrier
+   the CPU is allowed to produce a single TLP '2'.
+
+   Note that there is no order guarentee for writes to WC memory without
+   barriers.
+
+   This is intended to be used in conjunction with WC memory to generate large
+   PCI-E MemWr TLPs from the CPU.
+*/
+#if defined(__i386__)
+#define mmio_flush_writes() asm volatile("lock; addl $0,0(%%esp) " ::: "memory")
+#elif defined(__x86_64__)
+#define mmio_flush_writes() asm volatile("sfence" ::: "memory")
+#elif defined(__PPC64__)
+#define mmio_flush_writes() asm volatile("sync" ::: "memory")
+#elif defined(__PPC__)
+#define mmio_flush_writes() asm volatile("sync" ::: "memory")
+#elif defined(__ia64__)
+#define mmio_flush_writes() asm volatile("fwb" ::: "memory")
+#elif defined(__sparc_v9__)
+#define mmio_flush_writes() asm volatile("membar #StoreStore" ::: "memory")
+#elif defined(__aarch64__)
+#define mmio_flush_writes() asm volatile("dsb st" ::: "memory");
+#elif defined(__sparc__) || defined(__s390x__)
+#define mmio_flush_writes() asm volatile("" ::: "memory")
+#error No architecture specific memory barrier defines found!
+#endif
+
+/* Prevent WC writes from being re-ordered relative to other MMIO
+   writes. This should be used before a write to WC memory.
+
+   This must act as a barrier to prevent write re-ordering from different
+   memory types:
+     *mmio_mem = 1;
+     mmio_flush_writes();
+     *wc_mem = 2;
+   Must always produce a TLP '1' followed by '2'.
+
+   This barrier implies udma_to_device_barrier()
 
-   This is intended to be used in conjunction with write combining memory
-   to generate large PCI-E MemWr TLPs from the CPU.
+   This is intended to be used in conjunction with WC memory to generate large
+   PCI-E MemWr TLPs from the CPU.
 */
-#define mmio_flush_writes() wc_wmb()
+#define mmio_wc_start() mmio_flush_writes()
 
 /* Keep MMIO writes in order.
    Currently we lack writel macros that universally guarentee MMIO
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* RE: [PATCH rdma-core 03/14] cxgb3: Update to use new udma write barriers
       [not found]     ` <1487272989-8215-4-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-02-16 21:20       ` Steve Wise
  2017-02-16 21:45         ` Jason Gunthorpe
  0 siblings, 1 reply; 65+ messages in thread
From: Steve Wise @ 2017-02-16 21:20 UTC (permalink / raw)
  To: 'Jason Gunthorpe', linux-rdma-u79uwXL29TY76Z2rM5mHXA

> Steve says the chip reads until a EOP marked WR is found, so the only
> write barrier is to make sure DMA is ready before setting the WR in
> that way.
> 
> Add missing rmb()s in the obvious spot.
> 
> Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> ---
>  providers/cxgb3/cq.c      | 2 ++
>  providers/cxgb3/cxio_wr.h | 2 +-
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/providers/cxgb3/cq.c b/providers/cxgb3/cq.c
> index a6158ce771f89f..eddcd43dc3bd8e 100644
> --- a/providers/cxgb3/cq.c
> +++ b/providers/cxgb3/cq.c
> @@ -121,6 +121,7 @@ static inline int cxio_poll_cq(struct t3_wq *wq, struct
t3_cq
> *cq,
> 
>  	*cqe_flushed = 0;
>  	hw_cqe = cxio_next_cqe(cq);
> +	udma_from_device_barrier();
> 
>  	/*
>  	 * Skip cqes not affiliated with a QP.
> @@ -266,6 +267,7 @@ static int iwch_poll_cq_one(struct iwch_device *rhp,
struct
> iwch_cq *chp,
>  	int ret = 1;
> 
>  	hw_cqe = cxio_next_cqe(&chp->cq);
> +	udma_from_device_barrier();
> 
>  	if (!hw_cqe)
>  		return 0;

Hey Jason, is it possible the omission on these was never detected because the
memory for cq (and sq and rq) queues is allocated in the kernel by
dma_alloc_coherent(), and mapped to the process's address space?  I'm wondering
how cxgb3 made it 10+ years this this bug...

Steve.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 03/14] cxgb3: Update to use new udma write barriers
  2017-02-16 21:20       ` Steve Wise
@ 2017-02-16 21:45         ` Jason Gunthorpe
       [not found]           ` <20170216214527.GA13616-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-16 21:45 UTC (permalink / raw)
  To: Steve Wise; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Feb 16, 2017 at 03:20:56PM -0600, Steve Wise wrote:

> Hey Jason, is it possible the omission on these was never detected because the
> memory for cq (and sq and rq) queues is allocated in the kernel by
> dma_alloc_coherent(), and mapped to the process's address space?

If the pgprot in userspace is UC then the odds of having a problem are
much lower (but IIRC dma_alloc_coherent does not do that on
x86?).

But DMA coherent memory explicitly doesn't save you from requiring
barriers and it is still playing with fire as the compiler doesn't
know the memory is UC and can re-order loads improperly.

AFAIK, any arch that requires something special for dma_coherent
mappings is already broken for libibverbs in user space - as we do not
have any cache flushing support. So it sort of makes sense to use it
in the kernel, but if it produces anything other than cached memory
things will go terribly wrong for that arch when using libibverbs.

I suspect the primary reason is cxgb3 simply got lucky and the
compiler (that was tested) did not do anything bad, or place any
dependent loads too closely to the valid bit load.

x86 is fairly forgiving.. And quite possibly if you test today with
gcc 6 on ARM64 it might be broken?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: [PATCH rdma-core 03/14] cxgb3: Update to use new udma write barriers
       [not found]           ` <20170216214527.GA13616-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-02-16 22:01             ` Steve Wise
  0 siblings, 0 replies; 65+ messages in thread
From: Steve Wise @ 2017-02-16 22:01 UTC (permalink / raw)
  To: 'Jason Gunthorpe'; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

> 
> On Thu, Feb 16, 2017 at 03:20:56PM -0600, Steve Wise wrote:
> 
> > Hey Jason, is it possible the omission on these was never detected because
the
> > memory for cq (and sq and rq) queues is allocated in the kernel by
> > dma_alloc_coherent(), and mapped to the process's address space?
> 
> If the pgprot in userspace is UC then the odds of having a problem are
> much lower (but IIRC dma_alloc_coherent does not do that on
> x86?).
> 
> But DMA coherent memory explicitly doesn't save you from requiring
> barriers and it is still playing with fire as the compiler doesn't
> know the memory is UC and can re-order loads improperly.
> 
> AFAIK, any arch that requires something special for dma_coherent
> mappings is already broken for libibverbs in user space - as we do not
> have any cache flushing support. So it sort of makes sense to use it
> in the kernel, but if it produces anything other than cached memory
> things will go terribly wrong for that arch when using libibverbs.
> 
> I suspect the primary reason is cxgb3 simply got lucky and the
> compiler (that was tested) did not do anything bad, or place any
> dependent loads too closely to the valid bit load.
> 
> x86 is fairly forgiving.. And quite possibly if you test today with
> gcc 6 on ARM64 it might be broken?

Ok thanks for the explanation!

Reviewed-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: [PATCH rdma-core 02/14] Provide new names for the CPU barriers related to DMA
       [not found]     ` <1487272989-8215-3-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-02-16 22:07       ` Steve Wise
  2017-02-17 16:37         ` Jason Gunthorpe
  0 siblings, 1 reply; 65+ messages in thread
From: Steve Wise @ 2017-02-16 22:07 UTC (permalink / raw)
  To: 'Jason Gunthorpe', linux-rdma-u79uwXL29TY76Z2rM5mHXA

> 
> Broadly speaking, providers are not using the existing macros
> consistently and the existing macros are very poorly defined.
> 
> Due to this poor definition we struggled to implement a sensible
> barrier for ARM64 and just went with the strongest barriers instead.
> 
> Split wmb/wmb_wc into several cases:
>  udma_to_device_barrier - Think dma_map(TO_DEVICE) in kernel terms
>  udma_ordering_write_barrier - Weaker than wmb() in the kernel
>  mmio_flush_writes - Special to help work with WC memory
>  mmio_wc_start - Special to help work with WC memory

I think you left out the mmio_wc_start() implementation?


>  mmio_ordered_writes_hack - Stand in for lack an ordered writel()
> 
> rmb becomes:
>  udma_from_device_barrier - Think dmap_unamp(FROM_DEVICE) in kernel terms
> 
> The split forces provider authors to think about what they are doing more
> carefully and the comments provide a solid explanation for when the barrier
> is actually supposed to be used and when to use it with the common idioms
> all drivers seem to have.
> 
> NOTE: do not assume that the existing asm optimally implements the defined
> semantics. The required semantics were derived primarily from what the
> existing providers do.
> 
> Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> ---
>  util/udma_barrier.h | 107
> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 107 insertions(+)
> 
> diff --git a/util/udma_barrier.h b/util/udma_barrier.h
> index 57ab0f76cbe33e..f9b8291db20210 100644
> --- a/util/udma_barrier.h
> +++ b/util/udma_barrier.h
> @@ -122,4 +122,111 @@
> 
>  #endif
> 
> +/* Barriers for DMA.
> +
> +   These barriers are expliclty only for use with user DMA operations. If you
> +   are looking for barriers to use with cache-coherent multi-threaded
> +   consitency then look in stdatomic.h. If you need both kinds of
synchronicity
> +   for the same address then use an atomic operation followed by one
> +   of these barriers.
> +
> +   When reasoning about these barriers there are two objects:
> +     - CPU attached address space (the CPU memory could be a range of things:
> +       cached/uncached/non-temporal CPU DRAM, uncached MMIO space in
> another
> +       device, pMEM). Generally speaking the ordering is only relative
> +       to the local CPU's view of the system. Eg if the local CPU
> +       is not guarenteed to see a write from another CPU then it is also
> +       OK for the DMA device to also not see the write after the barrier.
> +     - A DMA initiator on a bus. For instance a PCI-E device issuing
> +       MemRd/MemWr TLPs.
> +
> +   The ordering guarentee is always stated between those two streams. Eg what
> +   happens if a MemRd TLP is sent in via PCI-E relative to a CPU WRITE to the
> +   same memory location.
> +*/
> +
> +/* Ensure that the device's view of memory matches the CPU's view of memory.
> +   This should be placed before any MMIO store that could trigger the device
> +   to begin doing DMA, such as a device doorbell ring.
> +
> +   eg
> +    *dma_buf = 1;
> +    udma_to_device_barrier();
> +    mmio_write(DO_DMA_REG, dma_buf);
> +   Must ensure that the device sees the '1'.
> +
> +   This is required to fence writes created by the libibverbs user. Those
> +   writes could be to any CPU mapped memory object with any cachability mode.
> +
> +   NOTE: x86 has historically used a weaker semantic for this barrier, and
> +   only fenced normal stores to normal memory. libibverbs users using other
> +   memory types or non-temporal stores are required to use SFENCE in their
own
> +   code prior to calling verbs to start a DMA.
> +*/
> +#define udma_to_device_barrier() wmb()
> +
> +/* Ensure that all ordered stores from the device are observable from the
> +   CPU. This only makes sense after something that observes an ordered store
> +   from the device - eg by reading a MMIO register or seeing that CPU memory
is
> +   updated.
> +
> +   This guarentees that all reads that follow the barrier see the ordered
> +   stores that preceded the observation.
> +
> +   For instance, this would be used after testing a valid bit in a memory
> +   that is a DMA target, to ensure that the following reads see the
> +   data written before the MemWr TLP that set the valid bit.
> +*/
> +#define udma_from_device_barrier() rmb()
> +
> +/* Order writes to CPU memory so that a DMA device cannot view writes after
> +   the barrier without also seeing all writes before the barrier. This does
> +   not guarentee any writes are visible to DMA.
> +
> +   This would be used in cases where a DMA buffer might have a valid bit and
> +   data, this barrier is placed after writing the data but before writing the
> +   valid bit to ensure the DMA device cannot observe a set valid bit with
> +   unwritten data.
> +
> +   Compared to udma_to_device_barrier() this barrier is not required to fence
> +   anything but normal stores to normal malloc memory. Usage should be:
> +
> +   write_wqe
> +      udma_to_device_barrier();    // Get user memory ready for DMA
> +      wqe->addr = ...;
> +      wqe->flags = ...;
> +      udma_ordering_write_barrier();  // Guarantee WQE written in order
> +      wqe->valid = 1;
> +*/
> +#define udma_ordering_write_barrier() wmb()
> +
> +/* Promptly flush writes, possibly in a write buffer, to MMIO backed memory.
> +   This is not required to have any effect on CPU memory. If done while
> +   holding a lock then the ordering of MMIO writes across CPUs must be
> +   guarenteed to follow the natural ordering implied by the lock.
> +
> +   This must also act as a barrier that prevents write combining, eg
> +     *wc_mem = 1;
> +     mmio_flush_writes();
> +     *wc_mem = 2;
> +   Must always produce two MemWr TLPs, the '2' cannot be combined with and
> +   supress the '1'.
> +
> +   This is intended to be used in conjunction with write combining memory
> +   to generate large PCI-E MemWr TLPs from the CPU.
> +*/
> +#define mmio_flush_writes() wc_wmb()
> +
> +/* Keep MMIO writes in order.
> +   Currently we lack writel macros that universally guarentee MMIO
> +   writes happen in order, like the kernel does. Even worse many
> +   providers haphazardly open code writes to MMIO memory omitting even
> +   volatile.
> +
> +   Until this can be fixed with a proper writel macro, this barrier
> +   is a stand in to indicate places where MMIO writes should be switched
> +   to some future writel.
> +*/
> +#define mmio_ordered_writes_hack() mmio_flush_writes()
> +
>  #endif
> --
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 02/14] Provide new names for the CPU barriers related to DMA
  2017-02-16 22:07       ` Steve Wise
@ 2017-02-17 16:37         ` Jason Gunthorpe
  0 siblings, 0 replies; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-17 16:37 UTC (permalink / raw)
  To: Steve Wise; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Feb 16, 2017 at 04:07:54PM -0600, Steve Wise wrote:
> > 
> > Broadly speaking, providers are not using the existing macros
> > consistently and the existing macros are very poorly defined.
> > 
> > Due to this poor definition we struggled to implement a sensible
> > barrier for ARM64 and just went with the strongest barriers instead.
> > 
> > Split wmb/wmb_wc into several cases:
> >  udma_to_device_barrier - Think dma_map(TO_DEVICE) in kernel terms
> >  udma_ordering_write_barrier - Weaker than wmb() in the kernel
> >  mmio_flush_writes - Special to help work with WC memory
> >  mmio_wc_start - Special to help work with WC memory
> 
> I think you left out the mmio_wc_start() implementation?

Oops, that hunk ended up in  patch 14. I've fixed it thanks

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 13/14] vmw_pvrdma: Update to use new udma write barriers
       [not found]     ` <1487272989-8215-14-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-02-17 18:05       ` Adit Ranadive
  0 siblings, 0 replies; 65+ messages in thread
From: Adit Ranadive @ 2017-02-17 18:05 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Feb 16, 2017 at 12:23:08AM -0700, Jason Gunthorpe wrote:
> For some reason write barriers were placed after the writes, move
> them before.
> 
> Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> ---
>  providers/vmw_pvrdma/cq.c | 6 +++---
>  providers/vmw_pvrdma/qp.c | 8 ++++----
>  2 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/providers/vmw_pvrdma/cq.c b/providers/vmw_pvrdma/cq.c
> index f24d80742678bd..701f0522f7b0dd 100644
> --- a/providers/vmw_pvrdma/cq.c
> +++ b/providers/vmw_pvrdma/cq.c
> @@ -109,7 +109,7 @@ retry:
>  	if (!cqe)
>  		return CQ_EMPTY;
>  
> -	rmb();
> +	udma_from_device_barrier();
>  
>  	if (ctx->qp_tbl[cqe->qp & 0xFFFF])
>  		*cur_qp = (struct pvrdma_qp *)ctx->qp_tbl[cqe->qp & 0xFFFF];
> @@ -184,11 +184,11 @@ void pvrdma_cq_clean_int(struct pvrdma_cq *cq, uint32_t qpn)
>  			if (tail < 0)
>  				tail = cq->cqe_cnt - 1;
>  			curr_cqe = get_cqe(cq, curr);
> -			rmb();
> +			udma_from_device_barrier();
>  			if ((curr_cqe->qp & 0xFFFF) != qpn) {
>  				if (curr != tail) {
>  					cqe = get_cqe(cq, tail);
> -					rmb();
> +					udma_from_device_barrier();
>  					*cqe = *curr_cqe;
>  				}
>  				tail--;
> diff --git a/providers/vmw_pvrdma/qp.c b/providers/vmw_pvrdma/qp.c
> index d2e2189fda6de4..116063ee07c83b 100644
> --- a/providers/vmw_pvrdma/qp.c
> +++ b/providers/vmw_pvrdma/qp.c
> @@ -404,11 +404,10 @@ int pvrdma_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
>  			sge++;
>  		}
>  
> +		udma_to_device_barrier();
>  		pvrdma_idx_ring_inc(&(qp->sq.ring_state->prod_tail),
>  				    qp->sq.wqe_cnt);
>  
> -		wmb();
> -
>  		qp->sq.wrid[ind] = wr->wr_id;
>  		++ind;
>  		if (ind >= qp->sq.wqe_cnt)
> @@ -416,11 +415,12 @@ int pvrdma_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
>  	}
>  
>  out:
> -	if (nreq)
> +	if (nreq) {
> +		udma_to_device_barrier();
>  		pvrdma_write_uar_qp(ctx->uar,
>  				    PVRDMA_UAR_QP_SEND | ibqp->qp_num);
> +	}
>  
> -	wmb();
>  	pthread_spin_unlock(&qp->sq.lock);
>  
>  	return ret;
> 

Thanks! Not sure how we missed that barrier. I guess nothing bad happened.

Acked-by: Adit Ranadive <aditr-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: [PATCH rdma-core 04/14] cxgb4: Update to use new udma write barriers
       [not found]     ` <1487272989-8215-5-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-02-17 20:16       ` Steve Wise
  0 siblings, 0 replies; 65+ messages in thread
From: Steve Wise @ 2017-02-17 20:16 UTC (permalink / raw)
  To: 'Jason Gunthorpe', linux-rdma-u79uwXL29TY76Z2rM5mHXA

> Based on help from Steve the barriers here are change to consistently
> bracket WC memory writes with wc_wmb() like other drivers do.
> 
> This allows some of the wc_wmb() calls that were not related to WC
> memory be downgraded to wmb().
> 
> The driver was probably correct (at least for x86-64) but did not
> follow the idiom established by the other drivers for working with
> WC memry.
> 
> Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

I'm not going to address any of the FIXME issues yet, but:

Reviewed-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 11/14] ocrdma: Update to use new udma write barriers
       [not found]     ` <1487272989-8215-12-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-02-18 16:21       ` Devesh Sharma
  0 siblings, 0 replies; 65+ messages in thread
From: Devesh Sharma @ 2017-02-18 16:21 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma

Looks good!

Acked-By: Devesh Sharma <devesh.sharma-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>

On Fri, Feb 17, 2017 at 12:53 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> Move the barriers closer to the actual action being protected eg
> put udma_to_device_barrier in ocrdma_ring_*.
>
> Add a wc_wmb() barrier before starting WC writes for consistency
> with other drivers.
>
> Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> ---
>  providers/ocrdma/ocrdma_verbs.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/providers/ocrdma/ocrdma_verbs.c b/providers/ocrdma/ocrdma_verbs.c
> index 7fc841a194127d..3725d63a9b88f3 100644
> --- a/providers/ocrdma/ocrdma_verbs.c
> +++ b/providers/ocrdma/ocrdma_verbs.c
> @@ -1111,18 +1111,24 @@ int ocrdma_destroy_qp(struct ibv_qp *ibqp)
>  static void ocrdma_ring_sq_db(struct ocrdma_qp *qp)
>  {
>         uint32_t db_val = ocrdma_cpu_to_le((qp->sq.dbid | (1 << 16)));
> +
> +       udma_to_device_barrier();
>         *(uint32_t *) (((uint8_t *) qp->db_sq_va)) = db_val;
>  }
>
>  static void ocrdma_ring_rq_db(struct ocrdma_qp *qp)
>  {
>         uint32_t db_val = ocrdma_cpu_to_le((qp->rq.dbid | (1 << qp->db_shift)));
> +
> +       udma_to_device_barrier();
>         *(uint32_t *) ((uint8_t *) qp->db_rq_va) = db_val;
>  }
>
>  static void ocrdma_ring_srq_db(struct ocrdma_srq *srq)
>  {
>         uint32_t db_val = ocrdma_cpu_to_le(srq->rq.dbid | (1 << srq->db_shift));
> +
> +       udma_to_device_barrier();
>         *(uint32_t *) (srq->db_va) = db_val;
>  }
>
> @@ -1141,6 +1147,7 @@ static void ocrdma_ring_cq_db(struct ocrdma_cq *cq, uint32_t armed,
>                 val |= (1 << OCRDMA_DB_CQ_SOLICIT_SHIFT);
>         val |= (num_cqe << OCRDMA_DB_CQ_NUM_POPPED_SHIFT);
>
> +       udma_to_device_barrier();
>         *(uint32_t *) ((uint8_t *) (cq->db_va) + OCRDMA_DB_CQ_OFFSET) =
>             ocrdma_cpu_to_le(val);
>  }
> @@ -1322,6 +1329,9 @@ static void ocrdma_build_dpp_wqe(void *va, struct ocrdma_hdr_wqe *wqe,
>  {
>         uint32_t pyld_len = (wqe->cw >> OCRDMA_WQE_SIZE_SHIFT) * 2;
>         uint32_t i = 0;
> +
> +       mmio_wc_start();
> +
>         /* convert WQE header to LE format */
>         for (; i < hdr_len; i++)
>                 *((uint32_t *) va + i) =
> @@ -1329,7 +1339,8 @@ static void ocrdma_build_dpp_wqe(void *va, struct ocrdma_hdr_wqe *wqe,
>         /* Convertion of data is done in HW */
>         for (; i < pyld_len; i++)
>                 *((uint32_t *) va + i) = (*((uint32_t *) wqe + i));
> -       wc_wmb();
> +
> +       mmio_flush_writes();
>  }
>
>  static void ocrdma_post_dpp_wqe(struct ocrdma_qp *qp,
> @@ -1439,7 +1450,6 @@ int ocrdma_post_send(struct ibv_qp *ib_qp, struct ibv_send_wr *wr,
>                                       OCRDMA_WQE_SIZE_MASK) *
>                                       OCRDMA_WQE_STRIDE);
>
> -               wmb();
>                 ocrdma_ring_sq_db(qp);
>
>                 /* update pointer, counter for next wr */
> @@ -1501,7 +1511,6 @@ int ocrdma_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr,
>                 rqe = ocrdma_hwq_head(&qp->rq);
>                 ocrdma_build_rqe(rqe, wr, 0);
>                 qp->rqe_wr_id_tbl[qp->rq.head] = wr->wr_id;
> -               wmb();
>                 ocrdma_ring_rq_db(qp);
>
>                 /* update pointer, counter for next wr */
> @@ -2082,7 +2091,6 @@ int ocrdma_post_srq_recv(struct ibv_srq *ibsrq, struct ibv_recv_wr *wr,
>                 ocrdma_build_rqe(rqe, wr, tag);
>                 srq->rqe_wr_id_tbl[tag] = wr->wr_id;
>
> -               wmb();
>                 ocrdma_ring_srq_db(srq);
>
>                 /* update pointer, counter for next wr */
> --
> 2.7.4
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 07/14] mlx4: Update to use new udma write barriers
       [not found]     ` <1487272989-8215-8-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-02-20 17:46       ` Yishai Hadas
       [not found]         ` <206559e5-0488-f6d5-c4ec-bf560e0c3ba6-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Yishai Hadas @ 2017-02-20 17:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Matan Barak,
	Majd Dibbiny

On 2/16/2017 9:23 PM, Jason Gunthorpe wrote:
> The mlx4 comments are good so these translate fairly directly.
>
> - Added barrier at the top of mlx4_post_send, this makes the driver
>   ready for a change to a stronger udma_to_device_barrier /
>   weaker udma_order_write_barrier() which  would make the post loop a bit
>   faster. No change on x86-64
> - The wmb() directly before the BF copy is upgraded to a wc_wmb(),
>   this is consistent with what mlx5 does and makes sense.
>
> Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> ---
>  providers/mlx4/cq.c  |  6 +++---
>  providers/mlx4/qp.c  | 19 +++++++++++--------
>  providers/mlx4/srq.c |  2 +-
>  3 files changed, 15 insertions(+), 12 deletions(-)
>
> diff --git a/providers/mlx4/cq.c b/providers/mlx4/cq.c
> index 6a5cf8be218892..14f8cbce6d75ed 100644
> --- a/providers/mlx4/cq.c
> +++ b/providers/mlx4/cq.c
> @@ -222,7 +222,7 @@ static inline int mlx4_get_next_cqe(struct mlx4_cq *cq,
>  	 * Make sure we read CQ entry contents after we've checked the
>  	 * ownership bit.
>  	 */
> -	rmb();
> +	udma_from_device_barrier();
>
>  	*pcqe = cqe;
>
> @@ -698,7 +698,7 @@ int mlx4_arm_cq(struct ibv_cq *ibvcq, int solicited)
>  	 * Make sure that the doorbell record in host memory is
>  	 * written before ringing the doorbell via PCI MMIO.
>  	 */
> -	wmb();
> +	udma_to_device_barrier();
>
>  	doorbell[0] = htonl(sn << 28 | cmd | cq->cqn);
>  	doorbell[1] = htonl(ci);
> @@ -764,7 +764,7 @@ void __mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq)
>  		 * Make sure update of buffer contents is done before
>  		 * updating consumer index.
>  		 */
> -		wmb();
> +		udma_to_device_barrier();
>  		mlx4_update_cons_index(cq);
>  	}
>  }
> diff --git a/providers/mlx4/qp.c b/providers/mlx4/qp.c
> index a607326c7c452c..77a4a34576cb69 100644
> --- a/providers/mlx4/qp.c
> +++ b/providers/mlx4/qp.c
> @@ -204,7 +204,7 @@ static void set_data_seg(struct mlx4_wqe_data_seg *dseg, struct ibv_sge *sg)
>  	 * chunk and get a valid (!= * 0xffffffff) byte count but
>  	 * stale data, and end up sending the wrong data.
>  	 */
> -	wmb();
> +	udma_ordering_write_barrier();
>
>  	if (likely(sg->length))
>  		dseg->byte_count = htonl(sg->length);
> @@ -228,6 +228,9 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
>
>  	pthread_spin_lock(&qp->sq.lock);
>
> +	/* Get all user DMA buffers ready to go */
> +	udma_to_device_barrier();
> +
>  	/* XXX check that state is OK to post send */

Not clear why do we need here an extra barrier ? what is the future 
optimization that you pointed on ?
We prefer not adding any new instructions at least on some ARCHs without 
a clear justification.

>  	ind = qp->sq.head;
> @@ -400,7 +403,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
>  					wqe += to_copy;
>  					addr += to_copy;
>  					seg_len += to_copy;
> -					wmb(); /* see comment below */
> +					udma_ordering_write_barrier(); /* see comment below */
>  					seg->byte_count = htonl(MLX4_INLINE_SEG | seg_len);
>  					seg_len = 0;
>  					seg = wqe;
> @@ -428,7 +431,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
>  				 * data, and end up sending the wrong
>  				 * data.
>  				 */
> -				wmb();
> +				udma_ordering_write_barrier();
>  				seg->byte_count = htonl(MLX4_INLINE_SEG | seg_len);
>  			}
>
> @@ -450,7 +453,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
>  		 * setting ownership bit (because HW can start
>  		 * executing as soon as we do).
>  		 */
> -		wmb();
> +		udma_ordering_write_barrier();
>
>  		ctrl->owner_opcode = htonl(mlx4_ib_opcode[wr->opcode]) |
>  			(ind & qp->sq.wqe_cnt ? htonl(1 << 31) : 0);
> @@ -478,7 +481,7 @@ out:
>  		 * Make sure that descriptor is written to memory
>  		 * before writing to BlueFlame page.
>  		 */
> -		wmb();
> +		mmio_wc_start();

Why to make this change which affects at least X86_64 ? the data was 
previously written to the host memory so we expect that wmb is enough 
here. See above comment which explicitly points on.
>
>  		++qp->sq.head;
>
> @@ -486,7 +489,7 @@ out:
>
>  		mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned long *) ctrl,
>  			     align(size * 16, 64));
> -		wc_wmb();
> +		mmio_flush_writes();
>
>  		ctx->bf_offset ^= ctx->bf_buf_size;
>
> @@ -498,7 +501,7 @@ out:
>  		 * Make sure that descriptors are written before
>  		 * doorbell record.
>  		 */
> -		wmb();
> +		udma_to_device_barrier();
>
>  		mmio_writel((unsigned long)(ctx->uar + MLX4_SEND_DOORBELL),
>  			    qp->doorbell_qpn);
> @@ -566,7 +569,7 @@ out:
>  		 * Make sure that descriptors are written before
>  		 * doorbell record.
>  		 */
> -		wmb();
> +		udma_to_device_barrier();
>
>  		*qp->db = htonl(qp->rq.head & 0xffff);
>  	}
> diff --git a/providers/mlx4/srq.c b/providers/mlx4/srq.c
> index 4f90efdf927209..6e4ff5663d019b 100644
> --- a/providers/mlx4/srq.c
> +++ b/providers/mlx4/srq.c
> @@ -113,7 +113,7 @@ int mlx4_post_srq_recv(struct ibv_srq *ibsrq,
>  		 * Make sure that descriptors are written before
>  		 * we write doorbell record.
>  		 */
> -		wmb();
> +		udma_to_device_barrier();
>
>  		*srq->db = htonl(srq->counter);
>  	}
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 07/14] mlx4: Update to use new udma write barriers
       [not found]         ` <206559e5-0488-f6d5-c4ec-bf560e0c3ba6-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2017-02-21 18:14           ` Jason Gunthorpe
       [not found]             ` <20170221181407.GA13138-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-21 18:14 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Matan Barak,
	Majd Dibbiny

On Mon, Feb 20, 2017 at 07:46:02PM +0200, Yishai Hadas wrote:

> > 	pthread_spin_lock(&qp->sq.lock);
> >
> >+	/* Get all user DMA buffers ready to go */
> >+	udma_to_device_barrier();
> >+
> > 	/* XXX check that state is OK to post send */
> 
> Not clear why do we need here an extra barrier ? what is the future
> optimization that you pointed on ?

Writes to different memory types are not guarenteed to be strongly
ordered. This is narrowly true on x86-64 too apparently.

The purpose of udma_to_device_barrier is to serialize all memory
types.

This allows the follow on code to use the weaker
'udma_ordering_write_barrier' when it knows it is working exclusively
with cached memory.

Eg in an ideal world on x86 udma_to_device_barrier should be SFENCE
and 'udma_ordering_write_barrier' should be compiler_barrier()

Since the barriers have been so ill defined and wonky on x86 other
arches use much stronger barriers than they actually need for each
cases. Eg ARM64 can switch to much weaker barriers.

Since there is no cost on x86-64 to this barrier I would like to leave
it here as it lets us actually optimize the ARM and other barriers. If
you take it out then 'udma_ordering_write_barrier' is forced to be the
strongest barrier on all arches.

> >@@ -478,7 +481,7 @@ out:
> > 		 * Make sure that descriptor is written to memory
> > 		 * before writing to BlueFlame page.
> > 		 */
> >-		wmb();
> >+		mmio_wc_start();
> 
> Why to make this change which affects at least X86_64 ? the data was
> previously written to the host memory so we expect that wmb is
> enough here.

Same as above, writes to different memory types are not strongly
ordered and wmb() is not the right barrier to serialize writes to
DMA'able ctrl with the WC memcpy.

If this is left wrong then again other arches are again required to
adopt the strongest barrier for everything which hurts them.

Even on x86, it is very questionable to not have the SFENCE in that
spot. AFAIK it is not defined to be strongly ordered.

mlx5 has the SFENCE here, for instance.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: [PATCH rdma-core 14/14] Remove the old barrier macros
       [not found]     ` <1487272989-8215-15-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-02-23 13:33       ` Amrani, Ram
       [not found]         ` <SN1PR07MB22070A48ACD50848267A5AD8F8530-mikhvbZlbf8TSoR2DauN2+FPX92sqiQdvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Amrani, Ram @ 2017-02-23 13:33 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA

>  /* Ensure that the device's view of memory matches the CPU's view of memory.
> @@ -163,7 +78,25 @@
>     memory types or non-temporal stores are required to use SFENCE in their own
>     code prior to calling verbs to start a DMA.
>  */
> -#define udma_to_device_barrier() wmb()
> +#if defined(__i386__)
> +#define udma_to_device_barrier() asm volatile("" ::: "memory")
> +#elif defined(__x86_64__)
> +#define udma_to_device_barrier() asm volatile("" ::: "memory")
> +#elif defined(__PPC64__)
> +#define udma_to_device_barrier() asm volatile("sync" ::: "memory")
> +#elif defined(__PPC__)
> +#define udma_to_device_barrier() asm volatile("sync" ::: "memory")
> +#elif defined(__ia64__)
> +#define udma_to_device_barrier() asm volatile("mf" ::: "memory")
> +#elif defined(__sparc_v9__)
> +#define udma_to_device_barrier() asm volatile("membar #StoreStore" ::: "memory")
> +#elif defined(__aarch64__)
> +#define wmb() asm volatile("dsb st" ::: "memory");
> +#elif defined(__sparc__) || defined(__s390x__)
> +#define udma_to_device_barrier() asm volatile("" ::: "memory")
> +#else
> +#error No architecture specific memory barrier defines found!
> +#endif

In the kernel wmb() translates, for x86_64, into 'sfence'.
In user-space, however wmb, and now its "successor", udma_to_device_barrier,
translate to volatile("" ::: "memory")

What is the reasoning behind this?
Why aren't the kernel and user space implementations the same?

Thanks,
Ram

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: [PATCH rdma-core 12/14] qedr: Update to use new udma write barriers
       [not found]     ` <1487272989-8215-13-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-02-23 13:49       ` Amrani, Ram
       [not found]         ` <SN1PR07MB2207DE206738E6DD8511CEA1F8530-mikhvbZlbf8TSoR2DauN2+FPX92sqiQdvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Amrani, Ram @ 2017-02-23 13:49 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: Elior, Ariel, Kalderon, Michal

Thanks, Jason, for this work.

Per this change, I don't see any need in using mmio_wc_start() in qedr.
In libqedr each buffer is flushed immediately after it is filled with
data so there is no need to do that again before filling it in the next
iteration.
wmb() is invoked to make sure that the CPU doesn't issue the doorbell before
the data was flushed (that's why I expected its implementation to be as the
kernel's i.e. sfence. See e-mail relating to 14/14).
Hence please replace the each mmio_wc_start() in the code below with
udma_to_device_barrier().

See another tiny inside.

Thanks,
Ram


> qedr uses WC memory for its '.db' mmap, so all writes to it have
> to be wrapped in the WC barriers. This upgrades the leading
> wmb to a wc_wmb() for consistency with other drivers.
> 
> Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> ---
>  providers/qedr/qelr_verbs.c | 32 +++++++++++++++-----------------
>  1 file changed, 15 insertions(+), 17 deletions(-)
> 
> diff --git a/providers/qedr/qelr_verbs.c b/providers/qedr/qelr_verbs.c
> index 95cd429e1b9b47..c8a0db2c9c1cfd 100644
> --- a/providers/qedr/qelr_verbs.c
> +++ b/providers/qedr/qelr_verbs.c
> @@ -672,9 +672,9 @@ static int qelr_update_qp_state(struct qelr_qp *qp,
>  			/* Update doorbell (in case post_recv was done before
>  			 * move to RTR)
>  			 */
> -			wmb();
> +			mmio_wc_start();
>  			writel(qp->rq.db_data.raw, qp->rq.db);
> -			wc_wmb();
> +			mmio_flush_writes();
>  			break;
>  		case QELR_QPS_ERR:
>  			break;
> @@ -1096,7 +1096,7 @@ static void doorbell_edpm_qp(struct qelr_qp *qp)
>  	if (!qp->edpm.is_edpm)
>  		return;
> 
> -	wmb();
> +	mmio_wc_start();
> 
>  	qp->edpm.msg.data.icid = qp->sq.db_data.data.icid;
>  	qp->edpm.msg.data.prod_val = qp->sq.db_data.data.value;
> @@ -1116,15 +1116,16 @@ static void doorbell_edpm_qp(struct qelr_qp *qp)
>  		       sizeof(uint64_t));
> 
>  		bytes += sizeof(uint64_t);
> -		/* Need to place a barrier after every 64 bytes */
> +		/* Since we rewrite the buffer every 64 bytes we need to flush
> +		   it here, otherwise the CPU could optimize alway the

alway --> away

> +		   duplicate stores. */
>  		if (bytes == 64) {
> -			wc_wmb();
> +			mmio_flush_writes();
>  			bytes = 0;
>  		}
>  		offset++;
>  	}
> -
> -	wc_wmb();
> +	mmio_flush_writes();
>  }
> 
>  int qelr_post_send(struct ibv_qp *ib_qp, struct ibv_send_wr *wr,
> @@ -1363,11 +1364,9 @@ int qelr_post_send(struct ibv_qp *ib_qp, struct ibv_send_wr *wr,
>  	}
> 
>  	if (!qp->edpm.is_edpm) {
> -		wmb();
> -
> +		mmio_wc_start();
>  		writel(qp->sq.db_data.raw, qp->sq.db);
> -
> -		wc_wmb();
> +		mmio_flush_writes();
>  	}
> 
>  	pthread_spin_unlock(&qp->q_lock);
> @@ -1446,14 +1445,13 @@ int qelr_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr,
> 
>  		qelr_inc_sw_prod_u16(&qp->rq);
> 
> -		wmb();
> +		mmio_wc_start();
> 
>  		db_val = le16toh(qp->rq.db_data.data.value) + 1;
>  		qp->rq.db_data.data.value = htole16(db_val);
> 
>  		writel(qp->rq.db_data.raw, qp->rq.db);
> -
> -		wc_wmb();
> +		mmio_flush_writes();
> 
>  		wr = wr->next;
>  	}
> @@ -1795,12 +1793,12 @@ static int qelr_poll_cq_resp(struct qelr_qp *qp, struct qelr_cq *cq,
> 
>  static void doorbell_cq(struct qelr_cq *cq, uint32_t cons, uint8_t flags)
>  {
> -	wmb();
> +	mmio_wc_start();
>  	cq->db.data.agg_flags = flags;
>  	cq->db.data.value = htole32(cons);
> 
>  	writeq(cq->db.raw, cq->db_addr);
> -	wc_wmb();
> +	mmio_flush_writes();
>  }
> 
>  int qelr_poll_cq(struct ibv_cq *ibcq, int num_entries, struct ibv_wc *wc)
> @@ -1816,7 +1814,7 @@ int qelr_poll_cq(struct ibv_cq *ibcq, int num_entries, struct ibv_wc *wc)
>  		struct qelr_qp *qp;
> 
>  		/* prevent speculative reads of any field of CQE */
> -		rmb();
> +		udma_from_device_barrier();
> 
>  		qp = cqe_get_qp(cqe);
>  		if (!qp) {
> --
> 2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 14/14] Remove the old barrier macros
       [not found]         ` <SN1PR07MB22070A48ACD50848267A5AD8F8530-mikhvbZlbf8TSoR2DauN2+FPX92sqiQdvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-02-23 16:59           ` Jason Gunthorpe
  0 siblings, 0 replies; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-23 16:59 UTC (permalink / raw)
  To: Amrani, Ram; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Feb 23, 2017 at 01:33:33PM +0000, Amrani, Ram wrote:
> >  /* Ensure that the device's view of memory matches the CPU's view of memory.
> > @@ -163,7 +78,25 @@
> >     memory types or non-temporal stores are required to use SFENCE in their own
> >     code prior to calling verbs to start a DMA.
> >  */
> > -#define udma_to_device_barrier() wmb()
> > +#if defined(__i386__)
> > +#define udma_to_device_barrier() asm volatile("" ::: "memory")
> > +#elif defined(__x86_64__)
> > +#define udma_to_device_barrier() asm volatile("" ::: "memory")
> > +#elif defined(__PPC64__)
> > +#define udma_to_device_barrier() asm volatile("sync" ::: "memory")
> > +#elif defined(__PPC__)
> > +#define udma_to_device_barrier() asm volatile("sync" ::: "memory")
> > +#elif defined(__ia64__)
> > +#define udma_to_device_barrier() asm volatile("mf" ::: "memory")
> > +#elif defined(__sparc_v9__)
> > +#define udma_to_device_barrier() asm volatile("membar #StoreStore" ::: "memory")
> > +#elif defined(__aarch64__)
> > +#define wmb() asm volatile("dsb st" ::: "memory");
> > +#elif defined(__sparc__) || defined(__s390x__)
> > +#define udma_to_device_barrier() asm volatile("" ::: "memory")
> > +#else
> > +#error No architecture specific memory barrier defines found!
> > +#endif
> 
> In the kernel wmb() translates, for x86_64, into 'sfence'.

Yes. Keep in mind the kernel wmb is doing something different, it is
basically defined as the strongest possible barrier that does SMP and
DMA strong ordering.

Based on this historical barrier in verbs the belief was apparently
that on x86 the order DMA observes stores is strictly in program order
for cachable memory. I have no idea if that is actually true or not..

> In user-space, however wmb, and now its "successor", udma_to_device_barrier,
> translate to volatile("" ::: "memory")

Yes :(

> What is the reasoning behind this?
> Why aren't the kernel and user space implementations the same?

I don't know. It is something that doesn't really make sense. Because
it is a weaker barrier user code using certain SSE stuff is going to
have to use SFENCE to be correct.

Arguably we should change it to be SFENCE.. Allowing that is why
I included the weaker udma_ordering_write_barrier..

I put this remark in the comments on udma_to_device_barrier:

   NOTE: x86 has historically used a weaker semantic for this barrier, and
   only fenced normal stores to normal memory. libibverbs users using other
   memory types or non-temporal stores are required to use SFENCE in their own
   code prior to calling verbs to start a DMA.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 12/14] qedr: Update to use new udma write barriers
       [not found]         ` <SN1PR07MB2207DE206738E6DD8511CEA1F8530-mikhvbZlbf8TSoR2DauN2+FPX92sqiQdvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-02-23 17:30           ` Jason Gunthorpe
       [not found]             ` <20170223173047.GC6688-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-23 17:30 UTC (permalink / raw)
  To: Amrani, Ram
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Elior, Ariel, Kalderon, Michal

On Thu, Feb 23, 2017 at 01:49:47PM +0000, Amrani, Ram wrote:
> Thanks, Jason, for this work.
> 
> Per this change, I don't see any need in using mmio_wc_start() in
> qedr.

The required pattern when dealing with WC memory is this:

 mmio_wc_start();
 writel_to_wc(..);
 mmio_flush_writes();

On x86-64 this translates to

 sfence
 mov ..
 sfence

As far as I know, on x86-64 both sfences are required to ensure that
all stores are visible to the device in strictly program order. x86-64
does not strongly order accesses to WC relative to other memory types
without SFENCE.

mmio_wc_start() is defined to be a superset of the functionality of
udma_to_device_barrier() - and it is given a unique name to make it
very clear where to use it: directly before writing to WC memory.

> wmb() is invoked to make sure that the CPU doesn't issue the
> doorbell before the data was flushed (that's why I expected its
> implementation to be as the kernel's i.e. sfence. See e-mail

Exactly. mmio_wc_start() is defined as the barrier to do that.  It
specifically guarentees that all writes to C and UC memory are ordered
before upcoming writes to WC memory.

It is SFENCE on x86-64, which as you say, is what you expected in the
first place, so this fixes a bug :\..

FWIW the

#define mmio_wc_start() mmio_flush_writes()

Is just a stand in that happens to get a safe instruction on every
arch.

If required each of the macros could have a unique implementation if
the arch requires it, however in most cases I would expect that
udma_to_device_barrier(), mmio_flush_writes() and mmio_wc_start() are
the same instruction.

The use of different names is a purely descriptive to aid in
understanding what the barriers do and when to use them.

Does that address your concern?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: [PATCH rdma-core 12/14] qedr: Update to use new udma write barriers
       [not found]             ` <20170223173047.GC6688-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-02-24 10:01               ` Amrani, Ram
  0 siblings, 0 replies; 65+ messages in thread
From: Amrani, Ram @ 2017-02-24 10:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Elior, Ariel, Kalderon, Michal

 
> Does that address your concern?
> 

I understand your reasoning.
I'll consult internally on the subject and see how we adopt the patch.
Thanks for the thorough explanation.

Ram

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 08/14] mlx5: Update to use new udma write barriers
       [not found]     ` <1487272989-8215-9-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-02-27 10:56       ` Yishai Hadas
       [not found]         ` <d5921636-1911-5588-8c59-620066bca01a-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Yishai Hadas @ 2017-02-27 10:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Majd Dibbiny

On 2/16/2017 9:23 PM, Jason Gunthorpe wrote:
> The mlx5 comments are good so these translate fairly directly.
>
> There is one barrier in mlx5_arm_cq that I could not explain, it became
> mmio_ordered_writes_hack()
>
> Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> ---

>  #ifdef MLX5_DEBUG
>  	{
> @@ -1283,14 +1283,14 @@ int mlx5_arm_cq(struct ibv_cq *ibvcq, int solicited)
>  	 * Make sure that the doorbell record in host memory is
>  	 * written before ringing the doorbell via PCI MMIO.
>  	 */
> -	wmb();
> +	udma_to_device_barrier();
>
>  	doorbell[0] = htonl(sn << 28 | cmd | ci);
>  	doorbell[1] = htonl(cq->cqn);
>
>  	mlx5_write64(doorbell, ctx->uar[0] + MLX5_CQ_DOORBELL, &ctx->lock32);
>
> -	wc_wmb();
> +	mmio_ordered_writes_hack();

We expect to use here the "mmio_flush_writes()" new macro, instead of 
the above "hack" one. This barrier enforces the data to be flushed 
immediately to the device so that the CQ will be armed with no delay.

>  	return 0;
>  }

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 08/14] mlx5: Update to use new udma write barriers
       [not found]         ` <d5921636-1911-5588-8c59-620066bca01a-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2017-02-27 18:00           ` Jason Gunthorpe
       [not found]             ` <20170227180009.GL5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-27 18:00 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Majd Dibbiny

On Mon, Feb 27, 2017 at 12:56:33PM +0200, Yishai Hadas wrote:
> On 2/16/2017 9:23 PM, Jason Gunthorpe wrote:
> >The mlx5 comments are good so these translate fairly directly.
> >
> >There is one barrier in mlx5_arm_cq that I could not explain, it became
> >mmio_ordered_writes_hack()
> >
> >Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> 
> > #ifdef MLX5_DEBUG
> > 	{
> >@@ -1283,14 +1283,14 @@ int mlx5_arm_cq(struct ibv_cq *ibvcq, int solicited)
> > 	 * Make sure that the doorbell record in host memory is
> > 	 * written before ringing the doorbell via PCI MMIO.
> > 	 */
> >-	wmb();
> >+	udma_to_device_barrier();
> >
> > 	doorbell[0] = htonl(sn << 28 | cmd | ci);
> > 	doorbell[1] = htonl(cq->cqn);
> >
> > 	mlx5_write64(doorbell, ctx->uar[0] + MLX5_CQ_DOORBELL, &ctx->lock32);
> >
> >-	wc_wmb();
> >+	mmio_ordered_writes_hack();
> 
> We expect to use here the "mmio_flush_writes()" new macro, instead of the
> above "hack" one. This barrier enforces the data to be flushed immediately
> to the device so that the CQ will be armed with no delay.

Hmm.....

Is it even possible to 'speed up' writes to UC memory? (uar is
UC, right?)

Be aware the trade off, a barrier may stall the CPU until the UC
writes progress far enough, but that stall is pointless if the barrier
also doesn't 'speed up' the write.

Also, the usual implementation of mlx5_write64 includes that spinlock
which already has a serializing atomic in it - so it is doubtfull that
the wc_wmb() actually ever did anything.

Do you have any hard information one way or another?

IMHO, if there is a way to speed up UC writes then it should have its
own macro. Eg mmio_flush_uc_writes(), and it probably should be called
within the mlx5_write64 implementation before releasing the spinlock.

But, AFAIK, there is no way to do that on x86-64...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 00/14] Revise the DMA barrier macros in ibverbs
       [not found] ` <1487272989-8215-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
                     ` (13 preceding siblings ...)
  2017-02-16 19:23   ` [PATCH rdma-core 14/14] Remove the old barrier macros Jason Gunthorpe
@ 2017-02-28 16:00   ` Doug Ledford
       [not found]     ` <1488297611.86943.215.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  14 siblings, 1 reply; 65+ messages in thread
From: Doug Ledford @ 2017-02-28 16:00 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 3635 bytes --]

On Thu, 2017-02-16 at 12:22 -0700, Jason Gunthorpe wrote:
> Now that the header is private to the library we can change it.
> 
> We have never had a clear definition for what our wmb() or wc_wmb()
> even do,
> and they don't match the same macros in the kernel. This causes
> problems for
> non x86 arches as they have no idea what to put in their versions of
> the macros
> and often just put the strongest thing.
> 
> This also causes problems for the driver authors who have no idea how
> to use
> these barriers properly, there are several instances of that :(
> 
> My approach here is to introduce a selection of macros that have
> narrow and
> clearly defined purposes. The selection is based on things the set of
> drivers
> actually do, which turns out to be fairly narrowly defined.
> 
> Then I went through all the drivers and adjusted them to various
> degrees to
> use the new macro names. In a few drivers I added more/stronger
> barriers.
> Overall this tries hard not to break anything by weaking existing
> barriers.
> 
> A future project for someone would be to see if the CPU ASM makes
> sense..
> 
> https://github.com/linux-rdma/rdma-core/pull/79
> 
> Jason Gunthorpe (14):
>   mlx5: Use stdatomic for the in_use barrier
>   Provide new names for the CPU barriers related to DMA
>   cxgb3: Update to use new udma write barriers
>   cxgb4: Update to use new udma write barriers
>   hns: Update to use new udma write barriers
>   i40iw: Get rid of unique barrier macros
>   mlx4: Update to use new udma write barriers
>   mlx5: Update to use new udma write barriers
>   nes: Update to use new udma write barriers
>   mthca: Update to use new mmio write barriers
>   ocrdma: Update to use new udma write barriers
>   qedr: Update to use new udma write barriers
>   vmw_pvrdma: Update to use new udma write barriers
>   Remove the old barrier macros
> 
>  providers/cxgb3/cq.c             |   2 +
>  providers/cxgb3/cxio_wr.h        |   2 +-
>  providers/cxgb4/qp.c             |  20 +++-
>  providers/cxgb4/t4.h             |  48 ++++++--
>  providers/cxgb4/verbs.c          |   2 +
>  providers/hns/hns_roce_u_hw_v1.c |  13 +-
>  providers/i40iw/i40iw_osdep.h    |  14 ---
>  providers/i40iw/i40iw_uk.c       |  26 ++--
>  providers/mlx4/cq.c              |   6 +-
>  providers/mlx4/qp.c              |  19 +--
>  providers/mlx4/srq.c             |   2 +-
>  providers/mlx5/cq.c              |   8 +-
>  providers/mlx5/mlx5.h            |   7 +-
>  providers/mlx5/qp.c              |  18 +--
>  providers/mlx5/srq.c             |   2 +-
>  providers/mthca/cq.c             |  10 +-
>  providers/mthca/doorbell.h       |   2 +-
>  providers/mthca/qp.c             |  20 ++--
>  providers/mthca/srq.c            |   6 +-
>  providers/nes/nes_uverbs.c       |  16 +--
>  providers/ocrdma/ocrdma_verbs.c  |  16 ++-
>  providers/qedr/qelr_verbs.c      |  32 +++--
>  providers/vmw_pvrdma/cq.c        |   6 +-
>  providers/vmw_pvrdma/qp.c        |   8 +-
>  util/udma_barrier.h              | 250 +++++++++++++++++++++++++++
> ------------
>  25 files changed, 354 insertions(+), 201 deletions(-)

Thanks, series merged.

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
   
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 08/14] mlx5: Update to use new udma write barriers
       [not found]             ` <20170227180009.GL5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-02-28 16:02               ` Yishai Hadas
       [not found]                 ` <2969cce4-8b51-8fcf-f099-2b42a6d40a9c-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Yishai Hadas @ 2017-02-28 16:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Majd Dibbiny,
	Doug Ledford

On 2/27/2017 8:00 PM, Jason Gunthorpe wrote:
> On Mon, Feb 27, 2017 at 12:56:33PM +0200, Yishai Hadas wrote:
>> On 2/16/2017 9:23 PM, Jason Gunthorpe wrote:
>>> The mlx5 comments are good so these translate fairly directly.
>>>
>>> There is one barrier in mlx5_arm_cq that I could not explain, it became
>>> mmio_ordered_writes_hack()
>>>
>>> Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
>>
>>> #ifdef MLX5_DEBUG
>>> 	{
>>> @@ -1283,14 +1283,14 @@ int mlx5_arm_cq(struct ibv_cq *ibvcq, int solicited)
>>> 	 * Make sure that the doorbell record in host memory is
>>> 	 * written before ringing the doorbell via PCI MMIO.
>>> 	 */
>>> -	wmb();
>>> +	udma_to_device_barrier();
>>>
>>> 	doorbell[0] = htonl(sn << 28 | cmd | ci);
>>> 	doorbell[1] = htonl(cq->cqn);
>>>
>>> 	mlx5_write64(doorbell, ctx->uar[0] + MLX5_CQ_DOORBELL, &ctx->lock32);
>>>
>>> -	wc_wmb();
>>> +	mmio_ordered_writes_hack();
>>
>> We expect to use here the "mmio_flush_writes()" new macro, instead of the
>> above "hack" one. This barrier enforces the data to be flushed immediately
>> to the device so that the CQ will be armed with no delay.
>
> Hmm.....
>
> Is it even possible to 'speed up' writes to UC memory? (uar is
> UC, right?)
>

No, the UAR is mapped write combing, that's why we need here the 
mmio_flush_writes() to make sure that the device will see the data with 
no delay.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 00/14] Revise the DMA barrier macros in ibverbs
       [not found]     ` <1488297611.86943.215.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-02-28 16:38       ` Majd Dibbiny
       [not found]         ` <C6384D48-FC47-4046-8025-462E1CB02A34-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Majd Dibbiny @ 2017-02-28 16:38 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA


> On Feb 28, 2017, at 6:00 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
>> On Thu, 2017-02-16 at 12:22 -0700, Jason Gunthorpe wrote:
>> Now that the header is private to the library we can change it.
>> 
>> We have never had a clear definition for what our wmb() or wc_wmb()
>> even do,
>> and they don't match the same macros in the kernel. This causes
>> problems for
>> non x86 arches as they have no idea what to put in their versions of
>> the macros
>> and often just put the strongest thing.
>> 
>> This also causes problems for the driver authors who have no idea how
>> to use
>> these barriers properly, there are several instances of that :(
>> 
>> My approach here is to introduce a selection of macros that have
>> narrow and
>> clearly defined purposes. The selection is based on things the set of
>> drivers
>> actually do, which turns out to be fairly narrowly defined.
>> 
>> Then I went through all the drivers and adjusted them to various
>> degrees to
>> use the new macro names. In a few drivers I added more/stronger
>> barriers.
>> Overall this tries hard not to break anything by weaking existing
>> barriers.
>> 
>> A future project for someone would be to see if the CPU ASM makes
>> sense..
>> 
>> https://github.com/linux-rdma/rdma-core/pull/79
>> 
>> Jason Gunthorpe (14):
>>   mlx5: Use stdatomic for the in_use barrier
>>   Provide new names for the CPU barriers related to DMA
>>   cxgb3: Update to use new udma write barriers
>>   cxgb4: Update to use new udma write barriers
>>   hns: Update to use new udma write barriers
>>   i40iw: Get rid of unique barrier macros
>>   mlx4: Update to use new udma write barriers
>>   mlx5: Update to use new udma write barriers
>>   nes: Update to use new udma write barriers
>>   mthca: Update to use new mmio write barriers
>>   ocrdma: Update to use new udma write barriers
>>   qedr: Update to use new udma write barriers
>>   vmw_pvrdma: Update to use new udma write barriers
>>   Remove the old barrier macros
>> 
>>  providers/cxgb3/cq.c             |   2 +
>>  providers/cxgb3/cxio_wr.h        |   2 +-
>>  providers/cxgb4/qp.c             |  20 +++-
>>  providers/cxgb4/t4.h             |  48 ++++++--
>>  providers/cxgb4/verbs.c          |   2 +
>>  providers/hns/hns_roce_u_hw_v1.c |  13 +-
>>  providers/i40iw/i40iw_osdep.h    |  14 ---
>>  providers/i40iw/i40iw_uk.c       |  26 ++--
>>  providers/mlx4/cq.c              |   6 +-
>>  providers/mlx4/qp.c              |  19 +--
>>  providers/mlx4/srq.c             |   2 +-
>>  providers/mlx5/cq.c              |   8 +-
>>  providers/mlx5/mlx5.h            |   7 +-
>>  providers/mlx5/qp.c              |  18 +--
>>  providers/mlx5/srq.c             |   2 +-
>>  providers/mthca/cq.c             |  10 +-
>>  providers/mthca/doorbell.h       |   2 +-
>>  providers/mthca/qp.c             |  20 ++--
>>  providers/mthca/srq.c            |   6 +-
>>  providers/nes/nes_uverbs.c       |  16 +--
>>  providers/ocrdma/ocrdma_verbs.c  |  16 ++-
>>  providers/qedr/qelr_verbs.c      |  32 +++--
>>  providers/vmw_pvrdma/cq.c        |   6 +-
>>  providers/vmw_pvrdma/qp.c        |   8 +-
>>  util/udma_barrier.h              | 250 +++++++++++++++++++++++++++
>> ------------
>>  25 files changed, 354 insertions(+), 201 deletions(-)
> 
> Thanks, series merged.
Hi Doug,

This is a very sensitive series that we haven't Acked yet.
It might have performance and functional/correctness implications.
Also, as you can see we still have open discussions with Jason.

We would really appreciate a heads up before merging topics that have on going discussions and give us a chance to Ack.

Thanks
> 
> -- 
> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>     GPG KeyID: B826A3330E572FDD
>    
> Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 08/14] mlx5: Update to use new udma write barriers
       [not found]                 ` <2969cce4-8b51-8fcf-f099-2b42a6d40a9c-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2017-02-28 17:06                   ` Jason Gunthorpe
       [not found]                     ` <20170228170658.GA17995-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-02-28 17:06 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Majd Dibbiny,
	Doug Ledford

On Tue, Feb 28, 2017 at 06:02:51PM +0200, Yishai Hadas wrote:
> On 2/27/2017 8:00 PM, Jason Gunthorpe wrote:
> >On Mon, Feb 27, 2017 at 12:56:33PM +0200, Yishai Hadas wrote:
> >>On 2/16/2017 9:23 PM, Jason Gunthorpe wrote:
> >>>The mlx5 comments are good so these translate fairly directly.
> >>>
> >>>There is one barrier in mlx5_arm_cq that I could not explain, it became
> >>>mmio_ordered_writes_hack()
> >>>
> >>>Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> >>
> >>>#ifdef MLX5_DEBUG
> >>>	{
> >>>@@ -1283,14 +1283,14 @@ int mlx5_arm_cq(struct ibv_cq *ibvcq, int solicited)
> >>>	 * Make sure that the doorbell record in host memory is
> >>>	 * written before ringing the doorbell via PCI MMIO.
> >>>	 */
> >>>-	wmb();
> >>>+	udma_to_device_barrier();
> >>>
> >>>	doorbell[0] = htonl(sn << 28 | cmd | ci);
> >>>	doorbell[1] = htonl(cq->cqn);
> >>>
> >>>	mlx5_write64(doorbell, ctx->uar[0] + MLX5_CQ_DOORBELL, &ctx->lock32);
> >>>
> >>>-	wc_wmb();
> >>>+	mmio_ordered_writes_hack();
> >>
> >>We expect to use here the "mmio_flush_writes()" new macro, instead of the
> >>above "hack" one. This barrier enforces the data to be flushed immediately
> >>to the device so that the CQ will be armed with no delay.
> >
> >Hmm.....
> >
> >Is it even possible to 'speed up' writes to UC memory? (uar is
> >UC, right?)
> >
> 
> No, the UAR is mapped write combing, that's why we need here the
> mmio_flush_writes() to make sure that the device will see the data with no
> delay.

Okay, my mistake, then it must be this, which increases the leading
barrier to a wc_wmb().. See the discussion with Ram:

http://marc.info/?l=linux-rdma&m=148787108423193&w=2

However, I question the locking here. It made sense if uar was a UC
region, but as WC it looks wrong to me.

For the 32 bit case, the flush needs to be done while holding the
spinlock or a concurrent caller can corrupt the 64 bit word.

For the 64 bit case, there is no locking at all, so delvery of the
doorbell write is not guarenteed, if there are concurrent callers.
(The comment in _mlx5_post_send applies here too)

Alternatively, if there is only one possible caller and cq->cqn is
constant, then things are OK but the 32 bit spinlock isn't required at
all.

diff --git a/providers/mlx5/cq.c b/providers/mlx5/cq.c
index cc0af920c703d9..1ce2cf2dd0bbd4 100644
--- a/providers/mlx5/cq.c
+++ b/providers/mlx5/cq.c
@@ -1281,16 +1281,16 @@ int mlx5_arm_cq(struct ibv_cq *ibvcq, int solicited)
 
 	/*
 	 * Make sure that the doorbell record in host memory is
-	 * written before ringing the doorbell via PCI MMIO.
+	 * written before ringing the doorbell via PCI WC MMIO.
 	 */
-	udma_to_device_barrier();
+	mmio_wc_start();
 
 	doorbell[0] = htonl(sn << 28 | cmd | ci);
 	doorbell[1] = htonl(cq->cqn);
 
 	mlx5_write64(doorbell, ctx->uar[0] + MLX5_CQ_DOORBELL, &ctx->lock32);
 
-	mmio_ordered_writes_hack();
+	mmio_flush_writes();
 
 	return 0;
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 00/14] Revise the DMA barrier macros in ibverbs
       [not found]         ` <C6384D48-FC47-4046-8025-462E1CB02A34-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2017-02-28 17:47           ` Doug Ledford
  0 siblings, 0 replies; 65+ messages in thread
From: Doug Ledford @ 2017-02-28 17:47 UTC (permalink / raw)
  To: Majd Dibbiny; +Cc: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 4272 bytes --]

On 2/28/2017 11:38 AM, Majd Dibbiny wrote:
> 
>> On Feb 28, 2017, at 6:00 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>
>>> On Thu, 2017-02-16 at 12:22 -0700, Jason Gunthorpe wrote:
>>> Now that the header is private to the library we can change it.
>>>
>>> We have never had a clear definition for what our wmb() or wc_wmb()
>>> even do,
>>> and they don't match the same macros in the kernel. This causes
>>> problems for
>>> non x86 arches as they have no idea what to put in their versions of
>>> the macros
>>> and often just put the strongest thing.
>>>
>>> This also causes problems for the driver authors who have no idea how
>>> to use
>>> these barriers properly, there are several instances of that :(
>>>
>>> My approach here is to introduce a selection of macros that have
>>> narrow and
>>> clearly defined purposes. The selection is based on things the set of
>>> drivers
>>> actually do, which turns out to be fairly narrowly defined.
>>>
>>> Then I went through all the drivers and adjusted them to various
>>> degrees to
>>> use the new macro names. In a few drivers I added more/stronger
>>> barriers.
>>> Overall this tries hard not to break anything by weaking existing
>>> barriers.
>>>
>>> A future project for someone would be to see if the CPU ASM makes
>>> sense..
>>>
>>> https://github.com/linux-rdma/rdma-core/pull/79
>>>
>>> Jason Gunthorpe (14):
>>>   mlx5: Use stdatomic for the in_use barrier
>>>   Provide new names for the CPU barriers related to DMA
>>>   cxgb3: Update to use new udma write barriers
>>>   cxgb4: Update to use new udma write barriers
>>>   hns: Update to use new udma write barriers
>>>   i40iw: Get rid of unique barrier macros
>>>   mlx4: Update to use new udma write barriers
>>>   mlx5: Update to use new udma write barriers
>>>   nes: Update to use new udma write barriers
>>>   mthca: Update to use new mmio write barriers
>>>   ocrdma: Update to use new udma write barriers
>>>   qedr: Update to use new udma write barriers
>>>   vmw_pvrdma: Update to use new udma write barriers
>>>   Remove the old barrier macros
>>>
>>>  providers/cxgb3/cq.c             |   2 +
>>>  providers/cxgb3/cxio_wr.h        |   2 +-
>>>  providers/cxgb4/qp.c             |  20 +++-
>>>  providers/cxgb4/t4.h             |  48 ++++++--
>>>  providers/cxgb4/verbs.c          |   2 +
>>>  providers/hns/hns_roce_u_hw_v1.c |  13 +-
>>>  providers/i40iw/i40iw_osdep.h    |  14 ---
>>>  providers/i40iw/i40iw_uk.c       |  26 ++--
>>>  providers/mlx4/cq.c              |   6 +-
>>>  providers/mlx4/qp.c              |  19 +--
>>>  providers/mlx4/srq.c             |   2 +-
>>>  providers/mlx5/cq.c              |   8 +-
>>>  providers/mlx5/mlx5.h            |   7 +-
>>>  providers/mlx5/qp.c              |  18 +--
>>>  providers/mlx5/srq.c             |   2 +-
>>>  providers/mthca/cq.c             |  10 +-
>>>  providers/mthca/doorbell.h       |   2 +-
>>>  providers/mthca/qp.c             |  20 ++--
>>>  providers/mthca/srq.c            |   6 +-
>>>  providers/nes/nes_uverbs.c       |  16 +--
>>>  providers/ocrdma/ocrdma_verbs.c  |  16 ++-
>>>  providers/qedr/qelr_verbs.c      |  32 +++--
>>>  providers/vmw_pvrdma/cq.c        |   6 +-
>>>  providers/vmw_pvrdma/qp.c        |   8 +-
>>>  util/udma_barrier.h              | 250 +++++++++++++++++++++++++++
>>> ------------
>>>  25 files changed, 354 insertions(+), 201 deletions(-)
>>
>> Thanks, series merged.
> Hi Doug,
> 
> This is a very sensitive series that we haven't Acked yet.
> It might have performance and functional/correctness implications.
> Also, as you can see we still have open discussions with Jason.
> 
> We would really appreciate a heads up before merging topics that have on going discussions and give us a chance to Ack.

Yes, I'm seeing that.  I had thought it was more settled than it is
evidently.  But, I haven't merged anything else yet, so any follow on
patches can be taken before anything else and produce a nice sequential
patch flow.


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG Key ID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros
       [not found]     ` <1487272989-8215-7-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-03-01 17:29       ` Shiraz Saleem
       [not found]         ` <20170301172920.GA11340-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
  2017-03-06 18:18       ` Shiraz Saleem
  1 sibling, 1 reply; 65+ messages in thread
From: Shiraz Saleem @ 2017-03-01 17:29 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Nikolova, Tatyana E

On Thu, Feb 16, 2017 at 12:23:01PM -0700, Jason Gunthorpe wrote:
> Use our standard versions from util instead. There doesn't seem
> to be anything tricky here, but the inlined versions were like our
> wc_wmb() barriers, not the wmb().
> 
> There appears to be no WC memory in this driver, so despite the comments,
> these barriers are also making sure that user DMA data is flushed out. Make
> them all wmb()
> 
> Guess at where the missing rmb() should be.
> 
> Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> ---
>  providers/i40iw/i40iw_osdep.h | 14 --------------
>  providers/i40iw/i40iw_uk.c    | 26 ++++++++++++++------------
>  2 files changed, 14 insertions(+), 26 deletions(-)
> 
> diff --git a/providers/i40iw/i40iw_osdep.h b/providers/i40iw/i40iw_osdep.h
> index fddedf40dd8ae2..92bedd31633eb5 100644
> --- a/providers/i40iw/i40iw_osdep.h
> +++ b/providers/i40iw/i40iw_osdep.h
> @@ -105,18 +105,4 @@ static inline void db_wr32(u32 value, u32 *wqe_word)
>  #define ACQUIRE_LOCK()
>  #define RELEASE_LOCK()
>  
> -#if defined(__i386__)
> -#define i40iw_mb() mb()		/* full memory barrier */
> -#define i40iw_wmb() mb()	/* write memory barrier */
> -#elif defined(__x86_64__)
> -#define i40iw_mb() asm volatile("mfence" ::: "memory")	 /* full memory barrier */
> -#define i40iw_wmb() asm volatile("sfence" ::: "memory")	 /* write memory barrier */
> -#else
> -#define i40iw_mb() mb()		/* full memory barrier */
> -#define i40iw_wmb() wmb()	/* write memory barrier */
> -#endif
> -#define i40iw_rmb() rmb()	/* read memory barrier */
> -#define i40iw_smp_mb() smp_mb()		/* memory barrier */
> -#define i40iw_smp_wmb() smp_wmb()	/* write memory barrier */
> -#define i40iw_smp_rmb() smp_rmb()	/* read memory barrier */
>  #endif				/* _I40IW_OSDEP_H_ */
> diff --git a/providers/i40iw/i40iw_uk.c b/providers/i40iw/i40iw_uk.c
> index d3e4fec7d8515b..b20748e9f09199 100644
> --- a/providers/i40iw/i40iw_uk.c
> +++ b/providers/i40iw/i40iw_uk.c
> @@ -75,7 +75,7 @@ static enum i40iw_status_code i40iw_nop_1(struct i40iw_qp_uk *qp)
>  	    LS_64(signaled, I40IWQPSQ_SIGCOMPL) |
>  	    LS_64(qp->swqe_polarity, I40IWQPSQ_VALID) | nop_signature++;
>  
> -	i40iw_wmb();	/* Memory barrier to ensure data is written before valid bit is set */
> +	udma_to_device_barrier();	/* Memory barrier to ensure data is written before valid bit is set */
>  
>  	set_64bit_val(wqe, I40IW_BYTE_24, header);
>  	return 0;
> @@ -91,7 +91,7 @@ void i40iw_qp_post_wr(struct i40iw_qp_uk *qp)
>  	u32 hw_sq_tail;
>  	u32 sw_sq_head;
>  
> -	i40iw_mb(); /* valid bit is written and loads completed before reading shadow */
> +	udma_to_device_barrier(); /* valid bit is written and loads completed before reading shadow */

The constraint here is that the writes to SQ WQE must be globally visible to the 
PCIe device before the read of the shadow area. For x86, since loads can be reordered 
with older stores to a different location, we need some sort of a storeload barrier to 
enforce the constraint and hence the mfence was chosen. The udma_to_device_barrier() 
equates to just a compiler barrier on x86 and isn't sufficient for this purpose.

>  
>  	/* read the doorbell shadow area */
>  	get_64bit_val(qp->shadow_area, I40IW_BYTE_0, &temp);
> @@ -297,7 +297,7 @@ static enum i40iw_status_code i40iw_rdma_write(struct i40iw_qp_uk *qp,
>  		byte_off += 16;
>  	}
>  
> -	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
> +	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
>  
>  	set_64bit_val(wqe, I40IW_BYTE_24, header);
>  
> @@ -347,7 +347,7 @@ static enum i40iw_status_code i40iw_rdma_read(struct i40iw_qp_uk *qp,
>  
>  	i40iw_set_fragment(wqe, I40IW_BYTE_0, &op_info->lo_addr);
>  
> -	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
> +	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
>  
>  	set_64bit_val(wqe, I40IW_BYTE_24, header);
>  	if (post_sq)
> @@ -410,7 +410,7 @@ static enum i40iw_status_code i40iw_send(struct i40iw_qp_uk *qp,
>  		byte_off += 16;
>  	}
>  
> -	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
> +	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
>  
>  	set_64bit_val(wqe, I40IW_BYTE_24, header);
>  	if (post_sq)
> @@ -478,7 +478,7 @@ static enum i40iw_status_code i40iw_inline_rdma_write(struct i40iw_qp_uk *qp,
>  		memcpy(dest, src, op_info->len - I40IW_BYTE_16);
>  	}
>  
> -	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
> +	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
>  
>  	set_64bit_val(wqe, I40IW_BYTE_24, header);
>  
> @@ -552,7 +552,7 @@ static enum i40iw_status_code i40iw_inline_send(struct i40iw_qp_uk *qp,
>  		memcpy(dest, src, op_info->len - I40IW_BYTE_16);
>  	}
>  
> -	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
> +	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
>  
>  	set_64bit_val(wqe, I40IW_BYTE_24, header);
>  
> @@ -601,7 +601,7 @@ static enum i40iw_status_code i40iw_stag_local_invalidate(struct i40iw_qp_uk *qp
>  	    LS_64(info->signaled, I40IWQPSQ_SIGCOMPL) |
>  	    LS_64(qp->swqe_polarity, I40IWQPSQ_VALID);
>  
> -	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
> +	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
>  
>  	set_64bit_val(wqe, I40IW_BYTE_24, header);
>  
> @@ -650,7 +650,7 @@ static enum i40iw_status_code i40iw_mw_bind(struct i40iw_qp_uk *qp,
>  	    LS_64(info->signaled, I40IWQPSQ_SIGCOMPL) |
>  	    LS_64(qp->swqe_polarity, I40IWQPSQ_VALID);
>  
> -	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
> +	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
>  
>  	set_64bit_val(wqe, I40IW_BYTE_24, header);
>  
> @@ -694,7 +694,7 @@ static enum i40iw_status_code i40iw_post_receive(struct i40iw_qp_uk *qp,
>  		byte_off += 16;
>  	}
>  
> -	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
> +	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
>  
>  	set_64bit_val(wqe, I40IW_BYTE_24, header);
>  
> @@ -731,7 +731,7 @@ static void i40iw_cq_request_notification(struct i40iw_cq_uk *cq,
>  
>  	set_64bit_val(cq->shadow_area, I40IW_BYTE_32, temp_val);
>  
> -	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
> +	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
>  
>  	db_wr32(cq->cq_id, cq->cqe_alloc_reg);
>  }
> @@ -780,6 +780,8 @@ static enum i40iw_status_code i40iw_cq_poll_completion(struct i40iw_cq_uk *cq,
>  	if (polarity != cq->polarity)
>  		return I40IW_ERR_QUEUE_EMPTY;
>  
> +	udma_from_device_barrier();
> +
>  	q_type = (u8)RS_64(qword3, I40IW_CQ_SQ);
>  	info->error = (bool)RS_64(qword3, I40IW_CQ_ERROR);
>  	info->push_dropped = (bool)RS_64(qword3, I40IWCQ_PSHDROP);
> @@ -1121,7 +1123,7 @@ enum i40iw_status_code i40iw_nop(struct i40iw_qp_uk *qp,
>  	    LS_64(signaled, I40IWQPSQ_SIGCOMPL) |
>  	    LS_64(qp->swqe_polarity, I40IWQPSQ_VALID);
>  
> -	i40iw_wmb(); /* make sure WQE is populated before valid bit is set */
> +	udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
>  
>  	set_64bit_val(wqe, I40IW_BYTE_24, header);
>  	if (post_sq)
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros
       [not found]         ` <20170301172920.GA11340-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
@ 2017-03-01 17:55           ` Jason Gunthorpe
       [not found]             ` <20170301175521.GB14791-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-03-01 17:55 UTC (permalink / raw)
  To: Shiraz Saleem; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Nikolova, Tatyana E

On Wed, Mar 01, 2017 at 11:29:20AM -0600, Shiraz Saleem wrote:

> The constraint here is that the writes to SQ WQE must be globally
> visible to the PCIe device before the read of the shadow area.

I'm struggling to understand how this can make any sense..

It looks like shadow_area is in system memory:

providers/i40iw/i40iw_uverbs.c: info.cq_base = memalign(I40IW_HW_PAGE_SIZE, totalsize);
providers/i40iw/i40iw_uverbs.c: info.shadow_area = (u64 *)((u8 *)info.cq_base + (cq_pages << 12));
providers/i40iw/i40iw_uk.c:     qp->shadow_area = info->shadow_area;

Is there DMA occuring to shadow_area?

Is this trying to implement CPU atomics in shadow_area for SMP?

I'm struggling to understand what a store/load barrier should even do
when talking about DMA.

> For x86, since loads can be reordered with older stores to a
> different location, we need some sort of a storeload barrier to
> enforce the constraint and hence the mfence was chosen. The
> udma_to_device_barrier() equates to just a compiler barrier on x86
> and isn't sufficient for this purpose.

We've never had MFENCE in our set of standard barriers. If it really
is a needed operation for DMA then we will have to add a new barrier
macro for it.

However - I can't explain what that macro should be doing relative to
DMA, so I need more help.. Can you diagram a ladder chart to show the
issue?

This is the sequence at the CPU we are looking at:

udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
set_64bit_val(wqe, I40IW_BYTE_24, header);
udma_to_device_barrier();
get_64bit_val(qp->shadow_area, I40IW_BYTE_0, &temp);

What is the 2nd actor doing? Is it DMA from the PCI-E device? Is it
another SMP core?

What can wrong if it executes like this?

get_64bit_val(qp->shadow_area, I40IW_BYTE_0, &temp);
udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
set_64bit_val(wqe, I40IW_BYTE_24, header);
udma_to_device_barrier();

Is this the only barrier you are worried about?
Are the other changes OK?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros
       [not found]             ` <20170301175521.GB14791-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-03-01 22:14               ` Shiraz Saleem
       [not found]                 ` <20170301221420.GA18548-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Shiraz Saleem @ 2017-03-01 22:14 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Nikolova, Tatyana E

On Wed, Mar 01, 2017 at 10:55:21AM -0700, Jason Gunthorpe wrote:
> On Wed, Mar 01, 2017 at 11:29:20AM -0600, Shiraz Saleem wrote:
> 
> 
> Is there DMA occuring to shadow_area?

The shadow area contains status variables which are read by SW and 
updated by PCI device.

> 
> Is this trying to implement CPU atomics in shadow_area for SMP?

No.

> What can wrong if it executes like this?
> 
> get_64bit_val(qp->shadow_area, I40IW_BYTE_0, &temp);
> udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
> set_64bit_val(wqe, I40IW_BYTE_24, header);
> udma_to_device_barrier();

We need strict ordering that ensures write of the WQE completes before 
read of the shadow area. This ensures the value read from the shadow can 
be used to determine if a DB ring is needed. If the shadow area is read first, 
the algorithm, in certain cases, would not ring the DB when it should and the 
HW may go idle with work requests posted.

> Is this the only barrier you are worried about?
> Are the other changes OK?

This is of most concern. The other changes look ok. Being reviewed.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros
       [not found]                 ` <20170301221420.GA18548-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
@ 2017-03-01 23:05                   ` Jason Gunthorpe
       [not found]                     ` <20170301230506.GB2820-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-03-01 23:05 UTC (permalink / raw)
  To: Shiraz Saleem; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Nikolova, Tatyana E

On Wed, Mar 01, 2017 at 04:14:20PM -0600, Shiraz Saleem wrote:
> > Is there DMA occuring to shadow_area?
> 
> The shadow area contains status variables which are read by SW and 
> updated by PCI device.

So the device is DMA'ing to it, and the driver is reading DMA memory..

> > What can wrong if it executes like this?
> > 
> > get_64bit_val(qp->shadow_area, I40IW_BYTE_0, &temp);
> > udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
> > set_64bit_val(wqe, I40IW_BYTE_24, header);
> > udma_to_device_barrier();
> 
> We need strict ordering that ensures write of the WQE completes before 
> read of the shadow area.

> This ensures the value read from the shadow can be used to determine
> if a DB ring is needed. If the shadow area is read first, the
> algorithm, in certain cases, would not ring the DB when it should
> and the HW may go idle with work requests posted.

This still is not making a lot of sense to me.. I really need to see a
ladder diagram to understand your case.

Here is an example, I think what you are saying is: The HW could have
fetched valid = 0 and stopped the queue and the driver needs to
doorbell it to wake it up again. However, the driver optimizes away
the doorbell rings in certain cases based on reading a DMA result.

So here is a possible ladder diagram:

CPU                         DMA DEVICE
                            Issue READ#1 of valid bit
 Respond to READ#1
 SFENCE
 set_valid_bit
 MEFENCE
 read_tail
                            Receive READ#1 response with valid bit unset
			    Issue DMA WRITE to shadow_area indicating STOPPED
 DMA WRITE arrives

And the version where the DMA is seen:

CPU                         DMA DEVICE
                            Issue READ#1 of valid bit
 SFENCE
 Respond to READ#1
 set_valid_bit
 MEFENCE
                            Receive READ#1 response with valid bit unset
			    Issue DMA WRITE to shadow_area indicating STOPPED
 DMA WRITE arrives
 read_tail

These diagrams attempt to show that the DMA device reads the valid bit
then DMA's back to the shadow_area depending on what it read.

Given the semantics of MFENCE, both are possible, so I still don't
understand what the MFENCE is supposed to help with.

I get the feeling this approach requires MFENCE to do something it
doesn't...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 08/14] mlx5: Update to use new udma write barriers
       [not found]                     ` <20170228170658.GA17995-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-03-02  9:34                       ` Yishai Hadas
       [not found]                         ` <24bf0e37-e032-0862-c5b9-b5a40fcfb343-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Yishai Hadas @ 2017-03-02  9:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Majd Dibbiny,
	Doug Ledford

On 2/28/2017 7:06 PM, Jason Gunthorpe wrote:
> Okay, my mistake, then it must be this, which increases the leading
> barrier to a wc_wmb().. See the discussion with Ram:
>

> diff --git a/providers/mlx5/cq.c b/providers/mlx5/cq.c
> index cc0af920c703d9..1ce2cf2dd0bbd4 100644
> --- a/providers/mlx5/cq.c
> +++ b/providers/mlx5/cq.c
> @@ -1281,16 +1281,16 @@ int mlx5_arm_cq(struct ibv_cq *ibvcq, int solicited)
>
>  	/*
>  	 * Make sure that the doorbell record in host memory is
> -	 * written before ringing the doorbell via PCI MMIO.
> +	 * written before ringing the doorbell via PCI WC MMIO.
>  	 */
> -	udma_to_device_barrier();
> +	mmio_wc_start();
>
>  	doorbell[0] = htonl(sn << 28 | cmd | ci);
>  	doorbell[1] = htonl(cq->cqn);
>
>  	mlx5_write64(doorbell, ctx->uar[0] + MLX5_CQ_DOORBELL, &ctx->lock32);
>
> -	mmio_ordered_writes_hack();
> +	mmio_flush_writes();
>
>  	return 0;
>  }
>

The above seems fine, can you please send a formal patch to fix the 
original series, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 08/14] mlx5: Update to use new udma write barriers
       [not found]                         ` <24bf0e37-e032-0862-c5b9-b5a40fcfb343-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2017-03-02 17:12                           ` Jason Gunthorpe
       [not found]                             ` <20170302171210.GA8595-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-03-02 17:12 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Majd Dibbiny,
	Doug Ledford

On Thu, Mar 02, 2017 at 11:34:56AM +0200, Yishai Hadas wrote:
> The above seems fine, can you please send a formal patch to fix the original
> series, thanks.

Done,

https://github.com/linux-rdma/rdma-core/pull/89

Are you happy with the other things now as well?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros
       [not found]                     ` <20170301230506.GB2820-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-03-03 21:45                       ` Shiraz Saleem
       [not found]                         ` <20170303214514.GA12996-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Shiraz Saleem @ 2017-03-03 21:45 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Nikolova, Tatyana E

On Wed, Mar 01, 2017 at 04:05:06PM -0700, Jason Gunthorpe wrote:
> On Wed, Mar 01, 2017 at 04:14:20PM -0600, Shiraz Saleem wrote:
> > > Is there DMA occuring to shadow_area?
> > 
> > The shadow area contains status variables which are read by SW and 
> > updated by PCI device.
> 
> So the device is DMA'ing to it, and the driver is reading DMA memory..
> 
> > > What can wrong if it executes like this?
> > > 
> > > get_64bit_val(qp->shadow_area, I40IW_BYTE_0, &temp);
> > > udma_to_device_barrier(); /* make sure WQE is populated before valid bit is set */
> > > set_64bit_val(wqe, I40IW_BYTE_24, header);
> > > udma_to_device_barrier();
> > 
> > We need strict ordering that ensures write of the WQE completes before 
> > read of the shadow area.
> 
> > This ensures the value read from the shadow can be used to determine
> > if a DB ring is needed. If the shadow area is read first, the
> > algorithm, in certain cases, would not ring the DB when it should
> > and the HW may go idle with work requests posted.
> 
> This still is not making a lot of sense to me.. I really need to see a
> ladder diagram to understand your case.
> 
> Here is an example, I think what you are saying is: The HW could have
> fetched valid = 0 and stopped the queue and the driver needs to
> doorbell it to wake it up again. However, the driver optimizes away
> the doorbell rings in certain cases based on reading a DMA result.
> 
> So here is a possible ladder diagram:
> 
> CPU                         DMA DEVICE
>                             Issue READ#1 of valid bit
>  Respond to READ#1
>  SFENCE
>  set_valid_bit
>  MEFENCE
>  read_tail
>                             Receive READ#1 response with valid bit unset
> 			    Issue DMA WRITE to shadow_area indicating STOPPED
>  DMA WRITE arrives
> 
> And the version where the DMA is seen:
> 
> CPU                         DMA DEVICE
>                             Issue READ#1 of valid bit
>  SFENCE
>  Respond to READ#1
>  set_valid_bit
>  MEFENCE
>                             Receive READ#1 response with valid bit unset
> 			    Issue DMA WRITE to shadow_area indicating STOPPED
>  DMA WRITE arrives
>  read_tail
> 
> These diagrams attempt to show that the DMA device reads the valid bit
> then DMA's back to the shadow_area depending on what it read.


This is not quite how our DB logic works. There are additional HW steps and nuances 
in the flow. Unfortunately, to explain this, we need to provide details of our internal 
HW flow for the DB logic. We are unable to do so at this time.

The ordering is a HW requirement. i.e write valid bit of WQE __must__ precede the read 
tail of the shadow.

> 
> I get the feeling this approach requires MFENCE to do something it
> doesn't...

Mfence guarantees that load won't be reordered before the store, and thus 
we are using it.

We understand this is a unique requirement specific to our design but it is necessary.

The rest of the changes look ok.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros
       [not found]                         ` <20170303214514.GA12996-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
@ 2017-03-03 22:22                           ` Jason Gunthorpe
       [not found]                             ` <20170303222244.GA678-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-03-03 22:22 UTC (permalink / raw)
  To: Shiraz Saleem; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Nikolova, Tatyana E

On Fri, Mar 03, 2017 at 03:45:14PM -0600, Shiraz Saleem wrote:

> This is not quite how our DB logic works. There are additional HW
> steps and nuances in the flow. Unfortunately, to explain this, we
> need to provide details of our internal HW flow for the DB logic. We
> are unable to do so at this time.

Well, it is very problematic to help you define what a cross-arch
barrier should do if you can't explain what you need to have happen
relative to PCI-E.

> > I get the feeling this approach requires MFENCE to do something it
> > doesn't...
> 
> Mfence guarantees that load won't be reordered before the store, and
> thus we are using it.

If that is all then the driver can use LFENCE and the
udma_from_device_barrier() .. Is that OK?

But fundamentally, PCI is fairly lax about what it permits the root
complex to do, and to what degree it requires strong ordering within
the root complex itself for PCI issued LOAD/STORES.

It is hard to understand how the order of CPU operations matters when
the PCI operations to different cache lines can be re-ordered inside
the root complex.

An approach may work on some x86-64 systems but be unreliable on other
arches, or even on unusual x86 (eg SGI's x86 NUMA systems have
conformant, but lax, ordering rules)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 08/14] mlx5: Update to use new udma write barriers
       [not found]                             ` <20170302171210.GA8595-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-03-06 14:19                               ` Yishai Hadas
  0 siblings, 0 replies; 65+ messages in thread
From: Yishai Hadas @ 2017-03-06 14:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Majd Dibbiny,
	Doug Ledford

On 3/2/2017 7:12 PM, Jason Gunthorpe wrote:
> Are you happy with the other things now as well?

No, it looks like the barriers patch for mlx4 introduced performance 
degradation which we don't accept, will answer on with the exact details 
shortly.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 07/14] mlx4: Update to use new udma write barriers
       [not found]             ` <20170221181407.GA13138-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-03-06 14:57               ` Yishai Hadas
       [not found]                 ` <45d2b7da-9ad6-6b37-d0b2-00f7807966b4-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Yishai Hadas @ 2017-03-06 14:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Matan Barak,
	Majd Dibbiny, Doug Ledford

On 2/21/2017 8:14 PM, Jason Gunthorpe wrote:
> On Mon, Feb 20, 2017 at 07:46:02PM +0200, Yishai Hadas wrote:
>
>>> 	pthread_spin_lock(&qp->sq.lock);
>>>
>>> +	/* Get all user DMA buffers ready to go */
>>> +	udma_to_device_barrier();
>>> +
>>> 	/* XXX check that state is OK to post send */
>>
>> Not clear why do we need here an extra barrier ? what is the future
>> optimization that you pointed on ?
>
> Writes to different memory types are not guarenteed to be strongly
> ordered. This is narrowly true on x86-64 too apparently.
>
> The purpose of udma_to_device_barrier is to serialize all memory
> types.
>
> This allows the follow on code to use the weaker
> 'udma_ordering_write_barrier' when it knows it is working exclusively
> with cached memory.
>
> Eg in an ideal world on x86 udma_to_device_barrier should be SFENCE
> and 'udma_ordering_write_barrier' should be compiler_barrier()
>
> Since the barriers have been so ill defined and wonky on x86 other
> arches use much stronger barriers than they actually need for each
> cases. Eg ARM64 can switch to much weaker barriers.
>
> Since there is no cost on x86-64 to this barrier I would like to leave
> it here as it lets us actually optimize the ARM and other barriers. If
> you take it out then 'udma_ordering_write_barrier' is forced to be the
> strongest barrier on all arches.

Till we make the further optimizations, we suspect a performance 
degradation in other ARCH(s) rather than X86, as this patch introduce an 
extra barrier which wasn't before (i.e udma_to_device_barrier).
Usually such a change should come with its follows improvement patch to 
justify it. The impact is still something that we should measure but 
basically any new code in the data path must be justified when it's 
added from performance point of view.

>>> @@ -478,7 +481,7 @@ out:
>>> 		 * Make sure that descriptor is written to memory
>>> 		 * before writing to BlueFlame page.
>>> 		 */
>>> -		wmb();
>>> +		mmio_wc_start();
>>
>> Why to make this change which affects at least X86_64 ? the data was
>> previously written to the host memory so we expect that wmb is
>> enough here.
>
> Same as above, writes to different memory types are not strongly
> ordered and wmb() is not the right barrier to serialize writes to
> DMA'able ctrl with the WC memcpy.
>
> If this is left wrong then again other arches are again required to
> adopt the strongest barrier for everything which hurts them.
>
> Even on x86, it is very questionable to not have the SFENCE in that
> spot. AFAIK it is not defined to be strongly ordered.
>
> mlx5 has the SFENCE here, for instance.

We made some performance testing with the above change, initial results 
point on degradation of about 3% in the message rate in the above 
BlueFlame path in X86, this is something that we should prevent.

Based on below analysis it looks as the change to use 'mmio_wc_start()' 
which is mapped to SFENCE in X86 is redundant.

Details:
There is a call to 'pthread_spin_lock()' in between the WB (write back 
host memory) to WC writes.

pthread_spin_lock is defined [1] to x86 XCHG atomic instruction. Based 
on Intel® 64 and IA-32 Architectures [2] such an operation operates as 
some barrier to make sure that previous memory writes were completed.
As such we expect to see here some other macro which consider whether 
lock is a barrier based on ARCH instead of taking in call cases the 
SFENCE barrier.

[1] http://code.metager.de/source/xref/gnu/glibc/nptl/pthread_spin_lock.c
Call atomic_exchange_acq() which is defined in Line 115 at:
http://code.metager.de/source/xref/gnu/glibc/sysdeps/x86_64/atomic-machine.h

[2] Intel® 64 and IA-32 Architectures Software Developer’s Manual
At: 
https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf

8.1.2.2 Software Controlled Bus Locking

On page 8-3 Vol. 3A  : “The LOCK prefix is automatically assumed for 
XCHG instruction.”

On page 8-4 Vol. 3A: “Locked instructions can be used to synchronize 
data written by one processor and read by another processor.
For the P6 family processors, locked operations serialize all 
outstanding load and store operations (that is, wait for
them to complete). This rule is also true for the Pentium 4 and Intel 
Xeon processors, with one exception. Load
operations that reference weakly ordered memory types (such as the WC 
memory type) may not be serialized.

8.3 SERIALIZING INSTRUCTIONS

On page 8-16 Vol. 3A: “Locking operations typically operate like I/O 
operations in that they wait for all previous instructions to complete 
and for all buffered writes to drain to memory”

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 07/14] mlx4: Update to use new udma write barriers
       [not found]                 ` <45d2b7da-9ad6-6b37-d0b2-00f7807966b4-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2017-03-06 17:31                   ` Jason Gunthorpe
       [not found]                     ` <20170306173139.GA11805-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-03-06 17:31 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Matan Barak,
	Majd Dibbiny, Doug Ledford

On Mon, Mar 06, 2017 at 04:57:40PM +0200, Yishai Hadas wrote:

> >Since there is no cost on x86-64 to this barrier I would like to leave
> >it here as it lets us actually optimize the ARM and other barriers. If
> >you take it out then 'udma_ordering_write_barrier' is forced to be the
> >strongest barrier on all arches.
> 
> Till we make the further optimizations, we suspect a performance degradation
> in other ARCH(s) rather than X86, as this patch introduce an extra barrier
> which wasn't before (i.e udma_to_device_barrier).

Yes, possibly.

The only other option I see is to change those couple of call sites in
mlx4 to be udma_to_device_barrier() - which looses the information
they are actually doing something different.

Honestly, I think if someone cares about the other arches they will
see a net win if the proper weak barrier is implemented for
udma_ordering_write_barrier

> >Even on x86, it is very questionable to not have the SFENCE in that
> >spot. AFAIK it is not defined to be strongly ordered.
> >
> >mlx5 has the SFENCE here, for instance.
> 
> We made some performance testing with the above change, initial results
> point on degradation of about 3% in the message rate in the above BlueFlame
> path in X86, this is something that we should prevent.
> 
> Based on below analysis it looks as the change to use 'mmio_wc_start()'
> which is mapped to SFENCE in X86 is redundant.

Okay, I think your analysis makes sense, and extending it broadly
means there is a fair amount of over-barriering on x86 in various
places.

For now, there are several places where this WC spinlock pattern is
used, so let us pull it out, here is an example, there are still other
places in the mlx drivers.

What do you think about this approach? Notice that it allows us to see
that mlx5 should be optimized to elide the leading SFENCE as
well. This should speed up mlx5 compared to the original.

diff --git a/providers/mlx4/qp.c b/providers/mlx4/qp.c
index 77a4a34576cb69..32f0b3fe78fe7c 100644
--- a/providers/mlx4/qp.c
+++ b/providers/mlx4/qp.c
@@ -481,15 +481,14 @@ out:
 		 * Make sure that descriptor is written to memory
 		 * before writing to BlueFlame page.
 		 */
-		mmio_wc_start();
+		mmio_wc_spinlock(&ctx->bf_lock);
 
 		++qp->sq.head;
 
-		pthread_spin_lock(&ctx->bf_lock);
-
 		mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned long *) ctrl,
 			     align(size * 16, 64));
-		mmio_flush_writes();
+
+		mmio_wc_spinunlock(&ctx->bf_lock);
 
 		ctx->bf_offset ^= ctx->bf_buf_size;
 
diff --git a/providers/mlx5/qp.c b/providers/mlx5/qp.c
index d7087d986ce79f..6187b85219dacc 100644
--- a/providers/mlx5/qp.c
+++ b/providers/mlx5/qp.c
@@ -931,11 +931,11 @@ out:
 
 		/* Make sure that the doorbell write happens before the memcpy
 		 * to WC memory below */
-		mmio_wc_start();
-
 		ctx = to_mctx(ibqp->context);
-		if (bf->need_lock)
-			mlx5_spin_lock(&bf->lock);
+		if (bf->need_lock && !mlx5_single_threaded)
+			mmio_wc_spinlock(&bf->lock->lock);
+		else
+			mmio_wc_start();
 
 		if (!ctx->shut_up_bf && nreq == 1 && bf->uuarn &&
 		    (inl || ctx->prefer_bf) && size > 1 &&
@@ -955,10 +955,11 @@ out:
 		 * the mmio_flush_writes is CPU local, this will result in the HCA seeing
 		 * doorbell 2, followed by doorbell 1.
 		 */
-		mmio_flush_writes();
 		bf->offset ^= bf->buf_size;
-		if (bf->need_lock)
-			mlx5_spin_unlock(&bf->lock);
+		if (bf->need_lock && !mlx5_single_threaded)
+			mmio_wc_spinunlock(&bf->lock->lock);
+		else
+			mmio_flush_writes();
 	}
 
 	mlx5_spin_unlock(&qp->sq.lock);
diff --git a/util/udma_barrier.h b/util/udma_barrier.h
index 9e73148af8d5b6..ea4904d28f6a48 100644
--- a/util/udma_barrier.h
+++ b/util/udma_barrier.h
@@ -33,6 +33,8 @@
 #ifndef __UTIL_UDMA_BARRIER_H
 #define __UTIL_UDMA_BARRIER_H
 
+#include <pthread.h>
+
 /* Barriers for DMA.
 
    These barriers are expliclty only for use with user DMA operations. If you
@@ -222,4 +224,26 @@
 */
 #define mmio_ordered_writes_hack() mmio_flush_writes()
 
+/* Higher Level primitives */
+
+/* Do mmio_wc_start and grab a spinlock */
+static inline void mmio_wc_spinlock(pthread_spinlock_t *lock)
+{
+	pthread_spin_lock(lock);
+#if !defined(__i386__) && !defined(__x86_64__)
+	/* For x86 the serialization within the spin lock is enough to
+	 * strongly order WC and other memory types. */
+	mmio_wc_start();
+#endif
+}
+
+static inline void mmio_wc_spinunlock(pthread_spinlock_t *lock)
+{
+	/* On x86 the lock is enough for strong ordering, but the SFENCE
+	 * encourages the WC buffers to flush out more quickly (Yishai:
+	 * confirm?) */
+	mmio_flush_writes();
+	pthread_spin_unlock(lock);
+}
+
 #endif
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros
       [not found]     ` <1487272989-8215-7-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-03-01 17:29       ` Shiraz Saleem
@ 2017-03-06 18:18       ` Shiraz Saleem
       [not found]         ` <20170306181808.GA34252-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
  1 sibling, 1 reply; 65+ messages in thread
From: Shiraz Saleem @ 2017-03-06 18:18 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Nikolova, Tatyana E

On Thu, Feb 16, 2017 at 12:23:01PM -0700, Jason Gunthorpe wrote:
> Use our standard versions from util instead. There doesn't seem
> to be anything tricky here, but the inlined versions were like our
> wc_wmb() barriers, not the wmb().
> 
> There appears to be no WC memory in this driver, so despite the comments,
> these barriers are also making sure that user DMA data is flushed out. Make
> them all wmb()
> 
> Guess at where the missing rmb() should be.
> 
> Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> ---


> @@ -780,6 +780,8 @@ static enum i40iw_status_code i40iw_cq_poll_completion(struct i40iw_cq_uk *cq,
>  	if (polarity != cq->polarity)
>  		return I40IW_ERR_QUEUE_EMPTY;
>  
> +	udma_from_device_barrier();
> +

What is the need for the barrier here?

>  	q_type = (u8)RS_64(qword3, I40IW_CQ_SQ);
>  	info->error = (bool)RS_64(qword3, I40IW_CQ_ERROR);
>  	info->push_dropped = (bool)RS_64(qword3, I40IWCQ_PSHDROP);
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros
       [not found]         ` <20170306181808.GA34252-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
@ 2017-03-06 19:07           ` Jason Gunthorpe
       [not found]             ` <20170306190751.GA30663-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-03-06 19:07 UTC (permalink / raw)
  To: Shiraz Saleem; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Nikolova, Tatyana E

On Mon, Mar 06, 2017 at 12:18:09PM -0600, Shiraz Saleem wrote:
> On Thu, Feb 16, 2017 at 12:23:01PM -0700, Jason Gunthorpe wrote:
> > Use our standard versions from util instead. There doesn't seem
> > to be anything tricky here, but the inlined versions were like our
> > wc_wmb() barriers, not the wmb().
> > 
> > There appears to be no WC memory in this driver, so despite the comments,
> > these barriers are also making sure that user DMA data is flushed out. Make
> > them all wmb()
> > 
> > Guess at where the missing rmb() should be.
> > 
> > Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> 
> 
> > @@ -780,6 +780,8 @@ static enum i40iw_status_code i40iw_cq_poll_completion(struct i40iw_cq_uk *cq,
> >  	if (polarity != cq->polarity)
> >  		return I40IW_ERR_QUEUE_EMPTY;
> >  
> > +	udma_from_device_barrier();
> > +
> 
> What is the need for the barrier here?

Every driver doing DMA needs to have a rmb(), I guessed this is
approximately the right place to put it for i10iw.

Within the PCI producer/consumer model the rmb needs to be placed
after the CPU observes a DMA write that proves other DMA writes have
occured.

It serializes the CPU with the write stream from PCI and prevents the
compiler/CPU from re-ordering dependent loads to be before the proving
load.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros
       [not found]                             ` <20170303222244.GA678-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-03-06 19:16                               ` Shiraz Saleem
       [not found]                                 ` <20170306191631.GB34252-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Shiraz Saleem @ 2017-03-06 19:16 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Nikolova, Tatyana E

On Fri, Mar 03, 2017 at 03:22:44PM -0700, Jason Gunthorpe wrote:
> On Fri, Mar 03, 2017 at 03:45:14PM -0600, Shiraz Saleem wrote:
> 
> > This is not quite how our DB logic works. There are additional HW
> > steps and nuances in the flow. Unfortunately, to explain this, we
> > need to provide details of our internal HW flow for the DB logic. We
> > are unable to do so at this time.
> 
> Well, it is very problematic to help you define what a cross-arch
> barrier should do if you can't explain what you need to have happen
> relative to PCI-E.
> 

Unfortunately, we can help with this only at the point when this information 
is made public. If you must have an explanation for all barriers defined in 
utils, an option here is to leave this barrier in i40iw and migrate it to 
utils when documentation is available. 

> > Mfence guarantees that load won't be reordered before the store, and
> > thus we are using it.
> 
> If that is all then the driver can use LFENCE and the
> udma_from_device_barrier() .. Is that OK?
>

The write valid WQE needs to be globally visible before read tail. LFENCE does not 
guarantee this. MFENCE does.

https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf

LFENCE (Vol. 3A 8-16)
"Serializes all load (read) operations that occurred prior to the LFENCE instruction 
in the program instruction stream, but does not affect store operations"

LFENCE (Vol. 2A 3-529)
"An LFENCE that follows an instruction that stores to memory might complete before 
the data being stored have become globally visible. Instructions following an LFENCE 
may be fetched from memory before the LFENCE, but they will not execute until the LFENCE 
completes"	

MFENCE (Vol. 2B 4-22)
"This serializing operation guarantees that every load and store instruction that precedes 
the MFENCE instruction in program order becomes globally visible before any load or store 
instruction that follows the MFENCE instruction"

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros
       [not found]                                 ` <20170306191631.GB34252-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
@ 2017-03-06 19:40                                   ` Jason Gunthorpe
       [not found]                                     ` <20170306194052.GB31672-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-03-06 19:40 UTC (permalink / raw)
  To: Shiraz Saleem; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Nikolova, Tatyana E

On Mon, Mar 06, 2017 at 01:16:31PM -0600, Shiraz Saleem wrote:
> On Fri, Mar 03, 2017 at 03:22:44PM -0700, Jason Gunthorpe wrote:
> > On Fri, Mar 03, 2017 at 03:45:14PM -0600, Shiraz Saleem wrote:
> > 
> > > This is not quite how our DB logic works. There are additional HW
> > > steps and nuances in the flow. Unfortunately, to explain this, we
> > > need to provide details of our internal HW flow for the DB logic. We
> > > are unable to do so at this time.
> > 
> > Well, it is very problematic to help you define what a cross-arch
> > barrier should do if you can't explain what you need to have happen
> > relative to PCI-E.
> 
> Unfortunately, we can help with this only at the point when this information 
> is made public. If you must have an explanation for all barriers defined in 
> utils, an option here is to leave this barrier in i40iw and migrate it to 
> utils when documentation is available. 

Well, it is impossible to document what other arches are expected to
do if you can't define what you need.

Talking about the CPU alone does not define the interaction required
with PCI.

The reason we have these special barriers and do not just use C11's
atomic_thread_fence is specifically because some arches make a small
distinction on ordering relative to PCI and ordering relative to other
CPUs.

> > > Mfence guarantees that load won't be reordered before the store, and
> > > thus we are using it.
> > 
> > If that is all then the driver can use LFENCE and the
> > udma_from_device_barrier() .. Is that OK?
> 
> The write valid WQE needs to be globally visible before read tail. LFENCE does not 
> guarantee this. MFENCE does.

I was thinking

SFENCE
LFENCE

So, okay, here are two more choices.

1) Use a C11 barrier:

 atomic_thread_fence(memory_order_seq_cst);

This produces what you want on x86-64:

0000000000000590 <i40iw_qp_post_wr>:
     590:       0f ae f0                mfence 
     593:       48 8b 47 28             mov    0x28(%rdi),%rax
     597:       8b 57 40                mov    0x40(%rdi),%edx

x86-32 does:

00000600 <i40iw_qp_post_wr>:
     600:       53                      push   %ebx
     601:       8b 44 24 08             mov    0x8(%esp),%eax
     605:       f0 83 0c 24 00          lock orl $0x0,(%esp)

Which is basically the same as the "lock; addl $0,0(%%esp)" the old
macros used.

Take your chances on other arches.

2) Explicitly optimize x86 and have other arches skip the
   shadow optimization

Here is a patch that does #2, I'm guessing about the implementation..

What do you think?

diff --git a/providers/i40iw/i40iw_uk.c b/providers/i40iw/i40iw_uk.c
index b20748e9f09199..e61bb049686cc5 100644
--- a/providers/i40iw/i40iw_uk.c
+++ b/providers/i40iw/i40iw_uk.c
@@ -33,6 +33,7 @@
 *******************************************************************************/
 
 #include <stdint.h>
+#include <stdatomic.h>
 
 #include "i40iw_osdep.h"
 #include "i40iw_status.h"
@@ -85,13 +86,21 @@ static enum i40iw_status_code i40iw_nop_1(struct i40iw_qp_uk *qp)
  * i40iw_qp_post_wr - post wr to hrdware
  * @qp: hw qp ptr
  */
+#if defined(__x86_64__) || defined(__i386__)
 void i40iw_qp_post_wr(struct i40iw_qp_uk *qp)
 {
 	u64 temp;
 	u32 hw_sq_tail;
 	u32 sw_sq_head;
 
-	udma_to_device_barrier(); /* valid bit is written and loads completed before reading shadow */
+	/* valid bit is written and loads completed before reading shadow
+	 *
+	 * Whatever is happening here does not match our common macros for
+	 * producer/consumer DMA and may not be portable, however on x86-64
+	 * the required barrier is MFENCE, get a 'portable' version via C11
+	 * atomic.
+	 */
+	atomic_thread_fence(memory_order_seq_cst);
 
 	/* read the doorbell shadow area */
 	get_64bit_val(qp->shadow_area, I40IW_BYTE_0, &temp);
@@ -114,6 +123,15 @@ void i40iw_qp_post_wr(struct i40iw_qp_uk *qp)
 
 	qp->initial_ring.head = qp->sq_ring.head;
 }
+#else
+void i40iw_qp_post_wr(struct i40iw_qp_uk *qp)
+{
+	/* We do not know how to do the shadow area optimization on this arch,
+	 * disable it. */
+	db_wr32(qp->qp_id, qp->wqe_alloc_reg);
+	qp->initial_ring.head = qp->sq_ring.head;
+}
+#endif
 
 /**
  * i40iw_qp_ring_push_db -  ring qp doorbell
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 07/14] mlx4: Update to use new udma write barriers
       [not found]                     ` <20170306173139.GA11805-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-03-07 16:44                       ` Yishai Hadas
       [not found]                         ` <55bcc87e-b059-65df-8079-100120865ffb-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Yishai Hadas @ 2017-03-07 16:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Matan Barak,
	Majd Dibbiny, Doug Ledford

On 3/6/2017 7:31 PM, Jason Gunthorpe wrote:
> On Mon, Mar 06, 2017 at 04:57:40PM +0200, Yishai Hadas wrote:
>
>>> Since there is no cost on x86-64 to this barrier I would like to leave
>>> it here as it lets us actually optimize the ARM and other barriers. If
>>> you take it out then 'udma_ordering_write_barrier' is forced to be the
>>> strongest barrier on all arches.
>>
>> Till we make the further optimizations, we suspect a performance degradation
>> in other ARCH(s) rather than X86, as this patch introduce an extra barrier
>> which wasn't before (i.e udma_to_device_barrier).
>
> Yes, possibly.
>
> The only other option I see is to change those couple of call sites in
> mlx4 to be udma_to_device_barrier() - which looses the information
> they are actually doing something different.
>
> Honestly, I think if someone cares about the other arches they will
> see a net win if the proper weak barrier is implemented for
> udma_ordering_write_barrier

We can't allow any temporary degradation and rely on some future 
improvements, it must come together and be justified by some performance 
testing.

I'll send some patch that will drop the leading udma_to_device_barrier() 
and replace udma_ordering_write_barrier() to be 
udma_to_device_barrier(), this will be done as part of the other change 
that is expected here, see below.

>>> Even on x86, it is very questionable to not have the SFENCE in that
>>> spot. AFAIK it is not defined to be strongly ordered.
>>>
>>> mlx5 has the SFENCE here, for instance.
>>
>> We made some performance testing with the above change, initial results
>> point on degradation of about 3% in the message rate in the above BlueFlame
>> path in X86, this is something that we should prevent.
>>
>> Based on below analysis it looks as the change to use 'mmio_wc_start()'
>> which is mapped to SFENCE in X86 is redundant.
>
> Okay, I think your analysis makes sense, and extending it broadly
> means there is a fair amount of over-barriering on x86 in various
> places.
>

Correct, that's should be the way to go.

> What do you think about this approach? Notice that it allows us to see
> that mlx5 should be optimized to elide the leading SFENCE as
> well. This should speed up mlx5 compared to the original.

The below patch makes sense, however, need to be fixed in few points, 
see below. I'll fix it and take it in-house to our regression and 
performance systems, once approved will sent it upstream.

> diff --git a/providers/mlx4/qp.c b/providers/mlx4/qp.c
> index 77a4a34576cb69..32f0b3fe78fe7c 100644
> --- a/providers/mlx4/qp.c
> +++ b/providers/mlx4/qp.c
> @@ -481,15 +481,14 @@ out:
>  		 * Make sure that descriptor is written to memory
>  		 * before writing to BlueFlame page.
>  		 */
> -		mmio_wc_start();
> +		mmio_wc_spinlock(&ctx->bf_lock);
>
>  		++qp->sq.head;

Originally wasn't under the BF spinlock, looks as should be out of that 
spin.

> -		pthread_spin_lock(&ctx->bf_lock);
> -
>  		mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned long *) ctrl,
>  			     align(size * 16, 64));
> -		mmio_flush_writes();
> +
> +		mmio_wc_spinunlock(&ctx->bf_lock);

We still should be under the spinlock, see below note, we expect here 
only a mmio_flush_writes() so this macro is not needed at all.
>
>  		ctx->bf_offset ^= ctx->bf_buf_size;

You missed here next line which do a second pthread_spin_unlock(), in 
addition this line need to be under the bf_lock.
>
> diff --git a/providers/mlx5/qp.c b/providers/mlx5/qp.c
> index d7087d986ce79f..6187b85219dacc 100644
> --- a/providers/mlx5/qp.c
> +++ b/providers/mlx5/qp.c
> @@ -931,11 +931,11 @@ out:
>
>  		/* Make sure that the doorbell write happens before the memcpy
>  		 * to WC memory below */
> -		mmio_wc_start();
> -
>  		ctx = to_mctx(ibqp->context);
> -		if (bf->need_lock)
> -			mlx5_spin_lock(&bf->lock);
> +		if (bf->need_lock && !mlx5_single_threaded)
> +			mmio_wc_spinlock(&bf->lock->lock);

Should be &bf->lock.lock to compile, here and below.

> +		else
> +			mmio_wc_start();
>
>  		if (!ctx->shut_up_bf && nreq == 1 && bf->uuarn &&
>  		    (inl || ctx->prefer_bf) && size > 1 &&
> @@ -955,10 +955,11 @@ out:
>  		 * the mmio_flush_writes is CPU local, this will result in the HCA seeing
>  		 * doorbell 2, followed by doorbell 1.
>  		 */
> -		mmio_flush_writes();
>  		bf->offset ^= bf->buf_size;
> -		if (bf->need_lock)
> -			mlx5_spin_unlock(&bf->lock);
> +		if (bf->need_lock && !mlx5_single_threaded)
> +			mmio_wc_spinunlock(&bf->lock->lock);
> +		else
> +			mmio_flush_writes();
>  	}
>
>  	mlx5_spin_unlock(&qp->sq.lock);
> diff --git a/util/udma_barrier.h b/util/udma_barrier.h
> index 9e73148af8d5b6..ea4904d28f6a48 100644
> --- a/util/udma_barrier.h
> +++ b/util/udma_barrier.h
> @@ -33,6 +33,8 @@
>  #ifndef __UTIL_UDMA_BARRIER_H
>  #define __UTIL_UDMA_BARRIER_H
>
> +#include <pthread.h>
> +
>  /* Barriers for DMA.
>
>     These barriers are expliclty only for use with user DMA operations. If you
> @@ -222,4 +224,26 @@
>  */
>  #define mmio_ordered_writes_hack() mmio_flush_writes()
>
> +/* Higher Level primitives */
> +
> +/* Do mmio_wc_start and grab a spinlock */
> +static inline void mmio_wc_spinlock(pthread_spinlock_t *lock)
> +{
> +	pthread_spin_lock(lock);
> +#if !defined(__i386__) && !defined(__x86_64__)
> +	/* For x86 the serialization within the spin lock is enough to
> +	 * strongly order WC and other memory types. */
> +	mmio_wc_start();
> +#endif
> +}
> +
> +static inline void mmio_wc_spinunlock(pthread_spinlock_t *lock)
> +{
> +	/* On x86 the lock is enough for strong ordering, but the SFENCE
> +	 * encourages the WC buffers to flush out more quickly (Yishai:
> +	 * confirm?) */

This macro can't do both and should be dropped, see above.

> +	mmio_flush_writes();
> +	pthread_spin_unlock(lock);
> +}
> +
>  #endif
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 07/14] mlx4: Update to use new udma write barriers
       [not found]                         ` <55bcc87e-b059-65df-8079-100120865ffb-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2017-03-07 19:18                           ` Jason Gunthorpe
       [not found]                             ` <20170307191824.GD2228-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-03-07 19:18 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Matan Barak,
	Majd Dibbiny, Doug Ledford

On Tue, Mar 07, 2017 at 06:44:55PM +0200, Yishai Hadas wrote:

> >Honestly, I think if someone cares about the other arches they will
> >see a net win if the proper weak barrier is implemented for
> >udma_ordering_write_barrier
> 
> We can't allow any temporary degradation and rely on some future
> improvements, it must come together and be justified by some performance
> testing.

Well, I haven't sent any changes to the barrier macros, until we get
everyone happy with them, but they exist.. I've included a few lines
in the patch below.

This *probably* makes ppc faster, if you leave the mlx4 stuff as-is,
as it replaces some hwsync with lwsync

Notice the use of atomic_thread_fence also produces more optimal
compiler output with new compilers since it is a much weaker impact to
the memory model that asm ("" ::: "memory"). Eg for mlx4_post_send
this results in less stack traffic and 15 bytes less in the function
for x86-64.

If that shows a performance win, let us keep it as is?

> I'll send some patch that will drop the leading udma_to_device_barrier() and
> replace udma_ordering_write_barrier() to be udma_to_device_barrier(), this
> will be done as part of the other change that is expected here, see below.

Sure, but lets have two patches so we can revert it..

> The below patch makes sense, however, need to be fixed in few points, see
> below. I'll fix it and take it in-house to our regression and performance
> systems, once approved will sent it upstream.

Okay, thanks for working on this!

> >-		pthread_spin_lock(&ctx->bf_lock);
> >-
> > 		mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned long *) ctrl,
> > 			     align(size * 16, 64));
> >-		mmio_flush_writes();
> >+
> >+		mmio_wc_spinunlock(&ctx->bf_lock);
> 
> We still should be under the spinlock, see below note, we expect here only a
> mmio_flush_writes() so this macro is not needed at all.

Right sorry, I just made a quick sketch to see if you like it.

> >+static inline void mmio_wc_spinunlock(pthread_spinlock_t *lock)
> >+{
> >+	/* On x86 the lock is enough for strong ordering, but the SFENCE
> >+	 * encourages the WC buffers to flush out more quickly (Yishai:
> >+	 * confirm?) */
> 
> This macro can't do both and should be dropped, see above.

I don't understand this comment? Why can't it do both? The patch below
shows the corrected version, perhaps it is clearer.

The intended guarentee of the wc_spinlock critical region is that:

   mmio_wc_spinlock();
   *wc_mem = 1;
   mmio_wc_spinunlock();

   mmio_wc_spinlock();
   *wc_mem = 2;
   mmio_wc_spinunlock();

Must *always* generate two TLPs, and *must* generate the visible TLPs in CPU
order of acquiring the spinlock - even if the two critical sections
are run on concurrently on different CPUs.

This is needed to consistently address the risk identified in mlx5:

		/*
		 * use mmio_flush_writes() to ensure write combining buffers are flushed out
		 * of the running CPU. This must be carried inside the spinlock.
		 * Otherwise, there is a potential race. In the race, CPU A
		 * writes doorbell 1, which is waiting in the WC buffer. CPU B
		 * writes doorbell 2, and it's write is flushed earlier. Since
		 * the mmio_flush_writes is CPU local, this will result in the HCA seeing
		 * doorbell 2, followed by doorbell 1.
		 */

We cannot provide these invariant without also providing
mmio_wc_spinunlock()

If for some reason that doesn't work for you then we should not use
the approach of wrappering the spinlock.

Anyhow, here is the patch that summarizes everything in this email:

diff --git a/providers/mlx4/qp.c b/providers/mlx4/qp.c
index 77a4a34576cb69..a22fca7c6f1360 100644
--- a/providers/mlx4/qp.c
+++ b/providers/mlx4/qp.c
@@ -477,23 +477,20 @@ out:
 		ctrl->owner_opcode |= htonl((qp->sq.head & 0xffff) << 8);
 
 		ctrl->bf_qpn |= qp->doorbell_qpn;
+		++qp->sq.head;
+
 		/*
 		 * Make sure that descriptor is written to memory
 		 * before writing to BlueFlame page.
 		 */
-		mmio_wc_start();
-
-		++qp->sq.head;
-
-		pthread_spin_lock(&ctx->bf_lock);
+		mmio_wc_spinlock(&ctx->bf_lock);
 
 		mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned long *) ctrl,
 			     align(size * 16, 64));
-		mmio_flush_writes();
 
 		ctx->bf_offset ^= ctx->bf_buf_size;
 
-		pthread_spin_unlock(&ctx->bf_lock);
+		mmio_wc_spinunlock(&ctx->bf_lock);
 	} else if (nreq) {
 		qp->sq.head += nreq;
 
diff --git a/providers/mlx5/qp.c b/providers/mlx5/qp.c
index d7087d986ce79f..0f1ec0ef2b094b 100644
--- a/providers/mlx5/qp.c
+++ b/providers/mlx5/qp.c
@@ -931,11 +931,11 @@ out:
 
 		/* Make sure that the doorbell write happens before the memcpy
 		 * to WC memory below */
-		mmio_wc_start();
-
 		ctx = to_mctx(ibqp->context);
-		if (bf->need_lock)
-			mlx5_spin_lock(&bf->lock);
+		if (bf->need_lock && !mlx5_single_threaded)
+			mmio_wc_spinlock(&bf->lock.lock);
+		else
+			mmio_wc_start();
 
 		if (!ctx->shut_up_bf && nreq == 1 && bf->uuarn &&
 		    (inl || ctx->prefer_bf) && size > 1 &&
@@ -955,10 +955,11 @@ out:
 		 * the mmio_flush_writes is CPU local, this will result in the HCA seeing
 		 * doorbell 2, followed by doorbell 1.
 		 */
-		mmio_flush_writes();
 		bf->offset ^= bf->buf_size;
-		if (bf->need_lock)
-			mlx5_spin_unlock(&bf->lock);
+		if (bf->need_lock && !mlx5_single_threaded)
+			mmio_wc_spinunlock(&bf->lock.lock);
+		else
+			mmio_flush_writes();
 	}
 
 	mlx5_spin_unlock(&qp->sq.lock);
diff --git a/util/udma_barrier.h b/util/udma_barrier.h
index 9e73148af8d5b6..db4ff0c6c25376 100644
--- a/util/udma_barrier.h
+++ b/util/udma_barrier.h
@@ -33,6 +33,9 @@
 #ifndef __UTIL_UDMA_BARRIER_H
 #define __UTIL_UDMA_BARRIER_H
 
+#include <pthread.h>
+#include <stdatomic.h>
+
 /* Barriers for DMA.
 
    These barriers are expliclty only for use with user DMA operations. If you
@@ -78,10 +81,8 @@
    memory types or non-temporal stores are required to use SFENCE in their own
    code prior to calling verbs to start a DMA.
 */
-#if defined(__i386__)
-#define udma_to_device_barrier() asm volatile("" ::: "memory")
-#elif defined(__x86_64__)
-#define udma_to_device_barrier() asm volatile("" ::: "memory")
+#if defined(__i386__) || defined(__x86_64__)
+#define udma_to_device_barrier() atomic_thread_fence(memory_order_release)
 #elif defined(__PPC64__)
 #define udma_to_device_barrier() asm volatile("sync" ::: "memory")
 #elif defined(__PPC__)
@@ -115,7 +116,7 @@
 #elif defined(__x86_64__)
 #define udma_from_device_barrier() asm volatile("lfence" ::: "memory")
 #elif defined(__PPC64__)
-#define udma_from_device_barrier() asm volatile("lwsync" ::: "memory")
+#define udma_from_device_barrier() atomic_thread_fence(memory_order_acquire)
 #elif defined(__PPC__)
 #define udma_from_device_barrier() asm volatile("sync" ::: "memory")
 #elif defined(__ia64__)
@@ -149,7 +150,11 @@
       udma_ordering_write_barrier();  // Guarantee WQE written in order
       wqe->valid = 1;
 */
+#if defined(__i386__) || defined(__x86_64__) || defined(__PPC64__) || defined(__PPC__)
+#define udma_ordering_write_barrier() atomic_thread_fence(memory_order_release)
+#else
 #define udma_ordering_write_barrier() udma_to_device_barrier()
+#endif
 
 /* Promptly flush writes to MMIO Write Cominbing memory.
    This should be used after a write to WC memory. This is both a barrier
@@ -222,4 +227,37 @@
 */
 #define mmio_ordered_writes_hack() mmio_flush_writes()
 
+/* Write Combining Spinlock primitive
+
+   Any access to a multi-value WC region must ensure that multiple cpus do not
+   write to the same values concurrently, these macros make that
+   straightforward and efficient if the choosen exclusion is a spinlock.
+
+   The spinlock guarantees that the WC writes issued within the critical
+   section are made visible as TLP to the device. The TLP must seen by the
+   device strictly in the order that the spinlocks are acquired, and combining
+   WC writes between different sections is not permitted.
+
+   Use of these macros allow the fencing inside the spinlock to be combined
+   with the fencing required for DMA.
+ */
+static inline void mmio_wc_spinlock(pthread_spinlock_t *lock)
+{
+	pthread_spin_lock(lock);
+#if !defined(__i386__) && !defined(__x86_64__)
+	/* For x86 the serialization within the spin lock is enough to
+	 * strongly order WC and other memory types. */
+	mmio_wc_start();
+#endif
+}
+
+static inline void mmio_wc_spinunlock(pthread_spinlock_t *lock)
+{
+	/* On x86 the lock is enough for strong ordering, but the SFENCE
+	 * encourages the WC buffers to flush out more quickly (Yishai:
+	 * confirm?) */
+	mmio_flush_writes();
+	pthread_spin_unlock(lock);
+}
+
 #endif
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros
       [not found]                                     ` <20170306194052.GB31672-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-03-07 22:46                                       ` Shiraz Saleem
       [not found]                                         ` <20170307224622.GA45028-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Shiraz Saleem @ 2017-03-07 22:46 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Nikolova, Tatyana E

On Mon, Mar 06, 2017 at 12:40:52PM -0700, Jason Gunthorpe wrote:
> On Mon, Mar 06, 2017 at 01:16:31PM -0600, Shiraz Saleem wrote:
> > On Fri, Mar 03, 2017 at 03:22:44PM -0700, Jason Gunthorpe wrote:
> > > On Fri, Mar 03, 2017 at 03:45:14PM -0600, Shiraz Saleem wrote:
> > > 
> > > > This is not quite how our DB logic works. There are additional HW
> > > > steps and nuances in the flow. Unfortunately, to explain this, we
> > > > need to provide details of our internal HW flow for the DB logic. We
> > > > are unable to do so at this time.
> > > 
> > > Well, it is very problematic to help you define what a cross-arch
> > > barrier should do if you can't explain what you need to have happen
> > > relative to PCI-E.
> > 
> > Unfortunately, we can help with this only at the point when this information 
> > is made public. If you must have an explanation for all barriers defined in 
> > utils, an option here is to leave this barrier in i40iw and migrate it to 
> > utils when documentation is available. 
> 
> Well, it is impossible to document what other arches are expected to
> do if you can't define what you need.
> 
> Talking about the CPU alone does not define the interaction required
> with PCI.
> 
> The reason we have these special barriers and do not just use C11's
> atomic_thread_fence is specifically because some arches make a small
> distinction on ordering relative to PCI and ordering relative to other
> CPUs.
> 
> > > > Mfence guarantees that load won't be reordered before the store, and
> > > > thus we are using it.
> > > 
> > > If that is all then the driver can use LFENCE and the
> > > udma_from_device_barrier() .. Is that OK?
> > 
> > The write valid WQE needs to be globally visible before read tail. LFENCE does not 
> > guarantee this. MFENCE does.
> 
> I was thinking
> 
> SFENCE
> LFENCE
> 
> So, okay, here are two more choices.
> 
> 1) Use a C11 barrier:
> 
>  atomic_thread_fence(memory_order_seq_cst);
> 
> This produces what you want on x86-64:
> 
> 0000000000000590 <i40iw_qp_post_wr>:
>      590:       0f ae f0                mfence 
>      593:       48 8b 47 28             mov    0x28(%rdi),%rax
>      597:       8b 57 40                mov    0x40(%rdi),%edx
> 
> x86-32 does:
> 
> 00000600 <i40iw_qp_post_wr>:
>      600:       53                      push   %ebx
>      601:       8b 44 24 08             mov    0x8(%esp),%eax
>      605:       f0 83 0c 24 00          lock orl $0x0,(%esp)
> 
> Which is basically the same as the "lock; addl $0,0(%%esp)" the old
> macros used.
> 
> Take your chances on other arches.
> 
> 2) Explicitly optimize x86 and have other arches skip the
>    shadow optimization
> 
> Here is a patch that does #2, I'm guessing about the implementation..
> 
> What do you think?

Is __this__ C11 barrier a compiler barrier as well?
#1 is preferred using atomic_thread_fence(memory_order_seq_cst) for all 
archs. 
  
 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros
       [not found]                                         ` <20170307224622.GA45028-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
@ 2017-03-07 22:50                                           ` Jason Gunthorpe
       [not found]                                             ` <20170307225027.GA20858-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-03-07 22:50 UTC (permalink / raw)
  To: Shiraz Saleem; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Nikolova, Tatyana E

On Tue, Mar 07, 2017 at 04:46:22PM -0600, Shiraz Saleem wrote:

> Is __this__ C11 barrier a compiler barrier as well?

Yes, of course.

> #1 is preferred using atomic_thread_fence(memory_order_seq_cst) for all 
> archs. 

Okay..

I will send a patch to fix this then?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros
       [not found]                                             ` <20170307225027.GA20858-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-03-07 23:01                                               ` Shiraz Saleem
       [not found]                                                 ` <20170307230121.GA52428-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Shiraz Saleem @ 2017-03-07 23:01 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Nikolova, Tatyana E

On Tue, Mar 07, 2017 at 03:50:27PM -0700, Jason Gunthorpe wrote:
> On Tue, Mar 07, 2017 at 04:46:22PM -0600, Shiraz Saleem wrote:
> 
> > Is __this__ C11 barrier a compiler barrier as well?
> 
> Yes, of course.
> 
> > #1 is preferred using atomic_thread_fence(memory_order_seq_cst) for all 
> > archs. 
> 
> Okay..
> 
> I will send a patch to fix this then?
> 

One more question - C11 atomics are only supported on latest GCC 4.9? User
would need to move newer compilers. Yes? 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros
       [not found]                                                 ` <20170307230121.GA52428-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
@ 2017-03-07 23:11                                                   ` Jason Gunthorpe
       [not found]                                                     ` <20170307231145.GB20858-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-03-07 23:11 UTC (permalink / raw)
  To: Shiraz Saleem; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Nikolova, Tatyana E

On Tue, Mar 07, 2017 at 05:01:21PM -0600, Shiraz Saleem wrote:

> One more question - C11 atomics are only supported on latest GCC
> 4.9? User would need to move newer compilers. Yes?

No, we have an emulation layer for this.

Older compilers will translate to the gcc built in
__sync_synchronize() which is still MFENCE.

I tested that this works as expected on RH 6's built in compiler.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros
       [not found]             ` <20170306190751.GA30663-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-03-07 23:16               ` Shiraz Saleem
  0 siblings, 0 replies; 65+ messages in thread
From: Shiraz Saleem @ 2017-03-07 23:16 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Nikolova, Tatyana E

On Mon, Mar 06, 2017 at 12:07:51PM -0700, Jason Gunthorpe wrote:
> On Mon, Mar 06, 2017 at 12:18:09PM -0600, Shiraz Saleem wrote:
> > On Thu, Feb 16, 2017 at 12:23:01PM -0700, Jason Gunthorpe wrote:
> > > Use our standard versions from util instead. There doesn't seem
> > > to be anything tricky here, but the inlined versions were like our
> > > wc_wmb() barriers, not the wmb().
> > > 
> > > There appears to be no WC memory in this driver, so despite the comments,
> > > these barriers are also making sure that user DMA data is flushed out. Make
> > > them all wmb()
> > > 
> > > Guess at where the missing rmb() should be.
> > > 
> > > Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> > 
> > 
> > > @@ -780,6 +780,8 @@ static enum i40iw_status_code i40iw_cq_poll_completion(struct i40iw_cq_uk *cq,
> > >  	if (polarity != cq->polarity)
> > >  		return I40IW_ERR_QUEUE_EMPTY;
> > >  
> > > +	udma_from_device_barrier();
> > > +
> > 
> > What is the need for the barrier here?
> 
> Every driver doing DMA needs to have a rmb(), I guessed this is
> approximately the right place to put it for i10iw.
> 
> Within the PCI producer/consumer model the rmb needs to be placed
> after the CPU observes a DMA write that proves other DMA writes have
> occured.
> 
> It serializes the CPU with the write stream from PCI and prevents the
> compiler/CPU from re-ordering dependent loads to be before the proving
> load.
> 

OK. This change makes sense. ACK.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros
       [not found]                                                     ` <20170307231145.GB20858-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-03-07 23:23                                                       ` Shiraz Saleem
  0 siblings, 0 replies; 65+ messages in thread
From: Shiraz Saleem @ 2017-03-07 23:23 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Nikolova, Tatyana E

On Tue, Mar 07, 2017 at 04:11:45PM -0700, Jason Gunthorpe wrote:
> On Tue, Mar 07, 2017 at 05:01:21PM -0600, Shiraz Saleem wrote:
> 
> > One more question - C11 atomics are only supported on latest GCC
> > 4.9? User would need to move newer compilers. Yes?
> 
> No, we have an emulation layer for this.
> 
> Older compilers will translate to the gcc built in
> __sync_synchronize() which is still MFENCE.
> 
> I tested that this works as expected on RH 6's built in compiler.
>

Ok. Great. Then, lets proceed with a revised patch including the C11 barrier.
Thank you Jason! 

 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 07/14] mlx4: Update to use new udma write barriers
       [not found]                             ` <20170307191824.GD2228-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-03-08 21:27                               ` Yishai Hadas
       [not found]                                 ` <6571cf34-63b9-7b83-ddb0-9279e7e20fa9-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Yishai Hadas @ 2017-03-08 21:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Matan Barak,
	Majd Dibbiny, Doug Ledford

On 3/7/2017 9:18 PM, Jason Gunthorpe wrote:
>>> +static inline void mmio_wc_spinunlock(pthread_spinlock_t *lock)
>>> +{
>>> +	/* On x86 the lock is enough for strong ordering, but the SFENCE
>>> +	 * encourages the WC buffers to flush out more quickly (Yishai:
>>> +	 * confirm?) */
>>
>> This macro can't do both and should be dropped, see above.
>
> I don't understand this comment? Why can't it do both?

This macro made the flush and just later made the spin_unlock without 
letting any command to be executed between them, logically there is no 
reason for such a limitation.

As of that any command that needs the lock must be done before the flush 
and delays the hardware from see the BF data immediately.

Specifically, in both mlx4 and mlx5 the original code was to flush just 
after the BF copy then under the spinlock toggle the bf_offset (i.e. 
ctx->bf_offset ^= ctx->bf_buf_size) and finally unlock.

That's why this macro is bad, I'll drop this macro and preserve the 
original logic with the other new macros.

Similar logical issue might be with the mmio_wc_spinlock macro in case 
the driver wants to make some command after the lock before the 
start_mmio, currently there is no such use case so basically I'm fine 
with staying with it.

> diff --git a/providers/mlx4/qp.c b/providers/mlx4/qp.c
> index 77a4a34576cb69..a22fca7c6f1360 100644
> --- a/providers/mlx4/qp.c
> +++ b/providers/mlx4/qp.c
> @@ -477,23 +477,20 @@ out:
>  		ctrl->owner_opcode |= htonl((qp->sq.head & 0xffff) << 8);
>
>  		ctrl->bf_qpn |= qp->doorbell_qpn;
> +		++qp->sq.head;
> +

This is a change comparing the original code where was wmb() before that 
line to enforce the compiler not to change the order so that 
ctrl->owner_opcode will get the correct value of sq.head.
Will add here udma_to_device_barrier() which logically expects to do the 
same.

>  		/*
>  		 * Make sure that descriptor is written to memory
>  		 * before writing to BlueFlame page.
>  		 */
> -		mmio_wc_start();
> -
> -		++qp->sq.head;
> -
> -		pthread_spin_lock(&ctx->bf_lock);
> +		mmio_wc_spinlock(&ctx->bf_lock);
>
>  		mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned long *) ctrl,
>  			     align(size * 16, 64));
> -		mmio_flush_writes();
>
>  		ctx->bf_offset ^= ctx->bf_buf_size;
>
> -		pthread_spin_unlock(&ctx->bf_lock);
> +		mmio_wc_spinunlock(&ctx->bf_lock);
>  	} else if (nreq) {
>  		qp->sq.head += nreq;
>
> diff --git a/providers/mlx5/qp.c b/providers/mlx5/qp.c
> index d7087d986ce79f..0f1ec0ef2b094b 100644
> --- a/providers/mlx5/qp.c
> +++ b/providers/mlx5/qp.c
> @@ -931,11 +931,11 @@ out:
>
>  		/* Make sure that the doorbell write happens before the memcpy
>  		 * to WC memory below */
> -		mmio_wc_start();
> -
>  		ctx = to_mctx(ibqp->context);
> -		if (bf->need_lock)
> -			mlx5_spin_lock(&bf->lock);
> +		if (bf->need_lock && !mlx5_single_threaded)

Can consider some pre-patch which will encapsulate the 
mlx5_single_threaded flag into the bf->need lock, it will prevent the 
extra if statement that this new code introduced in the data path.

> +			mmio_wc_spinlock(&bf->lock.lock);
> +		else
> +			mmio_wc_start();
>
>  		if (!ctx->shut_up_bf && nreq == 1 && bf->uuarn &&
>  		    (inl || ctx->prefer_bf) && size > 1 &&
> @@ -955,10 +955,11 @@ out:
>  		 * the mmio_flush_writes is CPU local, this will result in the HCA seeing
>  		 * doorbell 2, followed by doorbell 1.
>  		 */
> -		mmio_flush_writes();
>  		bf->offset ^= bf->buf_size;
> -		if (bf->need_lock)
> -			mlx5_spin_unlock(&bf->lock);
> +		if (bf->need_lock && !mlx5_single_threaded)

Same as of above.
> +			mmio_wc_spinunlock(&bf->lock.lock);
> +		else
> +			mmio_flush_writes();
>  	}
>
> +static inline void mmio_wc_spinunlock(pthread_spinlock_t *lock)
> +{
> +	/* On x86 the lock is enough for strong ordering, but the SFENCE
> +	 * encourages the WC buffers to flush out more quickly (Yishai:
> +	 * confirm?) */
> +	mmio_flush_writes();
> +	pthread_spin_unlock(lock);

See above comment on that.
> +}
> +
>  #endif
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 07/14] mlx4: Update to use new udma write barriers
       [not found]                                 ` <6571cf34-63b9-7b83-ddb0-9279e7e20fa9-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2017-03-08 21:56                                   ` Jason Gunthorpe
       [not found]                                     ` <20170308215609.GB4109-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-03-08 21:56 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Matan Barak,
	Majd Dibbiny, Doug Ledford

On Wed, Mar 08, 2017 at 11:27:51PM +0200, Yishai Hadas wrote:

> As of that any command that needs the lock must be done before the
> flush and delays the hardware from see the BF data immediately.

The counter point is that the unlock macro can combine the WC flushing
barrier with the spinlock atomics, reducing the amount of global
fencing. If you remove the macro your remove that optimization.

Why not do this:

-               mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned long *) ctrl,
-                            align(size * 16, 64));
-
+               tmp_bf_offset = ctx->bf_offset;
                ctx->bf_offset ^= ctx->bf_buf_size;
+               mlx4_bf_copy(ctx->bf_page + tmp_bf_offset, (unsigned long *) ctrl,
+                            align(size * 16, 64));
 
Which lets the load/store/xor run concurrently with filling of the write
buffer.

> Similar logical issue might be with the mmio_wc_spinlock macro in case the
> driver wants to make some command after the lock before the start_mmio,

I think you've missed the point. The idea is to combine the required
barrier for DMA with the required barrier for the spinlock. We can't
do that if we allow code to run between those steps.

> >diff --git a/providers/mlx4/qp.c b/providers/mlx4/qp.c
> >index 77a4a34576cb69..a22fca7c6f1360 100644
> >+++ b/providers/mlx4/qp.c
> >@@ -477,23 +477,20 @@ out:
> > 		ctrl->owner_opcode |= htonl((qp->sq.head & 0xffff) << 8);
> >
> > 		ctrl->bf_qpn |= qp->doorbell_qpn;
> >+		++qp->sq.head;
> >+
> 
> This is a change comparing the original code where was wmb() before that
> line to enforce the compiler not to change the order so that
> ctrl->owner_opcode will get the correct value of sq.head.

My reading was that the placement was not relevant, all access to
sq.head appears to be protected by pthread_spin_lock(&qp->sq.lock);

And sq.head is not DMA memory.

So ctrl->owner_opcode will get the correct value in both cases.

Just a few lines down we see:

		qp->sq.head += nreq;
		udma_to_device_barrier();

So there is a bug or the order doesn't matter.

> Will add here udma_to_device_barrier() which logically expects to do the
> same.

These new barriers are *only* for DMA memory. If it is the case that
qp->sq.head is being accessed concurrently without locking then it
*must* be converted to use stdatomic which provides the necessary
arch specific fencing.

> > 		ctx = to_mctx(ibqp->context);
> >-		if (bf->need_lock)
> >-			mlx5_spin_lock(&bf->lock);
> >+		if (bf->need_lock && !mlx5_single_threaded)
> 
> Can consider some pre-patch which will encapsulate the mlx5_single_threaded
> flag into the bf->need lock, it will prevent the extra if statement that
> this new code introduced in the data path.

Yes, I was thinking the same

> >+	/* On x86 the lock is enough for strong ordering, but the SFENCE
> >+	 * encourages the WC buffers to flush out more quickly (Yishai:
> >+	 * confirm?) */
> >+	mmio_flush_writes();
> >+	pthread_spin_unlock(lock);
> 
> See above comment on that.

The question here is if the atomic store inside pthread_spin_unlock is
enough to promptly flush the WC buffer. If so we should drop the
SFENCE for x86 and this will overall be faster.

It is the same reasoning you presented as to why we are able to drop
the SFENCE on the spin lock acquire.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 07/14] mlx4: Update to use new udma write barriers
       [not found]                                     ` <20170308215609.GB4109-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-03-09 15:42                                       ` Yishai Hadas
       [not found]                                         ` <4dcf0cea-3652-0df2-9d98-74e258e6170a-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Yishai Hadas @ 2017-03-09 15:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Matan Barak,
	Majd Dibbiny, Doug Ledford

On 3/8/2017 11:56 PM, Jason Gunthorpe wrote:
> On Wed, Mar 08, 2017 at 11:27:51PM +0200, Yishai Hadas wrote:
>
>> As of that any command that needs the lock must be done before the
>> flush and delays the hardware from see the BF data immediately.
>
> The counter point is that the unlock macro can combine the WC flushing
> barrier with the spinlock atomics, reducing the amount of global
> fencing. If you remove the macro your remove that optimization.

The optimization is done as part of mmio_wc_spinlock() for X86, this 
macro is still used.

>
> Why not do this:
>
> -               mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned long *) ctrl,
> -                            align(size * 16, 64));
> -
> +               tmp_bf_offset = ctx->bf_offset;
>                 ctx->bf_offset ^= ctx->bf_buf_size;

The above 2 commands are still delaying the writing to the NIC comparing 
the original code where it was done in one command after mlx4_bf_copy().

> +               mlx4_bf_copy(ctx->bf_page + tmp_bf_offset, (unsigned long *) ctrl,
> +                            align(size * 16, 64));
>

The candidate mlx4 code will be as follows, similar logic will be in mlx5.

@@ -477,22 +474,18 @@ out:
                 ctrl->owner_opcode |= htonl((qp->sq.head & 0xffff) << 8);

                 ctrl->bf_qpn |= qp->doorbell_qpn;
+               ++qp->sq.head;
                 /*
                  * Make sure that descriptor is written to memory
                  * before writing to BlueFlame page.
                  */
-               mmio_wc_start();
-
-               ++qp->sq.head;
-
-               pthread_spin_lock(&ctx->bf_lock);
+               mmio_wc_spinlock(&ctx->bf_lock);

                 mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned
                              long *) ctrl, align(size * 16, 64));

		mmio_flush_writes();
                 ctx->bf_offset ^= ctx->bf_buf_size;
                 pthread_spin_unlock(&ctx->bf_lock);
         } else if (nreq) {
                 qp->sq.head += nreq;

diff --git a/util/udma_barrier.h b/util/udma_barrier.h
index 9e73148..ec14dd3 100644
--- a/util/udma_barrier.h
+++ b/util/udma_barrier.h
@@ -33,6 +33,8 @@
  #ifndef __UTIL_UDMA_BARRIER_H
  #define __UTIL_UDMA_BARRIER_H

+#include <pthread.h>
+
  /* Barriers for DMA.

     These barriers are explicitly only for use with user DMA 
operations. If you
@@ -222,4 +224,17 @@
  */
  #define mmio_ordered_writes_hack() mmio_flush_writes()

+/* Higher Level primitives */
+
+/* Do mmio_wc_start and grab a spinlock */
+static inline void mmio_wc_spinlock(pthread_spinlock_t *lock)
+{
+       pthread_spin_lock(lock);
+#if !defined(__i386__) && !defined(__x86_64__)
+       /* For x86 the serialization within the spin lock is enough to
+        * strongly order WC and other memory types. */
+       mmio_wc_start();
+#endif
+}
+
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 07/14] mlx4: Update to use new udma write barriers
       [not found]                                         ` <4dcf0cea-3652-0df2-9d98-74e258e6170a-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2017-03-09 17:03                                           ` Jason Gunthorpe
       [not found]                                             ` <20170309170320.GA12694-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2017-03-09 17:03 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Matan Barak,
	Majd Dibbiny, Doug Ledford

On Thu, Mar 09, 2017 at 05:42:19PM +0200, Yishai Hadas wrote:

> >The counter point is that the unlock macro can combine the WC flushing
> >barrier with the spinlock atomics, reducing the amount of global
> >fencing. If you remove the macro your remove that optimization.
> 
> The optimization is done as part of mmio_wc_spinlock() for X86, this macro
> is still used.

I'm talking about optimizing the unlock too.

The x86 definition probably requires the unlock to flush the WC
buffers just like for the lock, so avoiding the unlocking SFENCE
entirely will increase throughput further, but with a bit more latency
till flush.

> >Why not do this:
> >
> >-               mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned long *) ctrl,
> >-                            align(size * 16, 64));
> >-
> >+               tmp_bf_offset = ctx->bf_offset;
> >                ctx->bf_offset ^= ctx->bf_buf_size;
> 
> The above 2 commands are still delaying the writing to the NIC comparing the
> original code where it was done in one command after mlx4_bf_copy().

It so simple, look at the assembly, my version eliminates an extra
load, for instance. So we get once 'cycle' better on throughput and
one cycle worse on latency to SFENCE.

But.. This routine was obviously never written to optimize latency to
SFENCE, eg why isn't '++qp->sq.head;' pushed to after the SFENCE if
that is so important? But again, pushing it after will improve
latency, hurt throughput.

.. and if you are so concerned about latency to SFENCE you should be
super excited about the barrier changes I sent you, as those will
improve that by a few cycles also.

I honestly think you are trying far too much to pointlessly preserve
the exact original code...

If you want to wreck the API like this, I would rather do it supported
by actual numbers. Add some TSC measurements in this area and see what
the different scenarios cost.

IMHO this stuff is already hard enough, having a simple to use,
symmetric API is more valuable.

> +/* Higher Level primitives */

If you are going to send this a patch please include my updated
comment.

> +/* Do mmio_wc_start and grab a spinlock */
> +static inline void mmio_wc_spinlock(pthread_spinlock_t *lock)
> +{
> +       pthread_spin_lock(lock);
> +#if !defined(__i386__) && !defined(__x86_64__)
> +       /* For x86 the serialization within the spin lock is enough to
> +        * strongly order WC and other memory types. */
> +       mmio_wc_start();
> +#endif

I would like to see the unlock inline still present in the header for
clarity to the reader what the expected pattern is, and a comment in
mlx4/mlx5 indicating they are not using the unlock macro directly to
try and reduce latency to the flush.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH rdma-core 07/14] mlx4: Update to use new udma write barriers
       [not found]                                             ` <20170309170320.GA12694-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-03-13 15:17                                               ` Yishai Hadas
  0 siblings, 0 replies; 65+ messages in thread
From: Yishai Hadas @ 2017-03-13 15:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Matan Barak,
	Majd Dibbiny, Doug Ledford

On 3/9/2017 7:03 PM, Jason Gunthorpe wrote:
> I honestly think you are trying far too much to pointlessly preserve
> the exact original code...

At that stage we would like to solve the degradation that was introduced 
in previous series of the barriers. Further improvements will be as some 
incremental patches after making the required performance testing.

> If you are going to send this a patch please include my updated
> comment.

I put some comment in mlx4/5 that points that code is latency oriented 
and flush immediately.

>> +/* Do mmio_wc_start and grab a spinlock */
>> +static inline void mmio_wc_spinlock(pthread_spinlock_t *lock)
>> +{
>> +       pthread_spin_lock(lock);
>> +#if !defined(__i386__) && !defined(__x86_64__)
>> +       /* For x86 the serialization within the spin lock is enough to
>> +        * strongly order WC and other memory types. */
>> +       mmio_wc_start();
>> +#endif
>
> I would like to see the unlock inline still present in the header for
> clarity to the reader what the expected pattern is, and a comment in
> mlx4/mlx5 indicating they are not using the unlock macro directly to
> try and reduce latency to the flush.

For now this macro is not in use by mlx4 & mlx5 as pointed before.

In addition, it still needs some work to verify whether the unlock is 
fully equal to flushing the data immediately as you also wondered.
The lock macro is used for ordering and as such can be used based on 
Intel docs that were pointed.

In case there will be some provider that needs this optimization it can 
be added after finalizing above work.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2017-03-13 15:17 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-16 19:22 [PATCH rdma-core 00/14] Revise the DMA barrier macros in ibverbs Jason Gunthorpe
     [not found] ` <1487272989-8215-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-16 19:22   ` [PATCH rdma-core 01/14] mlx5: Use stdatomic for the in_use barrier Jason Gunthorpe
2017-02-16 19:22   ` [PATCH rdma-core 02/14] Provide new names for the CPU barriers related to DMA Jason Gunthorpe
     [not found]     ` <1487272989-8215-3-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-16 22:07       ` Steve Wise
2017-02-17 16:37         ` Jason Gunthorpe
2017-02-16 19:22   ` [PATCH rdma-core 03/14] cxgb3: Update to use new udma write barriers Jason Gunthorpe
     [not found]     ` <1487272989-8215-4-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-16 21:20       ` Steve Wise
2017-02-16 21:45         ` Jason Gunthorpe
     [not found]           ` <20170216214527.GA13616-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-16 22:01             ` Steve Wise
2017-02-16 19:22   ` [PATCH rdma-core 04/14] cxgb4: " Jason Gunthorpe
     [not found]     ` <1487272989-8215-5-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-17 20:16       ` Steve Wise
2017-02-16 19:23   ` [PATCH rdma-core 05/14] hns: " Jason Gunthorpe
2017-02-16 19:23   ` [PATCH rdma-core 06/14] i40iw: Get rid of unique barrier macros Jason Gunthorpe
     [not found]     ` <1487272989-8215-7-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-03-01 17:29       ` Shiraz Saleem
     [not found]         ` <20170301172920.GA11340-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
2017-03-01 17:55           ` Jason Gunthorpe
     [not found]             ` <20170301175521.GB14791-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-03-01 22:14               ` Shiraz Saleem
     [not found]                 ` <20170301221420.GA18548-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
2017-03-01 23:05                   ` Jason Gunthorpe
     [not found]                     ` <20170301230506.GB2820-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-03-03 21:45                       ` Shiraz Saleem
     [not found]                         ` <20170303214514.GA12996-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
2017-03-03 22:22                           ` Jason Gunthorpe
     [not found]                             ` <20170303222244.GA678-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-03-06 19:16                               ` Shiraz Saleem
     [not found]                                 ` <20170306191631.GB34252-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
2017-03-06 19:40                                   ` Jason Gunthorpe
     [not found]                                     ` <20170306194052.GB31672-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-03-07 22:46                                       ` Shiraz Saleem
     [not found]                                         ` <20170307224622.GA45028-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
2017-03-07 22:50                                           ` Jason Gunthorpe
     [not found]                                             ` <20170307225027.GA20858-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-03-07 23:01                                               ` Shiraz Saleem
     [not found]                                                 ` <20170307230121.GA52428-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
2017-03-07 23:11                                                   ` Jason Gunthorpe
     [not found]                                                     ` <20170307231145.GB20858-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-03-07 23:23                                                       ` Shiraz Saleem
2017-03-06 18:18       ` Shiraz Saleem
     [not found]         ` <20170306181808.GA34252-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
2017-03-06 19:07           ` Jason Gunthorpe
     [not found]             ` <20170306190751.GA30663-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-03-07 23:16               ` Shiraz Saleem
2017-02-16 19:23   ` [PATCH rdma-core 07/14] mlx4: Update to use new udma write barriers Jason Gunthorpe
     [not found]     ` <1487272989-8215-8-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-20 17:46       ` Yishai Hadas
     [not found]         ` <206559e5-0488-f6d5-c4ec-bf560e0c3ba6-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2017-02-21 18:14           ` Jason Gunthorpe
     [not found]             ` <20170221181407.GA13138-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-03-06 14:57               ` Yishai Hadas
     [not found]                 ` <45d2b7da-9ad6-6b37-d0b2-00f7807966b4-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2017-03-06 17:31                   ` Jason Gunthorpe
     [not found]                     ` <20170306173139.GA11805-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-03-07 16:44                       ` Yishai Hadas
     [not found]                         ` <55bcc87e-b059-65df-8079-100120865ffb-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2017-03-07 19:18                           ` Jason Gunthorpe
     [not found]                             ` <20170307191824.GD2228-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-03-08 21:27                               ` Yishai Hadas
     [not found]                                 ` <6571cf34-63b9-7b83-ddb0-9279e7e20fa9-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2017-03-08 21:56                                   ` Jason Gunthorpe
     [not found]                                     ` <20170308215609.GB4109-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-03-09 15:42                                       ` Yishai Hadas
     [not found]                                         ` <4dcf0cea-3652-0df2-9d98-74e258e6170a-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2017-03-09 17:03                                           ` Jason Gunthorpe
     [not found]                                             ` <20170309170320.GA12694-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-03-13 15:17                                               ` Yishai Hadas
2017-02-16 19:23   ` [PATCH rdma-core 08/14] mlx5: " Jason Gunthorpe
     [not found]     ` <1487272989-8215-9-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-27 10:56       ` Yishai Hadas
     [not found]         ` <d5921636-1911-5588-8c59-620066bca01a-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2017-02-27 18:00           ` Jason Gunthorpe
     [not found]             ` <20170227180009.GL5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-28 16:02               ` Yishai Hadas
     [not found]                 ` <2969cce4-8b51-8fcf-f099-2b42a6d40a9c-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2017-02-28 17:06                   ` Jason Gunthorpe
     [not found]                     ` <20170228170658.GA17995-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-03-02  9:34                       ` Yishai Hadas
     [not found]                         ` <24bf0e37-e032-0862-c5b9-b5a40fcfb343-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2017-03-02 17:12                           ` Jason Gunthorpe
     [not found]                             ` <20170302171210.GA8595-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-03-06 14:19                               ` Yishai Hadas
2017-02-16 19:23   ` [PATCH rdma-core 09/14] nes: " Jason Gunthorpe
2017-02-16 19:23   ` [PATCH rdma-core 10/14] mthca: Update to use new mmio " Jason Gunthorpe
2017-02-16 19:23   ` [PATCH rdma-core 11/14] ocrdma: Update to use new udma " Jason Gunthorpe
     [not found]     ` <1487272989-8215-12-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-18 16:21       ` Devesh Sharma
2017-02-16 19:23   ` [PATCH rdma-core 12/14] qedr: " Jason Gunthorpe
     [not found]     ` <1487272989-8215-13-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-23 13:49       ` Amrani, Ram
     [not found]         ` <SN1PR07MB2207DE206738E6DD8511CEA1F8530-mikhvbZlbf8TSoR2DauN2+FPX92sqiQdvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-02-23 17:30           ` Jason Gunthorpe
     [not found]             ` <20170223173047.GC6688-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-24 10:01               ` Amrani, Ram
2017-02-16 19:23   ` [PATCH rdma-core 13/14] vmw_pvrdma: " Jason Gunthorpe
     [not found]     ` <1487272989-8215-14-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-17 18:05       ` Adit Ranadive
2017-02-16 19:23   ` [PATCH rdma-core 14/14] Remove the old barrier macros Jason Gunthorpe
     [not found]     ` <1487272989-8215-15-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-23 13:33       ` Amrani, Ram
     [not found]         ` <SN1PR07MB22070A48ACD50848267A5AD8F8530-mikhvbZlbf8TSoR2DauN2+FPX92sqiQdvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-02-23 16:59           ` Jason Gunthorpe
2017-02-28 16:00   ` [PATCH rdma-core 00/14] Revise the DMA barrier macros in ibverbs Doug Ledford
     [not found]     ` <1488297611.86943.215.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-28 16:38       ` Majd Dibbiny
     [not found]         ` <C6384D48-FC47-4046-8025-462E1CB02A34-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-02-28 17:47           ` Doug Ledford

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.