All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH rdma-core 0/5] Common MMIO accessors for rdma-core
@ 2017-04-13 22:38 Jason Gunthorpe
       [not found] ` <1492123127-6266-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2017-04-13 22:38 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

I've been threatening to do this for a while, here is a refactoring of the
MMIO accessors into a common set of functions that unify all the weird
options accumulated in the various providers:

- Normal 64 bit accessors
- 32 bit emulation of 64 bit write
- ia32 XMM support for 64 bit write
- s390 syscall accessors

This is a big complex transformation that has lots of possibility for little
mistakes, it would be good for other people to look closely at this before we
merge it.

This is available on my github:

https://github.com/jgunthorpe/rdma-plumbing/tree/mmio

For this starting point I only converted the Mellanox drivers, because they
are all basically the same.

This fixes one bug in the mlx_read_clock function which attempted to double
read a register without telling the compiler, which was miscompiling :(

I have another series on top of this:

https://github.com/jgunthorpe/rdma-plumbing/tree/sparse4

Which does the last few things to make the Mellanox drivers sparse clean.

Jason Gunthorpe (5):
  util: Add common mmio macros
  mlx4: Use util/mmio.h
  mlx5: Use util/mmio.h
  mthca: Use util/mmio.h
  Add mmio_memcpy_x64

 CMakeLists.txt                |  16 +++
 buildlib/config.h.in          |   2 +
 buildlib/rdma_functions.cmake |   4 +-
 providers/mlx4/cq.c           |  14 ++-
 providers/mlx4/doorbell.h     |  70 -----------
 providers/mlx4/mlx4.c         |   1 -
 providers/mlx4/mlx4.h         |   1 -
 providers/mlx4/mmio.h         | 116 ------------------
 providers/mlx4/qp.c           |  10 +-
 providers/mlx4/srq.c          |   1 -
 providers/mlx4/verbs.c        |   9 +-
 providers/mlx5/cq.c           |  13 ++-
 providers/mlx5/doorbell.h     |  67 -----------
 providers/mlx5/mlx5.c         |   2 -
 providers/mlx5/mlx5.h         |   1 -
 providers/mlx5/qp.c           |  22 ++--
 providers/mlx5/srq.c          |   1 -
 providers/mlx5/verbs.c        |   9 +-
 providers/mthca/cq.c          |  38 +++---
 providers/mthca/doorbell.h    | 109 +----------------
 providers/mthca/qp.c          |  44 ++++---
 providers/mthca/srq.c         |  14 ++-
 util/CMakeLists.txt           |  20 +++-
 util/dummy.c                  |   0
 util/mmio.c                   |  83 +++++++++++++
 util/mmio.h                   | 265 ++++++++++++++++++++++++++++++++++++++++++
 26 files changed, 480 insertions(+), 452 deletions(-)
 delete mode 100644 providers/mlx4/doorbell.h
 delete mode 100644 providers/mlx4/mmio.h
 delete mode 100644 providers/mlx5/doorbell.h
 create mode 100644 util/dummy.c
 create mode 100644 util/mmio.c
 create mode 100644 util/mmio.h

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH rdma-core 1/5] util: Add common mmio macros
       [not found] ` <1492123127-6266-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-04-13 22:38   ` Jason Gunthorpe
       [not found]     ` <1492123127-6266-2-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-04-13 22:38   ` [PATCH rdma-core 2/5] mlx4: Use util/mmio.h Jason Gunthorpe
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2017-04-13 22:38 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

The macros are a mashup of the Mellanox driver macros:
 - An alternate implementation for S390 is provided
 - The ia32 XMM based 64 bit store emulation is switched
   in if supported
 - The macros are 'relaxed' ordering, so no implicit barrier
   overheads
 - If UDMA is not available then MMIO macros are also switched off,
   this is because the 32 bit emulation relies on the udma barriers.

Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
---
 CMakeLists.txt                |  16 ++++
 buildlib/config.h.in          |   2 +
 buildlib/rdma_functions.cmake |   4 +-
 util/CMakeLists.txt           |  20 +++-
 util/dummy.c                  |   0
 util/mmio.c                   |  83 ++++++++++++++++
 util/mmio.h                   | 214 ++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 336 insertions(+), 3 deletions(-)
 create mode 100644 util/dummy.c
 create mode 100644 util/mmio.c
 create mode 100644 util/mmio.h

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 1ff3189c9d295e..5552872e8ca707 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -181,6 +181,19 @@ set(NO_STRICT_ALIASING_FLAGS "")
 RDMA_AddOptCFlag(NO_STRICT_ALIASING_FLAGS HAVE_NO_STRICT_ALIASING
   "-fno-strict-aliasing")
 
+CHECK_C_SOURCE_COMPILES("
+ #include <unistd.h>
+
+ void entry(void);
+
+ static void do_entry(void) {}
+ void entry(void) __attribute__((ifunc(\"resolve_entry\")));
+ static void *resolve_entry(void) {return &do_entry;}
+
+ int main(int argc,const char *argv[]) { entry(); }"
+  HAVE_FUNC_ATTRIBUTE_IFUNC
+  FAIL_REGEX "warning")
+
 # The code does not do the racy fcntl if the various CLOEXEC's are not
 # supported so it really doesn't work right if this isn't available. Thus hard
 # require it.
@@ -418,6 +431,9 @@ message(STATUS "Missing Optional Items:")
 if (NOT HAVE_FUNC_ATTRIBUTE_ALWAYS_INLINE)
   message(STATUS " Compiler attribute always_inline NOT supported")
 endif()
+if (NOT HAVE_FUNC_ATTRIBUTE_IFUNC)
+  message(STATUS " Compiler attribute ifunc NOT supported")
+endif()
 if (NOT HAVE_COHERENT_DMA)
   message(STATUS " Architecture NOT able to do coherent DMA (check util/udma_barrier.h) some providers disabled!")
 endif()
diff --git a/buildlib/config.h.in b/buildlib/config.h.in
index e4e7b9ee70c857..2eb16be764630e 100644
--- a/buildlib/config.h.in
+++ b/buildlib/config.h.in
@@ -28,6 +28,8 @@
 // FIXME This has been supported in compilers forever, we should just fail to build on such old systems.
 #cmakedefine HAVE_FUNC_ATTRIBUTE_ALWAYS_INLINE 1
 
+#cmakedefine HAVE_FUNC_ATTRIBUTE_IFUNC 1
+
 #cmakedefine HAVE_WORKING_IF_H 1
 
 @SIZEOF_LONG_CODE@
diff --git a/buildlib/rdma_functions.cmake b/buildlib/rdma_functions.cmake
index 01997b51468f55..936e1b256816ef 100644
--- a/buildlib/rdma_functions.cmake
+++ b/buildlib/rdma_functions.cmake
@@ -6,8 +6,8 @@
 # Global list of pairs of (SHARED STATIC) libary target names
 set(RDMA_STATIC_LIBS "" CACHE INTERNAL "Doc" FORCE)
 
-set(COMMON_LIBS_PIC ccan_pic)
-set(COMMON_LIBS ccan)
+set(COMMON_LIBS_PIC ccan_pic rdma_util_pic)
+set(COMMON_LIBS ccan rdma_util)
 
 # Create a symlink at filename DEST
 # If the directory containing DEST does not exist then it is created
diff --git a/util/CMakeLists.txt b/util/CMakeLists.txt
index 71e33ac3baaa9b..581bf6822ec39f 100644
--- a/util/CMakeLists.txt
+++ b/util/CMakeLists.txt
@@ -1,5 +1,23 @@
 publish_internal_headers(util
   compiler.h
-  udma_barrier.h
   util.h
   )
+
+# The empty dummy.c is only needed so that cmake always has something to build
+# into the library.
+set(C_FILES dummy.c)
+
+if (HAVE_COHERENT_DMA)
+  publish_internal_headers(util
+    mmio.h
+    udma_barrier.h
+    )
+
+  set(C_FILES ${C_FILES}
+    mmio.c
+    )
+endif()
+
+add_library(rdma_util STATIC ${C_FILES})
+add_library(rdma_util_pic STATIC ${C_FILES})
+set_property(TARGET rdma_util_pic PROPERTY POSITION_INDEPENDENT_CODE TRUE)
diff --git a/util/dummy.c b/util/dummy.c
new file mode 100644
index 00000000000000..e69de29bb2d1d6
diff --git a/util/mmio.c b/util/mmio.c
new file mode 100644
index 00000000000000..757a46ce5418d8
--- /dev/null
+++ b/util/mmio.c
@@ -0,0 +1,83 @@
+/* GPLv2 or OpenIB.org BSD (MIT) See COPYING file */
+#include <util/mmio.h>
+#include <util/udma_barrier.h>
+#include <config.h>
+
+#include <pthread.h>
+#include <stdbool.h>
+
+#if SIZEOF_LONG != 8
+
+static pthread_spinlock_t mmio_spinlock;
+
+static __attribute__((constructor)) void lock_constructor(void)
+{
+	pthread_spin_init(&mmio_spinlock, PTHREAD_PROCESS_PRIVATE);
+}
+
+/* When the arch does not have a 32 bit store we provide an emulation that
+   does two stores in address ascending order while holding a global
+   spinlock. */
+static void pthread_mmio_write64_be(void *addr, __be64 val)
+{
+	__be32 first_dword = htobe32(be64toh(val) >> 32);
+	__be32 second_dword = htobe32(be64toh(val));
+
+	/* The WC spinlock, by definition, provides global ordering for all UC
+	   and WC stores within the critical region. */
+	mmio_wc_spinlock(&mmio_spinlock);
+
+	mmio_write32_be(addr, first_dword);
+	mmio_write32_be(addr + 4, second_dword);
+
+	mmio_wc_spinunlock(&mmio_spinlock);
+}
+
+#if defined(__i386__)
+#include <xmmintrin.h>
+#include <cpuid.h>
+
+/* For ia32 we have historically emitted movlps SSE instructions to do the 64
+   bit operations. */
+static void __attribute__((target("sse")))
+sse_mmio_write64_be(void *addr, __be64 val)
+{
+	__m128 tmp = {};
+	tmp = _mm_loadl_pi(tmp, (__force __m64 *)&val);
+	_mm_storel_pi((__m64 *)addr,tmp);
+}
+
+static bool have_sse(void)
+{
+	unsigned int ax,bx,cx,dx;
+
+	if (!__get_cpuid(1,&ax,&bx,&cx,&dx))
+		return false;
+	return dx & bit_SSE;
+}
+
+#endif /* defined(__i386__) */
+
+typedef void (*write64_fn_t)(void *, __be64);
+
+/* This uses the STT_GNU_IFUNC extension to have the dynamic linker select the
+   best above implementations at runtime. */
+#if HAVE_FUNC_ATTRIBUTE_IFUNC
+void mmio_write64_be(void *addr, __be64 val)
+    __attribute__((ifunc("resolve_mmio_write64_be")));
+static write64_fn_t resolve_mmio_write64_be(void);
+#else
+__asm__(".type mmio_write64_be, %gnu_indirect_function");
+write64_fn_t resolve_mmio_write64_be(void) __asm__("mmio_write64_be");
+#endif
+
+write64_fn_t resolve_mmio_write64_be(void)
+{
+#if defined(__i386__)
+	if (have_sse())
+		return &sse_mmio_write64_be;
+#endif
+	return &pthread_mmio_write64_be;
+}
+
+#endif /* SIZEOF_LONG != 8 */
diff --git a/util/mmio.h b/util/mmio.h
new file mode 100644
index 00000000000000..0b89f5fcbe000e
--- /dev/null
+++ b/util/mmio.h
@@ -0,0 +1,214 @@
+/* GPLv2 or OpenIB.org BSD (MIT) See COPYING file
+
+   These accessors always map to PCI-E TLPs in predictable ways. Translation
+   to other buses should follow similar definitions.
+
+   write32(mem, 1)
+      Produce a 4 byte MemWr TLP with bit 0 of DW byte offset 0 set
+   write32_be(mem, htobe32(1))
+      Produce a 4 byte MemWr TLP with bit 0 of DW byte offset 3 set
+   write32_le(mem, htole32(1))
+      Produce a 4 byte MemWr TLP with bit 0 of DW byte offset 0 set
+
+   For ordering these accessors are similar to the Kernel's concept of
+   writel_relaxed(). When working with UC memory the following hold:
+
+   1) Strong ordering is required when talking to the same device (eg BAR),
+      and combining is not permitted:
+
+       write32(mem, 1);
+       write32(mem + 4, 1);
+       write32(mem, 1);
+
+      Must produce three TLPs, in order.
+
+   2) Ordering ignores all pthread locking:
+
+       pthread_spin_lock(&lock);
+       write32(mem, global++);
+       pthread_spin_unlock(&lock);
+
+      When run concurrently on all CPUs the device must observe all stores,
+      but the data value will not be strictly increasing.
+
+   3) Interaction with DMA is not ordered. Explicit use of a barrier from
+      udma_barriers is required:
+
+	*dma_mem = 1;
+	udma_to_device_barrier();
+	write32(mem, GO_DMA);
+
+   4) Access out of program order (eg speculation), either by the CPU or
+      compiler is not permitted:
+
+	if (cond)
+	   read32();
+
+      Must not issue a read TLP if cond is false.
+
+   If these are used with WC memory then #1 and #4 do not apply, and all WC
+   accesses must be bracketed with mmio_wc_start() // mmio_flush_writes()
+*/
+
+#ifndef __UTIL_MMIO_H
+#define __UTIL_MMIO_H
+
+#include <linux/types.h>
+#include <stdatomic.h>
+#include <stdint.h>
+#include <endian.h>
+
+#include <config.h>
+#include <util/compiler.h>
+
+/* The first step is to define the 'raw' accessors. To make this very safe
+   with sparse we define two versions of each, a le and a be - however the
+   code is always identical.
+*/
+#ifdef __s390x__
+#include <unistd.h>
+#include <sys/syscall.h>
+
+/* s390 requires a privileged instruction to access IO memory, these syscalls
+   perform that instruction using a memory buffer copy semantic.
+*/
+static inline void s390_mmio_write(void *mmio_addr, const void *val,
+				   size_t length)
+{
+	// FIXME: Check for error and call abort?
+	syscall(__NR_s390_pci_mmio_write, mmio_addr, val, length);
+}
+
+static inline void s390_mmio_read(const void *mmio_addr, void *val,
+				  size_t length)
+{
+	// FIXME: Check for error and call abort?
+	syscall(__NR_s390_pci_mmio_read, mmio_addr, val, length);
+}
+
+#define MAKE_WRITE(_NAME_, _SZ_)                                               \
+	static inline void _NAME_##_be(void *addr, __be##_SZ_ value)           \
+	{                                                                      \
+		s390_mmio_write(addr, &value, sizeof(value));                  \
+	}                                                                      \
+	static inline void _NAME_##_le(void *addr, __le##_SZ_ value)           \
+	{                                                                      \
+		s390_mmio_write(addr, &value, sizeof(value));                  \
+	}
+#define MAKE_READ(_NAME_, _SZ_)                                                \
+	static inline __be##_SZ_ _NAME_##_be(const void *addr)                 \
+	{                                                                      \
+		__be##_SZ_ res;                                                \
+		s390_mmio_read(addr, &res, sizeof(res));                       \
+		return res;                                                    \
+	}                                                                      \
+	static inline __le##_SZ_ _NAME_##_le(const void *addr)                 \
+	{                                                                      \
+		__le##_SZ_ res;                                                \
+		s390_mmio_read(addr, &res, sizeof(res));                       \
+		return res;                                                    \
+	}
+
+static inline void mmio_read8(void *addr, uint8_t value)
+{
+	s390_mmio_write(addr, &value, sizeof(value));
+}
+
+static inline uint8_t mmio_read8(const void *addr)
+{
+	uint8_t res;
+	s390_mmio_read(addr, &res, sizeof(res));
+	return res;
+}
+
+#else /* __s390x__ */
+
+#define MAKE_WRITE(_NAME_, _SZ_)                                               \
+	static inline void _NAME_##_be(void *addr, __be##_SZ_ value)           \
+	{                                                                      \
+		atomic_store_explicit((_Atomic(uint##_SZ_##_t) *)addr,         \
+				      (__force uint##_SZ_##_t)value,           \
+				      memory_order_relaxed);                   \
+	}                                                                      \
+	static inline void _NAME_##_le(void *addr, __le##_SZ_ value)           \
+	{                                                                      \
+		atomic_store_explicit((_Atomic(uint##_SZ_##_t) *)addr,         \
+				      (__force uint##_SZ_##_t)value,           \
+				      memory_order_relaxed);                   \
+	}
+#define MAKE_READ(_NAME_, _SZ_)                                                \
+	static inline __be##_SZ_ _NAME_##_be(const void *addr)                 \
+	{                                                                      \
+		return (__force __be##_SZ_)atomic_load_explicit(               \
+		    (_Atomic(uint##_SZ_##_t) *)addr, memory_order_relaxed);    \
+	}                                                                      \
+	static inline __le##_SZ_ _NAME_##_le(const void *addr)                 \
+	{                                                                      \
+		return (__force __le##_SZ_)atomic_load_explicit(               \
+		    (_Atomic(uint##_SZ_##_t) *)addr, memory_order_relaxed);    \
+	}
+
+static inline void mmio_write8(void *addr, uint8_t value)
+{
+	atomic_store_explicit((_Atomic(uint8_t) *)addr, value,
+			      memory_order_relaxed);
+}
+static inline uint8_t mmio_read8(const void *addr)
+{
+	return atomic_load_explicit((_Atomic(uint32_t) *)addr,
+				    memory_order_relaxed);
+}
+
+#endif /* __s390x__ */
+
+MAKE_WRITE(mmio_write16, 16)
+MAKE_WRITE(mmio_write32, 32)
+
+MAKE_READ(mmio_read16, 16)
+MAKE_READ(mmio_read32, 32)
+
+#if SIZEOF_LONG == 8
+MAKE_WRITE(mmio_write64, 64)
+MAKE_READ(mmio_read64, 64)
+#else
+void mmio_write64_be(void *addr, __be64 val);
+static inline void mmio_write64_le(void *addr, __le64 val)
+{
+	mmio_write64_be(addr, (__be64 __force)val);
+}
+
+/* There is no way to do read64 atomically, rather than provide some sketchy
+   implementation we leave these functions undefined, users should not call
+   them if SIZEOF_LONG != 8, but instead implement an appropriate version. */
+__be64 mmio_read64_be(const void *addr);
+__le64 mmio_read64_le(const void *addr);
+#endif /* SIZEOF_LONG == 8 */
+
+#undef MAKE_WRITE
+#undef MAKE_READ
+
+/* Now we can define the host endian versions of the operator, this just includes
+   a call to htole. */
+#define MAKE_WRITE(_NAME_, _SZ_)                                               \
+	static inline void _NAME_(void *addr, uint##_SZ_##_t value)            \
+	{                                                                      \
+		_NAME_##_le(addr, htole##_SZ_(value));                         \
+	}
+#define MAKE_READ(_NAME_, _SZ_)                                                \
+	static inline uint##_SZ_##_t _NAME_(const void *addr)                  \
+	{                                                                      \
+		return le##_SZ_##toh(_NAME_##_le(addr));                       \
+	}
+
+MAKE_WRITE(mmio_write16, 16)
+MAKE_WRITE(mmio_write32, 32)
+MAKE_WRITE(mmio_write64, 64)
+
+MAKE_READ(mmio_read16, 16)
+MAKE_READ(mmio_read32, 32)
+MAKE_READ(mmio_read64, 64)
+
+#undef MAKE_WRITE
+#undef MAKE_READ
+
+#endif
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH rdma-core 2/5] mlx4: Use util/mmio.h
       [not found] ` <1492123127-6266-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-04-13 22:38   ` [PATCH rdma-core 1/5] util: Add common mmio macros Jason Gunthorpe
@ 2017-04-13 22:38   ` Jason Gunthorpe
  2017-04-13 22:38   ` [PATCH rdma-core 3/5] mlx5: " Jason Gunthorpe
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Jason Gunthorpe @ 2017-04-13 22:38 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Yishai Hadas

Remove now duplicated mmio accessor macros.

This fixes a bug in mlx4_read_clock where a pointer was read twice without any
sort of barrier, resulting in mis-compilation. (eg the double read of clockhi
never worked)

Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
---
 providers/mlx4/cq.c       | 14 +++++----
 providers/mlx4/doorbell.h | 70 ---------------------------------------------
 providers/mlx4/mlx4.c     |  1 -
 providers/mlx4/mlx4.h     |  1 -
 providers/mlx4/mmio.h     | 73 -----------------------------------------------
 providers/mlx4/qp.c       |  7 +++--
 providers/mlx4/srq.c      |  1 -
 providers/mlx4/verbs.c    |  9 +++---
 8 files changed, 17 insertions(+), 159 deletions(-)
 delete mode 100644 providers/mlx4/doorbell.h

diff --git a/providers/mlx4/cq.c b/providers/mlx4/cq.c
index f0a47ae374a689..8ebc2aeaf941bf 100644
--- a/providers/mlx4/cq.c
+++ b/providers/mlx4/cq.c
@@ -40,10 +40,10 @@
 #include <string.h>
 
 #include <util/compiler.h>
+#include <util/mmio.h>
 #include <infiniband/opcode.h>
 
 #include "mlx4.h"
-#include "doorbell.h"
 
 enum {
 	MLX4_CQ_DOORBELL			= 0x20
@@ -682,7 +682,7 @@ void mlx4_cq_fill_pfns(struct mlx4_cq *cq, const struct ibv_cq_init_attr_ex *cq_
 int mlx4_arm_cq(struct ibv_cq *ibvcq, int solicited)
 {
 	struct mlx4_cq *cq = to_mcq(ibvcq);
-	uint32_t doorbell[2];
+	uint64_t doorbell;
 	uint32_t sn;
 	uint32_t ci;
 	uint32_t cmd;
@@ -691,6 +691,10 @@ int mlx4_arm_cq(struct ibv_cq *ibvcq, int solicited)
 	ci  = cq->cons_index & 0xffffff;
 	cmd = solicited ? MLX4_CQ_DB_REQ_NOT_SOL : MLX4_CQ_DB_REQ_NOT;
 
+	doorbell = sn << 28 | cmd | cq->cqn;
+	doorbell <<= 32;
+	doorbell |= ci;
+
 	*cq->arm_db = htobe32(sn << 28 | cmd | ci);
 
 	/*
@@ -699,10 +703,8 @@ int mlx4_arm_cq(struct ibv_cq *ibvcq, int solicited)
 	 */
 	udma_to_device_barrier();
 
-	doorbell[0] = htobe32(sn << 28 | cmd | cq->cqn);
-	doorbell[1] = htobe32(ci);
-
-	mlx4_write64(doorbell, to_mctx(ibvcq->context), MLX4_CQ_DOORBELL);
+	mmio_write64_be(to_mctx(ibvcq->context)->uar + MLX4_CQ_DOORBELL,
+			htobe64(doorbell));
 
 	return 0;
 }
diff --git a/providers/mlx4/doorbell.h b/providers/mlx4/doorbell.h
deleted file mode 100644
index 140a6158d7f2ee..00000000000000
diff --git a/providers/mlx4/mlx4.c b/providers/mlx4/mlx4.c
index 229c2670b5ed85..b798b37f9a8f36 100644
--- a/providers/mlx4/mlx4.c
+++ b/providers/mlx4/mlx4.c
@@ -219,7 +219,6 @@ static int mlx4_init_context(struct verbs_device *v_device,
 		context->bf_buf_size = 0;
 	}
 
-	pthread_spin_init(&context->uar_lock, PTHREAD_PROCESS_PRIVATE);
 	ibv_ctx->ops = mlx4_ctx_ops;
 
 	context->hca_core_clock = NULL;
diff --git a/providers/mlx4/mlx4.h b/providers/mlx4/mlx4.h
index a14245976c01bf..b4f6e864722e00 100644
--- a/providers/mlx4/mlx4.h
+++ b/providers/mlx4/mlx4.h
@@ -127,7 +127,6 @@ struct mlx4_context {
 	struct ibv_context		ibv_ctx;
 
 	void			       *uar;
-	pthread_spinlock_t		uar_lock;
 
 	void			       *bf_page;
 	int				bf_buf_size;
diff --git a/providers/mlx4/mmio.h b/providers/mlx4/mmio.h
index a1a296658fdbb0..9821e85224dcfd 100644
--- a/providers/mlx4/mmio.h
+++ b/providers/mlx4/mmio.h
@@ -7,30 +7,6 @@
 #include <sys/syscall.h>
 #ifdef __s390x__
 
-static inline long mmio_writeb(const unsigned long mmio_addr,
-			       const uint8_t val)
-{
-	return syscall(__NR_s390_pci_mmio_write, mmio_addr, &val, sizeof(val));
-}
-
-static inline long mmio_writew(const unsigned long mmio_addr,
-			       const uint16_t val)
-{
-	return syscall(__NR_s390_pci_mmio_write, mmio_addr, &val, sizeof(val));
-}
-
-static inline long mmio_writel(const unsigned long mmio_addr,
-			       const uint32_t val)
-{
-	return syscall(__NR_s390_pci_mmio_write, mmio_addr, &val, sizeof(val));
-}
-
-static inline long mmio_writeq(const unsigned long mmio_addr,
-			       const uint64_t val)
-{
-	return syscall(__NR_s390_pci_mmio_write, mmio_addr, &val, sizeof(val));
-}
-
 static inline long mmio_write(const unsigned long mmio_addr,
 			      const void *val,
 			      const size_t length)
@@ -38,33 +14,6 @@ static inline long mmio_write(const unsigned long mmio_addr,
 	return syscall(__NR_s390_pci_mmio_write, mmio_addr, val, length);
 }
 
-static inline long mmio_readb(const unsigned long mmio_addr, uint8_t *val)
-{
-	return syscall(__NR_s390_pci_mmio_read, mmio_addr, val, sizeof(*val));
-}
-
-static inline long mmio_readw(const unsigned long mmio_addr, uint16_t *val)
-{
-	return syscall(__NR_s390_pci_mmio_read, mmio_addr, val, sizeof(*val));
-}
-
-static inline long mmio_readl(const unsigned long mmio_addr, uint32_t *val)
-{
-	return syscall(__NR_s390_pci_mmio_read, mmio_addr, val, sizeof(*val));
-}
-
-static inline long mmio_readq(const unsigned long mmio_addr, uint64_t *val)
-{
-	return syscall(__NR_s390_pci_mmio_read, mmio_addr, val, sizeof(*val));
-}
-
-static inline long mmio_read(const unsigned long mmio_addr,
-			     void *val,
-			     const size_t length)
-{
-	return syscall(__NR_s390_pci_mmio_read, mmio_addr, val, length);
-}
-
 static inline void mlx4_bf_copy(unsigned long *dst,
 				unsigned long *src,
 				unsigned bytecnt)
@@ -74,28 +23,6 @@ static inline void mlx4_bf_copy(unsigned long *dst,
 
 #else
 
-#define mmio_writeb(addr, value) \
-	(*((volatile uint8_t *)addr) = value)
-#define mmio_writew(addr, value) \
-	(*((volatile uint16_t *)addr) = value)
-#define mmio_writel(addr, value) \
-	(*((volatile uint32_t *)addr) = value)
-#define mmio_writeq(addr, value) \
-	(*((volatile uint64_t *)addr) = value)
-#define mmio_write(addr, value, length) \
-	memcpy(addr, value, length)
-
-#define mmio_readb(addr, value) \
-	(value = *((volatile uint8_t *)addr))
-#define mmio_readw(addr, value) \
-	(value = *((volatile uint16_t *)addr))
-#define mmio_readl(addr, value) \
-	(value = *((volatile uint32_t *)addr))
-#define mmio_readq(addr, value) \
-	(value = *((volatile uint64_t *)addr))
-#define mmio_read(addr, value, length) \
-	memcpy(value, addr, length)
-
 /*
  * Avoid using memcpy() to copy to BlueFlame page, since memcpy()
  * implementations may use move-string-buffer assembler instructions,
diff --git a/providers/mlx4/qp.c b/providers/mlx4/qp.c
index a8eb8e295edeb2..423f59533de68d 100644
--- a/providers/mlx4/qp.c
+++ b/providers/mlx4/qp.c
@@ -38,11 +38,12 @@
 #include <pthread.h>
 #include <string.h>
 #include <errno.h>
+#include <util/mmio.h>
 #include <util/compiler.h>
 
 #include "mlx4.h"
-#include "doorbell.h"
 #include "wqe.h"
+#include "mmio.h"
 
 static const uint32_t mlx4_ib_opcode[] = {
 	[IBV_WR_SEND]			= MLX4_OPCODE_SEND,
@@ -497,8 +498,8 @@ out:
 		 */
 		udma_to_device_barrier();
 
-		mmio_writel((unsigned long)(ctx->uar + MLX4_SEND_DOORBELL),
-			    qp->doorbell_qpn);
+		mmio_write32_be(ctx->uar + MLX4_SEND_DOORBELL,
+				qp->doorbell_qpn);
 	}
 
 	if (nreq)
diff --git a/providers/mlx4/srq.c b/providers/mlx4/srq.c
index b8d25bb343da30..f30cc2e44a0e9e 100644
--- a/providers/mlx4/srq.c
+++ b/providers/mlx4/srq.c
@@ -37,7 +37,6 @@
 #include <string.h>
 
 #include "mlx4.h"
-#include "doorbell.h"
 #include "wqe.h"
 #include "mlx4-abi.h"
 
diff --git a/providers/mlx4/verbs.c b/providers/mlx4/verbs.c
index 97816a9814999d..80efd9ac4b2426 100644
--- a/providers/mlx4/verbs.c
+++ b/providers/mlx4/verbs.c
@@ -39,6 +39,8 @@
 #include <pthread.h>
 #include <errno.h>
 
+#include <util/mmio.h>
+
 #include "mlx4.h"
 #include "mlx4-abi.h"
 #include "wqe.h"
@@ -101,7 +103,6 @@ int mlx4_query_device_ex(struct ibv_context *context,
 	return 0;
 }
 
-#define READL(ptr) (*((uint32_t *)(ptr)))
 static int mlx4_read_clock(struct ibv_context *context, uint64_t *cycles)
 {
 	unsigned int clockhi, clocklo, clockhi1;
@@ -113,9 +114,9 @@ static int mlx4_read_clock(struct ibv_context *context, uint64_t *cycles)
 
 	/* Handle wraparound */
 	for (i = 0; i < 2; i++) {
-		clockhi = be32toh(READL(ctx->hca_core_clock));
-		clocklo = be32toh(READL(ctx->hca_core_clock + 4));
-		clockhi1 = be32toh(READL(ctx->hca_core_clock));
+		clockhi = be32toh(mmio_read32_be(ctx->hca_core_clock));
+		clocklo = be32toh(mmio_read32_be(ctx->hca_core_clock + 4));
+		clockhi1 = be32toh(mmio_read32_be(ctx->hca_core_clock));
 		if (clockhi == clockhi1)
 			break;
 	}
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH rdma-core 3/5] mlx5: Use util/mmio.h
       [not found] ` <1492123127-6266-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-04-13 22:38   ` [PATCH rdma-core 1/5] util: Add common mmio macros Jason Gunthorpe
  2017-04-13 22:38   ` [PATCH rdma-core 2/5] mlx4: Use util/mmio.h Jason Gunthorpe
@ 2017-04-13 22:38   ` Jason Gunthorpe
  2017-04-13 22:38   ` [PATCH rdma-core 4/5] mthca: " Jason Gunthorpe
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Jason Gunthorpe @ 2017-04-13 22:38 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Yishai Hadas

Remove now duplicated mmio accessor macros.

This fixes a bug in mlx5_read_clock where a pointer was read twice without any
sort of barrier, resulting in mis-compilation. (eg the double read of clockhi
never worked)

Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
---
 providers/mlx5/cq.c       | 13 ++++-----
 providers/mlx5/doorbell.h | 67 -----------------------------------------------
 providers/mlx5/mlx5.c     |  2 --
 providers/mlx5/mlx5.h     |  1 -
 providers/mlx5/qp.c       |  5 ++--
 providers/mlx5/srq.c      |  1 -
 providers/mlx5/verbs.c    |  9 ++++---
 7 files changed, 14 insertions(+), 84 deletions(-)
 delete mode 100644 providers/mlx5/doorbell.h

diff --git a/providers/mlx5/cq.c b/providers/mlx5/cq.c
index 2e8b584b0930fd..0a1dd55896e775 100644
--- a/providers/mlx5/cq.c
+++ b/providers/mlx5/cq.c
@@ -40,11 +40,11 @@
 #include <unistd.h>
 
 #include <util/compiler.h>
+#include <util/mmio.h>
 #include <infiniband/opcode.h>
 
 #include "mlx5.h"
 #include "wqe.h"
-#include "doorbell.h"
 
 enum {
 	CQ_OK					=  0,
@@ -1285,7 +1285,7 @@ int mlx5_arm_cq(struct ibv_cq *ibvcq, int solicited)
 {
 	struct mlx5_cq *cq = to_mcq(ibvcq);
 	struct mlx5_context *ctx = to_mctx(ibvcq->context);
-	uint32_t doorbell[2];
+	uint64_t doorbell;
 	uint32_t sn;
 	uint32_t ci;
 	uint32_t cmd;
@@ -1294,6 +1294,10 @@ int mlx5_arm_cq(struct ibv_cq *ibvcq, int solicited)
 	ci  = cq->cons_index & 0xffffff;
 	cmd = solicited ? MLX5_CQ_DB_REQ_NOT_SOL : MLX5_CQ_DB_REQ_NOT;
 
+	doorbell = sn << 28 | cmd | ci;
+	doorbell <<= 32;
+	doorbell |= cq->cqn;
+
 	cq->dbrec[MLX5_CQ_ARM_DB] = htobe32(sn << 28 | cmd | ci);
 
 	/*
@@ -1302,10 +1306,7 @@ int mlx5_arm_cq(struct ibv_cq *ibvcq, int solicited)
 	 */
 	mmio_wc_start();
 
-	doorbell[0] = htobe32(sn << 28 | cmd | ci);
-	doorbell[1] = htobe32(cq->cqn);
-
-	mlx5_write64(doorbell, ctx->uar[0] + MLX5_CQ_DOORBELL, &ctx->lock32);
+	mmio_write64_be(ctx->uar[0] + MLX5_CQ_DOORBELL, htobe64(doorbell));
 
 	mmio_flush_writes();
 
diff --git a/providers/mlx5/doorbell.h b/providers/mlx5/doorbell.h
deleted file mode 100644
index 2d5ede4604d398..00000000000000
diff --git a/providers/mlx5/mlx5.c b/providers/mlx5/mlx5.c
index 30f165b0280559..88a808fb045a1a 100644
--- a/providers/mlx5/mlx5.c
+++ b/providers/mlx5/mlx5.c
@@ -891,8 +891,6 @@ static int mlx5_init_context(struct verbs_device *vdev,
 		mlx5_map_internal_clock(mdev, ctx);
 	}
 
-	mlx5_spinlock_init(&context->lock32);
-
 	context->prefer_bf = get_always_bf();
 	context->shut_up_bf = get_shut_up_bf();
 	mlx5_read_env(&vdev->device, context);
diff --git a/providers/mlx5/mlx5.h b/providers/mlx5/mlx5.h
index 0de40a809ffbee..615dea38e4fedd 100644
--- a/providers/mlx5/mlx5.h
+++ b/providers/mlx5/mlx5.h
@@ -236,7 +236,6 @@ struct mlx5_context {
 	pthread_mutex_t                 uidx_table_mutex;
 
 	void			       *uar[MLX5_MAX_UARS];
-	struct mlx5_spinlock		lock32;
 	struct mlx5_db_page	       *db_list;
 	pthread_mutex_t			db_list_mutex;
 	int				cache_line_size;
diff --git a/providers/mlx5/qp.c b/providers/mlx5/qp.c
index 1d5a2f9238cfe9..7f67a0b61b221f 100644
--- a/providers/mlx5/qp.c
+++ b/providers/mlx5/qp.c
@@ -37,10 +37,10 @@
 #include <string.h>
 #include <errno.h>
 #include <stdio.h>
+#include <util/mmio.h>
 #include <util/compiler.h>
 
 #include "mlx5.h"
-#include "doorbell.h"
 #include "wqe.h"
 
 #define MLX5_ATOMIC_SIZE 8
@@ -942,8 +942,7 @@ out:
 			mlx5_bf_copy(bf->reg + bf->offset, (unsigned long long *)ctrl,
 				     align(size * 16, 64), qp);
 		else
-			mlx5_write64((__be32 *)ctrl, bf->reg + bf->offset,
-				     &ctx->lock32);
+			mmio_write64_be(bf->reg + bf->offset, *(__be64 *)ctrl);
 
 		/*
 		 * use mmio_flush_writes() to ensure write combining buffers are flushed out
diff --git a/providers/mlx5/srq.c b/providers/mlx5/srq.c
index 202fa87aceef59..94528bba94d232 100644
--- a/providers/mlx5/srq.c
+++ b/providers/mlx5/srq.c
@@ -38,7 +38,6 @@
 #include <errno.h>
 
 #include "mlx5.h"
-#include "doorbell.h"
 #include "wqe.h"
 
 static void *get_wqe(struct mlx5_srq *srq, int n)
diff --git a/providers/mlx5/verbs.c b/providers/mlx5/verbs.c
index f0e4aabb0dbcef..4fc186e48847c7 100644
--- a/providers/mlx5/verbs.c
+++ b/providers/mlx5/verbs.c
@@ -45,6 +45,8 @@
 #include <sys/mman.h>
 
 #include <util/compiler.h>
+#include <util/mmio.h>
+
 #include "mlx5.h"
 #include "mlx5-abi.h"
 #include "wqe.h"
@@ -77,7 +79,6 @@ int mlx5_query_device(struct ibv_context *context, struct ibv_device_attr *attr)
 	return 0;
 }
 
-#define READL(ptr) (*((uint32_t *)(ptr)))
 static int mlx5_read_clock(struct ibv_context *context, uint64_t *cycles)
 {
 	unsigned int clockhi, clocklo, clockhi1;
@@ -89,9 +90,9 @@ static int mlx5_read_clock(struct ibv_context *context, uint64_t *cycles)
 
 	/* Handle wraparound */
 	for (i = 0; i < 2; i++) {
-		clockhi = be32toh(READL(ctx->hca_core_clock));
-		clocklo = be32toh(READL(ctx->hca_core_clock + 4));
-		clockhi1 = be32toh(READL(ctx->hca_core_clock));
+		clockhi = be32toh(mmio_read32_be(ctx->hca_core_clock));
+		clocklo = be32toh(mmio_read32_be(ctx->hca_core_clock + 4));
+		clockhi1 = be32toh(mmio_read32_be(ctx->hca_core_clock));
 		if (clockhi == clockhi1)
 			break;
 	}
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH rdma-core 4/5] mthca: Use util/mmio.h
       [not found] ` <1492123127-6266-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
                     ` (2 preceding siblings ...)
  2017-04-13 22:38   ` [PATCH rdma-core 3/5] mlx5: " Jason Gunthorpe
@ 2017-04-13 22:38   ` Jason Gunthorpe
  2017-04-13 22:38   ` [PATCH rdma-core 5/5] Add mmio_memcpy_x64 Jason Gunthorpe
  2017-04-14  7:18   ` [PATCH rdma-core 0/5] Common MMIO accessors for rdma-core Majd Dibbiny
  5 siblings, 0 replies; 16+ messages in thread
From: Jason Gunthorpe @ 2017-04-13 22:38 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Vladimir Sokolovsky

Remove now duplicated mmio accessor macros.

In this driver we keep the weird uint32_t array since there are so
many places that use it.

Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
---
 providers/mthca/cq.c       |  38 +++++++---------
 providers/mthca/doorbell.h | 109 +++------------------------------------------
 providers/mthca/qp.c       |  44 ++++++++++--------
 providers/mthca/srq.c      |  14 +++---
 4 files changed, 55 insertions(+), 150 deletions(-)

diff --git a/providers/mthca/cq.c b/providers/mthca/cq.c
index 8d30e24c83a8e2..68550410f349af 100644
--- a/providers/mthca/cq.c
+++ b/providers/mthca/cq.c
@@ -154,10 +154,10 @@ static inline void update_cons_index(struct mthca_cq *cq, int incr)
 		*cq->set_ci_db = htobe32(cq->cons_index);
 		mmio_ordered_writes_hack();
 	} else {
-		doorbell[0] = htobe32(MTHCA_TAVOR_CQ_DB_INC_CI | cq->cqn);
-		doorbell[1] = htobe32(incr - 1);
+		doorbell[0] = MTHCA_TAVOR_CQ_DB_INC_CI | cq->cqn;
+		doorbell[1] = incr - 1;
 
-		mthca_write64(doorbell, to_mctx(cq->ibv_cq.context), MTHCA_CQ_DOORBELL);
+		mthca_write64(doorbell, to_mctx(cq->ibv_cq.context)->uar + MTHCA_CQ_DOORBELL);
 	}
 }
 
@@ -485,13 +485,12 @@ int mthca_tavor_arm_cq(struct ibv_cq *cq, int solicited)
 {
 	uint32_t doorbell[2];
 
-	doorbell[0] = htobe32((solicited ?
-			     MTHCA_TAVOR_CQ_DB_REQ_NOT_SOL :
-			     MTHCA_TAVOR_CQ_DB_REQ_NOT)      |
-			    to_mcq(cq)->cqn);
+	doorbell[0] = (solicited ? MTHCA_TAVOR_CQ_DB_REQ_NOT_SOL
+				 : MTHCA_TAVOR_CQ_DB_REQ_NOT) |
+		      to_mcq(cq)->cqn;
 	doorbell[1] = 0xffffffff;
 
-	mthca_write64(doorbell, to_mctx(cq->context), MTHCA_CQ_DOORBELL);
+	mthca_write64(doorbell, to_mctx(cq->context)->uar + MTHCA_CQ_DOORBELL);
 
 	return 0;
 }
@@ -501,16 +500,14 @@ int mthca_arbel_arm_cq(struct ibv_cq *ibvcq, int solicited)
 	struct mthca_cq *cq = to_mcq(ibvcq);
 	uint32_t doorbell[2];
 	uint32_t sn;
-	uint32_t ci;
 
 	sn = cq->arm_sn & 3;
-	ci = htobe32(cq->cons_index);
 
-	doorbell[0] = ci;
-	doorbell[1] = htobe32((cq->cqn << 8) | (2 << 5) | (sn << 3) |
-			    (solicited ? 1 : 2));
+	doorbell[0] = cq->cons_index;
+	doorbell[1] =
+	    (cq->cqn << 8) | (2 << 5) | (sn << 3) | (solicited ? 1 : 2);
 
-	mthca_write_db_rec(doorbell, cq->arm_db);
+	mthca_write64(doorbell, cq->arm_db);
 
 	/*
 	 * Make sure that the doorbell record in host memory is
@@ -518,14 +515,13 @@ int mthca_arbel_arm_cq(struct ibv_cq *ibvcq, int solicited)
 	 */
 	udma_to_device_barrier();
 
-	doorbell[0] = htobe32((sn << 28)                       |
-			    (solicited ?
-			     MTHCA_ARBEL_CQ_DB_REQ_NOT_SOL :
-			     MTHCA_ARBEL_CQ_DB_REQ_NOT)      |
-			    cq->cqn);
-	doorbell[1] = ci;
+	doorbell[0] = (sn << 28) | (solicited ? MTHCA_ARBEL_CQ_DB_REQ_NOT_SOL
+					      : MTHCA_ARBEL_CQ_DB_REQ_NOT) |
+		      cq->cqn;
+	doorbell[1] = cq->cons_index;
 
-	mthca_write64(doorbell, to_mctx(ibvcq->context), MTHCA_CQ_DOORBELL);
+	mthca_write64(doorbell,
+		      to_mctx(ibvcq->context)->uar + MTHCA_CQ_DOORBELL);
 
 	return 0;
 }
diff --git a/providers/mthca/doorbell.h b/providers/mthca/doorbell.h
index 32829a4d1c967e..d2411ea040d8d4 100644
--- a/providers/mthca/doorbell.h
+++ b/providers/mthca/doorbell.h
@@ -1,113 +1,14 @@
-/*
- * Copyright (c) 2004, 2005 Topspin Communications.  All rights reserved.
- *
- * This software is available to you under a choice of one of two
- * licenses.  You may choose to be licensed under the terms of the GNU
- * General Public License (GPL) Version 2, available from the file
- * COPYING in the main directory of this source tree, or the
- * OpenIB.org BSD license below:
- *
- *     Redistribution and use in source and binary forms, with or
- *     without modification, are permitted provided that the following
- *     conditions are met:
- *
- *      - Redistributions of source code must retain the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer.
- *
- *      - Redistributions in binary form must reproduce the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer in the documentation and/or other materials
- *        provided with the distribution.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
- * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
- * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
- * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- */
-
+/* GPLv2 or OpenIB.org BSD (MIT) See COPYING file */
 #ifndef DOORBELL_H
 #define DOORBELL_H
 
-#include <stdint.h>
-#include <pthread.h>
+#include <util/mmio.h>
 #include "mthca.h"
 
-struct mthca_context;
-
-#ifdef __i386__
-
-static inline void mthca_write64(uint32_t val[2], struct mthca_context *ctx, int offset)
+static inline void mthca_write64(uint32_t val[2], void *reg)
 {
-	/* i386 stack is aligned to 8 bytes, so this should be OK: */
-	uint8_t xmmsave[8] __attribute__((aligned(8)));
-
-	asm volatile (
-		"movlps %%xmm0,(%0); \n\t"
-		"movlps (%1),%%xmm0; \n\t"
-		"movlps %%xmm0,(%2); \n\t"
-		"movlps (%0),%%xmm0; \n\t"
-		:
-		: "r" (xmmsave), "r" (val), "r" (ctx->uar + offset)
-		: "memory" );
+	uint64_t doorbell = (((uint64_t)val[0]) << 32) | val[1];
+	mmio_write64_be(reg, htobe64(doorbell));
 }
 
-static inline void mthca_write_db_rec(uint32_t val[2], uint32_t *db)
-{
-	/* i386 stack is aligned to 8 bytes, so this should be OK: */
-	uint8_t xmmsave[8] __attribute__((aligned(8)));
-
-	asm volatile (
-		"movlps %%xmm0,(%0); \n\t"
-		"movlps (%1),%%xmm0; \n\t"
-		"movlps %%xmm0,(%2); \n\t"
-		"movlps (%0),%%xmm0; \n\t"
-		:
-		: "r" (xmmsave), "r" (val), "r" (db)
-		: "memory" );
-}
-
-#elif SIZEOF_LONG == 8
-
-#if __BYTE_ORDER == __LITTLE_ENDIAN
-#  define MTHCA_PAIR_TO_64(val) ((uint64_t) val[1] << 32 | val[0])
-#elif __BYTE_ORDER == __BIG_ENDIAN
-#  define MTHCA_PAIR_TO_64(val) ((uint64_t) val[0] << 32 | val[1])
-#else
-#  error __BYTE_ORDER not defined
 #endif
-
-static inline void mthca_write64(uint32_t val[2], struct mthca_context *ctx, int offset)
-{
-	*(volatile uint64_t *) (ctx->uar + offset) = MTHCA_PAIR_TO_64(val);
-}
-
-static inline void mthca_write_db_rec(uint32_t val[2], uint32_t *db)
-{
-	*(volatile uint64_t *) db = MTHCA_PAIR_TO_64(val);
-}
-
-#else
-
-static inline void mthca_write64(uint32_t val[2], struct mthca_context *ctx, int offset)
-{
-	pthread_spin_lock(&ctx->uar_lock);
-	*(volatile uint32_t *) (ctx->uar + offset)     = val[0];
-	*(volatile uint32_t *) (ctx->uar + offset + 4) = val[1];
-	pthread_spin_unlock(&ctx->uar_lock);
-}
-
-static inline void mthca_write_db_rec(uint32_t val[2], uint32_t *db)
-{
-	*(volatile uint32_t *) db       = val[0];
-	mmio_ordered_writes_hack();
-	*(volatile uint32_t *) (db + 1) = val[1];
-}
-
-#endif
-
-#endif /* MTHCA_H */
diff --git a/providers/mthca/qp.c b/providers/mthca/qp.c
index 52850a4a9daa8a..1907f2b82d6987 100644
--- a/providers/mthca/qp.c
+++ b/providers/mthca/qp.c
@@ -310,12 +310,14 @@ out:
 	if (nreq) {
 		uint32_t doorbell[2];
 
-		doorbell[0] = htobe32(((qp->sq.next_ind << qp->sq.wqe_shift) +
-				     qp->send_wqe_offset) | f0 | op0);
-		doorbell[1] = htobe32((ibqp->qp_num << 8) | size0);
+		doorbell[0] = ((qp->sq.next_ind << qp->sq.wqe_shift) +
+			       qp->send_wqe_offset) |
+			      f0 | op0;
+		doorbell[1] = (ibqp->qp_num << 8) | size0;
 
 		udma_to_device_barrier();
-		mthca_write64(doorbell, to_mctx(ibqp->context), MTHCA_SEND_DOORBELL);
+		mthca_write64(doorbell, to_mctx(ibqp->context)->uar +
+					    MTHCA_SEND_DOORBELL);
 	}
 
 	qp->sq.next_ind = ind;
@@ -395,8 +397,9 @@ int mthca_tavor_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr,
 		if (nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB) {
 			nreq = 0;
 
-			doorbell[0] = htobe32((qp->rq.next_ind << qp->rq.wqe_shift) | size0);
-			doorbell[1] = htobe32(ibqp->qp_num << 8);
+			doorbell[0] =
+			    (qp->rq.next_ind << qp->rq.wqe_shift) | size0;
+			doorbell[1] = ibqp->qp_num << 8;
 
 			/*
 			 * Make sure that descriptors are written
@@ -404,7 +407,8 @@ int mthca_tavor_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr,
 			 */
 			udma_to_device_barrier();
 
-			mthca_write64(doorbell, to_mctx(ibqp->context), MTHCA_RECV_DOORBELL);
+			mthca_write64(doorbell, to_mctx(ibqp->context)->uar +
+						    MTHCA_RECV_DOORBELL);
 
 			qp->rq.next_ind = ind;
 			qp->rq.head += MTHCA_TAVOR_MAX_WQES_PER_RECV_DB;
@@ -414,8 +418,8 @@ int mthca_tavor_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr,
 
 out:
 	if (nreq) {
-		doorbell[0] = htobe32((qp->rq.next_ind << qp->rq.wqe_shift) | size0);
-		doorbell[1] = htobe32((ibqp->qp_num << 8) | nreq);
+		doorbell[0] = (qp->rq.next_ind << qp->rq.wqe_shift) | size0;
+		doorbell[1] = (ibqp->qp_num << 8) | nreq;
 
 		/*
 		 * Make sure that descriptors are written before
@@ -423,7 +427,8 @@ out:
 		 */
 		udma_to_device_barrier();
 
-		mthca_write64(doorbell, to_mctx(ibqp->context), MTHCA_RECV_DOORBELL);
+		mthca_write64(doorbell, to_mctx(ibqp->context)->uar +
+					    MTHCA_RECV_DOORBELL);
 	}
 
 	qp->rq.next_ind = ind;
@@ -458,9 +463,9 @@ int mthca_arbel_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
 		if (nreq == MTHCA_ARBEL_MAX_WQES_PER_SEND_DB) {
 			nreq = 0;
 
-			doorbell[0] = htobe32((MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) |
-					    ((qp->sq.head & 0xffff) << 8) | f0 | op0);
-			doorbell[1] = htobe32((ibqp->qp_num << 8) | size0);
+			doorbell[0] = (MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) |
+				      ((qp->sq.head & 0xffff) << 8) | f0 | op0;
+			doorbell[1] = (ibqp->qp_num << 8) | size0;
 
 			qp->sq.head += MTHCA_ARBEL_MAX_WQES_PER_SEND_DB;
 
@@ -476,7 +481,8 @@ int mthca_arbel_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
 			 * write MMIO send doorbell.
 			 */
 			mmio_ordered_writes_hack();
-			mthca_write64(doorbell, to_mctx(ibqp->context), MTHCA_SEND_DOORBELL);
+			mthca_write64(doorbell, to_mctx(ibqp->context)->uar +
+						    MTHCA_SEND_DOORBELL);
 
 			size0 = 0;
 		}
@@ -665,10 +671,9 @@ int mthca_arbel_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
 
 out:
 	if (nreq) {
-		doorbell[0] = htobe32((nreq << 24)                  |
-				    ((qp->sq.head & 0xffff) << 8) |
-				    f0 | op0);
-		doorbell[1] = htobe32((ibqp->qp_num << 8) | size0);
+		doorbell[0] =
+		    (nreq << 24) | ((qp->sq.head & 0xffff) << 8) | f0 | op0;
+		doorbell[1] = (ibqp->qp_num << 8) | size0;
 
 		qp->sq.head += nreq;
 
@@ -684,7 +689,8 @@ out:
 		 * write MMIO send doorbell.
 		 */
 		mmio_ordered_writes_hack();
-		mthca_write64(doorbell, to_mctx(ibqp->context), MTHCA_SEND_DOORBELL);
+		mthca_write64(doorbell, to_mctx(ibqp->context)->uar +
+					    MTHCA_SEND_DOORBELL);
 	}
 
 	pthread_spin_unlock(&qp->sq.lock);
diff --git a/providers/mthca/srq.c b/providers/mthca/srq.c
index 9abf95b15903f3..ad68961341b053 100644
--- a/providers/mthca/srq.c
+++ b/providers/mthca/srq.c
@@ -145,8 +145,8 @@ int mthca_tavor_post_srq_recv(struct ibv_srq *ibsrq,
 		if (++nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB) {
 			nreq = 0;
 
-			doorbell[0] = htobe32(first_ind << srq->wqe_shift);
-			doorbell[1] = htobe32(srq->srqn << 8);
+			doorbell[0] = first_ind << srq->wqe_shift;
+			doorbell[1] = srq->srqn << 8;
 
 			/*
 			 * Make sure that descriptors are written
@@ -154,15 +154,16 @@ int mthca_tavor_post_srq_recv(struct ibv_srq *ibsrq,
 			 */
 			udma_to_device_barrier();
 
-			mthca_write64(doorbell, to_mctx(ibsrq->context), MTHCA_RECV_DOORBELL);
+			mthca_write64(doorbell, to_mctx(ibsrq->context)->uar +
+						    MTHCA_RECV_DOORBELL);
 
 			first_ind = srq->first_free;
 		}
 	}
 
 	if (nreq) {
-		doorbell[0] = htobe32(first_ind << srq->wqe_shift);
-		doorbell[1] = htobe32((srq->srqn << 8) | nreq);
+		doorbell[0] = first_ind << srq->wqe_shift;
+		doorbell[1] = (srq->srqn << 8) | nreq;
 
 		/*
 		 * Make sure that descriptors are written before
@@ -170,7 +171,8 @@ int mthca_tavor_post_srq_recv(struct ibv_srq *ibsrq,
 		 */
 		udma_to_device_barrier();
 
-		mthca_write64(doorbell, to_mctx(ibsrq->context), MTHCA_RECV_DOORBELL);
+		mthca_write64(doorbell, to_mctx(ibsrq->context)->uar +
+					    MTHCA_RECV_DOORBELL);
 	}
 
 	pthread_spin_unlock(&srq->lock);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH rdma-core 5/5] Add mmio_memcpy_x64
       [not found] ` <1492123127-6266-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
                     ` (3 preceding siblings ...)
  2017-04-13 22:38   ` [PATCH rdma-core 4/5] mthca: " Jason Gunthorpe
@ 2017-04-13 22:38   ` Jason Gunthorpe
       [not found]     ` <1492123127-6266-6-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-04-14  7:18   ` [PATCH rdma-core 0/5] Common MMIO accessors for rdma-core Majd Dibbiny
  5 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2017-04-13 22:38 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Yishai Hadas

This pattern is common in a couple of drivers, and needs the s390
syscall.

The common version properly handles 32 bit and prevents reordering of
the stores, which is the stated reason for this to exist. It is also
slightly more optimized, since it assumes a non-zero transfer length.

Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
---
 providers/mlx4/mmio.h | 43 -----------------------------------------
 providers/mlx4/qp.c   |  5 ++---
 providers/mlx5/qp.c   | 17 ++++++-----------
 util/mmio.h           | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 60 insertions(+), 58 deletions(-)
 delete mode 100644 providers/mlx4/mmio.h

diff --git a/providers/mlx4/mmio.h b/providers/mlx4/mmio.h
deleted file mode 100644
index 9821e85224dcfd..00000000000000
diff --git a/providers/mlx4/qp.c b/providers/mlx4/qp.c
index 423f59533de68d..e7f10b9f1524d5 100644
--- a/providers/mlx4/qp.c
+++ b/providers/mlx4/qp.c
@@ -43,7 +43,6 @@
 
 #include "mlx4.h"
 #include "wqe.h"
-#include "mmio.h"
 
 static const uint32_t mlx4_ib_opcode[] = {
 	[IBV_WR_SEND]			= MLX4_OPCODE_SEND,
@@ -481,8 +480,8 @@ out:
 		 */
 		mmio_wc_spinlock(&ctx->bf_lock);
 
-		mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned long *) ctrl,
-			     align(size * 16, 64));
+		mmio_memcpy_x64(ctx->bf_page + ctx->bf_offset, ctrl,
+				align(size * 16, 64));
 		/* Flush before toggling bf_offset to be latency oriented */
 		mmio_flush_writes();
 
diff --git a/providers/mlx5/qp.c b/providers/mlx5/qp.c
index 7f67a0b61b221f..c4789bf0d909a4 100644
--- a/providers/mlx5/qp.c
+++ b/providers/mlx5/qp.c
@@ -239,19 +239,14 @@ static void set_data_ptr_seg_atomic(struct mlx5_wqe_data_seg *dseg,
 static void mlx5_bf_copy(unsigned long long *dst, unsigned long long *src,
 			 unsigned bytecnt, struct mlx5_qp *qp)
 {
-	while (bytecnt > 0) {
-		*dst++ = *src++;
-		*dst++ = *src++;
-		*dst++ = *src++;
-		*dst++ = *src++;
-		*dst++ = *src++;
-		*dst++ = *src++;
-		*dst++ = *src++;
-		*dst++ = *src++;
-		bytecnt -= 8 * sizeof(unsigned long long);
+	do {
+		mmio_memcpy_x64(dst, src, 64);
+		bytecnt -= 64;
+		dst += 8;
+		src += 8;
 		if (unlikely(src == qp->sq.qend))
 			src = qp->sq_start;
-	}
+	} while (bytecnt > 0);
 }
 
 static uint32_t send_ieth(struct ibv_send_wr *wr)
diff --git a/util/mmio.h b/util/mmio.h
index 0b89f5fcbe000e..1d45d6d6364d4e 100644
--- a/util/mmio.h
+++ b/util/mmio.h
@@ -56,6 +56,7 @@
 #include <linux/types.h>
 #include <stdatomic.h>
 #include <stdint.h>
+#include <stddef.h>
 #include <endian.h>
 
 #include <config.h>
@@ -158,7 +159,6 @@ static inline uint8_t mmio_read8(const void *addr)
 	return atomic_load_explicit((_Atomic(uint32_t) *)addr,
 				    memory_order_relaxed);
 }
-
 #endif /* __s390x__ */
 
 MAKE_WRITE(mmio_write16, 16)
@@ -200,6 +200,57 @@ __le64 mmio_read64_le(const void *addr);
 		return le##_SZ_##toh(_NAME_##_le(addr));                       \
 	}
 
+/* This strictly guarantees the order of TLP generation for the memory copy to
+   be in ascending address order.
+*/
+#ifdef __s390x__
+static inline void mmio_memcpy_x64(void *dest, const void *src, size_t bytecnt)
+{
+	s390_mmio_write(addr, src, bytecnt);
+}
+#else
+
+/* Transfer is some multiple of 64 bytes */
+static inline void mmio_memcpy_x64(void *dest, const void *src, size_t bytecnt)
+{
+	uintptr_t *dst_p = dest;
+
+	/* Caller must guarantee:
+	    assert(bytecnt != 0);
+	    assert((bytecnt % 64) == 0);
+	    assert(((uintptr_t)dest) % __alignof__(*dst) == 0);
+	    assert(((uintptr_t)src) % __alignof__(*dst) == 0);
+	*/
+
+	/* Use the native word size for the copy */
+	if (sizeof(*dst_p) == 8) {
+		const __be64 *src_p = src;
+
+		do {
+			/* Do 64 bytes at a time */
+			mmio_write64_be(dst_p++, *src_p++);
+			mmio_write64_be(dst_p++, *src_p++);
+			mmio_write64_be(dst_p++, *src_p++);
+			mmio_write64_be(dst_p++, *src_p++);
+			mmio_write64_be(dst_p++, *src_p++);
+			mmio_write64_be(dst_p++, *src_p++);
+			mmio_write64_be(dst_p++, *src_p++);
+			mmio_write64_be(dst_p++, *src_p++);
+
+			bytecnt -= 8 * sizeof(*dst_p);
+		} while (bytecnt > 0);
+	} else if (sizeof(*dst_p) == 4) {
+		const __be32 *src_p = src;
+
+		do {
+			mmio_write32_be(dst_p++, *src_p++);
+			mmio_write32_be(dst_p++, *src_p++);
+			bytecnt -= 2 * sizeof(*dst_p);
+		} while (bytecnt > 0);
+	}
+}
+#endif
+
 MAKE_WRITE(mmio_write16, 16)
 MAKE_WRITE(mmio_write32, 32)
 MAKE_WRITE(mmio_write64, 64)
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH rdma-core 0/5] Common MMIO accessors for rdma-core
       [not found] ` <1492123127-6266-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
                     ` (4 preceding siblings ...)
  2017-04-13 22:38   ` [PATCH rdma-core 5/5] Add mmio_memcpy_x64 Jason Gunthorpe
@ 2017-04-14  7:18   ` Majd Dibbiny
  5 siblings, 0 replies; 16+ messages in thread
From: Majd Dibbiny @ 2017-04-14  7:18 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas


> On Apr 14, 2017, at 1:39 AM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> I've been threatening to do this for a while, here is a refactoring of the
> MMIO accessors into a common set of functions that unify all the weird
> options accumulated in the various providers:
> 
> - Normal 64 bit accessors
> - 32 bit emulation of 64 bit write
> - ia32 XMM support for 64 bit write
> - s390 syscall accessors
> 
> This is a big complex transformation that has lots of possibility for little
> mistakes, it would be good for other people to look closely at this before we
> merge it.
Hi Jason and Doug,

We are on holidays vacation until next Tuesday and thus we won't be able to review it until then.
Please hold on with merging it until we review/ack it.

Thanks
> 
> This is available on my github:
> 
> https://github.com/jgunthorpe/rdma-plumbing/tree/mmio
> 
> For this starting point I only converted the Mellanox drivers, because they
> are all basically the same.
> 
> This fixes one bug in the mlx_read_clock function which attempted to double
> read a register without telling the compiler, which was miscompiling :(
> 
> I have another series on top of this:
> 
> https://github.com/jgunthorpe/rdma-plumbing/tree/sparse4
> 
> Which does the last few things to make the Mellanox drivers sparse clean.
> 
> Jason Gunthorpe (5):
>  util: Add common mmio macros
>  mlx4: Use util/mmio.h
>  mlx5: Use util/mmio.h
>  mthca: Use util/mmio.h
>  Add mmio_memcpy_x64
> 
> CMakeLists.txt                |  16 +++
> buildlib/config.h.in          |   2 +
> buildlib/rdma_functions.cmake |   4 +-
> providers/mlx4/cq.c           |  14 ++-
> providers/mlx4/doorbell.h     |  70 -----------
> providers/mlx4/mlx4.c         |   1 -
> providers/mlx4/mlx4.h         |   1 -
> providers/mlx4/mmio.h         | 116 ------------------
> providers/mlx4/qp.c           |  10 +-
> providers/mlx4/srq.c          |   1 -
> providers/mlx4/verbs.c        |   9 +-
> providers/mlx5/cq.c           |  13 ++-
> providers/mlx5/doorbell.h     |  67 -----------
> providers/mlx5/mlx5.c         |   2 -
> providers/mlx5/mlx5.h         |   1 -
> providers/mlx5/qp.c           |  22 ++--
> providers/mlx5/srq.c          |   1 -
> providers/mlx5/verbs.c        |   9 +-
> providers/mthca/cq.c          |  38 +++---
> providers/mthca/doorbell.h    | 109 +----------------
> providers/mthca/qp.c          |  44 ++++---
> providers/mthca/srq.c         |  14 ++-
> util/CMakeLists.txt           |  20 +++-
> util/dummy.c                  |   0
> util/mmio.c                   |  83 +++++++++++++
> util/mmio.h                   | 265 ++++++++++++++++++++++++++++++++++++++++++
> 26 files changed, 480 insertions(+), 452 deletions(-)
> delete mode 100644 providers/mlx4/doorbell.h
> delete mode 100644 providers/mlx4/mmio.h
> delete mode 100644 providers/mlx5/doorbell.h
> create mode 100644 util/dummy.c
> create mode 100644 util/mmio.c
> create mode 100644 util/mmio.h
> 
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH rdma-core 1/5] util: Add common mmio macros
       [not found]     ` <1492123127-6266-2-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-04-18 15:52       ` Yishai Hadas
       [not found]         ` <b35958cf-5e87-3149-5413-eb754ec89b4d-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Yishai Hadas @ 2017-04-18 15:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, Majd Dibbiny

On 4/14/2017 1:38 AM, Jason Gunthorpe wrote:
> The macros are a mashup of the Mellanox driver macros:
>  - An alternate implementation for S390 is provided
>  - The ia32 XMM based 64 bit store emulation is switched
>    in if supported
>  - The macros are 'relaxed' ordering, so no implicit barrier
>    overheads
>  - If UDMA is not available then MMIO macros are also switched off,
>    this is because the 32 bit emulation relies on the udma barriers.
>
> Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> ---
>  CMakeLists.txt                |  16 ++++
>  buildlib/config.h.in          |   2 +
>  buildlib/rdma_functions.cmake |   4 +-
>  util/CMakeLists.txt           |  20 +++-
>  util/dummy.c                  |   0
>  util/mmio.c                   |  83 ++++++++++++++++
>  util/mmio.h                   | 214 ++++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 336 insertions(+), 3 deletions(-)
>  create mode 100644 util/dummy.c
>  create mode 100644 util/mmio.c
>  create mode 100644 util/mmio.h
>
> diff --git a/CMakeLists.txt b/CMakeLists.txt
> index 1ff3189c9d295e..5552872e8ca707 100644
> --- a/CMakeLists.txt
> +++ b/CMakeLists.txt
> @@ -181,6 +181,19 @@ set(NO_STRICT_ALIASING_FLAGS "")
>  RDMA_AddOptCFlag(NO_STRICT_ALIASING_FLAGS HAVE_NO_STRICT_ALIASING
>    "-fno-strict-aliasing")
>
> +CHECK_C_SOURCE_COMPILES("
> + #include <unistd.h>
> +
> + void entry(void);
> +
> + static void do_entry(void) {}
> + void entry(void) __attribute__((ifunc(\"resolve_entry\")));
> + static void *resolve_entry(void) {return &do_entry;}
> +
> + int main(int argc,const char *argv[]) { entry(); }"
> +  HAVE_FUNC_ATTRIBUTE_IFUNC
> +  FAIL_REGEX "warning")
> +
>  # The code does not do the racy fcntl if the various CLOEXEC's are not
>  # supported so it really doesn't work right if this isn't available. Thus hard
>  # require it.
> @@ -418,6 +431,9 @@ message(STATUS "Missing Optional Items:")
>  if (NOT HAVE_FUNC_ATTRIBUTE_ALWAYS_INLINE)
>    message(STATUS " Compiler attribute always_inline NOT supported")
>  endif()
> +if (NOT HAVE_FUNC_ATTRIBUTE_IFUNC)
> +  message(STATUS " Compiler attribute ifunc NOT supported")
> +endif()
>  if (NOT HAVE_COHERENT_DMA)
>    message(STATUS " Architecture NOT able to do coherent DMA (check util/udma_barrier.h) some providers disabled!")
>  endif()
> diff --git a/buildlib/config.h.in b/buildlib/config.h.in
> index e4e7b9ee70c857..2eb16be764630e 100644
> --- a/buildlib/config.h.in
> +++ b/buildlib/config.h.in
> @@ -28,6 +28,8 @@
>  // FIXME This has been supported in compilers forever, we should just fail to build on such old systems.
>  #cmakedefine HAVE_FUNC_ATTRIBUTE_ALWAYS_INLINE 1
>
> +#cmakedefine HAVE_FUNC_ATTRIBUTE_IFUNC 1
> +
>  #cmakedefine HAVE_WORKING_IF_H 1
>
>  @SIZEOF_LONG_CODE@
> diff --git a/buildlib/rdma_functions.cmake b/buildlib/rdma_functions.cmake
> index 01997b51468f55..936e1b256816ef 100644
> --- a/buildlib/rdma_functions.cmake
> +++ b/buildlib/rdma_functions.cmake
> @@ -6,8 +6,8 @@
>  # Global list of pairs of (SHARED STATIC) libary target names
>  set(RDMA_STATIC_LIBS "" CACHE INTERNAL "Doc" FORCE)
>
> -set(COMMON_LIBS_PIC ccan_pic)
> -set(COMMON_LIBS ccan)
> +set(COMMON_LIBS_PIC ccan_pic rdma_util_pic)
> +set(COMMON_LIBS ccan rdma_util)
>
>  # Create a symlink at filename DEST
>  # If the directory containing DEST does not exist then it is created
> diff --git a/util/CMakeLists.txt b/util/CMakeLists.txt
> index 71e33ac3baaa9b..581bf6822ec39f 100644
> --- a/util/CMakeLists.txt
> +++ b/util/CMakeLists.txt
> @@ -1,5 +1,23 @@
>  publish_internal_headers(util
>    compiler.h
> -  udma_barrier.h
>    util.h
>    )
> +
> +# The empty dummy.c is only needed so that cmake always has something to build
> +# into the library.
> +set(C_FILES dummy.c)
> +
> +if (HAVE_COHERENT_DMA)
> +  publish_internal_headers(util
> +    mmio.h
> +    udma_barrier.h
> +    )
> +
> +  set(C_FILES ${C_FILES}
> +    mmio.c
> +    )
> +endif()
> +
> +add_library(rdma_util STATIC ${C_FILES})
> +add_library(rdma_util_pic STATIC ${C_FILES})
> +set_property(TARGET rdma_util_pic PROPERTY POSITION_INDEPENDENT_CODE TRUE)
> diff --git a/util/dummy.c b/util/dummy.c
> new file mode 100644
> index 00000000000000..e69de29bb2d1d6
> diff --git a/util/mmio.c b/util/mmio.c
> new file mode 100644
> index 00000000000000..757a46ce5418d8
> --- /dev/null
> +++ b/util/mmio.c
> @@ -0,0 +1,83 @@
> +/* GPLv2 or OpenIB.org BSD (MIT) See COPYING file */
> +#include <util/mmio.h>
> +#include <util/udma_barrier.h>
> +#include <config.h>
> +
> +#include <pthread.h>
> +#include <stdbool.h>
> +
> +#if SIZEOF_LONG != 8
> +
> +static pthread_spinlock_t mmio_spinlock;
> +
> +static __attribute__((constructor)) void lock_constructor(void)
> +{
> +	pthread_spin_init(&mmio_spinlock, PTHREAD_PROCESS_PRIVATE);
> +}
> +
> +/* When the arch does not have a 32 bit store we provide an emulation that
You mean a 64 bit store, correct ?

> +   does two stores in address ascending order while holding a global
> +   spinlock. */
> +static void pthread_mmio_write64_be(void *addr, __be64 val)
> +{
> +	__be32 first_dword = htobe32(be64toh(val) >> 32);
> +	__be32 second_dword = htobe32(be64toh(val));
> +
> +	/* The WC spinlock, by definition, provides global ordering for all UC
> +	   and WC stores within the critical region. */
> +	mmio_wc_spinlock(&mmio_spinlock);
> +
> +	mmio_write32_be(addr, first_dword);
> +	mmio_write32_be(addr + 4, second_dword);
> +
> +	mmio_wc_spinunlock(&mmio_spinlock);
> +}
> +
> +#if defined(__i386__)
> +#include <xmmintrin.h>
> +#include <cpuid.h>
> +
> +/* For ia32 we have historically emitted movlps SSE instructions to do the 64
> +   bit operations. */
> +static void __attribute__((target("sse")))
> +sse_mmio_write64_be(void *addr, __be64 val)
> +{
> +	__m128 tmp = {};
> +	tmp = _mm_loadl_pi(tmp, (__force __m64 *)&val);
> +	_mm_storel_pi((__m64 *)addr,tmp);
> +}
> +
> +static bool have_sse(void)
> +{
> +	unsigned int ax,bx,cx,dx;
> +
> +	if (!__get_cpuid(1,&ax,&bx,&cx,&dx))
> +		return false;
> +	return dx & bit_SSE;
> +}
> +
> +#endif /* defined(__i386__) */
> +
> +typedef void (*write64_fn_t)(void *, __be64);
> +
> +/* This uses the STT_GNU_IFUNC extension to have the dynamic linker select the
> +   best above implementations at runtime. */
> +#if HAVE_FUNC_ATTRIBUTE_IFUNC
> +void mmio_write64_be(void *addr, __be64 val)
> +    __attribute__((ifunc("resolve_mmio_write64_be")));
> +static write64_fn_t resolve_mmio_write64_be(void);
> +#else
> +__asm__(".type mmio_write64_be, %gnu_indirect_function");
> +write64_fn_t resolve_mmio_write64_be(void) __asm__("mmio_write64_be");
> +#endif
> +
> +write64_fn_t resolve_mmio_write64_be(void)
> +{
> +#if defined(__i386__)
> +	if (have_sse())
> +		return &sse_mmio_write64_be;
> +#endif
> +	return &pthread_mmio_write64_be;

This will bring in 32 bit systems code that uses global spinlock 
comparing current usage where a specific spinlock is used, isn't it ?
see mlx4_write64() which is replaced by that code in downstream patches.

> +}
> +
> +#endif /* SIZEOF_LONG != 8 */
> diff --git a/util/mmio.h b/util/mmio.h
> new file mode 100644
> index 00000000000000..0b89f5fcbe000e
> --- /dev/null
> +++ b/util/mmio.h
> @@ -0,0 +1,214 @@
> +/* GPLv2 or OpenIB.org BSD (MIT) See COPYING file
> +
> +   These accessors always map to PCI-E TLPs in predictable ways. Translation
> +   to other buses should follow similar definitions.
> +
> +   write32(mem, 1)
> +      Produce a 4 byte MemWr TLP with bit 0 of DW byte offset 0 set
> +   write32_be(mem, htobe32(1))
> +      Produce a 4 byte MemWr TLP with bit 0 of DW byte offset 3 set
> +   write32_le(mem, htole32(1))
> +      Produce a 4 byte MemWr TLP with bit 0 of DW byte offset 0 set
> +
> +   For ordering these accessors are similar to the Kernel's concept of
> +   writel_relaxed(). When working with UC memory the following hold:
> +
> +   1) Strong ordering is required when talking to the same device (eg BAR),
> +      and combining is not permitted:
> +
> +       write32(mem, 1);
> +       write32(mem + 4, 1);
> +       write32(mem, 1);
> +
> +      Must produce three TLPs, in order.
> +
> +   2) Ordering ignores all pthread locking:
> +
> +       pthread_spin_lock(&lock);
> +       write32(mem, global++);
> +       pthread_spin_unlock(&lock);
> +
> +      When run concurrently on all CPUs the device must observe all stores,
> +      but the data value will not be strictly increasing.
> +
> +   3) Interaction with DMA is not ordered. Explicit use of a barrier from
> +      udma_barriers is required:
> +
> +	*dma_mem = 1;
> +	udma_to_device_barrier();
> +	write32(mem, GO_DMA);
> +
> +   4) Access out of program order (eg speculation), either by the CPU or
> +      compiler is not permitted:
> +
> +	if (cond)
> +	   read32();
> +
> +      Must not issue a read TLP if cond is false.
> +
> +   If these are used with WC memory then #1 and #4 do not apply, and all WC
> +   accesses must be bracketed with mmio_wc_start() // mmio_flush_writes()
> +*/
> +
> +#ifndef __UTIL_MMIO_H
> +#define __UTIL_MMIO_H
> +
> +#include <linux/types.h>
> +#include <stdatomic.h>
> +#include <stdint.h>
> +#include <endian.h>
> +
> +#include <config.h>
> +#include <util/compiler.h>
> +
> +/* The first step is to define the 'raw' accessors. To make this very safe
> +   with sparse we define two versions of each, a le and a be - however the
> +   code is always identical.
> +*/
> +#ifdef __s390x__
> +#include <unistd.h>
> +#include <sys/syscall.h>
> +
> +/* s390 requires a privileged instruction to access IO memory, these syscalls
> +   perform that instruction using a memory buffer copy semantic.
> +*/
> +static inline void s390_mmio_write(void *mmio_addr, const void *val,
> +				   size_t length)
> +{
> +	// FIXME: Check for error and call abort?
> +	syscall(__NR_s390_pci_mmio_write, mmio_addr, val, length);
> +}
> +
> +static inline void s390_mmio_read(const void *mmio_addr, void *val,
> +				  size_t length)
> +{
> +	// FIXME: Check for error and call abort?
> +	syscall(__NR_s390_pci_mmio_read, mmio_addr, val, length);
> +}
> +
> +#define MAKE_WRITE(_NAME_, _SZ_)                                               \
> +	static inline void _NAME_##_be(void *addr, __be##_SZ_ value)           \
> +	{                                                                      \
> +		s390_mmio_write(addr, &value, sizeof(value));                  \
> +	}                                                                      \
> +	static inline void _NAME_##_le(void *addr, __le##_SZ_ value)           \
> +	{                                                                      \
> +		s390_mmio_write(addr, &value, sizeof(value));                  \
> +	}

I believe that you introduced two different versions (i.e. le/be only 
for semantic reasons as code is really similar, is that correct ?

> +#define MAKE_READ(_NAME_, _SZ_)                                                \
> +	static inline __be##_SZ_ _NAME_##_be(const void *addr)                 \
> +	{                                                                      \
> +		__be##_SZ_ res;                                                \
> +		s390_mmio_read(addr, &res, sizeof(res));                       \
> +		return res;                                                    \
> +	}                                                                      \
> +	static inline __le##_SZ_ _NAME_##_le(const void *addr)                 \
> +	{                                                                      \
> +		__le##_SZ_ res;                                                \
> +		s390_mmio_read(addr, &res, sizeof(res));                       \
> +		return res;                                                    \
> +	}
> +
> +static inline void mmio_read8(void *addr, uint8_t value)

Did you refer here to mmio_write8 ?

> +{
> +	s390_mmio_write(addr, &value, sizeof(value));
> +}
> +
> +static inline uint8_t mmio_read8(const void *addr)

This function was already defined above.

> +{
> +	uint8_t res;
> +	s390_mmio_read(addr, &res, sizeof(res));
> +	return res;
> +}
> +
> +#else /* __s390x__ */
> +
> +#define MAKE_WRITE(_NAME_, _SZ_)                                               \
> +	static inline void _NAME_##_be(void *addr, __be##_SZ_ value)           \
> +	{                                                                      \
> +		atomic_store_explicit((_Atomic(uint##_SZ_##_t) *)addr,         \
> +				      (__force uint##_SZ_##_t)value,           \
> +				      memory_order_relaxed);                   \

Mightn't this code introduce some overhead comparing direct assignment 
which is used (i.e. mlx5_write64) and replaced by that in downstream 
patches ?

This need to be measured and verified as it's used in the data path flows.

> +	}                                                                      \
> +	static inline void _NAME_##_le(void *addr, __le##_SZ_ value)           \
> +	{                                                                      \
> +		atomic_store_explicit((_Atomic(uint##_SZ_##_t) *)addr,         \
> +				      (__force uint##_SZ_##_t)value,           \
> +				      memory_order_relaxed);                   \
> +	}
> +#define MAKE_READ(_NAME_, _SZ_)                                                \
> +	static inline __be##_SZ_ _NAME_##_be(const void *addr)                 \
> +	{                                                                      \
> +		return (__force __be##_SZ_)atomic_load_explicit(               \
> +		    (_Atomic(uint##_SZ_##_t) *)addr, memory_order_relaxed);    \
> +	}                                                                      \
> +	static inline __le##_SZ_ _NAME_##_le(const void *addr)                 \
> +	{                                                                      \
> +		return (__force __le##_SZ_)atomic_load_explicit(               \
> +		    (_Atomic(uint##_SZ_##_t) *)addr, memory_order_relaxed);    \
> +	}
> +
> +static inline void mmio_write8(void *addr, uint8_t value)
> +{
> +	atomic_store_explicit((_Atomic(uint8_t) *)addr, value,
> +			      memory_order_relaxed);
> +}
> +static inline uint8_t mmio_read8(const void *addr)
> +{
> +	return atomic_load_explicit((_Atomic(uint32_t) *)addr,
> +				    memory_order_relaxed);
> +}
> +
> +#endif /* __s390x__ */
> +
> +MAKE_WRITE(mmio_write16, 16)
> +MAKE_WRITE(mmio_write32, 32)
> +
> +MAKE_READ(mmio_read16, 16)
> +MAKE_READ(mmio_read32, 32)
> +
> +#if SIZEOF_LONG == 8
> +MAKE_WRITE(mmio_write64, 64)
> +MAKE_READ(mmio_read64, 64)
> +#else
> +void mmio_write64_be(void *addr, __be64 val);
> +static inline void mmio_write64_le(void *addr, __le64 val)
> +{
> +	mmio_write64_be(addr, (__be64 __force)val);
> +}
> +
> +/* There is no way to do read64 atomically, rather than provide some sketchy
> +   implementation we leave these functions undefined, users should not call
> +   them if SIZEOF_LONG != 8, but instead implement an appropriate version. */
> +__be64 mmio_read64_be(const void *addr);
> +__le64 mmio_read64_le(const void *addr);
> +#endif /* SIZEOF_LONG == 8 */
> +
> +#undef MAKE_WRITE
> +#undef MAKE_READ
> +
> +/* Now we can define the host endian versions of the operator, this just includes
> +   a call to htole. */
> +#define MAKE_WRITE(_NAME_, _SZ_)                                               \
> +	static inline void _NAME_(void *addr, uint##_SZ_##_t value)            \
> +	{                                                                      \
> +		_NAME_##_le(addr, htole##_SZ_(value));                         \
> +	}
> +#define MAKE_READ(_NAME_, _SZ_)                                                \
> +	static inline uint##_SZ_##_t _NAME_(const void *addr)                  \
> +	{                                                                      \
> +		return le##_SZ_##toh(_NAME_##_le(addr));                       \
> +	}
> +
> +MAKE_WRITE(mmio_write16, 16)
> +MAKE_WRITE(mmio_write32, 32)
> +MAKE_WRITE(mmio_write64, 64)
> +
> +MAKE_READ(mmio_read16, 16)
> +MAKE_READ(mmio_read32, 32)
> +MAKE_READ(mmio_read64, 64)
> +
> +#undef MAKE_WRITE
> +#undef MAKE_READ
> +
> +#endif
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH rdma-core 5/5] Add mmio_memcpy_x64
       [not found]     ` <1492123127-6266-6-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-04-18 16:22       ` Yishai Hadas
       [not found]         ` <413a6c23-2bfe-60b6-f179-ddcb82bb19ab-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Yishai Hadas @ 2017-04-18 16:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Majd Dibbiny

On 4/14/2017 1:38 AM, Jason Gunthorpe wrote:
> This pattern is common in a couple of drivers, and needs the s390
> syscall.
>
> The common version properly handles 32 bit and prevents reordering of
> the stores, which is the stated reason for this to exist. It is also
> slightly more optimized, since it assumes a non-zero transfer length.
>
> Signed-off-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> ---
>  providers/mlx4/mmio.h | 43 -----------------------------------------
>  providers/mlx4/qp.c   |  5 ++---
>  providers/mlx5/qp.c   | 17 ++++++-----------
>  util/mmio.h           | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  4 files changed, 60 insertions(+), 58 deletions(-)
>  delete mode 100644 providers/mlx4/mmio.h
>
> diff --git a/providers/mlx4/mmio.h b/providers/mlx4/mmio.h
> deleted file mode 100644
> index 9821e85224dcfd..00000000000000
> diff --git a/providers/mlx4/qp.c b/providers/mlx4/qp.c
> index 423f59533de68d..e7f10b9f1524d5 100644
> --- a/providers/mlx4/qp.c
> +++ b/providers/mlx4/qp.c
> @@ -43,7 +43,6 @@
>
>  #include "mlx4.h"
>  #include "wqe.h"
> -#include "mmio.h"
>
>  static const uint32_t mlx4_ib_opcode[] = {
>  	[IBV_WR_SEND]			= MLX4_OPCODE_SEND,
> @@ -481,8 +480,8 @@ out:
>  		 */
>  		mmio_wc_spinlock(&ctx->bf_lock);
>
> -		mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned long *) ctrl,
> -			     align(size * 16, 64));
> +		mmio_memcpy_x64(ctx->bf_page + ctx->bf_offset, ctrl,
> +				align(size * 16, 64));
>  		/* Flush before toggling bf_offset to be latency oriented */
>  		mmio_flush_writes();
>
> diff --git a/providers/mlx5/qp.c b/providers/mlx5/qp.c
> index 7f67a0b61b221f..c4789bf0d909a4 100644
> --- a/providers/mlx5/qp.c
> +++ b/providers/mlx5/qp.c
> @@ -239,19 +239,14 @@ static void set_data_ptr_seg_atomic(struct mlx5_wqe_data_seg *dseg,
>  static void mlx5_bf_copy(unsigned long long *dst, unsigned long long *src,
>  			 unsigned bytecnt, struct mlx5_qp *qp)
>  {
> -	while (bytecnt > 0) {
> -		*dst++ = *src++;
> -		*dst++ = *src++;
> -		*dst++ = *src++;
> -		*dst++ = *src++;
> -		*dst++ = *src++;
> -		*dst++ = *src++;
> -		*dst++ = *src++;
> -		*dst++ = *src++;
> -		bytecnt -= 8 * sizeof(unsigned long long);
> +	do {
> +		mmio_memcpy_x64(dst, src, 64);
> +		bytecnt -= 64;
> +		dst += 8;
> +		src += 8;

It looks like the above +=8 is wrong in 32 bit systems, agree ?

>  		if (unlikely(src == qp->sq.qend))
>  			src = qp->sq_start;
> -	}
> +	} while (bytecnt > 0);
>  }
>
>  static uint32_t send_ieth(struct ibv_send_wr *wr)
> diff --git a/util/mmio.h b/util/mmio.h
> index 0b89f5fcbe000e..1d45d6d6364d4e 100644
> --- a/util/mmio.h
> +++ b/util/mmio.h
> @@ -56,6 +56,7 @@
>  #include <linux/types.h>
>  #include <stdatomic.h>
>  #include <stdint.h>
> +#include <stddef.h>
>  #include <endian.h>
>
>  #include <config.h>
> @@ -158,7 +159,6 @@ static inline uint8_t mmio_read8(const void *addr)
>  	return atomic_load_explicit((_Atomic(uint32_t) *)addr,
>  				    memory_order_relaxed);
>  }
> -
>  #endif /* __s390x__ */
>
>  MAKE_WRITE(mmio_write16, 16)
> @@ -200,6 +200,57 @@ __le64 mmio_read64_le(const void *addr);
>  		return le##_SZ_##toh(_NAME_##_le(addr));                       \
>  	}
>
> +/* This strictly guarantees the order of TLP generation for the memory copy to
> +   be in ascending address order.
> +*/
> +#ifdef __s390x__
> +static inline void mmio_memcpy_x64(void *dest, const void *src, size_t bytecnt)
> +{
> +	s390_mmio_write(addr, src, bytecnt);
> +}
> +#else
> +
> +/* Transfer is some multiple of 64 bytes */
> +static inline void mmio_memcpy_x64(void *dest, const void *src, size_t bytecnt)
> +{
> +	uintptr_t *dst_p = dest;
> +
> +	/* Caller must guarantee:
> +	    assert(bytecnt != 0);
> +	    assert((bytecnt % 64) == 0);
> +	    assert(((uintptr_t)dest) % __alignof__(*dst) == 0);
> +	    assert(((uintptr_t)src) % __alignof__(*dst) == 0);
> +	*/
> +
> +	/* Use the native word size for the copy */
> +	if (sizeof(*dst_p) == 8) {

We expect this 'if' to be dropped at compile time to prevent performance 
penalty comparing the original code, correct ?

> +		const __be64 *src_p = src;
> +
> +		do {
> +			/* Do 64 bytes at a time */
> +			mmio_write64_be(dst_p++, *src_p++);
> +			mmio_write64_be(dst_p++, *src_p++);
> +			mmio_write64_be(dst_p++, *src_p++);
> +			mmio_write64_be(dst_p++, *src_p++);
> +			mmio_write64_be(dst_p++, *src_p++);
> +			mmio_write64_be(dst_p++, *src_p++);
> +			mmio_write64_be(dst_p++, *src_p++);
> +			mmio_write64_be(dst_p++, *src_p++);
> +
> +			bytecnt -= 8 * sizeof(*dst_p);
> +		} while (bytecnt > 0);
> +	} else if (sizeof(*dst_p) == 4) {
> +		const __be32 *src_p = src;
> +
> +		do {
> +			mmio_write32_be(dst_p++, *src_p++);
> +			mmio_write32_be(dst_p++, *src_p++);
> +			bytecnt -= 2 * sizeof(*dst_p);

Any reason not to write at least 64 bytes here before checking byte 
count and looping again ?

> +		} while (bytecnt > 0);
> +	}
> +}
> +#endif
> +
>  MAKE_WRITE(mmio_write16, 16)
>  MAKE_WRITE(mmio_write32, 32)
>  MAKE_WRITE(mmio_write64, 64)
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH rdma-core 1/5] util: Add common mmio macros
       [not found]         ` <b35958cf-5e87-3149-5413-eb754ec89b4d-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2017-04-18 17:28           ` Jason Gunthorpe
       [not found]             ` <20170418172852.GD7181-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2017-04-18 17:28 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, Majd Dibbiny

On Tue, Apr 18, 2017 at 06:52:00PM +0300, Yishai Hadas wrote:

> >+/* When the arch does not have a 32 bit store we provide an emulation that
> You mean a 64 bit store, correct ?

Yep, thanks

> >+write64_fn_t resolve_mmio_write64_be(void)
> >+{
> >+#if defined(__i386__)
> >+	if (have_sse())
> >+		return &sse_mmio_write64_be;
> >+#endif
> >+	return &pthread_mmio_write64_be;
> 
> This will bring in 32 bit systems code that uses global spinlock comparing
> current usage where a specific spinlock is used, isn't it ?

Yes, that is mostly right. libutil is statically linked to each
provider, so every provider gets its own instance of the global lock.

The only case this would seem to matter is if multiple devices using
the same provider are run concurrently.

We could use a hash scheme or something to multiple the lock, but I'm
really not sure that 32 bit performance matters to anyone anymore?

> see mlx4_write64() which is replaced by that code in downstream patches.

Yes, it will be easier to migrate the entire code base if we don't
have to add more special locking. Since this brings back lockless SSE
on ia32 I feel there are very few systems that would actually hit this
lock and care about the performance of the lock ??

> >+/* The first step is to define the 'raw' accessors. To make this very safe
> >+   with sparse we define two versions of each, a le and a be - however the
> >+   code is always identical.
> >+*/

[..]

> >+#define MAKE_WRITE(_NAME_, _SZ_)                                               \
> >+	static inline void _NAME_##_be(void *addr, __be##_SZ_ value)           \
> >+	{                                                                      \
> >+		s390_mmio_write(addr, &value, sizeof(value));                  \
> >+	}                                                                      \
> >+	static inline void _NAME_##_le(void *addr, __le##_SZ_ value)           \
> >+	{                                                                      \
> >+		s390_mmio_write(addr, &value, sizeof(value));                  \
> >+	}
> 
> I believe that you introduced two different versions (i.e. le/be only for
> semantic reasons as code is really similar, is that correct ?

Yes, as the comment above says.

> >+static inline void mmio_read8(void *addr, uint8_t value)
> 
> Did you refer here to mmio_write8 ?

Yep, I can't even compile test the s390 stuff..

> >+#define MAKE_WRITE(_NAME_, _SZ_)                                               \
> >+	static inline void _NAME_##_be(void *addr, __be##_SZ_ value)           \
> >+	{                                                                      \
> >+		atomic_store_explicit((_Atomic(uint##_SZ_##_t) *)addr,         \
> >+				      (__force uint##_SZ_##_t)value,           \
> >+				      memory_order_relaxed);                   \
> 
> Mightn't this code introduce some overhead comparing direct assignment which
> is used (i.e. mlx5_write64) and replaced by that in downstream patches ?

All of the generated assembly I have inspected says this is at least
identical.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH rdma-core 1/5] util: Add common mmio macros
       [not found]             ` <20170418172852.GD7181-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-04-18 17:38               ` Leon Romanovsky
       [not found]                 ` <20170418173815.GC14088-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Leon Romanovsky @ 2017-04-18 17:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, Majd Dibbiny

[-- Attachment #1: Type: text/plain, Size: 423 bytes --]

On Tue, Apr 18, 2017 at 11:28:52AM -0600, Jason Gunthorpe wrote:
> On Tue, Apr 18, 2017 at 06:52:00PM +0300, Yishai Hadas wrote:
>
> We could use a hash scheme or something to multiple the lock, but I'm
> really not sure that 32 bit performance matters to anyone anymore?

In a hard way, I learned that we have a customer who expects that his 32bit
application will continue to work, so the answer is - no, I care.

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH rdma-core 1/5] util: Add common mmio macros
       [not found]                 ` <20170418173815.GC14088-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
@ 2017-04-18 18:17                   ` Jason Gunthorpe
       [not found]                     ` <20170418181736.GF7181-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2017-04-18 18:17 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, Majd Dibbiny

On Tue, Apr 18, 2017 at 08:38:15PM +0300, Leon Romanovsky wrote:
> On Tue, Apr 18, 2017 at 11:28:52AM -0600, Jason Gunthorpe wrote:
> > On Tue, Apr 18, 2017 at 06:52:00PM +0300, Yishai Hadas wrote:
> >
> > We could use a hash scheme or something to multiple the lock, but I'm
> > really not sure that 32 bit performance matters to anyone anymore?
> 
> In a hard way, I learned that we have a customer who expects that his 32bit
> application will continue to work, so the answer is - no, I care.

Of course it continues to work.

Does your customer fit this very narrow definition:
 - Would actually upgrade to rdma-core
 - 32 bit
 - No SSE hardware (any Intel chip capable of PCI-E has SSE hardware)
 - Multiple same-provider devices with a single program touching all
   devices (single device performance is unchanged)
 - Sensitive to the performance difference of a potential spinlock
   contention / cache misplacement for ~4 instructions

It is hard to understand who cares so much about peformance but leaves
a wack on the table by running in 32 bit mode.

Is this something like x32? We can certainly improve for x32.

PPC32 could also potentially have a path like SSE..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH rdma-core 5/5] Add mmio_memcpy_x64
       [not found]         ` <413a6c23-2bfe-60b6-f179-ddcb82bb19ab-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2017-04-18 18:27           ` Jason Gunthorpe
       [not found]             ` <20170418182703.GG7181-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2017-04-18 18:27 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Majd Dibbiny

On Tue, Apr 18, 2017 at 07:22:07PM +0300, Yishai Hadas wrote:
> >@@ -239,19 +239,14 @@ static void set_data_ptr_seg_atomic(struct mlx5_wqe_data_seg *dseg,
> > static void mlx5_bf_copy(unsigned long long *dst, unsigned long long *src,
> > 			 unsigned bytecnt, struct mlx5_qp *qp)
> > {
> >-	while (bytecnt > 0) {
> >-		*dst++ = *src++;
> >-		*dst++ = *src++;
> >-		*dst++ = *src++;
> >-		*dst++ = *src++;
> >-		*dst++ = *src++;
> >-		*dst++ = *src++;
> >-		*dst++ = *src++;
> >-		*dst++ = *src++;
> >-		bytecnt -= 8 * sizeof(unsigned long long);
> >+	do {
> >+		mmio_memcpy_x64(dst, src, 64);
> >+		bytecnt -= 64;
> >+		dst += 8;
> >+		src += 8;
> 
> It looks like the above +=8 is wrong in 32 bit systems, agree ?

Hurm. On 32 bit systems 'unsigned long long' will still be 64 bit, so
the above is OK.

The above original is buggy on 32 bit because it is not guarenteed to
generate stores strictly in increasing address order. I think the
author's intent was to have used 'uintptr_t *'.

I will change the arguments to be 'uint64_t *' for clarity.

> >+	/* Use the native word size for the copy */
> >+	if (sizeof(*dst_p) == 8) {
> 
> We expect this 'if' to be dropped at compile time to prevent performance
> penalty comparing the original code, correct ?

Yes.

The entire mmio_memcpy_x64 expands to a bunch of movs with no branches
as the transfer size is constant as well.

The overall mlx5_bf_copy looses one branch because of the
transformation to do/while

> >+		} while (bytecnt > 0);
> >+	} else if (sizeof(*dst_p) == 4) {
> >+		const __be32 *src_p = src;
> >+
> >+		do {
> >+			mmio_write32_be(dst_p++, *src_p++);
> >+			mmio_write32_be(dst_p++, *src_p++);
> >+			bytecnt -= 2 * sizeof(*dst_p);
> 
> Any reason not to write at least 64 bytes here before checking byte count
> and looping again ?

icache size? I debated doing that, but the consensus of the existing
implementations seems to be against it..

We could do a 32 byte unwind which is probably a similar icache
footprint?

What would you like?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH rdma-core 1/5] util: Add common mmio macros
       [not found]                     ` <20170418181736.GF7181-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-04-19  5:55                       ` Leon Romanovsky
       [not found]                         ` <20170419055517.GH14088-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Leon Romanovsky @ 2017-04-19  5:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, Majd Dibbiny

[-- Attachment #1: Type: text/plain, Size: 1691 bytes --]

On Tue, Apr 18, 2017 at 12:17:36PM -0600, Jason Gunthorpe wrote:
> On Tue, Apr 18, 2017 at 08:38:15PM +0300, Leon Romanovsky wrote:
> > On Tue, Apr 18, 2017 at 11:28:52AM -0600, Jason Gunthorpe wrote:
> > > On Tue, Apr 18, 2017 at 06:52:00PM +0300, Yishai Hadas wrote:
> > >
> > > We could use a hash scheme or something to multiple the lock, but I'm
> > > really not sure that 32 bit performance matters to anyone anymore?
> >
> > In a hard way, I learned that we have a customer who expects that his 32bit
> > application will continue to work, so the answer is - no, I care.
>
> Of course it continues to work.
>
> Does your customer fit this very narrow definition:
>  - Would actually upgrade to rdma-core
>  - 32 bit
>  - No SSE hardware (any Intel chip capable of PCI-E has SSE hardware)
>  - Multiple same-provider devices with a single program touching all
>    devices (single device performance is unchanged)
>  - Sensitive to the performance difference of a potential spinlock
>    contention / cache misplacement for ~4 instructions
>
> It is hard to understand who cares so much about peformance but leaves
> a wack on the table by running in 32 bit mode.

It doesn't matter if customer cares or not cares about performance. We
as a company provided satisfactory numbers to them and would like to
ensure that these numbers don't reduce over time.

I'm not looking for a performance boost in 32bit environment, but I
don't expect decrease either.

>
> Is this something like x32? We can certainly improve for x32.

From the customer ticket, the system is x86 (32 bits).

>
> PPC32 could also potentially have a path like SSE..
>
> Jason

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH rdma-core 1/5] util: Add common mmio macros
       [not found]                         ` <20170419055517.GH14088-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
@ 2017-04-19 15:33                           ` Jason Gunthorpe
  0 siblings, 0 replies; 16+ messages in thread
From: Jason Gunthorpe @ 2017-04-19 15:33 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, Majd Dibbiny

On Wed, Apr 19, 2017 at 08:55:17AM +0300, Leon Romanovsky wrote:

> > Is this something like x32? We can certainly improve for x32.
> 
> From the customer ticket, the system is x86 (32 bits).

Then it will use the SSE path and run faster with this patch, so no
problem.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH rdma-core 5/5] Add mmio_memcpy_x64
       [not found]             ` <20170418182703.GG7181-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-04-19 15:54               ` Yishai Hadas
  0 siblings, 0 replies; 16+ messages in thread
From: Yishai Hadas @ 2017-04-19 15:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Majd Dibbiny

On 4/18/2017 9:27 PM, Jason Gunthorpe wrote:
>>> +	} else if (sizeof(*dst_p) == 4) {
>>> +		const __be32 *src_p = src;
>>> +
>>> +		do {
>>> +			mmio_write32_be(dst_p++, *src_p++);
>>> +			mmio_write32_be(dst_p++, *src_p++);
>>> +			bytecnt -= 2 * sizeof(*dst_p);
>>
>> Any reason not to write at least 64 bytes here before checking byte count
>> and looping again ?
>
> icache size? I debated doing that, but the consensus of the existing
> implementations seems to be against it..
>
> We could do a 32 byte unwind which is probably a similar icache
> footprint?
>
> What would you like?

I'm fine with leaving the code as is following the existing implementation.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-04-19 15:54 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-13 22:38 [PATCH rdma-core 0/5] Common MMIO accessors for rdma-core Jason Gunthorpe
     [not found] ` <1492123127-6266-1-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-04-13 22:38   ` [PATCH rdma-core 1/5] util: Add common mmio macros Jason Gunthorpe
     [not found]     ` <1492123127-6266-2-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-04-18 15:52       ` Yishai Hadas
     [not found]         ` <b35958cf-5e87-3149-5413-eb754ec89b4d-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2017-04-18 17:28           ` Jason Gunthorpe
     [not found]             ` <20170418172852.GD7181-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-04-18 17:38               ` Leon Romanovsky
     [not found]                 ` <20170418173815.GC14088-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-04-18 18:17                   ` Jason Gunthorpe
     [not found]                     ` <20170418181736.GF7181-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-04-19  5:55                       ` Leon Romanovsky
     [not found]                         ` <20170419055517.GH14088-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-04-19 15:33                           ` Jason Gunthorpe
2017-04-13 22:38   ` [PATCH rdma-core 2/5] mlx4: Use util/mmio.h Jason Gunthorpe
2017-04-13 22:38   ` [PATCH rdma-core 3/5] mlx5: " Jason Gunthorpe
2017-04-13 22:38   ` [PATCH rdma-core 4/5] mthca: " Jason Gunthorpe
2017-04-13 22:38   ` [PATCH rdma-core 5/5] Add mmio_memcpy_x64 Jason Gunthorpe
     [not found]     ` <1492123127-6266-6-git-send-email-jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-04-18 16:22       ` Yishai Hadas
     [not found]         ` <413a6c23-2bfe-60b6-f179-ddcb82bb19ab-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2017-04-18 18:27           ` Jason Gunthorpe
     [not found]             ` <20170418182703.GG7181-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-04-19 15:54               ` Yishai Hadas
2017-04-14  7:18   ` [PATCH rdma-core 0/5] Common MMIO accessors for rdma-core Majd Dibbiny

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.