All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Elliott, Robert (Server Storage)" <Elliott@hp.com>
To: Dan Williams <dan.j.williams@intel.com>,
	"axboe@kernel.dk" <axboe@kernel.dk>
Cc: "linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	"neilb@suse.de" <neilb@suse.de>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Robert Moore <robert.moore@intel.com>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	Lv Zheng <lv.zheng@intel.com>, "hch@lst.de" <hch@lst.de>,
	"mingo@kernel.org" <mingo@kernel.org>,
	"Kani,
	Toshimitsu" <toshi.kani@hp.com>Christoph Hellwig <hch@lst.de>,
	"Boaz Harrosh (boaz@plexistor.com)" <boaz@plexistor.com>
Subject: RE: [PATCH v3 20/21] nfit-test: manufactured NFITs for interface development
Date: Mon, 25 May 2015 07:02:57 +0000	[thread overview]
Message-ID: <94D0CD8314A33A4D9D801C0FE68B40295A9217B0@G9W0745.americas.hpqcorp.net> (raw)
In-Reply-To: <20150520205800.32249.74581.stgit@dwillia2-desk3.amr.corp.intel.com>

[-- Attachment #1: Type: text/plain, Size: 667 bytes --]

> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf
> Of Dan Williams
> Sent: Wednesday, May 20, 2015 3:58 PM
> To: axboe@kernel.dk
> Subject: [PATCH v3 20/21] nfit-test: manufactured NFITs for interface
> development
...

Attached is some experimental code to try pmem with different 
cache types (UC, WB, WC, and WT) and memcpy functions using x86 
AVX non-temporal load and store instructions.

It depends on Toshi's WT patch series:
	https://lkml.org/lkml/2015/5/13/866

If you don't have that, you can just comment out the lines related
to ioremap_wt.

---
Rob Elliott, HP Server Storage


[-- Attachment #2: 0001-pmem-cache-type --]
[-- Type: application/octet-stream, Size: 19027 bytes --]

From 18e75a7134e0130b925fffab13f41c1ffc4d9f05 Mon Sep 17 00:00:00 2001
From: Robert Elliott <elliott@hp.com>
Date: Fri, 22 May 2015 16:46:21 -0500
Subject: [PATCH] pmem cache type patch

Author: Robert Elliott <elliott@hp.com>
Date:   Tue Apr 28 19:14:53 2015 -0500

    pmem: cache_type, non-temporal memcpy experiments

    WARNING: Not for inclusion in the kernel - just for experimentation.

    Add modparams to select cache_type and various kinds of
    memcpy with non-temporal loads and stores.  Parameters
    are printed to the kernel serial log at module load time.

    Example usage:
    modprobe pmem pmem_cachetype=2 pmem_readscan=2 pmem_ntw=1 pmem_ntr=1

    x86 offers several non-temporal instructions:
    *  8 byte: movnti (store) from normal registers
    * 16 byte: movntdq (store) and movntdqa (load) using xmm registers (SSE)
    * 32 byte: vmovntdq and vmovntdqa using ymm registers (AVX)
    * 64 byte: vmovntdq and vmovntdqa using zmm registers (AVX512)

    The 32-byte AVX instructions are used by this patch.

    Normal memcpy is used for unaligned pmem_rw_bytes accesses,
    so is unsafe for WB mode.

    Module parameters
    =================
    pmem_cachetype=n	(default 3)
    	Select the cache type (which ioremap function to use to
    	map the NVDIMM memory)
    	0 = UC (uncacheable) - slow writes, slow reads
    	1 = WB (writeback) - fast unsafe writes, fast reads
    	2 = WC (write combining) - fast writes, slow reads
    	3 = WT (writethrough) - slow writes, fast reads

    	WB writes are safe if:
    	* non-temporal stores are exclusively used
    	* clflush instructions are added

    pmem_readscan=n		(default 0)
    	0 = no read scan
    	1 = read the entire memory range, looking to trigger
    	UC memory errors

    	The rate is also printed, serving as a quick performance
    	check (uses a 64 byte loop with NT loads).

    pmem_clean=n		(default 0)
    	0 = no clean
    	1 = overwrite the entire memory range, possibly
    	clearing UC memory errors (dangerous, destroys
    	all data)

    	The rate is also printed, serving as a quick performance
    	check (uses a 64 byte loop with NT stores).

    pmem_ntw=n		(default 3)
    	Use non-temporal stores when writing persistent memory

    	0 = memcpy (unsafe for WB)
    	1 = 64 byte loop with NT stores
    	2 = 128 byte loop with NT stores
    	3 = 64 byte loop with NT stores, plus use NT loads from
    	  normal memory (may be better cache usage)
    	4 = 128 byte loop with NT stores, plus use NT loads from
    	  normal memory
    	5 = __copy_from_user (existing kernel function with
    	  8 byte NT instructions)
    	6 = no write at all (nop)(dangerous)
    	7 = 64-byte loop, store only (write garbage)(dangerous)

    pmem_ntr=n		(default 3)
    	Use non-temporal loads when reading persistent memory

    	0 = memcpy
    	1 = 64 byte loop with NT loads
    	2 = 128 byte loop with NT loads
    	3 = 64 byte loop with NT loads, plus use NT stores to
    	  normal memory
    	4 = 128 byte loop with NT loads, plus use NT stores to
    	  normal memory
    	5 = memcpy
    	6 = no load at all (nop)(dangerous)
    	7 = 64-byte loop, load only (return garbage)(dangerous)

    pmm_ntw=6 pmem_ntr=6 exhibits the block layer IOPS limits.

    Signed-off-by: Robert Elliott <elliott@hp.com>
---
 drivers/block/nd/pmem.c | 550 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 539 insertions(+), 11 deletions(-)

diff --git a/drivers/block/nd/pmem.c b/drivers/block/nd/pmem.c
index 7b5cedf1f2a4..f378ef81733f 100644
--- a/drivers/block/nd/pmem.c
+++ b/drivers/block/nd/pmem.c
@@ -26,6 +26,382 @@
 #include <linux/nd.h>
 #include "nd.h"
 
+static int pmem_cachetype;	/* default UC */
+module_param(pmem_cachetype, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_cachetype,
+	"Select cache attribute for pmem driver (0=UC, 1=WB 2=WC 3=WT)");
+
+static int pmem_readscan;
+module_param(pmem_readscan, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_readscan,
+	"Read scan pmem device upon init (trigger ECC errors)");
+
+static int pmem_clean;
+module_param(pmem_clean, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_clean,
+	"Clean pmem device upon init (write garbage, but cleans the ECC)");
+
+static int pmem_ntw = 3;
+module_param(pmem_ntw, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_ntw,
+	"Use non-temporal stores for block writes in pmem (0=memcpy, 1=64 byte NT, 2=128 byte NT, 3=64 dual NT, 4=128 dual NT, 5=copy_from_user, 6=nop, 7=64-byte NT-store only)");
+
+static int pmem_ntr = 3;
+module_param(pmem_ntr, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_ntr,
+	"Use non-temporal loads for block reads in pmem (0=memcpy, 1=64 byte NT, 2=128 byte NT, 3=64 dual NT, 4=128 dual NT, 5=memcpy, 6=nop, 7=64-byte NT-load only)");
+
+/* load: normal, store: non-temporal, loop: 64 bytes */
+static void memcpy_lt_snt_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		/* 16-byte SSE instructions */
+		"movdqa (%0), %%xmm0\n"
+		"movdqa 16(%0), %%xmm1\n"
+		"movdqa 32(%0), %%xmm2\n"
+		"movdqa 48(%0), %%xmm3\n"
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+#endif
+		/* 32-byte AVX instructions */
+		"vmovdqa (%0), %%ymm0\n"
+		"vmovdqa 32(%0), %%ymm1\n"
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: skip, store: non-temporal, loop: 64 bytes */
+static void memcpy_lskip_snt_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+#endif
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: non-temporal, loop: 64 bytes */
+static void memcpy_lnt_snt_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: normal, store: non-temporal, loop: 128 bytes */
+static void memcpy_lt_snt_128(void *to, const void *from, size_t size)
+{
+	u64 bs = 128;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		/* hard to use prefetch effectively */
+		"prefetchnta 128(%0)\n"
+		"prefetchnta 192(%0)\n"
+#endif
+#if 0
+		"movdqa (%0), %%xmm0\n"
+		"movdqa 16(%0), %%xmm1\n"
+		"movdqa 32(%0), %%xmm2\n"
+		"movdqa 48(%0), %%xmm3\n"
+		"movdqa 64(%0), %%xmm4\n"
+		"movdqa 80(%0), %%xmm5\n"
+		"movdqa 96(%0), %%xmm6\n"
+		"movdqa 112(%0), %%xmm7\n"
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+		"movntdq %%xmm4, 64(%1)\n"
+		"movntdq %%xmm5, 80(%1)\n"
+		"movntdq %%xmm6, 96(%1)\n"
+		"movntdq %%xmm7, 112(%1)\n"
+#endif
+		"vmovdqa (%0), %%ymm0\n"
+		"vmovdqa 32(%0), %%ymm1\n"
+		"vmovdqa 64(%0), %%ymm2\n"
+		"vmovdqa 96(%0), %%ymm3\n"
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		"vmovntdq %%ymm2, 64(%1)\n"
+		"vmovntdq %%ymm3, 96(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: non-temporal, loop: 128 bytes */
+static void memcpy_lnt_snt_128(void *to, const void *from, size_t size)
+{
+	u64 bs = 128;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"prefetchnta 128(%0)\n"
+		"prefetchnta 192(%0)\n"
+#endif
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+		"movntdqa 64(%0), %%xmm4\n"
+		"movntdqa 80(%0), %%xmm5\n"
+		"movntdqa 96(%0), %%xmm6\n"
+		"movntdqa 112(%0), %%xmm7\n"
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+		"movntdq %%xmm4, 64(%1)\n"
+		"movntdq %%xmm5, 80(%1)\n"
+		"movntdq %%xmm6, 96(%1)\n"
+		"movntdq %%xmm7, 112(%1)\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		"vmovntdqa 64(%0), %%ymm2\n"
+		"vmovntdqa 96(%0), %%ymm3\n"
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		"vmovntdq %%ymm2, 64(%1)\n"
+		"vmovntdq %%ymm3, 96(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: normal, loop: 64 bytes */
+static void memcpy_lnt_st_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+		"movdqa %%xmm0, (%1)\n"
+		"movdqa %%xmm1, 16(%1)\n"
+		"movdqa %%xmm2, 32(%1)\n"
+		"movdqa %%xmm3, 48(%1)\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		"vmovdqa %%ymm0, (%1)\n"
+		"vmovdqa %%ymm1, 32(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: skip, loop: 64 bytes */
+static void memcpy_lnt_sskip_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: normal, loop: 128 bytes */
+static void memcpy_lnt_st_128(void *to, const void *from, size_t size)
+{
+	u64 bs = 128;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"prefetchnta 128(%0)\n"
+		"prefetchnta 192(%0)\n"
+#endif
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+		"movntdqa 64(%0), %%xmm4\n"
+		"movntdqa 80(%0), %%xmm5\n"
+		"movntdqa 96(%0), %%xmm6\n"
+		"movntdqa 112(%0), %%xmm7\n"
+		"movdqa %%xmm0, (%1)\n"
+		"movdqa %%xmm1, 16(%1)\n"
+		"movdqa %%xmm2, 32(%1)\n"
+		"movdqa %%xmm3, 48(%1)\n"
+		"movdqa %%xmm4, 64(%1)\n"
+		"movdqa %%xmm5, 80(%1)\n"
+		"movdqa %%xmm6, 96(%1)\n"
+		"movdqa %%xmm7, 112(%1)\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		"vmovntdqa 64(%0), %%ymm2\n"
+		"vmovntdqa 96(%0), %%ymm3\n"
+		"vmovdqa %%ymm0, (%1)\n"
+		"vmovdqa %%ymm1, 32(%1)\n"
+		"vmovdqa %%ymm2, 64(%1)\n"
+		"vmovdqa %%ymm3, 96(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
 struct pmem_device {
 	struct request_queue	*pmem_queue;
 	struct gendisk		*pmem_disk;
@@ -37,6 +413,81 @@ struct pmem_device {
 	size_t			size;
 };
 
+/* pick the type of memcpy for a read from NVDIMMs */
+static void memcpy_ntr(void *to, const void *from, size_t size)
+{
+	switch (pmem_ntr) {
+	case 1:
+		memcpy_lnt_st_64(to, from, size);
+		break;
+	case 2:
+		memcpy_lnt_st_128(to, from, size);
+		break;
+	case 3:
+		memcpy_lnt_snt_64(to, from, size);
+		break;
+	case 4:
+		memcpy_lnt_snt_128(to, from, size);
+		break;
+	case 6:
+		/* nop */
+		break;
+	case 7:
+		memcpy_lnt_sskip_64(to, from, size);
+		break;
+	default:
+		memcpy(to, from, size);
+		break;
+	}
+}
+
+/* pick the type of memcpy for a write to NVDIMMs */
+static void memcpy_ntw(void *to, const void *from, size_t size)
+{
+	int ret;
+	switch (pmem_ntw) {
+	case 1:
+		memcpy_lt_snt_64(to, from, size);
+		ret = 0;
+		break;
+	case 2:
+		memcpy_lt_snt_128(to, from, size);
+		ret = 0;
+		break;
+	case 3:
+		memcpy_lnt_snt_64(to, from, size);
+		ret = 0;
+		break;
+	case 4:
+		memcpy_lnt_snt_128(to, from, size);
+		ret = 0;
+		break;
+	case 5:
+		ret = __copy_from_user(to, from, size);
+		if (ret)
+			goto exit;
+	case 6:
+		/* nop */
+		ret = 0;
+		break;
+	case 7:
+		memcpy_lskip_snt_64(to, from, size);
+		ret = 0;
+		break;
+	default:
+		memcpy(to, from, size);
+		ret = 0;
+		break;
+	}
+exit:
+	/* if __copy_from_user or other memcpy functions with return
+	 * values are used, the return value should really be
+	 * propagated upstream. Since most memcpys assume success,
+	 * forgo this for now
+	 */
+	return;
+}
+
 static int pmem_major;
 
 static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
@@ -47,11 +498,11 @@ static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 	size_t pmem_off = sector << 9;
 
 	if (rw == READ) {
-		memcpy(mem + off, pmem->virt_addr + pmem_off, len);
+		memcpy_ntr(mem + off, pmem->virt_addr + pmem_off, len);
 		flush_dcache_page(page);
 	} else {
 		flush_dcache_page(page);
-		memcpy(pmem->virt_addr + pmem_off, mem + off, len);
+		memcpy_ntw(pmem->virt_addr + pmem_off, mem + off, len);
 	}
 
 	kunmap_atomic(mem);
@@ -109,10 +560,26 @@ static int pmem_rw_bytes(struct nd_io *ndio, void *buf, size_t offset,
 		return -EFAULT;
 	}
 
-	if (rw == READ)
-		memcpy(buf, pmem->virt_addr + offset, n);
-	else
-		memcpy(pmem->virt_addr + offset, buf, n);
+	/* NOTE: Plain memcpy is used for unaligned accesses, meaning
+	 * this is not safe for WB mode.
+	 *
+	 * All btt accesses come through here; many are not aligned.
+	 */
+	if (rw == READ) {
+		if (IS_ALIGNED((u64) buf, 64) &&
+		    IS_ALIGNED((u64) pmem->virt_addr + offset, 64) &&
+		    IS_ALIGNED(n, 64))
+			memcpy_ntr(buf, pmem->virt_addr + offset, n);
+		else
+			memcpy(buf, pmem->virt_addr + offset, n);
+	} else {
+		if (IS_ALIGNED((u64) buf, 64) &&
+		    IS_ALIGNED((u64) pmem->virt_addr + offset, 64) &&
+		    IS_ALIGNED(n, 64))
+			memcpy_ntw(pmem->virt_addr + offset, buf, n);
+		else
+			memcpy(pmem->virt_addr + offset, buf, n);
+	}
 
 	return 0;
 }
@@ -143,6 +610,7 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res,
 	struct pmem_device *pmem;
 	struct gendisk *disk;
 	int err;
+	u64 ts, te;
 
 	err = -ENOMEM;
 	pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
@@ -152,21 +620,78 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res,
 	pmem->phys_addr = res->start;
 	pmem->size = resource_size(res);
 
+	dev_info(dev,
+		"mapping phys=0x%llx (%lld GiB) size=0x%zx (%ld GiB)\n",
+		pmem->phys_addr, pmem->phys_addr / (1024*1024*1024),
+		pmem->size, pmem->size / (1024*1024*1024));
+
 	err = -EINVAL;
 	if (!request_mem_region(pmem->phys_addr, pmem->size, "pmem")) {
 		dev_warn(dev, "could not reserve region [0x%pa:0x%zx]\n", &pmem->phys_addr, pmem->size);
 		goto out_free_dev;
 	}
 
-	/*
-	 * Map the memory as non-cachable, as we can't write back the contents
-	 * of the CPU caches in case of a crash.
-	 */
 	err = -ENOMEM;
-	pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
+	switch (pmem_cachetype) {
+	case 0: /* UC */
+		pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
+		break;
+	case 1: /* WB */
+		/* WB is unsafe unless system flushes caches on power loss */
+		pmem->virt_addr = ioremap_cache(pmem->phys_addr, pmem->size);
+		break;
+	case 2: /* WC */
+		/* WC is unsafe unless system flushes buffers on power loss */
+		pmem->virt_addr = ioremap_wc(pmem->phys_addr, pmem->size);
+		break;
+	case 3: /* WT */
+	default:
+		pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size);
+		break;
+	}
+
+	dev_info(dev,
+		"mapped: cache_type=%d virt=0x%p phys=0x%llx (%lld GiB) size=0x%zx (%ld GiB)\n",
+		pmem_cachetype,
+		pmem->virt_addr,
+		pmem->phys_addr, pmem->phys_addr / (1024*1024*1024),
+		pmem->size, pmem->size / (1024*1024*1024));
+
 	if (!pmem->virt_addr)
 		goto out_release_region;
 
+	if (pmem_clean) {
+		/* write all of NVDIMM memory to clear any ECC errors */
+		dev_info(dev,
+			"write clean starting: virt=0x%p phys=0x%llx (%lld GiB) size=0x%zx (%ld GiB)\n",
+			pmem->virt_addr,
+			pmem->phys_addr, pmem->phys_addr / (1024*1024*1024),
+			pmem->size, pmem->size / (1024*1024*1024));
+		ts = local_clock();
+		memcpy_lskip_snt_64(pmem->virt_addr, NULL, pmem->size);
+		te = local_clock();
+		dev_info(dev,
+			"write clean complete: ct=%d in %lld GB/s\n",
+			pmem_cachetype,
+			pmem->size / (te - ts));	/* B/ns equals GB/s */
+	}
+
+	/* read all of NVDIMM memory to trigger any ECC errors now */
+	if (pmem_readscan) {
+		dev_info(dev,
+			"read scan starting: virt=0x%p phys=0x%llx (%lld GiB) size=0x%zx (%ld GiB)\n",
+			pmem->virt_addr,
+			pmem->phys_addr, pmem->phys_addr / (1024*1024*1024),
+			pmem->size, pmem->size / (1024*1024*1024));
+		ts = local_clock();
+		memcpy_lnt_sskip_64(0, pmem->virt_addr, pmem->size);
+		te = local_clock();
+		dev_info(dev,
+			"read scan complete: ct=%d in %lld GB/s\n",
+			pmem_cachetype,
+			pmem->size / (te - ts));	/* B/ns equals GB/s */
+	}
+
 	pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
 	if (!pmem->pmem_queue)
 		goto out_unmap;
@@ -276,6 +801,9 @@ static int __init pmem_init(void)
 {
 	int error;
 
+	pr_info("pmem loading with pmem_readscan=%d pmem_clean=%d pmem_cachetype=%d pmem_ntw=%d pmem_ntr=%d\n",
+		pmem_readscan, pmem_clean, pmem_cachetype, pmem_ntw, pmem_ntr);
+
 	pmem_major = register_blkdev(0, "pmem");
 	if (pmem_major < 0)
 		return pmem_major;
-- 
1.8.3.1


WARNING: multiple messages have this Message-ID (diff)
From: "Elliott, Robert (Server Storage)" <Elliott@hp.com>
To: Dan Williams <dan.j.williams@intel.com>,
	"axboe@kernel.dk" <axboe@kernel.dk>
Cc: "linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>,
	"neilb@suse.de" <neilb@suse.de>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Robert Moore <robert.moore@intel.com>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	Lv Zheng <lv.zheng@intel.com>, "hch@lst.de" <hch@lst.de>,
	"mingo@kernel.org" <mingo@kernel.org>,
	"Kani, Toshimitsu" <toshi.kani@hp.com>,
	Christoph Hellwig <hch@lst.de>,
	"Boaz Harrosh (boaz@plexistor.com)" <boaz@plexistor.com>
Subject: RE: [PATCH v3 20/21] nfit-test: manufactured NFITs for interface development
Date: Mon, 25 May 2015 07:02:57 +0000	[thread overview]
Message-ID: <94D0CD8314A33A4D9D801C0FE68B40295A9217B0@G9W0745.americas.hpqcorp.net> (raw)
In-Reply-To: <20150520205800.32249.74581.stgit@dwillia2-desk3.amr.corp.intel.com>

[-- Attachment #1: Type: text/plain, Size: 667 bytes --]

> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf
> Of Dan Williams
> Sent: Wednesday, May 20, 2015 3:58 PM
> To: axboe@kernel.dk
> Subject: [PATCH v3 20/21] nfit-test: manufactured NFITs for interface
> development
...

Attached is some experimental code to try pmem with different 
cache types (UC, WB, WC, and WT) and memcpy functions using x86 
AVX non-temporal load and store instructions.

It depends on Toshi's WT patch series:
	https://lkml.org/lkml/2015/5/13/866

If you don't have that, you can just comment out the lines related
to ioremap_wt.

---
Rob Elliott, HP Server Storage


[-- Attachment #2: 0001-pmem-cache-type --]
[-- Type: application/octet-stream, Size: 19027 bytes --]

From 18e75a7134e0130b925fffab13f41c1ffc4d9f05 Mon Sep 17 00:00:00 2001
From: Robert Elliott <elliott@hp.com>
Date: Fri, 22 May 2015 16:46:21 -0500
Subject: [PATCH] pmem cache type patch

Author: Robert Elliott <elliott@hp.com>
Date:   Tue Apr 28 19:14:53 2015 -0500

    pmem: cache_type, non-temporal memcpy experiments

    WARNING: Not for inclusion in the kernel - just for experimentation.

    Add modparams to select cache_type and various kinds of
    memcpy with non-temporal loads and stores.  Parameters
    are printed to the kernel serial log at module load time.

    Example usage:
    modprobe pmem pmem_cachetype=2 pmem_readscan=2 pmem_ntw=1 pmem_ntr=1

    x86 offers several non-temporal instructions:
    *  8 byte: movnti (store) from normal registers
    * 16 byte: movntdq (store) and movntdqa (load) using xmm registers (SSE)
    * 32 byte: vmovntdq and vmovntdqa using ymm registers (AVX)
    * 64 byte: vmovntdq and vmovntdqa using zmm registers (AVX512)

    The 32-byte AVX instructions are used by this patch.

    Normal memcpy is used for unaligned pmem_rw_bytes accesses,
    so is unsafe for WB mode.

    Module parameters
    =================
    pmem_cachetype=n	(default 3)
    	Select the cache type (which ioremap function to use to
    	map the NVDIMM memory)
    	0 = UC (uncacheable) - slow writes, slow reads
    	1 = WB (writeback) - fast unsafe writes, fast reads
    	2 = WC (write combining) - fast writes, slow reads
    	3 = WT (writethrough) - slow writes, fast reads

    	WB writes are safe if:
    	* non-temporal stores are exclusively used
    	* clflush instructions are added

    pmem_readscan=n		(default 0)
    	0 = no read scan
    	1 = read the entire memory range, looking to trigger
    	UC memory errors

    	The rate is also printed, serving as a quick performance
    	check (uses a 64 byte loop with NT loads).

    pmem_clean=n		(default 0)
    	0 = no clean
    	1 = overwrite the entire memory range, possibly
    	clearing UC memory errors (dangerous, destroys
    	all data)

    	The rate is also printed, serving as a quick performance
    	check (uses a 64 byte loop with NT stores).

    pmem_ntw=n		(default 3)
    	Use non-temporal stores when writing persistent memory

    	0 = memcpy (unsafe for WB)
    	1 = 64 byte loop with NT stores
    	2 = 128 byte loop with NT stores
    	3 = 64 byte loop with NT stores, plus use NT loads from
    	  normal memory (may be better cache usage)
    	4 = 128 byte loop with NT stores, plus use NT loads from
    	  normal memory
    	5 = __copy_from_user (existing kernel function with
    	  8 byte NT instructions)
    	6 = no write at all (nop)(dangerous)
    	7 = 64-byte loop, store only (write garbage)(dangerous)

    pmem_ntr=n		(default 3)
    	Use non-temporal loads when reading persistent memory

    	0 = memcpy
    	1 = 64 byte loop with NT loads
    	2 = 128 byte loop with NT loads
    	3 = 64 byte loop with NT loads, plus use NT stores to
    	  normal memory
    	4 = 128 byte loop with NT loads, plus use NT stores to
    	  normal memory
    	5 = memcpy
    	6 = no load at all (nop)(dangerous)
    	7 = 64-byte loop, load only (return garbage)(dangerous)

    pmm_ntw=6 pmem_ntr=6 exhibits the block layer IOPS limits.

    Signed-off-by: Robert Elliott <elliott@hp.com>
---
 drivers/block/nd/pmem.c | 550 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 539 insertions(+), 11 deletions(-)

diff --git a/drivers/block/nd/pmem.c b/drivers/block/nd/pmem.c
index 7b5cedf1f2a4..f378ef81733f 100644
--- a/drivers/block/nd/pmem.c
+++ b/drivers/block/nd/pmem.c
@@ -26,6 +26,382 @@
 #include <linux/nd.h>
 #include "nd.h"
 
+static int pmem_cachetype;	/* default UC */
+module_param(pmem_cachetype, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_cachetype,
+	"Select cache attribute for pmem driver (0=UC, 1=WB 2=WC 3=WT)");
+
+static int pmem_readscan;
+module_param(pmem_readscan, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_readscan,
+	"Read scan pmem device upon init (trigger ECC errors)");
+
+static int pmem_clean;
+module_param(pmem_clean, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_clean,
+	"Clean pmem device upon init (write garbage, but cleans the ECC)");
+
+static int pmem_ntw = 3;
+module_param(pmem_ntw, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_ntw,
+	"Use non-temporal stores for block writes in pmem (0=memcpy, 1=64 byte NT, 2=128 byte NT, 3=64 dual NT, 4=128 dual NT, 5=copy_from_user, 6=nop, 7=64-byte NT-store only)");
+
+static int pmem_ntr = 3;
+module_param(pmem_ntr, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_ntr,
+	"Use non-temporal loads for block reads in pmem (0=memcpy, 1=64 byte NT, 2=128 byte NT, 3=64 dual NT, 4=128 dual NT, 5=memcpy, 6=nop, 7=64-byte NT-load only)");
+
+/* load: normal, store: non-temporal, loop: 64 bytes */
+static void memcpy_lt_snt_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		/* 16-byte SSE instructions */
+		"movdqa (%0), %%xmm0\n"
+		"movdqa 16(%0), %%xmm1\n"
+		"movdqa 32(%0), %%xmm2\n"
+		"movdqa 48(%0), %%xmm3\n"
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+#endif
+		/* 32-byte AVX instructions */
+		"vmovdqa (%0), %%ymm0\n"
+		"vmovdqa 32(%0), %%ymm1\n"
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: skip, store: non-temporal, loop: 64 bytes */
+static void memcpy_lskip_snt_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+#endif
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: non-temporal, loop: 64 bytes */
+static void memcpy_lnt_snt_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: normal, store: non-temporal, loop: 128 bytes */
+static void memcpy_lt_snt_128(void *to, const void *from, size_t size)
+{
+	u64 bs = 128;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		/* hard to use prefetch effectively */
+		"prefetchnta 128(%0)\n"
+		"prefetchnta 192(%0)\n"
+#endif
+#if 0
+		"movdqa (%0), %%xmm0\n"
+		"movdqa 16(%0), %%xmm1\n"
+		"movdqa 32(%0), %%xmm2\n"
+		"movdqa 48(%0), %%xmm3\n"
+		"movdqa 64(%0), %%xmm4\n"
+		"movdqa 80(%0), %%xmm5\n"
+		"movdqa 96(%0), %%xmm6\n"
+		"movdqa 112(%0), %%xmm7\n"
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+		"movntdq %%xmm4, 64(%1)\n"
+		"movntdq %%xmm5, 80(%1)\n"
+		"movntdq %%xmm6, 96(%1)\n"
+		"movntdq %%xmm7, 112(%1)\n"
+#endif
+		"vmovdqa (%0), %%ymm0\n"
+		"vmovdqa 32(%0), %%ymm1\n"
+		"vmovdqa 64(%0), %%ymm2\n"
+		"vmovdqa 96(%0), %%ymm3\n"
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		"vmovntdq %%ymm2, 64(%1)\n"
+		"vmovntdq %%ymm3, 96(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: non-temporal, loop: 128 bytes */
+static void memcpy_lnt_snt_128(void *to, const void *from, size_t size)
+{
+	u64 bs = 128;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"prefetchnta 128(%0)\n"
+		"prefetchnta 192(%0)\n"
+#endif
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+		"movntdqa 64(%0), %%xmm4\n"
+		"movntdqa 80(%0), %%xmm5\n"
+		"movntdqa 96(%0), %%xmm6\n"
+		"movntdqa 112(%0), %%xmm7\n"
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+		"movntdq %%xmm4, 64(%1)\n"
+		"movntdq %%xmm5, 80(%1)\n"
+		"movntdq %%xmm6, 96(%1)\n"
+		"movntdq %%xmm7, 112(%1)\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		"vmovntdqa 64(%0), %%ymm2\n"
+		"vmovntdqa 96(%0), %%ymm3\n"
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		"vmovntdq %%ymm2, 64(%1)\n"
+		"vmovntdq %%ymm3, 96(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: normal, loop: 64 bytes */
+static void memcpy_lnt_st_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+		"movdqa %%xmm0, (%1)\n"
+		"movdqa %%xmm1, 16(%1)\n"
+		"movdqa %%xmm2, 32(%1)\n"
+		"movdqa %%xmm3, 48(%1)\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		"vmovdqa %%ymm0, (%1)\n"
+		"vmovdqa %%ymm1, 32(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: skip, loop: 64 bytes */
+static void memcpy_lnt_sskip_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: normal, loop: 128 bytes */
+static void memcpy_lnt_st_128(void *to, const void *from, size_t size)
+{
+	u64 bs = 128;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"prefetchnta 128(%0)\n"
+		"prefetchnta 192(%0)\n"
+#endif
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+		"movntdqa 64(%0), %%xmm4\n"
+		"movntdqa 80(%0), %%xmm5\n"
+		"movntdqa 96(%0), %%xmm6\n"
+		"movntdqa 112(%0), %%xmm7\n"
+		"movdqa %%xmm0, (%1)\n"
+		"movdqa %%xmm1, 16(%1)\n"
+		"movdqa %%xmm2, 32(%1)\n"
+		"movdqa %%xmm3, 48(%1)\n"
+		"movdqa %%xmm4, 64(%1)\n"
+		"movdqa %%xmm5, 80(%1)\n"
+		"movdqa %%xmm6, 96(%1)\n"
+		"movdqa %%xmm7, 112(%1)\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		"vmovntdqa 64(%0), %%ymm2\n"
+		"vmovntdqa 96(%0), %%ymm3\n"
+		"vmovdqa %%ymm0, (%1)\n"
+		"vmovdqa %%ymm1, 32(%1)\n"
+		"vmovdqa %%ymm2, 64(%1)\n"
+		"vmovdqa %%ymm3, 96(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
 struct pmem_device {
 	struct request_queue	*pmem_queue;
 	struct gendisk		*pmem_disk;
@@ -37,6 +413,81 @@ struct pmem_device {
 	size_t			size;
 };
 
+/* pick the type of memcpy for a read from NVDIMMs */
+static void memcpy_ntr(void *to, const void *from, size_t size)
+{
+	switch (pmem_ntr) {
+	case 1:
+		memcpy_lnt_st_64(to, from, size);
+		break;
+	case 2:
+		memcpy_lnt_st_128(to, from, size);
+		break;
+	case 3:
+		memcpy_lnt_snt_64(to, from, size);
+		break;
+	case 4:
+		memcpy_lnt_snt_128(to, from, size);
+		break;
+	case 6:
+		/* nop */
+		break;
+	case 7:
+		memcpy_lnt_sskip_64(to, from, size);
+		break;
+	default:
+		memcpy(to, from, size);
+		break;
+	}
+}
+
+/* pick the type of memcpy for a write to NVDIMMs */
+static void memcpy_ntw(void *to, const void *from, size_t size)
+{
+	int ret;
+	switch (pmem_ntw) {
+	case 1:
+		memcpy_lt_snt_64(to, from, size);
+		ret = 0;
+		break;
+	case 2:
+		memcpy_lt_snt_128(to, from, size);
+		ret = 0;
+		break;
+	case 3:
+		memcpy_lnt_snt_64(to, from, size);
+		ret = 0;
+		break;
+	case 4:
+		memcpy_lnt_snt_128(to, from, size);
+		ret = 0;
+		break;
+	case 5:
+		ret = __copy_from_user(to, from, size);
+		if (ret)
+			goto exit;
+	case 6:
+		/* nop */
+		ret = 0;
+		break;
+	case 7:
+		memcpy_lskip_snt_64(to, from, size);
+		ret = 0;
+		break;
+	default:
+		memcpy(to, from, size);
+		ret = 0;
+		break;
+	}
+exit:
+	/* if __copy_from_user or other memcpy functions with return
+	 * values are used, the return value should really be
+	 * propagated upstream. Since most memcpys assume success,
+	 * forgo this for now
+	 */
+	return;
+}
+
 static int pmem_major;
 
 static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
@@ -47,11 +498,11 @@ static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 	size_t pmem_off = sector << 9;
 
 	if (rw == READ) {
-		memcpy(mem + off, pmem->virt_addr + pmem_off, len);
+		memcpy_ntr(mem + off, pmem->virt_addr + pmem_off, len);
 		flush_dcache_page(page);
 	} else {
 		flush_dcache_page(page);
-		memcpy(pmem->virt_addr + pmem_off, mem + off, len);
+		memcpy_ntw(pmem->virt_addr + pmem_off, mem + off, len);
 	}
 
 	kunmap_atomic(mem);
@@ -109,10 +560,26 @@ static int pmem_rw_bytes(struct nd_io *ndio, void *buf, size_t offset,
 		return -EFAULT;
 	}
 
-	if (rw == READ)
-		memcpy(buf, pmem->virt_addr + offset, n);
-	else
-		memcpy(pmem->virt_addr + offset, buf, n);
+	/* NOTE: Plain memcpy is used for unaligned accesses, meaning
+	 * this is not safe for WB mode.
+	 *
+	 * All btt accesses come through here; many are not aligned.
+	 */
+	if (rw == READ) {
+		if (IS_ALIGNED((u64) buf, 64) &&
+		    IS_ALIGNED((u64) pmem->virt_addr + offset, 64) &&
+		    IS_ALIGNED(n, 64))
+			memcpy_ntr(buf, pmem->virt_addr + offset, n);
+		else
+			memcpy(buf, pmem->virt_addr + offset, n);
+	} else {
+		if (IS_ALIGNED((u64) buf, 64) &&
+		    IS_ALIGNED((u64) pmem->virt_addr + offset, 64) &&
+		    IS_ALIGNED(n, 64))
+			memcpy_ntw(pmem->virt_addr + offset, buf, n);
+		else
+			memcpy(pmem->virt_addr + offset, buf, n);
+	}
 
 	return 0;
 }
@@ -143,6 +610,7 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res,
 	struct pmem_device *pmem;
 	struct gendisk *disk;
 	int err;
+	u64 ts, te;
 
 	err = -ENOMEM;
 	pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
@@ -152,21 +620,78 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res,
 	pmem->phys_addr = res->start;
 	pmem->size = resource_size(res);
 
+	dev_info(dev,
+		"mapping phys=0x%llx (%lld GiB) size=0x%zx (%ld GiB)\n",
+		pmem->phys_addr, pmem->phys_addr / (1024*1024*1024),
+		pmem->size, pmem->size / (1024*1024*1024));
+
 	err = -EINVAL;
 	if (!request_mem_region(pmem->phys_addr, pmem->size, "pmem")) {
 		dev_warn(dev, "could not reserve region [0x%pa:0x%zx]\n", &pmem->phys_addr, pmem->size);
 		goto out_free_dev;
 	}
 
-	/*
-	 * Map the memory as non-cachable, as we can't write back the contents
-	 * of the CPU caches in case of a crash.
-	 */
 	err = -ENOMEM;
-	pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
+	switch (pmem_cachetype) {
+	case 0: /* UC */
+		pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
+		break;
+	case 1: /* WB */
+		/* WB is unsafe unless system flushes caches on power loss */
+		pmem->virt_addr = ioremap_cache(pmem->phys_addr, pmem->size);
+		break;
+	case 2: /* WC */
+		/* WC is unsafe unless system flushes buffers on power loss */
+		pmem->virt_addr = ioremap_wc(pmem->phys_addr, pmem->size);
+		break;
+	case 3: /* WT */
+	default:
+		pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size);
+		break;
+	}
+
+	dev_info(dev,
+		"mapped: cache_type=%d virt=0x%p phys=0x%llx (%lld GiB) size=0x%zx (%ld GiB)\n",
+		pmem_cachetype,
+		pmem->virt_addr,
+		pmem->phys_addr, pmem->phys_addr / (1024*1024*1024),
+		pmem->size, pmem->size / (1024*1024*1024));
+
 	if (!pmem->virt_addr)
 		goto out_release_region;
 
+	if (pmem_clean) {
+		/* write all of NVDIMM memory to clear any ECC errors */
+		dev_info(dev,
+			"write clean starting: virt=0x%p phys=0x%llx (%lld GiB) size=0x%zx (%ld GiB)\n",
+			pmem->virt_addr,
+			pmem->phys_addr, pmem->phys_addr / (1024*1024*1024),
+			pmem->size, pmem->size / (1024*1024*1024));
+		ts = local_clock();
+		memcpy_lskip_snt_64(pmem->virt_addr, NULL, pmem->size);
+		te = local_clock();
+		dev_info(dev,
+			"write clean complete: ct=%d in %lld GB/s\n",
+			pmem_cachetype,
+			pmem->size / (te - ts));	/* B/ns equals GB/s */
+	}
+
+	/* read all of NVDIMM memory to trigger any ECC errors now */
+	if (pmem_readscan) {
+		dev_info(dev,
+			"read scan starting: virt=0x%p phys=0x%llx (%lld GiB) size=0x%zx (%ld GiB)\n",
+			pmem->virt_addr,
+			pmem->phys_addr, pmem->phys_addr / (1024*1024*1024),
+			pmem->size, pmem->size / (1024*1024*1024));
+		ts = local_clock();
+		memcpy_lnt_sskip_64(0, pmem->virt_addr, pmem->size);
+		te = local_clock();
+		dev_info(dev,
+			"read scan complete: ct=%d in %lld GB/s\n",
+			pmem_cachetype,
+			pmem->size / (te - ts));	/* B/ns equals GB/s */
+	}
+
 	pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
 	if (!pmem->pmem_queue)
 		goto out_unmap;
@@ -276,6 +801,9 @@ static int __init pmem_init(void)
 {
 	int error;
 
+	pr_info("pmem loading with pmem_readscan=%d pmem_clean=%d pmem_cachetype=%d pmem_ntw=%d pmem_ntr=%d\n",
+		pmem_readscan, pmem_clean, pmem_cachetype, pmem_ntw, pmem_ntr);
+
 	pmem_major = register_blkdev(0, "pmem");
 	if (pmem_major < 0)
 		return pmem_major;
-- 
1.8.3.1


  reply	other threads:[~2015-05-25  7:02 UTC|newest]

Thread overview: 89+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-20 20:56 [PATCH v3 00/21] libnd: non-volatile memory device support Dan Williams
2015-05-20 20:56 ` Dan Williams
2015-05-20 20:56 ` [PATCH v3 01/21] e820, efi: add ACPI 6.0 persistent memory types Dan Williams
2015-05-20 20:56   ` Dan Williams
2015-05-20 20:56 ` [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support Dan Williams
2015-05-20 20:56   ` Dan Williams
2015-05-21 13:55   ` Toshi Kani
2015-05-21 13:55     ` Toshi Kani
2015-05-21 15:56     ` Dan Williams
2015-05-21 15:56       ` Dan Williams
2015-05-21 17:25       ` Toshi Kani
2015-05-21 17:25         ` Toshi Kani
2015-05-21 17:49         ` Moore, Robert
2015-05-21 17:49           ` Moore, Robert
2015-05-21 18:01           ` Toshi Kani
2015-05-21 18:01             ` Toshi Kani
2015-05-21 19:06             ` Dan Williams
2015-05-21 19:06               ` Dan Williams
2015-05-21 19:44               ` Toshi Kani
2015-05-21 19:44                 ` Toshi Kani
2015-05-21 19:44                 ` Toshi Kani
2015-05-21 19:59                 ` Toshi Kani
2015-05-21 19:59                   ` Toshi Kani
2015-05-21 19:59                   ` Toshi Kani
2015-05-21 20:59                   ` Linda Knippers
2015-05-21 20:59                     ` Linda Knippers
2015-05-21 20:59                     ` Linda Knippers
2015-05-21 21:34                     ` Dan Williams
2015-05-21 21:34                       ` Dan Williams
2015-05-21 21:34                       ` Dan Williams
2015-05-21 22:11                       ` Toshi Kani
2015-05-21 22:11                         ` Toshi Kani
2015-05-22 14:58                       ` Moore, Robert
2015-05-22 14:58                         ` Moore, Robert
2015-05-22 15:21                         ` Toshi Kani
2015-05-22 15:21                           ` Toshi Kani
2015-05-22 16:12                           ` Moore, Robert
2015-05-22 16:12                             ` Moore, Robert
2015-05-20 20:56 ` [PATCH v3 03/21] libnd: control character device and libnd bus sysfs attributes Dan Williams
2015-05-20 20:56   ` Dan Williams
2015-05-20 20:56 ` [PATCH v3 04/21] libnd, nfit: dimm/memory-devices Dan Williams
2015-05-20 20:56   ` Dan Williams
2015-05-20 20:56 ` [PATCH v3 05/21] libnd: control (ioctl) messages for libnd bus and dimm devices Dan Williams
2015-05-20 20:56   ` Dan Williams
2015-05-20 20:56 ` [PATCH v3 06/21] libnd, nd_dimm: dimm driver and base libnd device-driver infrastructure Dan Williams
2015-05-20 20:56   ` Dan Williams
2015-05-20 20:56 ` [PATCH v3 07/21] libnd, nfit: regions (block-data-window, persistent memory, volatile memory) Dan Williams
2015-05-20 20:56   ` Dan Williams
2015-05-20 20:56 ` [PATCH v3 08/21] libnd: support for legacy (non-aliasing) nvdimms Dan Williams
2015-05-20 20:56   ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 09/21] libnd, nd_pmem: add libnd support to the pmem driver Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-23 14:39   ` Christoph Hellwig
2015-05-23 16:59     ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 10/21] pmem: Dynamically allocate partition numbers Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 11/21] libnd, nfit: add interleave-set state-tracking infrastructure Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 12/21] libnd: namespace indices: read and validate Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 13/21] libnd: pmem label sets and namespace instantiation Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 14/21] libnd: blk labels " Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-22 18:37   ` Elliott, Robert (Server Storage)
2015-05-22 18:37     ` Elliott, Robert (Server Storage)
2015-05-22 18:37     ` Elliott, Robert (Server Storage)
2015-05-22 18:51     ` Dan Williams
2015-05-22 18:51       ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 15/21] libnd: write pmem label set Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 16/21] libnd: write blk " Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 17/21] libnd: infrastructure for btt devices Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 18/21] nd_btt: atomic sector updates Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-22 21:16   ` Elliott, Robert (Server Storage)
2015-05-22 21:16     ` Elliott, Robert (Server Storage)
2015-05-22 21:39     ` Dan Williams
2015-05-22 21:39       ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 19/21] libnd, nfit, nd_blk: driver for BLK-mode access persistent memory Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-20 20:58 ` [PATCH v3 20/21] nfit-test: manufactured NFITs for interface development Dan Williams
2015-05-20 20:58   ` Dan Williams
2015-05-25  7:02   ` Elliott, Robert (Server Storage) [this message]
2015-05-25  7:02     ` Elliott, Robert (Server Storage)
2015-05-20 20:58 ` [PATCH v3 21/21] libnd: Non-Volatile Devices Dan Williams
2015-05-20 20:58   ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=94D0CD8314A33A4D9D801C0FE68B40295A9217B0@G9W0745.americas.hpqcorp.net \
    --to=elliott@hp.com \
    --cc=axboe@kernel.dk \
    --cc=dan.j.williams@intel.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hch@lst.de \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=lv.zheng@intel.com \
    --cc=mingo@kernel.org \
    --cc=neilb@suse.de \
    --cc=rafael.j.wysocki@intel.com \
    --cc=robert.moore@intel.com \
    --cc=toshi.kani@hp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.