All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/4] dm-writecache patches
@ 2018-05-19  5:25 Mikulas Patocka
  2018-05-19  5:25 ` [patch 1/4] x86: optimize memcpy_flushcache Mikulas Patocka
                   ` (3 more replies)
  0 siblings, 4 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-19  5:25 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer, Dan Williams; +Cc: dm-devel

Hi

Here I'm sending the dm-writecache patches.

The first patch optimizes x86 memcpy_flushcache for small constant size.
It increases dm-writecache throughput by about 2%. It should be already in
Dan Williams' tree.

The second patch exports __prepare_to_swait and __finish_swait.

The third patch is dm-writecache that is already in Mike's tree.

The fourth patch converts it to use the new API. The pmem_* API is at the
beginning of the file dm-writecache.c - it may be moved to system include
files.

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [patch 1/4] x86: optimize memcpy_flushcache
  2018-05-19  5:25 [patch 0/4] dm-writecache patches Mikulas Patocka
@ 2018-05-19  5:25 ` Mikulas Patocka
  2018-05-19 14:21   ` Dan Williams
  2018-05-19  5:25 ` [patch 2/4] swait: export the symbols __prepare_to_swait and __finish_swait Mikulas Patocka
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-19  5:25 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer, Dan Williams; +Cc: dm-devel

[-- Attachment #1: memcpy_flushcache-optimization.patch --]
[-- Type: text/plain, Size: 2651 bytes --]

I use memcpy_flushcache in my persistent memory driver for metadata
updates and it turns out that the overhead of memcpy_flushcache causes 2%
performance degradation compared to "movnti" instruction explicitly coded
using inline assembler.

This patch recognizes memcpy_flushcache calls with constant short length
and turns them into inline assembler - so that I don't have to use inline
assembler in the driver.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 arch/x86/include/asm/string_64.h |   20 +++++++++++++++++++-
 arch/x86/lib/usercopy_64.c       |    4 ++--
 2 files changed, 21 insertions(+), 3 deletions(-)

Index: linux-2.6/arch/x86/include/asm/string_64.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/string_64.h	2018-05-18 21:21:15.000000000 +0200
+++ linux-2.6/arch/x86/include/asm/string_64.h	2018-05-18 21:21:15.000000000 +0200
@@ -147,7 +147,25 @@ memcpy_mcsafe(void *dst, const void *src
 
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
-void memcpy_flushcache(void *dst, const void *src, size_t cnt);
+void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
+static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t cnt)
+{
+	if (__builtin_constant_p(cnt)) {
+		switch (cnt) {
+			case 4:
+				asm ("movntil %1, %0" : "=m"(*(u32 *)dst) : "r"(*(u32 *)src));
+				return;
+			case 8:
+				asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : "r"(*(u64 *)src));
+				return;
+			case 16:
+				asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : "r"(*(u64 *)src));
+				asm ("movntiq %1, %0" : "=m"(*(u64 *)(dst + 8)) : "r"(*(u64 *)(src + 8)));
+				return;
+		}
+	}
+	__memcpy_flushcache(dst, src, cnt);
+}
 #endif
 
 #endif /* __KERNEL__ */
Index: linux-2.6/arch/x86/lib/usercopy_64.c
===================================================================
--- linux-2.6.orig/arch/x86/lib/usercopy_64.c	2018-05-18 21:21:15.000000000 +0200
+++ linux-2.6/arch/x86/lib/usercopy_64.c	2018-05-18 22:09:49.000000000 +0200
@@ -133,7 +133,7 @@ long __copy_user_flushcache(void *dst, c
 	return rc;
 }
 
-void memcpy_flushcache(void *_dst, const void *_src, size_t size)
+void __memcpy_flushcache(void *_dst, const void *_src, size_t size)
 {
 	unsigned long dest = (unsigned long) _dst;
 	unsigned long source = (unsigned long) _src;
@@ -196,7 +196,7 @@ void memcpy_flushcache(void *_dst, const
 		clean_cache_range((void *) dest, size);
 	}
 }
-EXPORT_SYMBOL_GPL(memcpy_flushcache);
+EXPORT_SYMBOL_GPL(__memcpy_flushcache);
 
 void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
 		size_t len)

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [patch 2/4] swait: export the symbols __prepare_to_swait and __finish_swait
  2018-05-19  5:25 [patch 0/4] dm-writecache patches Mikulas Patocka
  2018-05-19  5:25 ` [patch 1/4] x86: optimize memcpy_flushcache Mikulas Patocka
@ 2018-05-19  5:25 ` Mikulas Patocka
  2018-05-22  6:34   ` Christoph Hellwig
  2018-05-19  5:25 ` [patch 3/4] dm-writecache Mikulas Patocka
  2018-05-19  5:25 ` [patch 4/4] dm-writecache: use new API for flushing Mikulas Patocka
  3 siblings, 1 reply; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-19  5:25 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer, Dan Williams; +Cc: dm-devel

[-- Attachment #1: export-__finish_swait-__prepare_to_swait.patch --]
[-- Type: text/plain, Size: 1265 bytes --]

In order to reduce locking overhead, I use the spinlock in
swait_queue_head to protect not only the wait queue, but also the list of
events. Consequently, I need to use unlocked functions __prepare_to_swait
and __finish_swait. These functions are declared in the file
include/linux/swait.h, but they are not exported, and so they are not
useable from kernel modules.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 kernel/sched/swait.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/kernel/sched/swait.c
===================================================================
--- linux-2.6.orig/kernel/sched/swait.c	2018-04-16 21:10:05.000000000 +0200
+++ linux-2.6/kernel/sched/swait.c	2018-04-16 21:10:05.000000000 +0200
@@ -75,6 +75,7 @@ void __prepare_to_swait(struct swait_que
 	if (list_empty(&wait->task_list))
 		list_add(&wait->task_list, &q->task_list);
 }
+EXPORT_SYMBOL(__prepare_to_swait);
 
 void prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait, int state)
 {
@@ -104,6 +105,7 @@ void __finish_swait(struct swait_queue_h
 	if (!list_empty(&wait->task_list))
 		list_del_init(&wait->task_list);
 }
+EXPORT_SYMBOL(__finish_swait);
 
 void finish_swait(struct swait_queue_head *q, struct swait_queue *wait)
 {

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [patch 3/4] dm-writecache
  2018-05-19  5:25 [patch 0/4] dm-writecache patches Mikulas Patocka
  2018-05-19  5:25 ` [patch 1/4] x86: optimize memcpy_flushcache Mikulas Patocka
  2018-05-19  5:25 ` [patch 2/4] swait: export the symbols __prepare_to_swait and __finish_swait Mikulas Patocka
@ 2018-05-19  5:25 ` Mikulas Patocka
  2018-05-22  6:37   ` Christoph Hellwig
  2018-05-19  5:25 ` [patch 4/4] dm-writecache: use new API for flushing Mikulas Patocka
  3 siblings, 1 reply; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-19  5:25 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer, Dan Williams; +Cc: dm-devel

[-- Attachment #1: dm-writecache.patch --]
[-- Type: text/plain, Size: 67950 bytes --]

The dm-writecache target.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 Documentation/device-mapper/writecache.txt |   68 
 drivers/md/Kconfig                         |   11 
 drivers/md/Makefile                        |    1 
 drivers/md/dm-writecache.c                 | 2414 +++++++++++++++++++++++++++++
 4 files changed, 2494 insertions(+)

Index: linux-2.6/drivers/md/Kconfig
===================================================================
--- linux-2.6.orig/drivers/md/Kconfig	2018-05-15 07:09:32.000000000 +0200
+++ linux-2.6/drivers/md/Kconfig	2018-05-15 07:09:32.000000000 +0200
@@ -334,6 +334,17 @@ config DM_CACHE_SMQ
          of less memory utilization, improved performance and increased
          adaptability in the face of changing workloads.
 
+config DM_WRITECACHE
+	tristate "Writecache target"
+	depends on BLK_DEV_DM
+	---help---
+	   The writecache target caches writes on persistent memory or SSD.
+	   It is intended for databases or other programs that need extremely
+	   low commit latency.
+
+	   The writecache target doesn't cache reads because reads are supposed
+	   to be cached in standard RAM.
+
 config DM_ERA
        tristate "Era target (EXPERIMENTAL)"
        depends on BLK_DEV_DM
Index: linux-2.6/drivers/md/Makefile
===================================================================
--- linux-2.6.orig/drivers/md/Makefile	2018-05-15 07:09:32.000000000 +0200
+++ linux-2.6/drivers/md/Makefile	2018-05-15 07:09:32.000000000 +0200
@@ -67,6 +67,7 @@ obj-$(CONFIG_DM_ERA)		+= dm-era.o
 obj-$(CONFIG_DM_LOG_WRITES)	+= dm-log-writes.o
 obj-$(CONFIG_DM_INTEGRITY)	+= dm-integrity.o
 obj-$(CONFIG_DM_ZONED)		+= dm-zoned.o
+obj-$(CONFIG_DM_WRITECACHE)	+= dm-writecache.o
 
 ifeq ($(CONFIG_DM_UEVENT),y)
 dm-mod-objs			+= dm-uevent.o
Index: linux-2.6/drivers/md/dm-writecache.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/drivers/md/dm-writecache.c	2018-05-17 02:46:44.000000000 +0200
@@ -0,0 +1,2414 @@
+/*
+ * Copyright (C) 2018 Red Hat. All rights reserved.
+ *
+ * This file is released under the GPL.
+ */
+
+#include <linux/device-mapper.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/vmalloc.h>
+#include <linux/kthread.h>
+#include <linux/swait.h>
+#include <linux/dm-io.h>
+#include <linux/dm-kcopyd.h>
+#include <linux/dax.h>
+#include <linux/pfn_t.h>
+
+#define DM_MSG_PREFIX "writecache"
+
+#define HIGH_WATERMARK			50
+#define LOW_WATERMARK			45
+#define MAX_WRITEBACK_JOBS		0
+#define ENDIO_LATENCY			16
+#define WRITEBACK_LATENCY		64
+#define AUTOCOMMIT_BLOCKS_SSD		65536
+#define AUTOCOMMIT_BLOCKS_PMEM		64
+#define AUTOCOMMIT_MSEC			1000
+
+/*
+ * If the architecture doesn't support persistent memory, we can use this driver
+ * in SSD-only mode.
+ */
+#ifndef CONFIG_ARCH_HAS_PMEM_API
+#define DM_WRITECACHE_ONLY_SSD
+#endif
+
+//#define WC_MEASURE_LATENCY
+
+#define BITMAP_GRANULARITY	65536
+#if BITMAP_GRANULARITY < PAGE_SIZE
+#undef BITMAP_GRANULARITY
+#define BITMAP_GRANULARITY	PAGE_SIZE
+#endif
+
+/*
+ * On X86, non-temporal stores are more efficient than cache flushing.
+ * On ARM64, cache flushing is more efficient.
+ */
+#if defined(CONFIG_X86_64)
+#define EAGER_DATA_FLUSH
+#define NT_STORE(dest, src)				\
+do {							\
+	typeof(src) val = (src);			\
+	memcpy_flushcache(&(dest), &val, sizeof(src));	\
+} while (0)
+#else
+#define NT_STORE(dest, src)	WRITE_ONCE(dest, src)
+#endif
+
+#if defined(__HAVE_ARCH_MEMCPY_MCSAFE) && !defined(DM_WRITECACHE_ONLY_SSD)
+#define DM_WRITECACHE_HANDLE_HARDWARE_ERRORS
+#endif
+
+#define MEMORY_SUPERBLOCK_MAGIC		0x23489321
+#define MEMORY_SUPERBLOCK_VERSION	1
+
+struct wc_memory_entry {
+	__le64 original_sector;
+	__le64 seq_count;
+};
+
+struct wc_memory_superblock {
+	union {
+		struct {
+			__le32 magic;
+			__le32 version;
+			__le32 block_size;
+			__le32 pad;
+			__le64 n_blocks;
+			__le64 seq_count;
+		};
+		__le64 padding[8];
+	};
+	struct wc_memory_entry entries[0];
+};
+
+struct wc_entry {
+	struct rb_node rb_node;
+	struct list_head lru;
+	unsigned short wc_list_contiguous;
+	bool write_in_progress
+#if BITS_PER_LONG == 64
+		:1
+#endif
+	;
+	unsigned long index
+#if BITS_PER_LONG == 64
+		:47
+#endif
+	;
+#ifdef DM_WRITECACHE_HANDLE_HARDWARE_ERRORS
+	uint64_t original_sector;
+	uint64_t seq_count;
+#endif
+};
+
+#ifndef DM_WRITECACHE_ONLY_SSD
+#define WC_MODE_PMEM(wc)			((wc)->pmem_mode)
+#define WC_MODE_FUA(wc)				((wc)->writeback_fua)
+#else
+#define WC_MODE_PMEM(wc)			false
+#define WC_MODE_FUA(wc)				false
+#endif
+#define WC_MODE_SORT_FREELIST(wc)		(!WC_MODE_PMEM(wc))
+
+struct dm_writecache {
+#ifndef DM_WRITECACHE_ONLY_SSD
+	bool pmem_mode;
+	bool writeback_fua;
+#endif
+	struct mutex lock;
+	struct rb_root tree;
+	struct list_head lru;
+	union {
+		struct list_head freelist;
+		struct {
+			struct rb_root freetree;
+			struct wc_entry *current_free;
+		};
+	};
+	size_t freelist_size;
+	size_t writeback_size;
+	unsigned uncommitted_blocks;
+	unsigned autocommit_blocks;
+	unsigned max_writeback_jobs;
+	size_t freelist_high_watermark;
+	size_t freelist_low_watermark;
+	struct timer_list autocommit_timer;
+	unsigned long autocommit_jiffies;
+	struct swait_queue_head freelist_wait;
+
+	struct dm_target *ti;
+	struct dm_dev *dev;
+	struct dm_dev *ssd_dev;
+	void *memory_map;
+	uint64_t memory_map_size;
+	size_t metadata_sectors;
+	void *block_start;
+	struct wc_entry *entries;
+	unsigned block_size;
+	unsigned char block_size_bits;
+	size_t n_blocks;
+	uint64_t seq_count;
+	int error;
+
+	bool overwrote_committed;
+	bool memory_vmapped;
+
+	atomic_t bio_in_progress[2];
+	struct swait_queue_head bio_in_progress_wait[2];
+
+	struct dm_io_client *dm_io;
+
+	unsigned writeback_all;
+	struct workqueue_struct *writeback_wq;
+	struct work_struct writeback_work;
+	struct work_struct flush_work;
+
+	struct swait_queue_head endio_thread_wait;
+	struct list_head endio_list;
+	struct task_struct *endio_thread;
+
+	struct task_struct *flush_thread;
+	struct bio *flush_bio;
+	struct completion flush_completion;
+
+	struct bio_set *bio_set;
+	mempool_t *copy_pool;
+
+	struct dm_kcopyd_client *dm_kcopyd;
+	unsigned long *dirty_bitmap;
+	unsigned dirty_bitmap_size;
+
+	bool high_wm_percent_set;
+	bool low_wm_percent_set;
+	bool max_writeback_jobs_set;
+	bool autocommit_blocks_set;
+	bool autocommit_time_set;
+	bool writeback_fua_set;
+	bool flush_on_suspend;
+
+#ifdef WC_MEASURE_LATENCY
+	ktime_t lock_acquired_time;
+	ktime_t max_lock_held;
+	ktime_t max_lock_wait;
+	ktime_t max_freelist_wait;
+	ktime_t measure_latency_time;
+	ktime_t max_measure_latency;
+#endif
+};
+
+#define WB_LIST_INLINE		16
+
+struct writeback_struct {
+	struct list_head endio_entry;
+	struct dm_writecache *wc;
+	struct wc_entry **wc_list;
+	unsigned wc_list_n;
+	unsigned page_offset;
+	struct page *page;
+	struct wc_entry *wc_list_inline[WB_LIST_INLINE];
+	struct bio bio;
+};
+
+struct copy_struct {
+	struct list_head endio_entry;
+	struct dm_writecache *wc;
+	struct wc_entry *e;
+	unsigned n_entries;
+	int error;
+};
+
+DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(dm_writecache_throttle,
+					    "A percentage of time allocated for data copying");
+
+static inline void measure_latency_start(struct dm_writecache *wc)
+{
+#ifdef WC_MEASURE_LATENCY
+	wc->measure_latency_time = ktime_get();
+#endif
+}
+
+static inline void measure_latency_end(struct dm_writecache *wc, unsigned long n)
+{
+#ifdef WC_MEASURE_LATENCY
+	ktime_t now = ktime_get();
+	if (now - wc->measure_latency_time > wc->max_measure_latency) {
+		wc->max_measure_latency = now - wc->measure_latency_time;
+		printk(KERN_DEBUG "dm-writecache: measured latency %lld.%03lldus, %lu steps\n",
+		       wc->max_measure_latency / 1000, wc->max_measure_latency % 1000, n);
+	}
+#endif
+}
+
+static void __wc_lock(struct dm_writecache *wc, int line)
+{
+#ifdef WC_MEASURE_LATENCY
+	ktime_t before, after;
+	before = ktime_get();
+#endif
+	mutex_lock(&wc->lock);
+#ifdef WC_MEASURE_LATENCY
+	after = ktime_get();
+	if (unlikely(after - before > wc->max_lock_wait)) {
+		wc->max_lock_wait = after - before;
+		printk(KERN_DEBUG "dm-writecache: waiting for lock for %lld.%03lldus at %d\n",
+		       wc->max_lock_wait / 1000, wc->max_lock_wait % 1000, line);
+		after = ktime_get();
+	}
+	wc->lock_acquired_time = after;
+#endif
+}
+#define wc_lock(wc)	__wc_lock(wc, __LINE__)
+
+static void __wc_unlock(struct dm_writecache *wc, int line)
+{
+#ifdef WC_MEASURE_LATENCY
+	ktime_t now = ktime_get();
+	if (now - wc->lock_acquired_time > wc->max_lock_held) {
+		wc->max_lock_held = now - wc->lock_acquired_time;
+		printk(KERN_DEBUG "dm-writecache: lock held for %lld.%03lldus at %d\n",
+		       wc->max_lock_held / 1000, wc->max_lock_held % 1000, line);
+	}
+#endif
+	mutex_unlock(&wc->lock);
+}
+#define wc_unlock(wc)	__wc_unlock(wc, __LINE__)
+
+#define wc_unlock_long(wc)	mutex_unlock(&wc->lock)
+
+static int persistent_memory_claim(struct dm_writecache *wc)
+{
+	int r;
+	loff_t s;
+	long p, da;
+	pfn_t pfn;
+	int id;
+	struct page **pages;
+
+	wc->memory_vmapped = false;
+
+	if (!wc->ssd_dev->dax_dev) {
+		r = -EOPNOTSUPP;
+		goto err1;
+	}
+	s = wc->memory_map_size;
+	p = s >> PAGE_SHIFT;
+	if (!p) {
+		r = -EINVAL;
+		goto err1;
+	}
+	if (p != s >> PAGE_SHIFT) {
+		r = -EOVERFLOW;
+		goto err1;
+	}
+
+	id = dax_read_lock();
+
+	da = dax_direct_access(wc->ssd_dev->dax_dev, 0, p, &wc->memory_map, &pfn);
+	if (da < 0) {
+		wc->memory_map = NULL;
+		r = da;
+		goto err2;
+	}
+	if (!pfn_t_has_page(pfn)) {
+		wc->memory_map = NULL;
+		r = -EOPNOTSUPP;
+		goto err2;
+	}
+#ifdef WC_MEASURE_LATENCY
+	printk(KERN_DEBUG "dm-writecache: device %s, pfn %016llx\n",
+	       wc->ssd_dev->name, pfn.val);
+#endif
+	if (da != p) {
+		long i;
+		wc->memory_map = NULL;
+		pages = kvmalloc(p * sizeof(struct page *), GFP_KERNEL);
+		if (!pages) {
+			r = -ENOMEM;
+			goto err2;
+		}
+		i = 0;
+		do {
+			long daa;
+			void *dummy_addr;
+			daa = dax_direct_access(wc->ssd_dev->dax_dev, i, p - i,
+						&dummy_addr, &pfn);
+			if (daa <= 0) {
+				r = daa ? daa : -EINVAL;
+				goto err3;
+			}
+			if (!pfn_t_has_page(pfn)) {
+				r = -EOPNOTSUPP;
+				goto err3;
+			}
+			while (daa-- && i < p) {
+				pages[i++] = pfn_t_to_page(pfn);
+				pfn.val++;
+			}
+		} while (i < p);
+		wc->memory_map = vmap(pages, p, VM_MAP, PAGE_KERNEL);
+		if (!wc->memory_map) {
+			r = -ENOMEM;
+			goto err3;
+		}
+		kvfree(pages);
+		wc->memory_vmapped = true;
+	}
+
+	dax_read_unlock(id);
+
+	return 0;
+
+err3:
+	kvfree(pages);
+err2:
+	dax_read_unlock(id);
+err1:
+	return r;
+}
+
+static void persistent_memory_release(struct dm_writecache *wc)
+{
+	if (wc->memory_vmapped)
+		vunmap(wc->memory_map);
+}
+
+static struct page *persistent_memory_page(void *addr)
+{
+	if (is_vmalloc_addr(addr))
+		return vmalloc_to_page(addr);
+	else
+		return virt_to_page(addr);
+}
+
+static unsigned persistent_memory_page_offset(void *addr)
+{
+	return (unsigned long)addr & (PAGE_SIZE - 1);
+}
+
+static void persistent_memory_flush_cache(void *ptr, size_t size)
+{
+	if (is_vmalloc_addr(ptr))
+		flush_kernel_vmap_range(ptr, size);
+}
+
+static void persistent_memory_invalidate_cache(void *ptr, size_t size)
+{
+	if (is_vmalloc_addr(ptr))
+		invalidate_kernel_vmap_range(ptr, size);
+}
+
+static void persistent_memory_flush(struct dm_writecache *wc, void *ptr, size_t size)
+{
+#ifndef EAGER_DATA_FLUSH
+	dax_flush(wc->ssd_dev->dax_dev, ptr, size);
+#endif
+}
+
+static void persistent_memory_commit_flushed(void)
+{
+#ifdef EAGER_DATA_FLUSH
+	/* needed since memcpy_flushcache is used instead of dax_flush */
+	wmb();
+#endif
+}
+
+static struct wc_memory_superblock *sb(struct dm_writecache *wc)
+{
+	return wc->memory_map;
+}
+
+static struct wc_memory_entry *memory_entry(struct dm_writecache *wc, struct wc_entry *e)
+{
+	if (is_power_of_2(sizeof(struct wc_entry)) && 0)
+		return &sb(wc)->entries[e - wc->entries];
+	else
+		return &sb(wc)->entries[e->index];
+}
+
+static void *memory_data(struct dm_writecache *wc, struct wc_entry *e)
+{
+	return (char *)wc->block_start + (e->index << wc->block_size_bits);
+}
+
+static sector_t cache_sector(struct dm_writecache *wc, struct wc_entry *e)
+{
+	return wc->metadata_sectors +
+		((sector_t)e->index << (wc->block_size_bits - SECTOR_SHIFT));
+}
+
+static uint64_t read_original_sector(struct dm_writecache *wc, struct wc_entry *e)
+{
+#ifdef DM_WRITECACHE_HANDLE_HARDWARE_ERRORS
+	return e->original_sector;
+#else
+	return le64_to_cpu(memory_entry(wc, e)->original_sector);
+#endif
+}
+
+static uint64_t read_seq_count(struct dm_writecache *wc, struct wc_entry *e)
+{
+#ifdef DM_WRITECACHE_HANDLE_HARDWARE_ERRORS
+	return e->seq_count;
+#else
+	return le64_to_cpu(memory_entry(wc, e)->seq_count);
+#endif
+}
+
+static void clear_seq_count(struct dm_writecache *wc, struct wc_entry *e)
+{
+#ifdef DM_WRITECACHE_HANDLE_HARDWARE_ERRORS
+	e->seq_count = -1;
+#endif
+	NT_STORE(memory_entry(wc, e)->seq_count, cpu_to_le64(-1));
+}
+
+static void write_original_sector_seq_count(struct dm_writecache *wc, struct wc_entry *e,
+					    uint64_t original_sector, uint64_t seq_count)
+{
+	struct wc_memory_entry *me_p, me;
+#ifdef DM_WRITECACHE_HANDLE_HARDWARE_ERRORS
+	e->original_sector = original_sector;
+	e->seq_count = seq_count;
+#endif
+	me_p = memory_entry(wc, e);
+	me.original_sector = cpu_to_le64(original_sector);
+	me.seq_count = cpu_to_le64(seq_count);
+	NT_STORE(*me_p, me);
+}
+
+#define writecache_error(wc, err, msg, arg...)				\
+do {									\
+	if (!cmpxchg(&(wc)->error, 0, err))				\
+		DMERR(msg, ##arg);					\
+	swake_up(&(wc)->freelist_wait);					\
+} while (0)
+
+#define writecache_has_error(wc)	(unlikely(READ_ONCE((wc)->error)))
+
+static void writecache_flush_all_metadata(struct dm_writecache *wc)
+{
+	if (WC_MODE_PMEM(wc)) {
+		persistent_memory_flush(wc,
+			sb(wc), offsetof(struct wc_memory_superblock, entries[wc->n_blocks]));
+	} else {
+		memset(wc->dirty_bitmap, -1, wc->dirty_bitmap_size);
+	}
+}
+
+static void writecache_flush_region(struct dm_writecache *wc, void *ptr, size_t size)
+{
+	if (WC_MODE_PMEM(wc))
+		persistent_memory_flush(wc, ptr, size);
+	else
+		__set_bit(((char *)ptr - (char *)wc->memory_map) / BITMAP_GRANULARITY,
+			  wc->dirty_bitmap);
+}
+
+static void writecache_disk_flush(struct dm_writecache *wc, struct dm_dev *dev);
+
+struct io_notify {
+	struct dm_writecache *wc;
+	struct completion c;
+	atomic_t count;
+};
+
+static void writecache_notify_io(unsigned long error, void *context)
+{
+	struct io_notify *endio = context;
+
+	if (unlikely(error != 0))
+		writecache_error(endio->wc, -EIO, "error writing metadata");
+	BUG_ON(atomic_read(&endio->count) <= 0);
+	if (atomic_dec_and_test(&endio->count))
+		complete(&endio->c);
+}
+
+static void ssd_commit_flushed(struct dm_writecache *wc)
+{
+	struct dm_io_region region;
+	struct dm_io_request req;
+	struct io_notify endio = {
+		wc,
+		COMPLETION_INITIALIZER_ONSTACK(endio.c),
+		ATOMIC_INIT(1),
+	};
+	unsigned bitmap_bits = wc->dirty_bitmap_size * BITS_PER_LONG;
+	unsigned i = 0;
+
+	while (1) {
+		unsigned j;
+		i = find_next_bit(wc->dirty_bitmap, bitmap_bits, i);
+		if (unlikely(i == bitmap_bits))
+			break;
+		j = find_next_zero_bit(wc->dirty_bitmap, bitmap_bits, i);
+
+		region.bdev = wc->ssd_dev->bdev;
+		region.sector = (sector_t)i * (BITMAP_GRANULARITY >> SECTOR_SHIFT);
+		region.count = (sector_t)(j - i) * (BITMAP_GRANULARITY >> SECTOR_SHIFT);
+
+		if (unlikely(region.sector >= wc->metadata_sectors))
+			break;
+		if (unlikely(region.sector + region.count > wc->metadata_sectors))
+			region.count = wc->metadata_sectors - region.sector;
+
+		atomic_inc(&endio.count);
+		req.bi_op = REQ_OP_WRITE;
+		req.bi_op_flags = REQ_SYNC;
+		req.mem.type = DM_IO_VMA;
+		req.mem.ptr.vma = (char *)wc->memory_map + (size_t)i * BITMAP_GRANULARITY;
+		req.client = wc->dm_io;
+		req.notify.fn = writecache_notify_io;
+		req.notify.context = &endio;
+
+		/* writing via async dm-io (implied by notify.fn above) won't return an error */
+	        (void) dm_io(&req, 1, &region, NULL);
+		i = j;
+	}
+
+	writecache_notify_io(0, &endio);
+	wait_for_completion_io(&endio.c);
+
+	writecache_disk_flush(wc, wc->ssd_dev);
+
+	memset(wc->dirty_bitmap, 0, wc->dirty_bitmap_size);
+}
+
+static void writecache_commit_flushed(struct dm_writecache *wc)
+{
+	if (WC_MODE_PMEM(wc))
+		persistent_memory_commit_flushed();
+	else
+		ssd_commit_flushed(wc);
+}
+
+static void writecache_disk_flush(struct dm_writecache *wc, struct dm_dev *dev)
+{
+	int r;
+	struct dm_io_region region;
+	struct dm_io_request req;
+
+	region.bdev = dev->bdev;
+	region.sector = 0;
+	region.count = 0;
+	req.bi_op = REQ_OP_WRITE;
+	req.bi_op_flags = REQ_PREFLUSH;
+	req.mem.type = DM_IO_KMEM;
+	req.mem.ptr.addr = NULL;
+	req.client = wc->dm_io;
+	req.notify.fn = NULL;
+
+	r = dm_io(&req, 1, &region, NULL);
+	if (unlikely(r))
+		writecache_error(wc, r, "error flushing metadata: %d", r);
+}
+
+static void writecache_wait_for_ios(struct dm_writecache *wc, int direction)
+{
+	swait_event(wc->bio_in_progress_wait[direction],
+		   !atomic_read(&wc->bio_in_progress[direction]));
+}
+
+#define WFE_RETURN_FOLLOWING	1
+#define WFE_LOWEST_SEQ		2
+
+static struct wc_entry *writecache_find_entry(struct dm_writecache *wc,
+					      uint64_t block, int flags)
+{
+	struct wc_entry *e;
+	struct rb_node *node = wc->tree.rb_node;
+
+	if (unlikely(!node))
+		return NULL;
+
+	while (1) {
+		e = container_of(node, struct wc_entry, rb_node);
+		if (read_original_sector(wc, e) == block)
+			break;
+		node = (read_original_sector(wc, e) >= block ?
+			e->rb_node.rb_left : e->rb_node.rb_right);
+		if (unlikely(!node)) {
+			if (!(flags & WFE_RETURN_FOLLOWING)) {
+				return NULL;
+			}
+			if (read_original_sector(wc, e) >= block) {
+				break;
+			} else {
+				node = rb_next(&e->rb_node);
+				if (unlikely(!node)) {
+					return NULL;
+				}
+				e = container_of(node, struct wc_entry, rb_node);
+				break;
+			}
+		}
+	}
+
+	while (1) {
+		struct wc_entry *e2;
+		if (flags & WFE_LOWEST_SEQ)
+			node = rb_prev(&e->rb_node);
+		else
+			node = rb_next(&e->rb_node);
+		if (!node)
+			return e;
+		e2 = container_of(node, struct wc_entry, rb_node);
+		if (read_original_sector(wc, e2) != block)
+			return e;
+		e = e2;
+	}
+}
+
+static void writecache_insert_entry(struct dm_writecache *wc, struct wc_entry *ins)
+{
+	struct wc_entry *e;
+	struct rb_node **node = &wc->tree.rb_node, *parent = NULL;
+
+	while (*node) {
+		e = container_of(*node, struct wc_entry, rb_node);
+		parent = &e->rb_node;
+		if (read_original_sector(wc, e) > read_original_sector(wc, ins))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+	rb_link_node(&ins->rb_node, parent, node);
+	rb_insert_color(&ins->rb_node, &wc->tree);
+	list_add(&ins->lru, &wc->lru);
+}
+
+static void writecache_unlink(struct dm_writecache *wc, struct wc_entry *e)
+{
+	list_del(&e->lru);
+	rb_erase(&e->rb_node, &wc->tree);
+}
+
+static void writecache_add_to_freelist(struct dm_writecache *wc, struct wc_entry *e)
+{
+	if (WC_MODE_SORT_FREELIST(wc)) {
+		struct rb_node **node = &wc->freetree.rb_node, *parent = NULL;
+		if (unlikely(!*node))
+			wc->current_free = e;
+		while (*node) {
+			parent = *node;
+			if (&e->rb_node < *node)
+				node = &parent->rb_left;
+			else
+				node = &parent->rb_right;
+		}
+		rb_link_node(&e->rb_node, parent, node);
+		rb_insert_color(&e->rb_node, &wc->freetree);
+	} else {
+		list_add_tail(&e->lru, &wc->freelist);
+	}
+	wc->freelist_size++;
+}
+
+static struct wc_entry *writecache_pop_from_freelist(struct dm_writecache *wc)
+{
+	struct wc_entry *e;
+
+	if (WC_MODE_SORT_FREELIST(wc)) {
+		struct rb_node *next;
+		if (unlikely(!wc->current_free))
+			return NULL;
+		e = wc->current_free;
+		next = rb_next(&e->rb_node);
+		rb_erase(&e->rb_node, &wc->freetree);
+		if (unlikely(!next))
+			next = rb_first(&wc->freetree);
+		wc->current_free = next ? container_of(next, struct wc_entry, rb_node) : NULL;
+	} else {
+		if (unlikely(list_empty(&wc->freelist)))
+			return NULL;
+		e = container_of(wc->freelist.next, struct wc_entry, lru);
+		list_del(&e->lru);
+	}
+	wc->freelist_size--;
+	if (unlikely(wc->freelist_size <= wc->freelist_high_watermark))
+		queue_work(wc->writeback_wq, &wc->writeback_work);
+
+	return e;
+}
+
+static void writecache_free_entry(struct dm_writecache *wc, struct wc_entry *e)
+{
+	writecache_unlink(wc, e);
+	writecache_add_to_freelist(wc, e);
+	clear_seq_count(wc, e);
+	writecache_flush_region(wc, memory_entry(wc, e), sizeof(struct wc_memory_entry));
+	if (unlikely(swait_active(&wc->freelist_wait)))
+		swake_up(&wc->freelist_wait);
+}
+
+static void __writecache_wait_on_freelist(struct dm_writecache *wc, bool measure, int line)
+{
+	DECLARE_SWAITQUEUE(wait);
+#ifdef WC_MEASURE_LATENCY
+	ktime_t before, after;
+#endif
+
+	prepare_to_swait(&wc->freelist_wait, &wait, TASK_UNINTERRUPTIBLE);
+	wc_unlock(wc);
+#ifdef WC_MEASURE_LATENCY
+	if (measure)
+		before = ktime_get();
+#endif
+	io_schedule();
+	finish_swait(&wc->freelist_wait, &wait);
+#ifdef WC_MEASURE_LATENCY
+	if (measure) {
+		after = ktime_get();
+		if (unlikely(after - before > wc->max_freelist_wait)) {
+			wc->max_freelist_wait = after - before;
+			printk(KERN_DEBUG "dm-writecache: waiting on freelist for %lld.%03lldus at %d\n",
+			       wc->max_freelist_wait / 1000, wc->max_freelist_wait % 1000, line);
+		}
+	}
+#endif
+	wc_lock(wc);
+}
+#define writecache_wait_on_freelist(wc)		__writecache_wait_on_freelist(wc, true, __LINE__)
+#define writecache_wait_on_freelist_long(wc)	__writecache_wait_on_freelist(wc, false, __LINE__)
+
+static void writecache_poison_lists(struct dm_writecache *wc)
+{
+	/*
+	 * Catch incorrect access to these values while the device is suspended.
+	 */
+	memset(&wc->tree, -1, sizeof wc->tree);
+	wc->lru.next = LIST_POISON1;
+	wc->lru.prev = LIST_POISON2;
+	wc->freelist.next = LIST_POISON1;
+	wc->freelist.prev = LIST_POISON2;
+}
+
+static void writecache_flush_entry(struct dm_writecache *wc, struct wc_entry *e)
+{
+	writecache_flush_region(wc, memory_entry(wc, e), sizeof(struct wc_memory_entry));
+#ifndef EAGER_DATA_FLUSH
+	if (WC_MODE_PMEM(wc))
+		writecache_flush_region(wc, memory_data(wc, e), wc->block_size);
+#endif
+}
+
+static bool writecache_entry_is_committed(struct dm_writecache *wc, struct wc_entry *e)
+{
+	return read_seq_count(wc, e) < wc->seq_count;
+}
+
+static void writecache_flush(struct dm_writecache *wc)
+{
+	struct wc_entry *e, *e2;
+	bool need_flush_after_free;
+
+	wc->uncommitted_blocks = 0;
+	del_timer(&wc->autocommit_timer);
+
+	if (list_empty(&wc->lru))
+		return;
+
+	e = container_of(wc->lru.next, struct wc_entry, lru);
+	if (writecache_entry_is_committed(wc, e)) {
+		if (wc->overwrote_committed) {
+			writecache_wait_for_ios(wc, WRITE);
+			writecache_disk_flush(wc, wc->ssd_dev);
+			wc->overwrote_committed = false;
+		}
+		return;
+	}
+	while (1) {
+		writecache_flush_entry(wc, e);
+		if (unlikely(e->lru.next == &wc->lru))
+			break;
+		e2 = container_of(e->lru.next, struct wc_entry, lru);
+		if (writecache_entry_is_committed(wc, e2))
+			break;
+		e = e2;
+		cond_resched();
+	}
+	writecache_commit_flushed(wc);
+
+	writecache_wait_for_ios(wc, WRITE);
+
+	wc->seq_count++;
+	NT_STORE(sb(wc)->seq_count, cpu_to_le64(wc->seq_count));
+	writecache_flush_region(wc, &sb(wc)->seq_count, sizeof sb(wc)->seq_count);
+	writecache_commit_flushed(wc);
+
+	wc->overwrote_committed = false;
+
+	need_flush_after_free = false;
+	while (1) {
+		/* Free another committed entry with lower seq-count */
+		struct rb_node *rb_node = rb_prev(&e->rb_node);
+
+		if (rb_node) {
+			e2 = container_of(rb_node, struct wc_entry, rb_node);
+			if (read_original_sector(wc, e2) == read_original_sector(wc, e) &&
+			    likely(!e2->write_in_progress)) {
+				writecache_free_entry(wc, e2);
+				need_flush_after_free = true;
+			}
+		}
+		if (unlikely(e->lru.prev == &wc->lru))
+			break;
+		e = container_of(e->lru.prev, struct wc_entry, lru);
+		cond_resched();
+	}
+
+	if (need_flush_after_free)
+		writecache_commit_flushed(wc);
+}
+
+static void writecache_flush_work(struct work_struct *work)
+{
+	struct dm_writecache *wc = container_of(work, struct dm_writecache, flush_work);
+	wc_lock(wc);
+	writecache_flush(wc);
+	wc_unlock(wc);
+}
+
+static void writecache_autocommit_timer(struct timer_list *t)
+{
+	struct dm_writecache *wc = from_timer(wc, t, autocommit_timer);
+	if (!writecache_has_error(wc))
+		queue_work(wc->writeback_wq, &wc->flush_work);
+}
+
+static void writecache_schedule_autocommit(struct dm_writecache *wc)
+{
+	if (!timer_pending(&wc->autocommit_timer))
+		mod_timer(&wc->autocommit_timer, jiffies + wc->autocommit_jiffies);
+}
+
+static void writecache_discard(struct dm_writecache *wc, sector_t start, sector_t end)
+{
+	struct wc_entry *e;
+	bool discarded_something = false;
+
+	e = writecache_find_entry(wc, start, WFE_RETURN_FOLLOWING | WFE_LOWEST_SEQ);
+	if (unlikely(!e))
+		return;
+
+	while (read_original_sector(wc, e) < end) {
+		struct rb_node *node = rb_next(&e->rb_node);
+
+		if (likely(!e->write_in_progress)) {
+			if (!discarded_something) {
+				writecache_wait_for_ios(wc, READ);
+				writecache_wait_for_ios(wc, WRITE);
+				discarded_something = true;
+			}
+			writecache_free_entry(wc, e);
+		}
+
+		if (!node)
+			break;
+
+		e = container_of(node, struct wc_entry, rb_node);
+	}
+
+	if (discarded_something)
+		writecache_commit_flushed(wc);
+}
+
+static bool writecache_wait_for_writeback(struct dm_writecache *wc)
+{
+	if (wc->writeback_size) {
+		writecache_wait_on_freelist(wc);
+		return true;
+	}
+	return false;
+}
+
+static void writecache_suspend(struct dm_target *ti)
+{
+	struct dm_writecache *wc = ti->private;
+	bool flush_on_suspend;
+
+	del_timer_sync(&wc->autocommit_timer);
+
+	wc_lock(wc);
+	writecache_flush(wc);
+	flush_on_suspend = wc->flush_on_suspend;
+	if (flush_on_suspend) {
+		wc->flush_on_suspend = false;
+		wc->writeback_all++;
+		queue_work(wc->writeback_wq, &wc->writeback_work);
+	}
+	wc_unlock(wc);
+
+	flush_workqueue(wc->writeback_wq);
+
+	wc_lock(wc);
+	if (flush_on_suspend) {
+		wc->writeback_all--;
+	}
+	while (writecache_wait_for_writeback(wc));
+
+	if (WC_MODE_PMEM(wc))
+		persistent_memory_flush_cache(wc->memory_map, wc->memory_map_size);
+
+	writecache_poison_lists(wc);
+
+	wc_unlock_long(wc);
+}
+
+static int writecache_alloc_entries(struct dm_writecache *wc)
+{
+	size_t b;
+	if (wc->entries)
+		return 0;
+	wc->entries = vmalloc(sizeof(struct wc_entry) * wc->n_blocks);
+	if (!wc->entries)
+		return -ENOMEM;
+	for (b = 0; b < wc->n_blocks; b++) {
+		struct wc_entry *e = &wc->entries[b];
+		e->index = b;
+		e->write_in_progress = false;
+	}
+	return 0;
+}
+
+static void writecache_resume(struct dm_target *ti)
+{
+	struct dm_writecache *wc = ti->private;
+	size_t b;
+	bool need_flush = false;
+	__le64 sb_seq_count;
+	int r;
+
+	wc_lock(wc);
+
+	if (WC_MODE_PMEM(wc))
+		persistent_memory_invalidate_cache(wc->memory_map, wc->memory_map_size);
+
+	wc->tree = RB_ROOT;
+	INIT_LIST_HEAD(&wc->lru);
+	if (WC_MODE_SORT_FREELIST(wc)) {
+		wc->freetree = RB_ROOT;
+		wc->current_free = NULL;
+	} else {
+		INIT_LIST_HEAD(&wc->freelist);
+	}
+	wc->freelist_size = 0;
+
+	r = memcpy_mcsafe(&sb_seq_count, &sb(wc)->seq_count, sizeof(uint64_t));
+	if (r) {
+		writecache_error(wc, r, "hardware memory error when reading superblock: %d", r);
+		sb_seq_count = cpu_to_le64(0);
+	}
+	wc->seq_count = le64_to_cpu(sb_seq_count);
+
+#ifdef DM_WRITECACHE_HANDLE_HARDWARE_ERRORS
+	for (b = 0; b < wc->n_blocks; b++) {
+		struct wc_entry *e = &wc->entries[b];
+		struct wc_memory_entry wme;
+		if (writecache_has_error(wc)) {
+			e->original_sector = -1;
+			e->seq_count = -1;
+			continue;
+		}
+		r = memcpy_mcsafe(&wme, memory_entry(wc, e), sizeof(struct wc_memory_entry));
+		if (r) {
+			writecache_error(wc, r, "hardware memory error when reading metadata entry %lu: %d",
+					 (unsigned long)b, r);
+			e->original_sector = -1;
+			e->seq_count = -1;
+		} else {
+			e->original_sector = le64_to_cpu(wme.original_sector);
+			e->seq_count = le64_to_cpu(wme.seq_count);
+		}
+	}
+#endif
+	for (b = 0; b < wc->n_blocks; b++) {
+		struct wc_entry *e = &wc->entries[b];
+		if (!writecache_entry_is_committed(wc, e)) {
+			if (read_seq_count(wc, e) != -1) {
+erase_this:
+				clear_seq_count(wc, e);
+				need_flush = true;
+			}
+			writecache_add_to_freelist(wc, e);
+		} else {
+			struct wc_entry *old;
+
+			old = writecache_find_entry(wc, read_original_sector(wc, e), 0);
+			if (!old) {
+				writecache_insert_entry(wc, e);
+			} else {
+				if (read_seq_count(wc, old) == read_seq_count(wc, e)) {
+					writecache_error(wc, -EINVAL,
+						 "two identical entries, position %llu, sector %llu, sequence %llu",
+						 (unsigned long long)b, (unsigned long long)read_original_sector(wc, e),
+						 (unsigned long long)read_seq_count(wc, e));
+				}
+				if (read_seq_count(wc, old) > read_seq_count(wc, e)) {
+					goto erase_this;
+				} else {
+					writecache_free_entry(wc, old);
+					writecache_insert_entry(wc, e);
+					need_flush = true;
+				}
+			}
+		}
+		cond_resched();
+	}
+
+	if (need_flush) {
+		writecache_flush_all_metadata(wc);
+		writecache_commit_flushed(wc);
+	}
+
+	wc_unlock_long(wc);
+}
+
+static int process_flush_mesg(unsigned argc, char **argv, struct dm_writecache *wc)
+{
+	if (argc != 1)
+		return -EINVAL;
+
+	wc_lock(wc);
+	if (dm_suspended(wc->ti)) {
+		wc_unlock(wc);
+		return -EBUSY;
+	}
+	if (writecache_has_error(wc)) {
+		wc_unlock(wc);
+		return -EIO;
+	}
+
+	writecache_flush(wc);
+	wc->writeback_all++;
+	queue_work(wc->writeback_wq, &wc->writeback_work);
+	wc_unlock(wc);
+
+	flush_workqueue(wc->writeback_wq);
+
+	wc_lock(wc);
+	wc->writeback_all--;
+	if (writecache_has_error(wc)) {
+		wc_unlock(wc);
+		return -EIO;
+	}
+	wc_unlock(wc);
+
+	return 0;
+}
+
+static int process_flush_on_suspend_mesg(unsigned argc, char **argv, struct dm_writecache *wc)
+{
+	if (argc != 1)
+		return -EINVAL;
+
+	wc_lock(wc);
+	wc->flush_on_suspend = true;
+	wc_unlock(wc);
+
+	return 0;
+}
+
+static int writecache_message(struct dm_target *ti, unsigned argc, char **argv,
+			      char *result, unsigned maxlen)
+{
+	int r = -EINVAL;
+	struct dm_writecache *wc = ti->private;
+
+	if (!strcasecmp(argv[0], "flush"))
+		r = process_flush_mesg(argc, argv, wc);
+	else if (!strcasecmp(argv[0], "flush_on_suspend"))
+		r = process_flush_on_suspend_mesg(argc, argv, wc);
+	else
+		DMWARN("unrecognised message received: %s", argv[0]);
+
+	return r;
+}
+
+static void bio_copy_block(struct dm_writecache *wc, struct bio *bio, void *data)
+{
+	void *buf;
+	unsigned long flags;
+	unsigned size;
+	int rw = bio_data_dir(bio);
+	unsigned remaining_size = wc->block_size;
+
+	do {
+		struct bio_vec bv = bio_iter_iovec(bio, bio->bi_iter);
+		buf = bvec_kmap_irq(&bv, &flags);
+		size = bv.bv_len;
+		if (unlikely(size > remaining_size))
+			size = remaining_size;
+
+		if (rw == READ) {
+			int r;
+			r = memcpy_mcsafe(buf, data, size);
+			flush_dcache_page(bio_page(bio));
+			if (unlikely(r)) {
+				writecache_error(wc, r, "hardware memory error when reading data: %d", r);
+				bio->bi_status = BLK_STS_IOERR;
+			}
+		} else {
+			flush_dcache_page(bio_page(bio));
+#ifdef EAGER_DATA_FLUSH
+			memcpy_flushcache(data, buf, size);
+#else
+			memcpy(data, buf, size);
+#endif
+		}
+
+		bvec_kunmap_irq(buf, &flags);
+
+		data = (char *)data + size;
+		remaining_size -= size;
+		bio_advance(bio, size);
+	} while (unlikely(remaining_size));
+}
+
+static int writecache_flush_thread(void *data)
+{
+	struct dm_writecache *wc = data;
+
+	while (!kthread_should_stop()) {
+		struct bio *bio = wc->flush_bio;
+
+		if (likely(bio)) {
+			if (bio_op(bio) == REQ_OP_DISCARD)
+				writecache_discard(wc, bio->bi_iter.bi_sector, bio_end_sector(bio));
+			else
+				writecache_flush(wc);
+		}
+
+		set_current_state(TASK_INTERRUPTIBLE);
+		/* for debugging - catch uninitialized use */
+		wc->flush_bio = (void *)0x600 + POISON_POINTER_DELTA;
+		complete(&wc->flush_completion);
+
+		schedule();
+	}
+
+	set_current_state(TASK_RUNNING);
+
+	return 0;
+}
+
+static void writecache_offload_bio(struct dm_writecache *wc, struct bio *bio)
+{
+	wc->flush_bio = bio;
+	reinit_completion(&wc->flush_completion);
+	wake_up_process(wc->flush_thread);
+	wait_for_completion_io(&wc->flush_completion);
+}
+
+/* FIXME: all the gotos in writecache_map() suggest the need for refactoring */
+static int writecache_map(struct dm_target *ti, struct bio *bio)
+{
+	struct wc_entry *e;
+	struct dm_writecache *wc = ti->private;
+
+	bio->bi_private = NULL;
+
+	wc_lock(wc);
+
+	if (unlikely(bio->bi_opf & REQ_PREFLUSH)) {
+		if (writecache_has_error(wc))
+			goto unlock_error;
+		if (WC_MODE_PMEM(wc))
+			writecache_flush(wc);
+		else
+			writecache_offload_bio(wc, bio);
+		goto unlock_ok_flush;
+	}
+
+	bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);
+
+	if (unlikely((((unsigned)bio->bi_iter.bi_sector | bio_sectors(bio)) &
+				(wc->block_size / 512 - 1)) != 0)) {
+		DMWARN("I/O is not aligned, sector %llu, size %u, block size %u",
+			(unsigned long long)bio->bi_iter.bi_sector,
+			bio->bi_iter.bi_size, wc->block_size);
+		goto unlock_error;
+	}
+
+	if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
+		if (writecache_has_error(wc))
+			goto unlock_error;
+		if (WC_MODE_PMEM(wc))
+			writecache_discard(wc, bio->bi_iter.bi_sector, bio_end_sector(bio));
+		else
+			writecache_offload_bio(wc, bio);
+		goto unlock_remap_origin;
+	}
+
+	if (bio_data_dir(bio) == READ) {
+next_block:
+		e = writecache_find_entry(wc, bio->bi_iter.bi_sector, WFE_RETURN_FOLLOWING);
+		if (e && read_original_sector(wc, e) == bio->bi_iter.bi_sector) {
+			if (WC_MODE_PMEM(wc)) {
+				bio_copy_block(wc, bio, memory_data(wc, e));
+				if (bio->bi_iter.bi_size)
+					goto next_block;
+				goto unlock_ok_read;
+			} else {
+				dm_accept_partial_bio(bio, wc->block_size >> SECTOR_SHIFT);
+				bio_set_dev(bio, wc->ssd_dev->bdev);
+				bio->bi_iter.bi_sector = cache_sector(wc, e);
+				if (!writecache_entry_is_committed(wc, e))
+					writecache_wait_for_ios(wc, WRITE);
+				goto unlock_remap;
+			}
+		} else {
+			if (e) {
+				sector_t next_boundary =
+					read_original_sector(wc, e) - bio->bi_iter.bi_sector;
+				if (next_boundary < bio->bi_iter.bi_size >> SECTOR_SHIFT) {
+					dm_accept_partial_bio(bio, next_boundary);
+				}
+			}
+			goto unlock_remap_origin;
+		}
+	} else {
+		do {
+			if (writecache_has_error(wc))
+				goto unlock_error;
+			e = writecache_find_entry(wc, bio->bi_iter.bi_sector, 0);
+			if (e) {
+				if (!writecache_entry_is_committed(wc, e))
+					goto bio_copy;
+				if (!WC_MODE_PMEM(wc) && !e->write_in_progress) {
+					wc->overwrote_committed = true;
+					goto bio_copy;
+				}
+			}
+			e = writecache_pop_from_freelist(wc);
+			if (unlikely(!e)) {
+				writecache_wait_on_freelist(wc);
+				continue;
+			}
+			write_original_sector_seq_count(wc, e, bio->bi_iter.bi_sector, wc->seq_count);
+			writecache_insert_entry(wc, e);
+			wc->uncommitted_blocks++;
+bio_copy:
+			if (WC_MODE_PMEM(wc)) {
+				bio_copy_block(wc, bio, memory_data(wc, e));
+			} else {
+				dm_accept_partial_bio(bio, wc->block_size >> SECTOR_SHIFT);
+				bio_set_dev(bio, wc->ssd_dev->bdev);
+				bio->bi_iter.bi_sector = cache_sector(wc, e);
+				if (unlikely(wc->uncommitted_blocks >= wc->autocommit_blocks)) {
+					wc->uncommitted_blocks = 0;
+					queue_work(wc->writeback_wq, &wc->flush_work);
+				} else {
+					writecache_schedule_autocommit(wc);
+				}
+				goto unlock_remap;
+			}
+		} while (bio->bi_iter.bi_size);
+
+		if (unlikely(wc->uncommitted_blocks >= wc->autocommit_blocks)) {
+			writecache_flush(wc);
+		} else {
+			writecache_schedule_autocommit(wc);
+		}
+
+		goto unlock_ok_write;
+	}
+
+unlock_remap_origin:
+	bio_set_dev(bio, wc->dev->bdev);
+	wc_unlock(wc);
+	return DM_MAPIO_REMAPPED;
+
+unlock_remap:
+	/* make sure that writecache_end_io decrements bio_in_progress: */
+	bio->bi_private = (void *)1;
+	atomic_inc(&wc->bio_in_progress[bio_data_dir(bio)]);
+	wc_unlock(wc);
+	return DM_MAPIO_REMAPPED;
+
+unlock_ok_flush:
+#ifdef WC_MEASURE_LATENCY
+	wc_unlock(wc);
+	bio_endio(bio);
+	return DM_MAPIO_SUBMITTED;
+#endif
+
+unlock_ok_read:
+#ifdef WC_MEASURE_LATENCY
+	wc_unlock(wc);
+	bio_endio(bio);
+	return DM_MAPIO_SUBMITTED;
+#endif
+
+unlock_ok_write:
+	wc_unlock(wc);
+	bio_endio(bio);
+	return DM_MAPIO_SUBMITTED;
+
+unlock_error:
+	wc_unlock(wc);
+	bio_io_error(bio);
+	return DM_MAPIO_SUBMITTED;
+}
+
+static int writecache_end_io(struct dm_target *ti, struct bio *bio, blk_status_t *status)
+{
+	struct dm_writecache *wc = ti->private;
+
+	if (bio->bi_private != NULL) {
+		int dir = bio_data_dir(bio);
+		if (atomic_dec_and_test(&wc->bio_in_progress[dir]))
+			if (unlikely(swait_active(&wc->bio_in_progress_wait[dir])))
+				swake_up(&wc->bio_in_progress_wait[dir]);
+	}
+	return 0;
+}
+
+static int writecache_iterate_devices(struct dm_target *ti,
+				      iterate_devices_callout_fn fn, void *data)
+{
+	struct dm_writecache *wc = ti->private;
+
+	return fn(ti, wc->dev, 0, ti->len, data);
+}
+
+static void writecache_io_hints(struct dm_target *ti, struct queue_limits *limits)
+{
+	struct dm_writecache *wc = ti->private;
+
+	if (limits->logical_block_size < wc->block_size)
+		limits->logical_block_size = wc->block_size;
+
+	if (limits->physical_block_size < wc->block_size)
+		limits->physical_block_size = wc->block_size;
+
+	if (limits->io_min < wc->block_size)
+		limits->io_min = wc->block_size;
+}
+
+
+static void writecache_writeback_endio(struct bio *bio)
+{
+	struct writeback_struct *wb = container_of(bio, struct writeback_struct, bio);
+	struct dm_writecache *wc = wb->wc;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&wc->endio_thread_wait.lock, flags);
+	list_add_tail(&wb->endio_entry, &wc->endio_list);
+	swake_up_locked(&wc->endio_thread_wait);
+	raw_spin_unlock_irqrestore(&wc->endio_thread_wait.lock, flags);
+}
+
+static void writecache_copy_endio(int read_err, unsigned long write_err, void *ptr)
+{
+	struct copy_struct *c = ptr;
+	struct dm_writecache *wc = c->wc;
+
+	c->error = likely(!(read_err | write_err)) ? 0 : -EIO;
+
+	raw_spin_lock_irq(&wc->endio_thread_wait.lock);
+	list_add_tail(&c->endio_entry, &wc->endio_list);
+	swake_up_locked(&wc->endio_thread_wait);
+	raw_spin_unlock_irq(&wc->endio_thread_wait.lock);
+}
+
+static void __writecache_endio_pmem(struct dm_writecache *wc, struct list_head *list)
+{
+	unsigned i;
+	struct writeback_struct *wb;
+	struct wc_entry *e;
+	unsigned long n_walked = 0;
+
+	do {
+		wb = list_entry(list->next, struct writeback_struct, endio_entry);
+		list_del(&wb->endio_entry);
+
+		if (unlikely(wb->bio.bi_status != BLK_STS_OK))
+			writecache_error(wc, blk_status_to_errno(wb->bio.bi_status),
+					"write error %d", wb->bio.bi_status);
+		i = 0;
+		do {
+			e = wb->wc_list[i];
+			BUG_ON(!e->write_in_progress);
+			e->write_in_progress = false;
+			INIT_LIST_HEAD(&e->lru);
+			if (!writecache_has_error(wc))
+				writecache_free_entry(wc, e);
+			BUG_ON(!wc->writeback_size);
+			wc->writeback_size--;
+			n_walked++;
+			if (unlikely(n_walked >= ENDIO_LATENCY)) {
+				writecache_commit_flushed(wc);
+				wc_unlock(wc);
+				wc_lock(wc);
+				n_walked = 0;
+			}
+		} while (++i < wb->wc_list_n);
+
+		if (wb->wc_list != wb->wc_list_inline)
+			kfree(wb->wc_list);
+		bio_put(&wb->bio);
+	} while (!list_empty(list));
+}
+
+static void __writecache_endio_ssd(struct dm_writecache *wc, struct list_head *list)
+{
+	struct copy_struct *c;
+	struct wc_entry *e;
+
+	do {
+		c = list_entry(list->next, struct copy_struct, endio_entry);
+		list_del(&c->endio_entry);
+
+		if (unlikely(c->error))
+			writecache_error(wc, c->error, "copy error");
+
+		e = c->e;
+		do {
+			BUG_ON(!e->write_in_progress);
+			e->write_in_progress = false;
+			INIT_LIST_HEAD(&e->lru);
+			if (!writecache_has_error(wc))
+				writecache_free_entry(wc, e);
+
+			BUG_ON(!wc->writeback_size);
+			wc->writeback_size--;
+			e++;
+		} while (--c->n_entries);
+		mempool_free(c, wc->copy_pool);
+	} while (!list_empty(list));
+}
+
+static int writecache_endio_thread(void *data)
+{
+	struct dm_writecache *wc = data;
+
+	while (1) {
+		DECLARE_SWAITQUEUE(wait);
+		struct list_head list;
+
+		raw_spin_lock_irq(&wc->endio_thread_wait.lock);
+continue_locked:
+		if (!list_empty(&wc->endio_list))
+			goto pop_from_list;
+		set_current_state(TASK_INTERRUPTIBLE);
+		__prepare_to_swait(&wc->endio_thread_wait, &wait);
+		raw_spin_unlock_irq(&wc->endio_thread_wait.lock);
+
+		if (unlikely(kthread_should_stop())) {
+			finish_swait(&wc->endio_thread_wait, &wait);
+			break;
+		}
+
+		schedule();
+
+		raw_spin_lock_irq(&wc->endio_thread_wait.lock);
+		__finish_swait(&wc->endio_thread_wait, &wait);
+		goto continue_locked;
+
+pop_from_list:
+		list = wc->endio_list;
+		list.next->prev = list.prev->next = &list;
+		INIT_LIST_HEAD(&wc->endio_list);
+		raw_spin_unlock_irq(&wc->endio_thread_wait.lock);
+
+		if (!WC_MODE_FUA(wc))
+			writecache_disk_flush(wc, wc->dev);
+
+		wc_lock(wc);
+
+		if (WC_MODE_PMEM(wc)) {
+			__writecache_endio_pmem(wc, &list);
+		} else {
+			__writecache_endio_ssd(wc, &list);
+			writecache_wait_for_ios(wc, READ);
+		}
+
+		writecache_commit_flushed(wc);
+
+		wc_unlock(wc);
+	}
+
+	return 0;
+}
+
+static bool wc_add_block(struct writeback_struct *wb, struct wc_entry *e, gfp_t gfp)
+{
+	struct dm_writecache *wc = wb->wc;
+	unsigned block_size = wc->block_size;
+	void *address = memory_data(wc, e);
+
+	persistent_memory_flush_cache(address, block_size);
+	return bio_add_page(&wb->bio, persistent_memory_page(address),
+			    block_size, persistent_memory_page_offset(address)) != 0;
+}
+
+struct writeback_list {
+	struct list_head list;
+	size_t size;
+};
+
+static void __writeback_throttle(struct dm_writecache *wc, struct writeback_list *wbl)
+{
+	if (unlikely(wc->max_writeback_jobs)) {
+		if (READ_ONCE(wc->writeback_size) - wbl->size >= wc->max_writeback_jobs) {
+			wc_lock(wc);
+			while (wc->writeback_size - wbl->size >= wc->max_writeback_jobs) {
+				writecache_wait_on_freelist_long(wc);
+			}
+			wc_unlock(wc);
+		}
+	}
+	cond_resched();
+}
+
+static void __writecache_writeback_pmem(struct dm_writecache *wc, struct writeback_list *wbl)
+{
+	struct wc_entry *e, *f;
+	struct bio *bio;
+	struct writeback_struct *wb;
+	unsigned max_pages;
+
+	while (wbl->size) {
+		wbl->size--;
+		e = container_of(wbl->list.prev, struct wc_entry, lru);
+		list_del(&e->lru);
+
+		max_pages = e->wc_list_contiguous;
+
+		bio = bio_alloc_bioset(GFP_NOIO, max_pages, wc->bio_set);
+		wb = container_of(bio, struct writeback_struct, bio);
+		wb->wc = wc;
+		wb->bio.bi_end_io = writecache_writeback_endio;
+		bio_set_dev(&wb->bio, wc->dev->bdev);
+		wb->bio.bi_iter.bi_sector = read_original_sector(wc, e);
+		wb->page_offset = PAGE_SIZE;
+		if (max_pages > WB_LIST_INLINE) {
+			wb->wc_list = kmalloc(max_pages * sizeof(struct wc_entry *),
+					      GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
+			if (unlikely(!wb->wc_list))
+				goto use_inline_list;
+		} else {
+use_inline_list:
+			wb->wc_list = wb->wc_list_inline;
+			max_pages = WB_LIST_INLINE;
+		}
+
+		BUG_ON(!wc_add_block(wb, e, GFP_NOIO));
+
+		wb->wc_list[0] = e;
+		wb->wc_list_n = 1;
+
+		while (wbl->size && wb->wc_list_n < max_pages) {
+			f = container_of(wbl->list.prev, struct wc_entry, lru);
+			if (read_original_sector(wc, f) !=
+			    read_original_sector(wc, e) + (wc->block_size >> SECTOR_SHIFT))
+				break;
+			if (!wc_add_block(wb, f, GFP_NOWAIT | __GFP_NOWARN))
+				break;
+			wbl->size--;
+			list_del(&f->lru);
+			wb->wc_list[wb->wc_list_n++] = f;
+			e = f;
+		}
+		bio_set_op_attrs(&wb->bio, REQ_OP_WRITE, WC_MODE_FUA(wc) * REQ_FUA);
+		if (writecache_has_error(wc)) {
+			bio->bi_status = BLK_STS_IOERR;
+			bio_endio(&wb->bio);
+		} else {
+			submit_bio(&wb->bio);
+		}
+
+		__writeback_throttle(wc, wbl);
+	}
+}
+
+static void __writecache_writeback_ssd(struct dm_writecache *wc, struct writeback_list *wbl)
+{
+	struct wc_entry *e, *f;
+	struct dm_io_region from, to;
+	struct copy_struct *c;
+
+	while (wbl->size) {
+		unsigned n_sectors;
+
+		wbl->size--;
+		e = container_of(wbl->list.prev, struct wc_entry, lru);
+		list_del(&e->lru);
+
+		n_sectors = e->wc_list_contiguous << (wc->block_size_bits - SECTOR_SHIFT);
+
+		from.bdev = wc->ssd_dev->bdev;
+		from.sector = cache_sector(wc, e);
+		from.count = n_sectors;
+		to.bdev = wc->dev->bdev;
+		to.sector = read_original_sector(wc, e);
+		to.count = n_sectors;
+
+		c = mempool_alloc(wc->copy_pool, GFP_NOIO);
+		c->wc = wc;
+		c->e = e;
+		c->n_entries = e->wc_list_contiguous;
+
+		while ((n_sectors -= wc->block_size >> SECTOR_SHIFT)) {
+			wbl->size--;
+			f = container_of(wbl->list.prev, struct wc_entry, lru);
+			BUG_ON(f != e + 1);
+			list_del(&f->lru);
+			e = f;
+		}
+
+		dm_kcopyd_copy(wc->dm_kcopyd, &from, 1, &to, 0, writecache_copy_endio, c);
+
+		__writeback_throttle(wc, wbl);
+	}
+}
+
+static void writecache_writeback(struct work_struct *work)
+{
+	struct dm_writecache *wc = container_of(work, struct dm_writecache, writeback_work);
+	struct blk_plug plug;
+	struct wc_entry *e, *f, *g;
+	struct rb_node *node, *next_node;
+	struct list_head skipped;
+	struct writeback_list wbl;
+	unsigned long n_walked;
+
+	wc_lock(wc);
+restart:
+	if (writecache_has_error(wc)) {
+		wc_unlock(wc);
+		return;
+	}
+
+	if (unlikely(wc->writeback_all)) {
+		if (writecache_wait_for_writeback(wc))
+			goto restart;
+	}
+
+	if (wc->overwrote_committed) {
+		writecache_wait_for_ios(wc, WRITE);
+	}
+
+	n_walked = 0;
+	INIT_LIST_HEAD(&skipped);
+	INIT_LIST_HEAD(&wbl.list);
+	wbl.size = 0;
+	while (!list_empty(&wc->lru) &&
+	       (wc->writeback_all ||
+		wc->freelist_size + wc->writeback_size <= wc->freelist_high_watermark)) {
+
+		n_walked++;
+		if (unlikely(n_walked > WRITEBACK_LATENCY) &&
+		    likely(!wc->writeback_all) && likely(!dm_suspended(wc->ti))) {
+			queue_work(wc->writeback_wq, &wc->writeback_work);
+			break;
+		}
+
+		e = container_of(wc->lru.prev, struct wc_entry, lru);
+		BUG_ON(e->write_in_progress);
+		if (unlikely(!writecache_entry_is_committed(wc, e))) {
+			writecache_flush(wc);
+		}
+		node = rb_prev(&e->rb_node);
+		if (node) {
+			f = container_of(node, struct wc_entry, rb_node);
+			if (unlikely(read_original_sector(wc, f) ==
+				     read_original_sector(wc, e))) {
+				BUG_ON(!f->write_in_progress);
+				list_del(&e->lru);
+				list_add(&e->lru, &skipped);
+				cond_resched();
+				continue;
+			}
+		}
+		wc->writeback_size++;
+		list_del(&e->lru);
+		list_add(&e->lru, &wbl.list);
+		wbl.size++;
+		e->write_in_progress = true;
+		e->wc_list_contiguous = 1;
+
+		f = e;
+
+		while (1) {
+			next_node = rb_next(&f->rb_node);
+			if (unlikely(!next_node))
+				break;
+			g = container_of(next_node, struct wc_entry, rb_node);
+			if (read_original_sector(wc, g) ==
+			    read_original_sector(wc, f)) {
+				f = g;
+				continue;
+			}
+			if (read_original_sector(wc, g) !=
+			    read_original_sector(wc, f) + (wc->block_size >> SECTOR_SHIFT))
+				break;
+			if (unlikely(g->write_in_progress))
+				break;
+			if (unlikely(!writecache_entry_is_committed(wc, g)))
+				break;
+
+			if (!WC_MODE_PMEM(wc)) {
+				if (g != f + 1)
+					break;
+			}
+
+			n_walked++;
+			//if (unlikely(n_walked > WRITEBACK_LATENCY) && likely(!wc->writeback_all))
+			//	break;
+
+			wc->writeback_size++;
+			list_del(&g->lru);
+			list_add(&g->lru, &wbl.list);
+			wbl.size++;
+			g->write_in_progress = true;
+			g->wc_list_contiguous = BIO_MAX_PAGES;
+			f = g;
+			e->wc_list_contiguous++;
+			if (unlikely(e->wc_list_contiguous == BIO_MAX_PAGES))
+				break;
+		}
+		cond_resched();
+	}
+
+	if (!list_empty(&skipped)) {
+		list_splice_tail(&skipped, &wc->lru);
+		/*
+		 * If we didn't do any progress, we must wait until some
+		 * writeback finishes to avoid burning CPU in a loop
+		 */
+		if (unlikely(!wbl.size))
+			writecache_wait_for_writeback(wc);
+	}
+
+	wc_unlock(wc);
+
+	blk_start_plug(&plug);
+
+	if (WC_MODE_PMEM(wc))
+		__writecache_writeback_pmem(wc, &wbl);
+	else
+		__writecache_writeback_ssd(wc, &wbl);
+
+	blk_finish_plug(&plug);
+
+	if (unlikely(wc->writeback_all)) {
+		wc_lock(wc);
+		while (writecache_wait_for_writeback(wc));
+		wc_unlock(wc);
+	}
+}
+
+static int calculate_memory_size(uint64_t device_size, unsigned block_size,
+				 size_t *n_blocks_p, size_t *n_metadata_blocks_p)
+{
+	uint64_t n_blocks, offset;
+	struct wc_entry e;
+
+	n_blocks = device_size;
+	do_div(n_blocks, block_size + sizeof(struct wc_memory_entry));
+
+	while (1) {
+		if (!n_blocks)
+			return -ENOSPC;
+		/* Verify the following entries[n_blocks] won't overflow */
+		if (n_blocks >= (size_t)-sizeof(struct wc_memory_superblock) / sizeof(struct wc_memory_entry))
+			return -EFBIG;
+		offset = offsetof(struct wc_memory_superblock, entries[n_blocks]);
+		offset = (offset + block_size - 1) & ~(uint64_t)(block_size - 1);
+		if (offset + n_blocks * block_size <= device_size)
+			break;
+		n_blocks--;
+	}
+
+	/* check if the bit field overflows */
+	e.index = n_blocks;
+	if (e.index != n_blocks)
+		return -EFBIG;
+
+	if (n_blocks_p)
+		*n_blocks_p = n_blocks;
+	if (n_metadata_blocks_p)
+		*n_metadata_blocks_p = offset >> __ffs(block_size);
+	return 0;
+}
+
+static int init_memory(struct dm_writecache *wc)
+{
+	size_t b;
+	int r;
+
+	r = calculate_memory_size(wc->memory_map_size, wc->block_size, &wc->n_blocks, NULL);
+	if (r)
+		return r;
+
+	r = writecache_alloc_entries(wc);
+	if (r)
+		return r;
+
+	for (b = 0; b < ARRAY_SIZE(sb(wc)->padding); b++)
+		NT_STORE(sb(wc)->padding[b], cpu_to_le64(0));
+	NT_STORE(sb(wc)->version, cpu_to_le32(MEMORY_SUPERBLOCK_VERSION));
+	NT_STORE(sb(wc)->block_size, cpu_to_le32(wc->block_size));
+	NT_STORE(sb(wc)->n_blocks, cpu_to_le64(wc->n_blocks));
+	NT_STORE(sb(wc)->seq_count, cpu_to_le64(0));
+
+	for (b = 0; b < wc->n_blocks; b++)
+		write_original_sector_seq_count(wc, &wc->entries[b], -1, -1);
+
+	writecache_flush_all_metadata(wc);
+	writecache_commit_flushed(wc);
+	NT_STORE(sb(wc)->magic, cpu_to_le32(MEMORY_SUPERBLOCK_MAGIC));
+	writecache_flush_region(wc, &sb(wc)->magic, sizeof sb(wc)->magic);
+	writecache_commit_flushed(wc);
+
+	return 0;
+}
+
+static void writecache_dtr(struct dm_target *ti)
+{
+	struct dm_writecache *wc = ti->private;
+
+	if (!wc)
+		return;
+
+	if (wc->endio_thread)
+		kthread_stop(wc->endio_thread);
+
+	if (wc->flush_thread)
+		kthread_stop(wc->flush_thread);
+
+	if (wc->bio_set)
+		bioset_free(wc->bio_set);
+
+	mempool_destroy(wc->copy_pool);
+
+	if (wc->writeback_wq)
+		destroy_workqueue(wc->writeback_wq);
+
+	if (wc->dev)
+		dm_put_device(ti, wc->dev);
+
+	if (wc->ssd_dev)
+		dm_put_device(ti, wc->ssd_dev);
+
+	if (wc->entries)
+		vfree(wc->entries);
+
+	if (wc->memory_map) {
+		if (WC_MODE_PMEM(wc))
+			persistent_memory_release(wc);
+		else
+			vfree(wc->memory_map);
+	}
+
+	if (wc->dm_kcopyd)
+		dm_kcopyd_client_destroy(wc->dm_kcopyd);
+
+	if (wc->dm_io)
+		dm_io_client_destroy(wc->dm_io);
+
+	if (wc->dirty_bitmap)
+		vfree(wc->dirty_bitmap);
+
+	kfree(wc);
+}
+
+static int writecache_ctr(struct dm_target *ti, unsigned argc, char **argv)
+{
+	struct dm_writecache *wc;
+	struct dm_arg_set as;
+	const char *string;
+	unsigned opt_params;
+	size_t offset, data_size;
+	int i, r;
+	char dummy;
+	int high_wm_percent = HIGH_WATERMARK;
+	int low_wm_percent = LOW_WATERMARK;
+	uint64_t x;
+	struct wc_memory_superblock s;
+
+	static struct dm_arg _args[] = {
+		{0, 10, "Invalid number of feature args"},
+	};
+
+	as.argc = argc;
+	as.argv = argv;
+
+	wc = kzalloc(sizeof(struct dm_writecache), GFP_KERNEL);
+	if (!wc) {
+		ti->error = "Cannot allocate writecache structure";
+		r = -ENOMEM;
+		goto bad;
+	}
+	ti->private = wc;
+	wc->ti = ti;
+
+	mutex_init(&wc->lock);
+	writecache_poison_lists(wc);
+	init_swait_queue_head(&wc->freelist_wait);
+	timer_setup(&wc->autocommit_timer, writecache_autocommit_timer, 0);
+
+	for (i = 0; i < 2; i++) {
+		atomic_set(&wc->bio_in_progress[i], 0);
+		init_swait_queue_head(&wc->bio_in_progress_wait[i]);
+	}
+
+	wc->dm_io = dm_io_client_create();
+	if (!wc->dm_io) {
+		r = -ENOMEM;
+		ti->error = "Unable to allocate dm-io client";
+		goto bad;
+	}
+
+	wc->writeback_wq = alloc_workqueue("writecache-writeabck", WQ_MEM_RECLAIM, 1);
+	if (!wc->writeback_wq) {
+		r = -ENOMEM;
+		ti->error = "Could not allocate writeback workqueue";
+		goto bad;
+	}
+	INIT_WORK(&wc->writeback_work, writecache_writeback);
+	INIT_WORK(&wc->flush_work, writecache_flush_work);
+
+	init_swait_queue_head(&wc->endio_thread_wait);
+	INIT_LIST_HEAD(&wc->endio_list);
+	wc->endio_thread = kthread_create(writecache_endio_thread, wc, "writecache_endio");
+	if (IS_ERR(wc->endio_thread)) {
+		r = PTR_ERR(wc->endio_thread);
+		wc->endio_thread = NULL;
+		ti->error = "Couldn't spawn endio thread";
+		goto bad;
+	}
+	wake_up_process(wc->endio_thread);
+
+	/*
+	 * Parse the mode (pmem or ssd)
+	 */
+	string = dm_shift_arg(&as);
+	if (!string)
+		goto bad_arguments;
+
+	if (!strcasecmp(string, "s")) {
+#ifndef DM_WRITECACHE_ONLY_SSD
+		wc->pmem_mode = false;
+#endif
+	} else if (!strcasecmp(string, "p")) {
+#ifndef DM_WRITECACHE_ONLY_SSD
+		wc->pmem_mode = true;
+		wc->writeback_fua = true;
+#else
+		r = -EOPNOTSUPP;
+		ti->error = "Persistent memory not supported on this architecture";
+		goto bad;
+#endif
+	} else {
+		goto bad_arguments;
+	}
+
+	if (WC_MODE_PMEM(wc)) {
+		wc->bio_set = bioset_create(BIO_POOL_SIZE,
+					    offsetof(struct writeback_struct, bio),
+					    BIOSET_NEED_BVECS);
+		if (!wc->bio_set) {
+			r = -ENOMEM;
+			ti->error = "Could not allocate bio set";
+			goto bad;
+		}
+	} else {
+		wc->copy_pool = mempool_create_kmalloc_pool(1, sizeof(struct copy_struct));
+		if (!wc->copy_pool) {
+			r = -ENOMEM;
+			ti->error = "Could not allocate mempool";
+			goto bad;
+		}
+	}
+
+	/*
+	 * Parse the origin data device
+	 */
+	string = dm_shift_arg(&as);
+	if (!string)
+		goto bad_arguments;
+	r = dm_get_device(ti, string, dm_table_get_mode(ti->table), &wc->dev);
+	if (r) {
+		ti->error = "Origin data device lookup failed";
+		goto bad;
+	}
+
+	/*
+	 * Parse cache data device (be it pmem or ssd)
+	 */
+	string = dm_shift_arg(&as);
+	if (!string)
+		goto bad_arguments;
+
+	r = dm_get_device(ti, string, dm_table_get_mode(ti->table), &wc->ssd_dev);
+	if (r) {
+		ti->error = "Cache data device lookup failed";
+		goto bad;
+	}
+	wc->memory_map_size = i_size_read(wc->ssd_dev->bdev->bd_inode);
+
+	if (WC_MODE_PMEM(wc)) {
+		r = persistent_memory_claim(wc);
+		if (r) {
+			ti->error = "Unable to map persistent memory for cache";
+			goto bad;
+		}
+	}
+
+	/*
+	 * Parse the cache block size
+	 */
+	string = dm_shift_arg(&as);
+	if (!string)
+		goto bad_arguments;
+	if (sscanf(string, "%u%c", &wc->block_size, &dummy) != 1 ||
+	    wc->block_size < 512 || wc->block_size > PAGE_SIZE ||
+	    (wc->block_size & (wc->block_size - 1))) {
+		r = -EINVAL;
+		ti->error = "Invalid block size";
+		goto bad;
+	}
+	wc->block_size_bits = __ffs(wc->block_size);
+
+	wc->max_writeback_jobs = MAX_WRITEBACK_JOBS;
+	wc->autocommit_blocks = !WC_MODE_PMEM(wc) ? AUTOCOMMIT_BLOCKS_SSD : AUTOCOMMIT_BLOCKS_PMEM;
+	wc->autocommit_jiffies = msecs_to_jiffies(AUTOCOMMIT_MSEC);
+
+	/*
+	 * Parse optional arguments
+	 */
+	r = dm_read_arg_group(_args, &as, &opt_params, &ti->error);
+	if (r)
+		goto bad;
+
+	while (opt_params) {
+		string = dm_shift_arg(&as), opt_params--;
+		if (!strcasecmp(string, "high_watermark") && opt_params >= 1) {
+			string = dm_shift_arg(&as), opt_params--;
+			if (sscanf(string, "%d%c", &high_wm_percent, &dummy) != 1)
+				goto invalid_optional;
+			if (high_wm_percent < 0 || high_wm_percent > 100)
+				goto invalid_optional;
+			wc->high_wm_percent_set = true;
+		} else if (!strcasecmp(string, "low_watermark") && opt_params >= 1) {
+			string = dm_shift_arg(&as), opt_params--;
+			if (sscanf(string, "%d%c", &low_wm_percent, &dummy) != 1)
+				goto invalid_optional;
+			if (low_wm_percent < 0 || low_wm_percent > 100)
+				goto invalid_optional;
+			wc->low_wm_percent_set = true;
+		} else if (!strcasecmp(string, "writeback_jobs") && opt_params >= 1) {
+			string = dm_shift_arg(&as), opt_params--;
+			if (sscanf(string, "%u%c", &wc->max_writeback_jobs, &dummy) != 1)
+				goto invalid_optional;
+			wc->max_writeback_jobs_set = true;
+		} else if (!strcasecmp(string, "autocommit_blocks") && opt_params >= 1) {
+			string = dm_shift_arg(&as), opt_params--;
+			if (sscanf(string, "%u%c", &wc->autocommit_blocks, &dummy) != 1)
+				goto invalid_optional;
+			wc->autocommit_blocks_set = true;
+		} else if (!strcasecmp(string, "autocommit_time") && opt_params >= 1) {
+			unsigned autocommit_msecs;
+			string = dm_shift_arg(&as), opt_params--;
+			if (sscanf(string, "%u%c", &autocommit_msecs, &dummy) != 1)
+				goto invalid_optional;
+			if (autocommit_msecs > 3600000)
+				goto invalid_optional;
+			wc->autocommit_jiffies = jiffies_to_msecs(autocommit_msecs);
+			wc->autocommit_time_set = true;
+		} else if (!strcasecmp(string, "fua")) {
+			if (WC_MODE_PMEM(wc)) {
+#ifndef DM_WRITECACHE_ONLY_SSD
+				wc->writeback_fua = true;
+				wc->writeback_fua_set = true;
+#endif
+			} else goto invalid_optional;
+		} else if (!strcasecmp(string, "nofua")) {
+			if (WC_MODE_PMEM(wc)) {
+#ifndef DM_WRITECACHE_ONLY_SSD
+				wc->writeback_fua = false;
+				wc->writeback_fua_set = true;
+#endif
+			} else goto invalid_optional;
+		} else {
+invalid_optional:
+			r = -EINVAL;
+			ti->error = "Invalid optional argument";
+			goto bad;
+		}
+	}
+
+	if (!WC_MODE_PMEM(wc)) {
+		struct dm_io_region region;
+		struct dm_io_request req;
+		size_t n_blocks, n_metadata_blocks;
+		uint64_t n_bitmap_bits;
+
+		init_completion(&wc->flush_completion);
+		wc->flush_thread = kthread_create(writecache_flush_thread, wc, "dm_writecache_flush");
+		if (IS_ERR(wc->flush_thread)) {
+			r = PTR_ERR(wc->flush_thread);
+			wc->flush_thread = NULL;
+			ti->error = "Couldn't spawn endio thread";
+			goto bad;
+		}
+		writecache_offload_bio(wc, NULL);
+
+		r = calculate_memory_size(wc->memory_map_size, wc->block_size,
+					  &n_blocks, &n_metadata_blocks);
+		if (r) {
+			ti->error = "Invalid device size";
+			goto bad;
+		}
+
+		n_bitmap_bits = (((uint64_t)n_metadata_blocks << wc->block_size_bits) +
+				 BITMAP_GRANULARITY - 1) / BITMAP_GRANULARITY;
+		/* this is limitation of test_bit functions */
+		if (n_bitmap_bits > 1U << 31) {
+			r = -EFBIG;
+			ti->error = "Invalid device size";
+		}
+
+		wc->memory_map = vmalloc(n_metadata_blocks << wc->block_size_bits);
+		if (!wc->memory_map) {
+			r = -ENOMEM;
+			ti->error = "Unable to allocate memory for metadata";
+			goto bad;
+		}
+
+		wc->dm_kcopyd = dm_kcopyd_client_create(&dm_kcopyd_throttle);
+		if (!wc->dm_kcopyd) {
+			r = -ENOMEM;
+			ti->error = "Unable to allocate dm-kcopyd client";
+			goto bad;
+		}
+
+		wc->metadata_sectors = n_metadata_blocks << (wc->block_size_bits - SECTOR_SHIFT);
+		wc->dirty_bitmap_size = (n_bitmap_bits + BITS_PER_LONG - 1) /
+			BITS_PER_LONG * sizeof(unsigned long);
+		wc->dirty_bitmap = vzalloc(wc->dirty_bitmap_size);
+		if (!wc->dirty_bitmap) {
+			r = -ENOMEM;
+			ti->error = "Unable to allocate dirty bitmap";
+			goto bad;
+		}
+
+		region.bdev = wc->ssd_dev->bdev;
+		region.sector = 0;
+		region.count = wc->metadata_sectors;
+		req.bi_op = REQ_OP_READ;
+		req.bi_op_flags = REQ_SYNC;
+		req.mem.type = DM_IO_VMA;
+		req.mem.ptr.vma = (char *)wc->memory_map;
+		req.client = wc->dm_io;
+		req.notify.fn = NULL;
+
+		r = dm_io(&req, 1, &region, NULL);
+		if (r) {
+			ti->error = "Unable to read metadata";
+			goto bad;
+		}
+	}
+
+	r = memcpy_mcsafe(&s, sb(wc), sizeof(struct wc_memory_superblock));
+	if (r) {
+		ti->error = "Hardware memory error when reading superblock";
+		goto bad;
+	}
+	if (!le32_to_cpu(s.magic) && !le32_to_cpu(s.version)) {
+		r = init_memory(wc);
+		if (r) {
+			ti->error = "Unable to initialize device";
+			goto bad;
+		}
+		r = memcpy_mcsafe(&s, sb(wc), sizeof(struct wc_memory_superblock));
+		if (r) {
+			ti->error = "Hardware memory error when reading superblock";
+			goto bad;
+		}
+	}
+
+	if (le32_to_cpu(s.magic) != MEMORY_SUPERBLOCK_MAGIC) {
+		ti->error = "Invalid magic in the superblock";
+		r = -EINVAL;
+		goto bad;
+	}
+
+	if (le32_to_cpu(s.version) != MEMORY_SUPERBLOCK_VERSION) {
+		ti->error = "Invalid version in the superblock";
+		r = -EINVAL;
+		goto bad;
+	}
+
+	if (le32_to_cpu(s.block_size) != wc->block_size) {
+		ti->error = "Block size does not match superblock";
+		r = -EINVAL;
+		goto bad;
+	}
+
+	wc->n_blocks = le64_to_cpu(s.n_blocks);
+
+	offset = wc->n_blocks * sizeof(struct wc_memory_entry);
+	if (offset / sizeof(struct wc_memory_entry) != le64_to_cpu(sb(wc)->n_blocks)) {
+overflow:
+		ti->error = "Overflow in size calculation";
+		r = -EINVAL;
+		goto bad;
+	}
+	offset += sizeof(struct wc_memory_superblock);
+	if (offset < sizeof(struct wc_memory_superblock))
+		goto overflow;
+	offset = (offset + wc->block_size - 1) & ~(size_t)(wc->block_size - 1);
+	data_size = wc->n_blocks * (size_t)wc->block_size;
+	if (!offset || (data_size / wc->block_size != wc->n_blocks) ||
+	    (offset + data_size < offset))
+		goto overflow;
+	if (offset + data_size > wc->memory_map_size) {
+		ti->error = "Memory area is too small";
+		r = -EINVAL;
+		goto bad;
+	}
+
+	wc->metadata_sectors = offset >> SECTOR_SHIFT;
+	wc->block_start = (char *)sb(wc) + offset;
+
+	x = (uint64_t)wc->n_blocks * (100 - high_wm_percent);
+	x += 50;
+	do_div(x, 100);
+	wc->freelist_high_watermark = x;
+	x = (uint64_t)wc->n_blocks * (100 - low_wm_percent);
+	x += 50;
+	do_div(x, 100);
+	wc->freelist_low_watermark = x;
+
+	r = writecache_alloc_entries(wc);
+	if (r) {
+		ti->error = "Cannot allocate memory";
+		goto bad;
+	}
+
+	ti->num_flush_bios = 1;
+	ti->flush_supported = true;
+	ti->num_discard_bios = 1;
+
+	if (WC_MODE_PMEM(wc))
+		persistent_memory_flush_cache(wc->memory_map, wc->memory_map_size);
+
+	return 0;
+
+bad_arguments:
+	r = -EINVAL;
+	ti->error = "Bad arguments";
+bad:
+	writecache_dtr(ti);
+	return r;
+}
+
+static void writecache_status(struct dm_target *ti, status_type_t type,
+			      unsigned status_flags, char *result, unsigned maxlen)
+{
+	struct dm_writecache *wc = ti->private;
+	unsigned extra_args;
+	unsigned sz = 0;
+	uint64_t x;
+
+	switch (type) {
+	case STATUSTYPE_INFO:
+		DMEMIT("%ld %llu %llu %llu", writecache_has_error(wc),
+		       (unsigned long long)wc->n_blocks, (unsigned long long)wc->freelist_size,
+		       (unsigned long long)wc->writeback_size);
+		break;
+	case STATUSTYPE_TABLE:
+		DMEMIT("%c %s %s %u ", WC_MODE_PMEM(wc) ? 'p' : 's',
+				wc->dev->name, wc->ssd_dev->name, wc->block_size);
+		extra_args = 0;
+		if (wc->high_wm_percent_set)
+			extra_args += 2;
+		if (wc->low_wm_percent_set)
+			extra_args += 2;
+		if (wc->max_writeback_jobs_set)
+			extra_args += 2;
+		if (wc->autocommit_blocks_set)
+			extra_args += 2;
+		if (wc->autocommit_time_set)
+			extra_args += 2;
+#ifndef DM_WRITECACHE_ONLY_SSD
+		if (wc->writeback_fua_set)
+			extra_args++;
+#endif
+		DMEMIT("%u", extra_args);
+		if (wc->high_wm_percent_set) {
+			x = (uint64_t)wc->freelist_high_watermark * 100;
+			x += wc->n_blocks / 2;
+			do_div(x, (size_t)wc->n_blocks);
+			DMEMIT(" high_watermark %u", 100 - (unsigned)x);
+		}
+		if (wc->low_wm_percent_set) {
+			x = (uint64_t)wc->freelist_low_watermark * 100;
+			x += wc->n_blocks / 2;
+			do_div(x, (size_t)wc->n_blocks);
+			DMEMIT(" low_watermark %u", 100 - (unsigned)x);
+		}
+		if (wc->max_writeback_jobs_set) {
+			DMEMIT(" writeback_jobs %u", wc->max_writeback_jobs);
+		}
+		if (wc->autocommit_blocks_set) {
+			DMEMIT(" autocommit_blocks %u", wc->autocommit_blocks);
+		}
+		if (wc->autocommit_time_set) {
+			DMEMIT(" autocommit_time %u", jiffies_to_msecs(wc->autocommit_jiffies));
+		}
+#ifndef DM_WRITECACHE_ONLY_SSD
+		if (wc->writeback_fua_set) {
+			DMEMIT(" %sfua", wc->writeback_fua ? "" : "no");
+		}
+#endif
+		break;
+	}
+}
+
+static struct target_type writecache_target = {
+	.name			= "writecache",
+	.version		= {1, 0, 0},
+	.module			= THIS_MODULE,
+	.ctr			= writecache_ctr,
+	.dtr			= writecache_dtr,
+	.status			= writecache_status,
+	.postsuspend		= writecache_suspend,
+	.resume			= writecache_resume,
+	.message		= writecache_message,
+	.map			= writecache_map,
+	.end_io			= writecache_end_io,
+	.iterate_devices	= writecache_iterate_devices,
+	.io_hints		= writecache_io_hints,
+};
+
+static int __init dm_writecache_init(void)
+{
+	int r;
+
+	r = dm_register_target(&writecache_target);
+	if (r < 0) {
+		DMERR("register failed %d", r);
+		return r;
+	}
+
+	return 0;
+}
+
+static void __exit dm_writecache_exit(void)
+{
+	dm_unregister_target(&writecache_target);
+}
+
+module_init(dm_writecache_init);
+module_exit(dm_writecache_exit);
+
+MODULE_DESCRIPTION(DM_NAME " writecache target");
+MODULE_AUTHOR("Mikulas Patocka <dm-devel@redhat.com>");
+MODULE_LICENSE("GPL");
Index: linux-2.6/Documentation/device-mapper/writecache.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/device-mapper/writecache.txt	2018-05-15 07:09:32.000000000 +0200
@@ -0,0 +1,68 @@
+The writecache target caches writes on persistent memory or on SSD. It
+doesn't cache reads because reads are supposed to be cached in page cache
+in normal RAM.
+
+When the device is constructed, the first sector should be zeroed or the
+first sector should contain valid superblock from previous invocation.
+
+Constructor parameters:
+1. type of the cache device - "p" or "s"
+	p - persistent memory
+	s - SSD
+2. the underlying device that will be cached
+3. the cache device
+4. block size (4096 is recommended; the maximum block size is the page
+   size)
+5. the number of optional parameters (the parameters with an argument
+   count as two)
+	high_watermark n	(default: 50)
+		start writeback when the number of used blocks reach this
+		watermark
+	low_watermark x		(default: 45)
+		stop writeback when the number of used blocks drops below
+		this watermark
+	writeback_jobs n	(default: unlimited)
+		limit the number of blocks that are in flight during
+		writeback. Setting this value reduces writeback
+		throughput, but it may improve latency of read requests
+	autocommit_blocks n	(default: 64 for pmem, 65536 for ssd)
+		when the application writes this amount of blocks without
+		issuing the FLUSH request, the blocks are automatically
+		commited
+	autocommit_time ms	(default: 1000)
+		autocommit time in milliseconds. The data is automatically
+		commited if this time passes and no FLUSH request is
+		received
+	fua			(by default on)
+		applicable only to persistent memory - use the FUA flag
+		when writing data from persistent memory back to the
+		underlying device
+	nofua
+		applicable only to persistent memory - don't use the FUA
+		flag when writing back data and send the FLUSH request
+		afterwards
+		- some underlying devices perform better with fua, some
+		  with nofua. The user should test it
+
+Status:
+1. error indicator - 0 if there was no error, otherwise error number
+2. the number of blocks
+3. the number of free blocks
+4. the number of blocks under writeback
+
+Messages:
+	flush
+		flush the cache device. The message returns successfully
+		if the cache device was flushed without an error
+	flush_on_suspend
+		flush the cache device on next suspend. Use this message
+		when you are going to remove the cache device. The proper
+		sequence for removing the cache device is:
+		1. send the "flush_on_suspend" message
+		2. load an inactive table with a linear target that maps
+		   to the underlying device
+		3. suspend the device
+		4. ask for status and verify that there are no errors
+		5. resume the device, so that it will use the linear
+		   target
+		6. the cache device is now inactive and it can be deleted

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [patch 4/4] dm-writecache: use new API for flushing
  2018-05-19  5:25 [patch 0/4] dm-writecache patches Mikulas Patocka
                   ` (2 preceding siblings ...)
  2018-05-19  5:25 ` [patch 3/4] dm-writecache Mikulas Patocka
@ 2018-05-19  5:25 ` Mikulas Patocka
  2018-05-22  6:39     ` Christoph Hellwig
  2018-05-25  3:12   ` Dan Williams
  3 siblings, 2 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-19  5:25 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer, Dan Williams; +Cc: dm-devel

[-- Attachment #1: dm-writecache-interface.patch --]
[-- Type: text/plain, Size: 7451 bytes --]

Use new API for flushing persistent memory.

The problem is this:
* on X86-64, non-temporal stores have the best performance
* ARM64 doesn't have non-temporal stores, so we must flush cache. We
  should flush cache as late as possible, because it performs better this
  way.

We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
data persistently, all three functions must be called.

The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
(unlike pmem_memcpy) guarantees that 8-byte values are written atomically.

On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
pmem_commit is wmb.

On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
pmem_commit is empty.

Signed-off-by: Mike Snitzer <msnitzer@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm-writecache.c |  100 +++++++++++++++++++++++++--------------------
 1 file changed, 56 insertions(+), 44 deletions(-)

Index: linux-2.6/drivers/md/dm-writecache.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-writecache.c	2018-05-19 06:20:28.000000000 +0200
+++ linux-2.6/drivers/md/dm-writecache.c	2018-05-19 07:10:26.000000000 +0200
@@ -14,6 +14,7 @@
 #include <linux/dm-kcopyd.h>
 #include <linux/dax.h>
 #include <linux/pfn_t.h>
+#include <linux/libnvdimm.h>
 
 #define DM_MSG_PREFIX "writecache"
 
@@ -47,14 +48,48 @@
  * On ARM64, cache flushing is more efficient.
  */
 #if defined(CONFIG_X86_64)
-#define EAGER_DATA_FLUSH
-#define NT_STORE(dest, src)				\
-do {							\
-	typeof(src) val = (src);			\
-	memcpy_flushcache(&(dest), &val, sizeof(src));	\
+
+static void pmem_memcpy(void *dest, void *src, size_t len)
+{
+	memcpy_flushcache(dest, src, len);
+}
+
+#define __pmem_assign(dest, src, uniq)				\
+do {								\
+	typeof(dest) uniq = (src);				\
+	memcpy_flushcache(&(dest), &uniq, sizeof(dest));	\
 } while (0)
+
+#define pmem_assign(dest, src)					\
+	__pmem_assign(dest, src, __UNIQUE_ID(pmem_assign))
+
+static void pmem_flush(void *dest, size_t len)
+{
+}
+
+static void pmem_commit(void)
+{
+	wmb();
+}
+
 #else
-#define NT_STORE(dest, src)	WRITE_ONCE(dest, src)
+
+static void pmem_memcpy(void *dest, void *src, size_t len)
+{
+	memcpy(dest, src, len);
+}
+
+#define pmem_assign(dest, src)		WRITE_ONCE(dest, src)
+
+static void pmem_flush(void *dest, size_t len)
+{
+	arch_wb_cache_pmem(dest, len);
+}
+
+static void pmem_commit(void)
+{
+}
+
 #endif
 
 #if defined(__HAVE_ARCH_MEMCPY_MCSAFE) && !defined(DM_WRITECACHE_ONLY_SSD)
@@ -105,7 +140,7 @@ struct wc_entry {
 };
 
 #ifndef DM_WRITECACHE_ONLY_SSD
-#define WC_MODE_PMEM(wc)			((wc)->pmem_mode)
+#define WC_MODE_PMEM(wc)			(likely((wc)->pmem_mode))
 #define WC_MODE_FUA(wc)				((wc)->writeback_fua)
 #else
 #define WC_MODE_PMEM(wc)			false
@@ -400,21 +435,6 @@ static void persistent_memory_invalidate
 		invalidate_kernel_vmap_range(ptr, size);
 }
 
-static void persistent_memory_flush(struct dm_writecache *wc, void *ptr, size_t size)
-{
-#ifndef EAGER_DATA_FLUSH
-	dax_flush(wc->ssd_dev->dax_dev, ptr, size);
-#endif
-}
-
-static void persistent_memory_commit_flushed(void)
-{
-#ifdef EAGER_DATA_FLUSH
-	/* needed since memcpy_flushcache is used instead of dax_flush */
-	wmb();
-#endif
-}
-
 static struct wc_memory_superblock *sb(struct dm_writecache *wc)
 {
 	return wc->memory_map;
@@ -462,21 +482,20 @@ static void clear_seq_count(struct dm_wr
 #ifdef DM_WRITECACHE_HANDLE_HARDWARE_ERRORS
 	e->seq_count = -1;
 #endif
-	NT_STORE(memory_entry(wc, e)->seq_count, cpu_to_le64(-1));
+	pmem_assign(memory_entry(wc, e)->seq_count, cpu_to_le64(-1));
 }
 
 static void write_original_sector_seq_count(struct dm_writecache *wc, struct wc_entry *e,
 					    uint64_t original_sector, uint64_t seq_count)
 {
-	struct wc_memory_entry *me_p, me;
+	struct wc_memory_entry me;
 #ifdef DM_WRITECACHE_HANDLE_HARDWARE_ERRORS
 	e->original_sector = original_sector;
 	e->seq_count = seq_count;
 #endif
-	me_p = memory_entry(wc, e);
 	me.original_sector = cpu_to_le64(original_sector);
 	me.seq_count = cpu_to_le64(seq_count);
-	NT_STORE(*me_p, me);
+	pmem_assign(*memory_entry(wc, e), me);
 }
 
 #define writecache_error(wc, err, msg, arg...)				\
@@ -491,8 +510,7 @@ do {									\
 static void writecache_flush_all_metadata(struct dm_writecache *wc)
 {
 	if (WC_MODE_PMEM(wc)) {
-		persistent_memory_flush(wc,
-			sb(wc), offsetof(struct wc_memory_superblock, entries[wc->n_blocks]));
+		pmem_flush(sb(wc), offsetof(struct wc_memory_superblock, entries[wc->n_blocks]));
 	} else {
 		memset(wc->dirty_bitmap, -1, wc->dirty_bitmap_size);
 	}
@@ -501,7 +519,7 @@ static void writecache_flush_all_metadat
 static void writecache_flush_region(struct dm_writecache *wc, void *ptr, size_t size)
 {
 	if (WC_MODE_PMEM(wc))
-		persistent_memory_flush(wc, ptr, size);
+		pmem_flush(ptr, size);
 	else
 		__set_bit(((char *)ptr - (char *)wc->memory_map) / BITMAP_GRANULARITY,
 			  wc->dirty_bitmap);
@@ -579,7 +597,7 @@ static void ssd_commit_flushed(struct dm
 static void writecache_commit_flushed(struct dm_writecache *wc)
 {
 	if (WC_MODE_PMEM(wc))
-		persistent_memory_commit_flushed();
+		pmem_commit();
 	else
 		ssd_commit_flushed(wc);
 }
@@ -788,10 +806,8 @@ static void writecache_poison_lists(stru
 static void writecache_flush_entry(struct dm_writecache *wc, struct wc_entry *e)
 {
 	writecache_flush_region(wc, memory_entry(wc, e), sizeof(struct wc_memory_entry));
-#ifndef EAGER_DATA_FLUSH
 	if (WC_MODE_PMEM(wc))
 		writecache_flush_region(wc, memory_data(wc, e), wc->block_size);
-#endif
 }
 
 static bool writecache_entry_is_committed(struct dm_writecache *wc, struct wc_entry *e)
@@ -834,7 +850,7 @@ static void writecache_flush(struct dm_w
 	writecache_wait_for_ios(wc, WRITE);
 
 	wc->seq_count++;
-	NT_STORE(sb(wc)->seq_count, cpu_to_le64(wc->seq_count));
+	pmem_assign(sb(wc)->seq_count, cpu_to_le64(wc->seq_count));
 	writecache_flush_region(wc, &sb(wc)->seq_count, sizeof sb(wc)->seq_count);
 	writecache_commit_flushed(wc);
 
@@ -1152,11 +1168,7 @@ static void bio_copy_block(struct dm_wri
 			}
 		} else {
 			flush_dcache_page(bio_page(bio));
-#ifdef EAGER_DATA_FLUSH
-			memcpy_flushcache(data, buf, size);
-#else
-			memcpy(data, buf, size);
-#endif
+			pmem_memcpy(data, buf, size);
 		}
 
 		bvec_kunmap_irq(buf, &flags);
@@ -1850,18 +1862,18 @@ static int init_memory(struct dm_writeca
 		return r;
 
 	for (b = 0; b < ARRAY_SIZE(sb(wc)->padding); b++)
-		NT_STORE(sb(wc)->padding[b], cpu_to_le64(0));
-	NT_STORE(sb(wc)->version, cpu_to_le32(MEMORY_SUPERBLOCK_VERSION));
-	NT_STORE(sb(wc)->block_size, cpu_to_le32(wc->block_size));
-	NT_STORE(sb(wc)->n_blocks, cpu_to_le64(wc->n_blocks));
-	NT_STORE(sb(wc)->seq_count, cpu_to_le64(0));
+		pmem_assign(sb(wc)->padding[b], cpu_to_le64(0));
+	pmem_assign(sb(wc)->version, cpu_to_le32(MEMORY_SUPERBLOCK_VERSION));
+	pmem_assign(sb(wc)->block_size, cpu_to_le32(wc->block_size));
+	pmem_assign(sb(wc)->n_blocks, cpu_to_le64(wc->n_blocks));
+	pmem_assign(sb(wc)->seq_count, cpu_to_le64(0));
 
 	for (b = 0; b < wc->n_blocks; b++)
 		write_original_sector_seq_count(wc, &wc->entries[b], -1, -1);
 
 	writecache_flush_all_metadata(wc);
 	writecache_commit_flushed(wc);
-	NT_STORE(sb(wc)->magic, cpu_to_le32(MEMORY_SUPERBLOCK_MAGIC));
+	pmem_assign(sb(wc)->magic, cpu_to_le32(MEMORY_SUPERBLOCK_MAGIC));
 	writecache_flush_region(wc, &sb(wc)->magic, sizeof sb(wc)->magic);
 	writecache_commit_flushed(wc);
 

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 1/4] x86: optimize memcpy_flushcache
  2018-05-19  5:25 ` [patch 1/4] x86: optimize memcpy_flushcache Mikulas Patocka
@ 2018-05-19 14:21   ` Dan Williams
  2018-05-24 18:20     ` [PATCH v2] " Mike Snitzer
  0 siblings, 1 reply; 108+ messages in thread
From: Dan Williams @ 2018-05-19 14:21 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mike Snitzer, X86 ML, Ingo Molnar, device-mapper development,
	Thomas Gleixner

[ add x86 folks for their review / ack ]

On Fri, May 18, 2018 at 10:25 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> I use memcpy_flushcache in my persistent memory driver for metadata
> updates and it turns out that the overhead of memcpy_flushcache causes 2%
> performance degradation compared to "movnti" instruction explicitly coded
> using inline assembler.
>
> This patch recognizes memcpy_flushcache calls with constant short length
> and turns them into inline assembler - so that I don't have to use inline
> assembler in the driver.
>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>
> ---
>  arch/x86/include/asm/string_64.h |   20 +++++++++++++++++++-
>  arch/x86/lib/usercopy_64.c       |    4 ++--
>  2 files changed, 21 insertions(+), 3 deletions(-)
>
> Index: linux-2.6/arch/x86/include/asm/string_64.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/string_64.h     2018-05-18 21:21:15.000000000 +0200
> +++ linux-2.6/arch/x86/include/asm/string_64.h  2018-05-18 21:21:15.000000000 +0200
> @@ -147,7 +147,25 @@ memcpy_mcsafe(void *dst, const void *src
>
>  #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
>  #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
> -void memcpy_flushcache(void *dst, const void *src, size_t cnt);
> +void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
> +static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t cnt)
> +{
> +       if (__builtin_constant_p(cnt)) {
> +               switch (cnt) {
> +                       case 4:
> +                               asm ("movntil %1, %0" : "=m"(*(u32 *)dst) : "r"(*(u32 *)src));
> +                               return;
> +                       case 8:
> +                               asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : "r"(*(u64 *)src));
> +                               return;
> +                       case 16:
> +                               asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : "r"(*(u64 *)src));
> +                               asm ("movntiq %1, %0" : "=m"(*(u64 *)(dst + 8)) : "r"(*(u64 *)(src + 8)));
> +                               return;
> +               }
> +       }
> +       __memcpy_flushcache(dst, src, cnt);
> +}
>  #endif
>
>  #endif /* __KERNEL__ */
> Index: linux-2.6/arch/x86/lib/usercopy_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/lib/usercopy_64.c   2018-05-18 21:21:15.000000000 +0200
> +++ linux-2.6/arch/x86/lib/usercopy_64.c        2018-05-18 22:09:49.000000000 +0200
> @@ -133,7 +133,7 @@ long __copy_user_flushcache(void *dst, c
>         return rc;
>  }
>
> -void memcpy_flushcache(void *_dst, const void *_src, size_t size)
> +void __memcpy_flushcache(void *_dst, const void *_src, size_t size)
>  {
>         unsigned long dest = (unsigned long) _dst;
>         unsigned long source = (unsigned long) _src;
> @@ -196,7 +196,7 @@ void memcpy_flushcache(void *_dst, const
>                 clean_cache_range((void *) dest, size);
>         }
>  }
> -EXPORT_SYMBOL_GPL(memcpy_flushcache);
> +EXPORT_SYMBOL_GPL(__memcpy_flushcache);
>
>  void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
>                 size_t len)
>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 2/4] swait: export the symbols __prepare_to_swait and __finish_swait
  2018-05-19  5:25 ` [patch 2/4] swait: export the symbols __prepare_to_swait and __finish_swait Mikulas Patocka
@ 2018-05-22  6:34   ` Christoph Hellwig
  2018-05-22 18:52     ` Mike Snitzer
  0 siblings, 1 reply; 108+ messages in thread
From: Christoph Hellwig @ 2018-05-22  6:34 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Mike Snitzer, peterz, wagi, dm-devel, tglx, Dan Williams

On Sat, May 19, 2018 at 07:25:05AM +0200, Mikulas Patocka wrote:
> In order to reduce locking overhead, I use the spinlock in
> swait_queue_head to protect not only the wait queue, but also the list of
> events. Consequently, I need to use unlocked functions __prepare_to_swait
> and __finish_swait. These functions are declared in the file
> include/linux/swait.h, but they are not exported, and so they are not
> useable from kernel modules.

Please CC the author and maintainers of the swait code.

My impression is that this is the wrong thing to do.  The swait code
is supposed to be simple and self contained, and if you want to do
anything else use normal waitqueues.

> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> 
> ---
>  kernel/sched/swait.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> Index: linux-2.6/kernel/sched/swait.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched/swait.c	2018-04-16 21:10:05.000000000 +0200
> +++ linux-2.6/kernel/sched/swait.c	2018-04-16 21:10:05.000000000 +0200
> @@ -75,6 +75,7 @@ void __prepare_to_swait(struct swait_que
>  	if (list_empty(&wait->task_list))
>  		list_add(&wait->task_list, &q->task_list);
>  }
> +EXPORT_SYMBOL(__prepare_to_swait);
>  
>  void prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait, int state)
>  {
> @@ -104,6 +105,7 @@ void __finish_swait(struct swait_queue_h
>  	if (!list_empty(&wait->task_list))
>  		list_del_init(&wait->task_list);
>  }
> +EXPORT_SYMBOL(__finish_swait);
>  
>  void finish_swait(struct swait_queue_head *q, struct swait_queue *wait)
>  {
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
---end quoted text---

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 3/4] dm-writecache
  2018-05-19  5:25 ` [patch 3/4] dm-writecache Mikulas Patocka
@ 2018-05-22  6:37   ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2018-05-22  6:37 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Mike Snitzer, Dan Williams, dm-devel

On Sat, May 19, 2018 at 07:25:06AM +0200, Mikulas Patocka wrote:
> The dm-writecache target.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

You'll need to actually describe your new code in the changelog.

> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/drivers/md/dm-writecache.c	2018-05-17 02:46:44.000000000 +0200
> @@ -0,0 +1,2414 @@
> +/*
> + * Copyright (C) 2018 Red Hat. All rights reserved.
> + *
> + * This file is released under the GPL.
> + */

New code needs a SPDX header.

> +/*
> + * On X86, non-temporal stores are more efficient than cache flushing.
> + * On ARM64, cache flushing is more efficient.
> + */
> +#if defined(CONFIG_X86_64)
> +#define EAGER_DATA_FLUSH
> +#define NT_STORE(dest, src)				\
> +do {							\
> +	typeof(src) val = (src);			\
> +	memcpy_flushcache(&(dest), &val, sizeof(src));	\
> +} while (0)
> +#else
> +#define NT_STORE(dest, src)	WRITE_ONCE(dest, src)
> +#endif

No per-arch hacks in the driver please, this needs a proper Kconfig
symbol provided from the architectures.

> +struct wc_entry {
> +	struct rb_node rb_node;
> +	struct list_head lru;
> +	unsigned short wc_list_contiguous;
> +	bool write_in_progress
> +#if BITS_PER_LONG == 64
> +		:1
> +#endif
> +	;
> +	unsigned long index
> +#if BITS_PER_LONG == 64
> +		:47
> +#endif

Hacks like this shouldn't normally exist, but if you absolutely
need them you need to explain why in a comment.

Haven't had time to do a full review due to my backlog, I'll try to
find some time later today or tomorrow.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [dm-devel] [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-22  6:39     ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2018-05-22  6:39 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Mike Snitzer, dm-devel, linux-nvdimm

On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> Use new API for flushing persistent memory.

The sentence doesnt make much sense.  'A new API', 'A better
abstraction' maybe?

> 
> The problem is this:
> * on X86-64, non-temporal stores have the best performance
> * ARM64 doesn't have non-temporal stores, so we must flush cache. We
>   should flush cache as late as possible, because it performs better this
>   way.
> 
> We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> data persistently, all three functions must be called.
> 
> The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> 
> On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> pmem_commit is wmb.
> 
> On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> pmem_commit is empty.

All these should be provided by the pmem layer, and be properly
documented.  And be sorted before adding your new target that uses
them.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [dm-devel] [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-22  6:39     ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2018-05-22  6:39 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mike Snitzer, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw

On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> Use new API for flushing persistent memory.

The sentence doesnt make much sense.  'A new API', 'A better
abstraction' maybe?

> 
> The problem is this:
> * on X86-64, non-temporal stores have the best performance
> * ARM64 doesn't have non-temporal stores, so we must flush cache. We
>   should flush cache as late as possible, because it performs better this
>   way.
> 
> We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> data persistently, all three functions must be called.
> 
> The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> 
> On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> pmem_commit is wmb.
> 
> On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> pmem_commit is empty.

All these should be provided by the pmem layer, and be properly
documented.  And be sorted before adding your new target that uses
them.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-22 18:41       ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-22 18:41 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Mikulas Patocka, dm-devel, linux-nvdimm

On Tue, May 22 2018 at  2:39am -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> > Use new API for flushing persistent memory.
> 
> The sentence doesnt make much sense.  'A new API', 'A better
> abstraction' maybe?
> 
> > 
> > The problem is this:
> > * on X86-64, non-temporal stores have the best performance
> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> >   should flush cache as late as possible, because it performs better this
> >   way.
> > 
> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> > data persistently, all three functions must be called.
> > 
> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> > 
> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> > pmem_commit is wmb.
> > 
> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> > pmem_commit is empty.
> 
> All these should be provided by the pmem layer, and be properly
> documented.  And be sorted before adding your new target that uses
> them.

I don't see that as a hard requirement.  Mikulas did the work to figure
out what is more optimal on x86_64 vs amd64.  It makes a difference for
his target and that is sufficient to carry it locally until/when it is
either elevated to pmem.

We cannot even get x86 and swait maintainers to reply to repeat requests
for review.  Stacking up further deps on pmem isn't high on my list.

Mike
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-22 18:41       ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-22 18:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mikulas Patocka, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw

On Tue, May 22 2018 at  2:39am -0400,
Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:

> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> > Use new API for flushing persistent memory.
> 
> The sentence doesnt make much sense.  'A new API', 'A better
> abstraction' maybe?
> 
> > 
> > The problem is this:
> > * on X86-64, non-temporal stores have the best performance
> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> >   should flush cache as late as possible, because it performs better this
> >   way.
> > 
> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> > data persistently, all three functions must be called.
> > 
> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> > 
> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> > pmem_commit is wmb.
> > 
> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> > pmem_commit is empty.
> 
> All these should be provided by the pmem layer, and be properly
> documented.  And be sorted before adding your new target that uses
> them.

I don't see that as a hard requirement.  Mikulas did the work to figure
out what is more optimal on x86_64 vs amd64.  It makes a difference for
his target and that is sufficient to carry it locally until/when it is
either elevated to pmem.

We cannot even get x86 and swait maintainers to reply to repeat requests
for review.  Stacking up further deps on pmem isn't high on my list.

Mike

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 2/4] swait: export the symbols __prepare_to_swait and __finish_swait
  2018-05-22  6:34   ` Christoph Hellwig
@ 2018-05-22 18:52     ` Mike Snitzer
  2018-05-23  9:21       ` Peter Zijlstra
  0 siblings, 1 reply; 108+ messages in thread
From: Mike Snitzer @ 2018-05-22 18:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: peterz, wagi, dm-devel, Mikulas Patocka, Dan Williams, tglx

On Tue, May 22 2018 at  2:34am -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> On Sat, May 19, 2018 at 07:25:05AM +0200, Mikulas Patocka wrote:
> > In order to reduce locking overhead, I use the spinlock in
> > swait_queue_head to protect not only the wait queue, but also the list of
> > events. Consequently, I need to use unlocked functions __prepare_to_swait
> > and __finish_swait. These functions are declared in the file
> > include/linux/swait.h, but they are not exported, and so they are not
> > useable from kernel modules.
> 
> Please CC the author and maintainers of the swait code.
> 
> My impression is that this is the wrong thing to do.  The swait code
> is supposed to be simple and self contained, and if you want to do
> anything else use normal waitqueues.

You said the same thing last time around.  I've since cc'd Peter and
Thomas and haven't heard back, see:
https://www.redhat.com/archives/dm-devel/2018-May/msg00048.html

The entire point of exporting these symbols is to allow use of the
"simple waitqueue" code to optimize -- without resorting to using normal
waitqueues.

Mike

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-22 19:00         ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-22 19:00 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, device-mapper development, Mikulas Patocka,
	linux-nvdimm

On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Tue, May 22 2018 at  2:39am -0400,
> Christoph Hellwig <hch@infradead.org> wrote:
>
>> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
>> > Use new API for flushing persistent memory.
>>
>> The sentence doesnt make much sense.  'A new API', 'A better
>> abstraction' maybe?
>>
>> >
>> > The problem is this:
>> > * on X86-64, non-temporal stores have the best performance
>> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
>> >   should flush cache as late as possible, because it performs better this
>> >   way.
>> >
>> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
>> > data persistently, all three functions must be called.
>> >
>> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
>> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
>> >
>> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
>> > pmem_commit is wmb.
>> >
>> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
>> > pmem_commit is empty.
>>
>> All these should be provided by the pmem layer, and be properly
>> documented.  And be sorted before adding your new target that uses
>> them.
>
> I don't see that as a hard requirement.  Mikulas did the work to figure
> out what is more optimal on x86_64 vs amd64.  It makes a difference for
> his target and that is sufficient to carry it locally until/when it is
> either elevated to pmem.
>
> We cannot even get x86 and swait maintainers to reply to repeat requests
> for review.  Stacking up further deps on pmem isn't high on my list.
>

Except I'm being responsive. I agree with Christoph that we should
build pmem helpers at an architecture level and not per-driver. Let's
make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
up to x86 in this space. We already have PowerPC enabling PMEM API, so
I don't see an unreasonable barrier to ask the same of ARM. This patch
is not even cc'd to linux-arm-kernel. Has the subject been broached
with them?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-22 19:00         ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-22 19:00 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, device-mapper development, Mikulas Patocka,
	linux-nvdimm

On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, May 22 2018 at  2:39am -0400,
> Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
>
>> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
>> > Use new API for flushing persistent memory.
>>
>> The sentence doesnt make much sense.  'A new API', 'A better
>> abstraction' maybe?
>>
>> >
>> > The problem is this:
>> > * on X86-64, non-temporal stores have the best performance
>> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
>> >   should flush cache as late as possible, because it performs better this
>> >   way.
>> >
>> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
>> > data persistently, all three functions must be called.
>> >
>> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
>> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
>> >
>> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
>> > pmem_commit is wmb.
>> >
>> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
>> > pmem_commit is empty.
>>
>> All these should be provided by the pmem layer, and be properly
>> documented.  And be sorted before adding your new target that uses
>> them.
>
> I don't see that as a hard requirement.  Mikulas did the work to figure
> out what is more optimal on x86_64 vs amd64.  It makes a difference for
> his target and that is sufficient to carry it locally until/when it is
> either elevated to pmem.
>
> We cannot even get x86 and swait maintainers to reply to repeat requests
> for review.  Stacking up further deps on pmem isn't high on my list.
>

Except I'm being responsive. I agree with Christoph that we should
build pmem helpers at an architecture level and not per-driver. Let's
make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
up to x86 in this space. We already have PowerPC enabling PMEM API, so
I don't see an unreasonable barrier to ask the same of ARM. This patch
is not even cc'd to linux-arm-kernel. Has the subject been broached
with them?

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-22 19:19           ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-22 19:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, device-mapper development, Mikulas Patocka,
	linux-nvdimm

On Tue, May 22 2018 at  3:00pm -0400,
Dan Williams <dan.j.williams@intel.com> wrote:

> On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> > On Tue, May 22 2018 at  2:39am -0400,
> > Christoph Hellwig <hch@infradead.org> wrote:
> >
> >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> >> > Use new API for flushing persistent memory.
> >>
> >> The sentence doesnt make much sense.  'A new API', 'A better
> >> abstraction' maybe?
> >>
> >> >
> >> > The problem is this:
> >> > * on X86-64, non-temporal stores have the best performance
> >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> >> >   should flush cache as late as possible, because it performs better this
> >> >   way.
> >> >
> >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> >> > data persistently, all three functions must be called.
> >> >
> >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> >> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> >> >
> >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> >> > pmem_commit is wmb.
> >> >
> >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> >> > pmem_commit is empty.
> >>
> >> All these should be provided by the pmem layer, and be properly
> >> documented.  And be sorted before adding your new target that uses
> >> them.
> >
> > I don't see that as a hard requirement.  Mikulas did the work to figure
> > out what is more optimal on x86_64 vs amd64.  It makes a difference for
> > his target and that is sufficient to carry it locally until/when it is
> > either elevated to pmem.
> >
> > We cannot even get x86 and swait maintainers to reply to repeat requests
> > for review.  Stacking up further deps on pmem isn't high on my list.
> >
> 
> Except I'm being responsive.

Except you're looking to immediately punt to linux-arm-kernel ;)

> I agree with Christoph that we should
> build pmem helpers at an architecture level and not per-driver. Let's
> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
> up to x86 in this space. We already have PowerPC enabling PMEM API, so
> I don't see an unreasonable barrier to ask the same of ARM. This patch
> is not even cc'd to linux-arm-kernel. Has the subject been broached
> with them?

No idea.  Not by me.

The thing is, I'm no expert in pmem.  You are.  Coordinating the change
with ARM et al feels unnecessarily limiting and quicky moves outside my
control.

Serious question: Why can't this code land in this dm-writecache target
and then be lifted (or obsoleted)?

But if you think it worthwhile to force ARM to step up then fine.  That
does limit the availability of using writecache on ARM while they get
the PMEM API together.

I'll do whatever you want.. just put the smack down and tell me how it
is ;)

Thanks,
Mike
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-22 19:19           ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-22 19:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, device-mapper development, Mikulas Patocka,
	linux-nvdimm

On Tue, May 22 2018 at  3:00pm -0400,
Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:

> On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Tue, May 22 2018 at  2:39am -0400,
> > Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> >
> >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> >> > Use new API for flushing persistent memory.
> >>
> >> The sentence doesnt make much sense.  'A new API', 'A better
> >> abstraction' maybe?
> >>
> >> >
> >> > The problem is this:
> >> > * on X86-64, non-temporal stores have the best performance
> >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> >> >   should flush cache as late as possible, because it performs better this
> >> >   way.
> >> >
> >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> >> > data persistently, all three functions must be called.
> >> >
> >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> >> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> >> >
> >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> >> > pmem_commit is wmb.
> >> >
> >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> >> > pmem_commit is empty.
> >>
> >> All these should be provided by the pmem layer, and be properly
> >> documented.  And be sorted before adding your new target that uses
> >> them.
> >
> > I don't see that as a hard requirement.  Mikulas did the work to figure
> > out what is more optimal on x86_64 vs amd64.  It makes a difference for
> > his target and that is sufficient to carry it locally until/when it is
> > either elevated to pmem.
> >
> > We cannot even get x86 and swait maintainers to reply to repeat requests
> > for review.  Stacking up further deps on pmem isn't high on my list.
> >
> 
> Except I'm being responsive.

Except you're looking to immediately punt to linux-arm-kernel ;)

> I agree with Christoph that we should
> build pmem helpers at an architecture level and not per-driver. Let's
> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
> up to x86 in this space. We already have PowerPC enabling PMEM API, so
> I don't see an unreasonable barrier to ask the same of ARM. This patch
> is not even cc'd to linux-arm-kernel. Has the subject been broached
> with them?

No idea.  Not by me.

The thing is, I'm no expert in pmem.  You are.  Coordinating the change
with ARM et al feels unnecessarily limiting and quicky moves outside my
control.

Serious question: Why can't this code land in this dm-writecache target
and then be lifted (or obsoleted)?

But if you think it worthwhile to force ARM to step up then fine.  That
does limit the availability of using writecache on ARM while they get
the PMEM API together.

I'll do whatever you want.. just put the smack down and tell me how it
is ;)

Thanks,
Mike

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-22 19:27             ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-22 19:27 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, device-mapper development, Mikulas Patocka,
	linux-nvdimm

On Tue, May 22, 2018 at 12:19 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Tue, May 22 2018 at  3:00pm -0400,
> Dan Williams <dan.j.williams@intel.com> wrote:
>
>> On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer@redhat.com> wrote:
>> > On Tue, May 22 2018 at  2:39am -0400,
>> > Christoph Hellwig <hch@infradead.org> wrote:
>> >
>> >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
>> >> > Use new API for flushing persistent memory.
>> >>
>> >> The sentence doesnt make much sense.  'A new API', 'A better
>> >> abstraction' maybe?
>> >>
>> >> >
>> >> > The problem is this:
>> >> > * on X86-64, non-temporal stores have the best performance
>> >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
>> >> >   should flush cache as late as possible, because it performs better this
>> >> >   way.
>> >> >
>> >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
>> >> > data persistently, all three functions must be called.
>> >> >
>> >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
>> >> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
>> >> >
>> >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
>> >> > pmem_commit is wmb.
>> >> >
>> >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
>> >> > pmem_commit is empty.
>> >>
>> >> All these should be provided by the pmem layer, and be properly
>> >> documented.  And be sorted before adding your new target that uses
>> >> them.
>> >
>> > I don't see that as a hard requirement.  Mikulas did the work to figure
>> > out what is more optimal on x86_64 vs amd64.  It makes a difference for
>> > his target and that is sufficient to carry it locally until/when it is
>> > either elevated to pmem.
>> >
>> > We cannot even get x86 and swait maintainers to reply to repeat requests
>> > for review.  Stacking up further deps on pmem isn't high on my list.
>> >
>>
>> Except I'm being responsive.
>
> Except you're looking to immediately punt to linux-arm-kernel ;)

Well, I'm not, not really. I'm saying drop ARM support, it's not ready.

>
>> I agree with Christoph that we should
>> build pmem helpers at an architecture level and not per-driver. Let's
>> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
>> up to x86 in this space. We already have PowerPC enabling PMEM API, so
>> I don't see an unreasonable barrier to ask the same of ARM. This patch
>> is not even cc'd to linux-arm-kernel. Has the subject been broached
>> with them?
>
> No idea.  Not by me.
>
> The thing is, I'm no expert in pmem.  You are.  Coordinating the change
> with ARM et al feels unnecessarily limiting and quicky moves outside my
> control.
>
> Serious question: Why can't this code land in this dm-writecache target
> and then be lifted (or obsoleted)?

Because we already have an API, and we don't want to promote local
solutions to global problems, or carry  unnecessary technical debt.

>
> But if you think it worthwhile to force ARM to step up then fine.  That
> does limit the availability of using writecache on ARM while they get
> the PMEM API together.
>
> I'll do whatever you want.. just put the smack down and tell me how it
> is ;)

I'd say just control the variables you can control. Drop the ARM
support if you want to move forward and propose extensions / updates
to the pmem api for x86 and I'll help push those since I was involved
in pushing the x86 pmem api in the first instance. That way you don't
need to touch this driver as new archs add their pmem api enabling.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-22 19:27             ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-22 19:27 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, device-mapper development, Mikulas Patocka,
	linux-nvdimm

On Tue, May 22, 2018 at 12:19 PM, Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, May 22 2018 at  3:00pm -0400,
> Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>
>> On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> > On Tue, May 22 2018 at  2:39am -0400,
>> > Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
>> >
>> >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
>> >> > Use new API for flushing persistent memory.
>> >>
>> >> The sentence doesnt make much sense.  'A new API', 'A better
>> >> abstraction' maybe?
>> >>
>> >> >
>> >> > The problem is this:
>> >> > * on X86-64, non-temporal stores have the best performance
>> >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
>> >> >   should flush cache as late as possible, because it performs better this
>> >> >   way.
>> >> >
>> >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
>> >> > data persistently, all three functions must be called.
>> >> >
>> >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
>> >> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
>> >> >
>> >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
>> >> > pmem_commit is wmb.
>> >> >
>> >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
>> >> > pmem_commit is empty.
>> >>
>> >> All these should be provided by the pmem layer, and be properly
>> >> documented.  And be sorted before adding your new target that uses
>> >> them.
>> >
>> > I don't see that as a hard requirement.  Mikulas did the work to figure
>> > out what is more optimal on x86_64 vs amd64.  It makes a difference for
>> > his target and that is sufficient to carry it locally until/when it is
>> > either elevated to pmem.
>> >
>> > We cannot even get x86 and swait maintainers to reply to repeat requests
>> > for review.  Stacking up further deps on pmem isn't high on my list.
>> >
>>
>> Except I'm being responsive.
>
> Except you're looking to immediately punt to linux-arm-kernel ;)

Well, I'm not, not really. I'm saying drop ARM support, it's not ready.

>
>> I agree with Christoph that we should
>> build pmem helpers at an architecture level and not per-driver. Let's
>> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
>> up to x86 in this space. We already have PowerPC enabling PMEM API, so
>> I don't see an unreasonable barrier to ask the same of ARM. This patch
>> is not even cc'd to linux-arm-kernel. Has the subject been broached
>> with them?
>
> No idea.  Not by me.
>
> The thing is, I'm no expert in pmem.  You are.  Coordinating the change
> with ARM et al feels unnecessarily limiting and quicky moves outside my
> control.
>
> Serious question: Why can't this code land in this dm-writecache target
> and then be lifted (or obsoleted)?

Because we already have an API, and we don't want to promote local
solutions to global problems, or carry  unnecessary technical debt.

>
> But if you think it worthwhile to force ARM to step up then fine.  That
> does limit the availability of using writecache on ARM while they get
> the PMEM API together.
>
> I'll do whatever you want.. just put the smack down and tell me how it
> is ;)

I'd say just control the variables you can control. Drop the ARM
support if you want to move forward and propose extensions / updates
to the pmem api for x86 and I'll help push those since I was involved
in pushing the x86 pmem api in the first instance. That way you don't
need to touch this driver as new archs add their pmem api enabling.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-22 20:52               ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-22 20:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, device-mapper development, Mikulas Patocka,
	linux-nvdimm

On Tue, May 22 2018 at  3:27pm -0400,
Dan Williams <dan.j.williams@intel.com> wrote:

> On Tue, May 22, 2018 at 12:19 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> > On Tue, May 22 2018 at  3:00pm -0400,
> > Dan Williams <dan.j.williams@intel.com> wrote:
> >
> >> On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> >> > On Tue, May 22 2018 at  2:39am -0400,
> >> > Christoph Hellwig <hch@infradead.org> wrote:
> >> >
> >> >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> >> >> > Use new API for flushing persistent memory.
> >> >>
> >> >> The sentence doesnt make much sense.  'A new API', 'A better
> >> >> abstraction' maybe?
> >> >>
> >> >> >
> >> >> > The problem is this:
> >> >> > * on X86-64, non-temporal stores have the best performance
> >> >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> >> >> >   should flush cache as late as possible, because it performs better this
> >> >> >   way.
> >> >> >
> >> >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> >> >> > data persistently, all three functions must be called.
> >> >> >
> >> >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> >> >> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> >> >> >
> >> >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> >> >> > pmem_commit is wmb.
> >> >> >
> >> >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> >> >> > pmem_commit is empty.
> >> >>
> >> >> All these should be provided by the pmem layer, and be properly
> >> >> documented.  And be sorted before adding your new target that uses
> >> >> them.
> >> >
> >> > I don't see that as a hard requirement.  Mikulas did the work to figure
> >> > out what is more optimal on x86_64 vs amd64.  It makes a difference for
> >> > his target and that is sufficient to carry it locally until/when it is
> >> > either elevated to pmem.
> >> >
> >> > We cannot even get x86 and swait maintainers to reply to repeat requests
> >> > for review.  Stacking up further deps on pmem isn't high on my list.
> >> >
> >>
> >> Except I'm being responsive.
> >
> > Except you're looking to immediately punt to linux-arm-kernel ;)
> 
> Well, I'm not, not really. I'm saying drop ARM support, it's not ready.
> 
> >
> >> I agree with Christoph that we should
> >> build pmem helpers at an architecture level and not per-driver. Let's
> >> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
> >> up to x86 in this space. We already have PowerPC enabling PMEM API, so
> >> I don't see an unreasonable barrier to ask the same of ARM. This patch
> >> is not even cc'd to linux-arm-kernel. Has the subject been broached
> >> with them?
> >
> > No idea.  Not by me.
> >
> > The thing is, I'm no expert in pmem.  You are.  Coordinating the change
> > with ARM et al feels unnecessarily limiting and quicky moves outside my
> > control.
> >
> > Serious question: Why can't this code land in this dm-writecache target
> > and then be lifted (or obsoleted)?
> 
> Because we already have an API, and we don't want to promote local
> solutions to global problems, or carry  unnecessary technical debt.
> 
> >
> > But if you think it worthwhile to force ARM to step up then fine.  That
> > does limit the availability of using writecache on ARM while they get
> > the PMEM API together.
> >
> > I'll do whatever you want.. just put the smack down and tell me how it
> > is ;)
> 
> I'd say just control the variables you can control. Drop the ARM
> support if you want to move forward and propose extensions / updates
> to the pmem api for x86 and I'll help push those since I was involved
> in pushing the x86 pmem api in the first instance. That way you don't
> need to touch this driver as new archs add their pmem api enabling.

Looking at Mikulas' wrapper API that you and hch are calling into
question:

For ARM it is using arch/arm64/mm/flush.c:arch_wb_cache_pmem().
(And ARM does seem to be providing CONFIG_ARCH_HAS_PMEM_API.)

Whereas x86_64 is using memcpy_flushcache() as provided by
CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE.
(Yet ARM does provide arch/arm64/lib/uaccess_flushcache.c:memcpy_flushcache)

Just seems this isn't purely about ARM lacking on an API level (given on
x86_64 Mikulas isn't only using CONFIG_ARCH_HAS_PMEM_API).

Seems this is more to do with x86_64 having efficient Non-temporal
stores?

Anyway, I'm still trying to appreciate the details here before I can
make any forward progress.

Mike
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-22 20:52               ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-22 20:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, device-mapper development, Mikulas Patocka,
	linux-nvdimm

On Tue, May 22 2018 at  3:27pm -0400,
Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:

> On Tue, May 22, 2018 at 12:19 PM, Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Tue, May 22 2018 at  3:00pm -0400,
> > Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> >
> >> On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> > On Tue, May 22 2018 at  2:39am -0400,
> >> > Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> >> >
> >> >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> >> >> > Use new API for flushing persistent memory.
> >> >>
> >> >> The sentence doesnt make much sense.  'A new API', 'A better
> >> >> abstraction' maybe?
> >> >>
> >> >> >
> >> >> > The problem is this:
> >> >> > * on X86-64, non-temporal stores have the best performance
> >> >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> >> >> >   should flush cache as late as possible, because it performs better this
> >> >> >   way.
> >> >> >
> >> >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> >> >> > data persistently, all three functions must be called.
> >> >> >
> >> >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> >> >> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> >> >> >
> >> >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> >> >> > pmem_commit is wmb.
> >> >> >
> >> >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> >> >> > pmem_commit is empty.
> >> >>
> >> >> All these should be provided by the pmem layer, and be properly
> >> >> documented.  And be sorted before adding your new target that uses
> >> >> them.
> >> >
> >> > I don't see that as a hard requirement.  Mikulas did the work to figure
> >> > out what is more optimal on x86_64 vs amd64.  It makes a difference for
> >> > his target and that is sufficient to carry it locally until/when it is
> >> > either elevated to pmem.
> >> >
> >> > We cannot even get x86 and swait maintainers to reply to repeat requests
> >> > for review.  Stacking up further deps on pmem isn't high on my list.
> >> >
> >>
> >> Except I'm being responsive.
> >
> > Except you're looking to immediately punt to linux-arm-kernel ;)
> 
> Well, I'm not, not really. I'm saying drop ARM support, it's not ready.
> 
> >
> >> I agree with Christoph that we should
> >> build pmem helpers at an architecture level and not per-driver. Let's
> >> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
> >> up to x86 in this space. We already have PowerPC enabling PMEM API, so
> >> I don't see an unreasonable barrier to ask the same of ARM. This patch
> >> is not even cc'd to linux-arm-kernel. Has the subject been broached
> >> with them?
> >
> > No idea.  Not by me.
> >
> > The thing is, I'm no expert in pmem.  You are.  Coordinating the change
> > with ARM et al feels unnecessarily limiting and quicky moves outside my
> > control.
> >
> > Serious question: Why can't this code land in this dm-writecache target
> > and then be lifted (or obsoleted)?
> 
> Because we already have an API, and we don't want to promote local
> solutions to global problems, or carry  unnecessary technical debt.
> 
> >
> > But if you think it worthwhile to force ARM to step up then fine.  That
> > does limit the availability of using writecache on ARM while they get
> > the PMEM API together.
> >
> > I'll do whatever you want.. just put the smack down and tell me how it
> > is ;)
> 
> I'd say just control the variables you can control. Drop the ARM
> support if you want to move forward and propose extensions / updates
> to the pmem api for x86 and I'll help push those since I was involved
> in pushing the x86 pmem api in the first instance. That way you don't
> need to touch this driver as new archs add their pmem api enabling.

Looking at Mikulas' wrapper API that you and hch are calling into
question:

For ARM it is using arch/arm64/mm/flush.c:arch_wb_cache_pmem().
(And ARM does seem to be providing CONFIG_ARCH_HAS_PMEM_API.)

Whereas x86_64 is using memcpy_flushcache() as provided by
CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE.
(Yet ARM does provide arch/arm64/lib/uaccess_flushcache.c:memcpy_flushcache)

Just seems this isn't purely about ARM lacking on an API level (given on
x86_64 Mikulas isn't only using CONFIG_ARCH_HAS_PMEM_API).

Seems this is more to do with x86_64 having efficient Non-temporal
stores?

Anyway, I'm still trying to appreciate the details here before I can
make any forward progress.

Mike

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [dm-devel] [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-22 22:53                 ` Jeff Moyer
  0 siblings, 0 replies; 108+ messages in thread
From: Jeff Moyer @ 2018-05-22 22:53 UTC (permalink / raw)
  To: Mike Snitzer, Mikulas Patocka
  Cc: Christoph Hellwig, device-mapper development, linux-nvdimm

Hi, Mike,

Mike Snitzer <snitzer@redhat.com> writes:

> Looking at Mikulas' wrapper API that you and hch are calling into
> question:
>
> For ARM it is using arch/arm64/mm/flush.c:arch_wb_cache_pmem().
> (And ARM does seem to be providing CONFIG_ARCH_HAS_PMEM_API.)
>
> Whereas x86_64 is using memcpy_flushcache() as provided by
> CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE.
> (Yet ARM does provide arch/arm64/lib/uaccess_flushcache.c:memcpy_flushcache)
>
> Just seems this isn't purely about ARM lacking on an API level (given on
> x86_64 Mikulas isn't only using CONFIG_ARCH_HAS_PMEM_API).
>
> Seems this is more to do with x86_64 having efficient Non-temporal
> stores?

Yeah, I think you've got that all right.

> Anyway, I'm still trying to appreciate the details here before I can
> make any forward progress.

Making data persistent on x64 requires 3 steps:
1) copy the data into pmem   (store instructions)
2) flush the cache lines associated with the data (clflush, clflush_opt, clwb)
3) wait on the flush to complete (sfence)

I'm not sure if other architectures require step 3.  Mikulas'
implementation seems to imply that arm64 doesn't require the fence.

The current pmem api provides:

memcpy*           -- step 1
memcpy_flushcache -- this combines steps 1 and 2
dax_flush         -- step 2
wmb*              -- step 3

* not strictly part of the pmem api

So, if you didn't care about performance, you could write generic code
that only used memcpy, dax_flush, and wmb (assuming other arches
actually need the wmb).  What Mikulas did was to abstract out an API
that could be called by generic code that would work optimally on all
architectures.

This looks like a worth-while addition to the PMEM API, to me.  Mikulas,
what do you think about refactoring the code as Christoph suggested?

Cheers,
Jeff
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [dm-devel] [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-22 22:53                 ` Jeff Moyer
  0 siblings, 0 replies; 108+ messages in thread
From: Jeff Moyer @ 2018-05-22 22:53 UTC (permalink / raw)
  To: Mike Snitzer, Mikulas Patocka
  Cc: Christoph Hellwig, device-mapper development, linux-nvdimm

Hi, Mike,

Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> Looking at Mikulas' wrapper API that you and hch are calling into
> question:
>
> For ARM it is using arch/arm64/mm/flush.c:arch_wb_cache_pmem().
> (And ARM does seem to be providing CONFIG_ARCH_HAS_PMEM_API.)
>
> Whereas x86_64 is using memcpy_flushcache() as provided by
> CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE.
> (Yet ARM does provide arch/arm64/lib/uaccess_flushcache.c:memcpy_flushcache)
>
> Just seems this isn't purely about ARM lacking on an API level (given on
> x86_64 Mikulas isn't only using CONFIG_ARCH_HAS_PMEM_API).
>
> Seems this is more to do with x86_64 having efficient Non-temporal
> stores?

Yeah, I think you've got that all right.

> Anyway, I'm still trying to appreciate the details here before I can
> make any forward progress.

Making data persistent on x64 requires 3 steps:
1) copy the data into pmem   (store instructions)
2) flush the cache lines associated with the data (clflush, clflush_opt, clwb)
3) wait on the flush to complete (sfence)

I'm not sure if other architectures require step 3.  Mikulas'
implementation seems to imply that arm64 doesn't require the fence.

The current pmem api provides:

memcpy*           -- step 1
memcpy_flushcache -- this combines steps 1 and 2
dax_flush         -- step 2
wmb*              -- step 3

* not strictly part of the pmem api

So, if you didn't care about performance, you could write generic code
that only used memcpy, dax_flush, and wmb (assuming other arches
actually need the wmb).  What Mikulas did was to abstract out an API
that could be called by generic code that would work optimally on all
architectures.

This looks like a worth-while addition to the PMEM API, to me.  Mikulas,
what do you think about refactoring the code as Christoph suggested?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 2/4] swait: export the symbols __prepare_to_swait and __finish_swait
  2018-05-22 18:52     ` Mike Snitzer
@ 2018-05-23  9:21       ` Peter Zijlstra
  2018-05-23 15:10         ` Mike Snitzer
  0 siblings, 1 reply; 108+ messages in thread
From: Peter Zijlstra @ 2018-05-23  9:21 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Sebastian Andrzej Siewior, wagi, Christoph Hellwig, dm-devel,
	Mikulas Patocka, Dan Williams, tglx

On Tue, May 22, 2018 at 02:52:54PM -0400, Mike Snitzer wrote:
> On Tue, May 22 2018 at  2:34am -0400,
> Christoph Hellwig <hch@infradead.org> wrote:

> > Please CC the author and maintainers of the swait code.
> > 
> > My impression is that this is the wrong thing to do.  The swait code
> > is supposed to be simple and self contained, and if you want to do
> > anything else use normal waitqueues.
> 
> You said the same thing last time around.  I've since cc'd Peter and
> Thomas and haven't heard back, see:
> https://www.redhat.com/archives/dm-devel/2018-May/msg00048.html

Yeah, sorry, got lost :/

> The entire point of exporting these symbols is to allow use of the
> "simple waitqueue" code to optimize -- without resorting to using normal
> waitqueues.

So I don't immediately object to exporting them; however I do share some
of hch's concerns. The reason swait exists is to be deterministic (for
RT) -- something that regular wait code cannot be.

And by (ab)using / exporting the wait internal lock you risk loosing
that. So I don't think the proposed usage is bad, it is possible to
create badness.

So if we're going to export them; someone needs to keep an eye on things
and ensure the lock isn't abused.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 2/4] swait: export the symbols __prepare_to_swait and __finish_swait
  2018-05-23  9:21       ` Peter Zijlstra
@ 2018-05-23 15:10         ` Mike Snitzer
  2018-05-23 18:10           ` [PATCH v2] swait: export " Mike Snitzer
  0 siblings, 1 reply; 108+ messages in thread
From: Mike Snitzer @ 2018-05-23 15:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, wagi, Christoph Hellwig, dm-devel,
	Mikulas Patocka, Dan Williams, tglx

On Wed, May 23 2018 at  5:21am -0400,
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, May 22, 2018 at 02:52:54PM -0400, Mike Snitzer wrote:
> > On Tue, May 22 2018 at  2:34am -0400,
> > Christoph Hellwig <hch@infradead.org> wrote:
> 
> > > Please CC the author and maintainers of the swait code.
> > > 
> > > My impression is that this is the wrong thing to do.  The swait code
> > > is supposed to be simple and self contained, and if you want to do
> > > anything else use normal waitqueues.
> > 
> > You said the same thing last time around.  I've since cc'd Peter and
> > Thomas and haven't heard back, see:
> > https://www.redhat.com/archives/dm-devel/2018-May/msg00048.html
> 
> Yeah, sorry, got lost :/

np

> > The entire point of exporting these symbols is to allow use of the
> > "simple waitqueue" code to optimize -- without resorting to using normal
> > waitqueues.
> 
> So I don't immediately object to exporting them; however I do share some
> of hch's concerns. The reason swait exists is to be deterministic (for
> RT) -- something that regular wait code cannot be.
> 
> And by (ab)using / exporting the wait internal lock you risk loosing
> that. So I don't think the proposed usage is bad, it is possible to
> create badness.

Understood.

> So if we're going to export them; someone needs to keep an eye on things
> and ensure the lock isn't abused.

I'll update the patch header and swait.h to reflect these requirements
and send out a new patch.

If you could then reply with your explicit Ack I can stage it for 4.18
via linux-dm.git to ease cross tree dependencies (given dm-writecache
depends on these exports) -- provided you're OK with me doing that.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v2] swait: export symbols __prepare_to_swait and __finish_swait
  2018-05-23 15:10         ` Mike Snitzer
@ 2018-05-23 18:10           ` Mike Snitzer
  2018-05-23 20:38             ` Mikulas Patocka
  2018-05-24 14:10             ` Peter Zijlstra
  0 siblings, 2 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-23 18:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, wagi, Christoph Hellwig, dm-devel,
	Mikulas Patocka, Dan Williams, tglx

[Peter, in this v2 I switched to using _GPL for the exports and updated
the patch header.  As covered in previous mail, please let me know if
you're OK with me staging this change for 4.18 via linux-dm.git with
your Ack, thanks]

From: Mikulas Patocka <mpatocka@redhat.com>
Subject: [PATCH] swait: export symbols __prepare_to_swait and __finish_swait

__prepare_to_swait and __finish_swait are declared in
include/linux/swait.h but they are not exported; so they are not useable
from kernel modules.

A new consumer of swait (in dm-writecache) reduces its locking overhead
by using the spinlock in swait_queue_head to protect not only the wait
queue, but also the list of events.  Consequently, this swait consuming
kernel module needs to use these unlocked functions.

Peter Zijlstra explained:
  "The reason swait exists is to be deterministic (for RT) -- something
  that regular wait code cannot be.
  And by (ab)using / exporting the wait internal lock you risk losing
  that. So I don't think the proposed [dm-writecache] usage is bad, it
  is possible to create badness.
  So if we're going to export them; someone needs to keep an eye on things
  and ensure the lock isn't abused."

So while this new use of the wait internal lock doesn't jeopardize the
realtime requirements of swait, these exports do open swait's internal
locking up to being abused.  As such, EXPORT_SYMBOL_GPL is used because
any future consumers of __prepare_to_swait and __finish_swait must
always be thoroughly scrutinized.

Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 kernel/sched/swait.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
index b6fb2c3b3ff7..5d891b65ada5 100644
--- a/kernel/sched/swait.c
+++ b/kernel/sched/swait.c
@@ -75,6 +75,7 @@ void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait)
 	if (list_empty(&wait->task_list))
 		list_add(&wait->task_list, &q->task_list);
 }
+EXPORT_SYMBOL_GPL(__prepare_to_swait);
 
 void prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait, int state)
 {
@@ -104,6 +105,7 @@ void __finish_swait(struct swait_queue_head *q, struct swait_queue *wait)
 	if (!list_empty(&wait->task_list))
 		list_del_init(&wait->task_list);
 }
+EXPORT_SYMBOL_GPL(__finish_swait);
 
 void finish_swait(struct swait_queue_head *q, struct swait_queue *wait)
 {
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH v2] swait: export symbols __prepare_to_swait and __finish_swait
  2018-05-23 18:10           ` [PATCH v2] swait: export " Mike Snitzer
@ 2018-05-23 20:38             ` Mikulas Patocka
  2018-05-23 21:51               ` Mike Snitzer
  2018-05-24 14:10             ` Peter Zijlstra
  1 sibling, 1 reply; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-23 20:38 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Peter Zijlstra, Sebastian Andrzej Siewior, wagi,
	Christoph Hellwig, dm-devel, Dan Williams, tglx



On Wed, 23 May 2018, Mike Snitzer wrote:

> [Peter, in this v2 I switched to using _GPL for the exports and updated
> the patch header.  As covered in previous mail, please let me know if
> you're OK with me staging this change for 4.18 via linux-dm.git with
> your Ack, thanks]
> 
> From: Mikulas Patocka <mpatocka@redhat.com>
> Subject: [PATCH] swait: export symbols __prepare_to_swait and __finish_swait
> 
> __prepare_to_swait and __finish_swait are declared in
> include/linux/swait.h but they are not exported; so they are not useable
> from kernel modules.
> 
> A new consumer of swait (in dm-writecache) reduces its locking overhead
> by using the spinlock in swait_queue_head to protect not only the wait
> queue, but also the list of events.  Consequently, this swait consuming
> kernel module needs to use these unlocked functions.
> 
> Peter Zijlstra explained:
>   "The reason swait exists is to be deterministic (for RT) -- something
>   that regular wait code cannot be.
>   And by (ab)using / exporting the wait internal lock you risk losing
>   that. So I don't think the proposed [dm-writecache] usage is bad, it
>   is possible to create badness.
>   So if we're going to export them; someone needs to keep an eye on things
>   and ensure the lock isn't abused."
> 
> So while this new use of the wait internal lock doesn't jeopardize the
> realtime requirements of swait, these exports do open swait's internal
> locking up to being abused.  As such, EXPORT_SYMBOL_GPL is used because
> any future consumers of __prepare_to_swait and __finish_swait must
> always be thoroughly scrutinized.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  kernel/sched/swait.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
> index b6fb2c3b3ff7..5d891b65ada5 100644
> --- a/kernel/sched/swait.c
> +++ b/kernel/sched/swait.c
> @@ -75,6 +75,7 @@ void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait)
>  	if (list_empty(&wait->task_list))
>  		list_add(&wait->task_list, &q->task_list);
>  }
> +EXPORT_SYMBOL_GPL(__prepare_to_swait);
>  
>  void prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait, int state)
>  {
> @@ -104,6 +105,7 @@ void __finish_swait(struct swait_queue_head *q, struct swait_queue *wait)
>  	if (!list_empty(&wait->task_list))
>  		list_del_init(&wait->task_list);
>  }
> +EXPORT_SYMBOL_GPL(__finish_swait);
>  
>  void finish_swait(struct swait_queue_head *q, struct swait_queue *wait)
>  {
> -- 
> 2.15.0

Then - you should export swake_up_locked with EXPORT_SYMBOL_GPL too.

Because swake_up_locked is unusable without __prepare_to_swait and 
__finish_swait.

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [dm-devel] [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-23 20:57                   ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-23 20:57 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, device-mapper development, Mike Snitzer, linux-nvdimm



On Tue, 22 May 2018, Jeff Moyer wrote:

> Hi, Mike,
> 
> Mike Snitzer <snitzer@redhat.com> writes:
> 
> > Looking at Mikulas' wrapper API that you and hch are calling into
> > question:
> >
> > For ARM it is using arch/arm64/mm/flush.c:arch_wb_cache_pmem().
> > (And ARM does seem to be providing CONFIG_ARCH_HAS_PMEM_API.)
> >
> > Whereas x86_64 is using memcpy_flushcache() as provided by
> > CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE.
> > (Yet ARM does provide arch/arm64/lib/uaccess_flushcache.c:memcpy_flushcache)
> >
> > Just seems this isn't purely about ARM lacking on an API level (given on
> > x86_64 Mikulas isn't only using CONFIG_ARCH_HAS_PMEM_API).
> >
> > Seems this is more to do with x86_64 having efficient Non-temporal
> > stores?
> 
> Yeah, I think you've got that all right.
> 
> > Anyway, I'm still trying to appreciate the details here before I can
> > make any forward progress.
> 
> Making data persistent on x64 requires 3 steps:
> 1) copy the data into pmem   (store instructions)
> 2) flush the cache lines associated with the data (clflush, clflush_opt, clwb)
> 3) wait on the flush to complete (sfence)

In theory it works this way. In practice, this sequence is useless because 
the cache flusing instructions are horribly slow.

So, the dm-writecache driver uses non-temporal stores instead of cache 
flushing.

Now, the problem with arm64 is that it doesn't have non-temporal stores. 
So, memcpy_flushcache on arm64 does cached stores and flushes the cache 
afterwards. And this eager flushing is slower than late flushing. On arm4, 
you want to do cached stores, then do something else, and flush the cache 
as late as possible.

> I'm not sure if other architectures require step 3.  Mikulas'
> implementation seems to imply that arm64 doesn't require the fence.

I suppose that arch_wb_cache_pmem() does whatever it needs to do to flush 
the cache. If not, add something like arch_wb_cache_pmem_commit().

> The current pmem api provides:
> 
> memcpy*           -- step 1
> memcpy_flushcache -- this combines steps 1 and 2
> dax_flush         -- step 2
> wmb*              -- step 3
> 
> * not strictly part of the pmem api
> 
> So, if you didn't care about performance, you could write generic code
> that only used memcpy, dax_flush, and wmb (assuming other arches
> actually need the wmb).  What Mikulas did was to abstract out an API
> that could be called by generic code that would work optimally on all
> architectures.
> 
> This looks like a worth-while addition to the PMEM API, to me.  Mikulas,
> what do you think about refactoring the code as Christoph suggested?

I sent this patch 
https://www.redhat.com/archives/dm-devel/2018-May/msg00054.html so that 
you can take the functions pmem_memcpy, pmem_assign, pmem_flush and 
pmem_commit and move them to the generic linux headers. If you want to do 
it, do it.

> Cheers,
> Jeff

Mikulas
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [dm-devel] [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-23 20:57                   ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-23 20:57 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, device-mapper development, Mike Snitzer, linux-nvdimm



On Tue, 22 May 2018, Jeff Moyer wrote:

> Hi, Mike,
> 
> Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Looking at Mikulas' wrapper API that you and hch are calling into
> > question:
> >
> > For ARM it is using arch/arm64/mm/flush.c:arch_wb_cache_pmem().
> > (And ARM does seem to be providing CONFIG_ARCH_HAS_PMEM_API.)
> >
> > Whereas x86_64 is using memcpy_flushcache() as provided by
> > CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE.
> > (Yet ARM does provide arch/arm64/lib/uaccess_flushcache.c:memcpy_flushcache)
> >
> > Just seems this isn't purely about ARM lacking on an API level (given on
> > x86_64 Mikulas isn't only using CONFIG_ARCH_HAS_PMEM_API).
> >
> > Seems this is more to do with x86_64 having efficient Non-temporal
> > stores?
> 
> Yeah, I think you've got that all right.
> 
> > Anyway, I'm still trying to appreciate the details here before I can
> > make any forward progress.
> 
> Making data persistent on x64 requires 3 steps:
> 1) copy the data into pmem   (store instructions)
> 2) flush the cache lines associated with the data (clflush, clflush_opt, clwb)
> 3) wait on the flush to complete (sfence)

In theory it works this way. In practice, this sequence is useless because 
the cache flusing instructions are horribly slow.

So, the dm-writecache driver uses non-temporal stores instead of cache 
flushing.

Now, the problem with arm64 is that it doesn't have non-temporal stores. 
So, memcpy_flushcache on arm64 does cached stores and flushes the cache 
afterwards. And this eager flushing is slower than late flushing. On arm4, 
you want to do cached stores, then do something else, and flush the cache 
as late as possible.

> I'm not sure if other architectures require step 3.  Mikulas'
> implementation seems to imply that arm64 doesn't require the fence.

I suppose that arch_wb_cache_pmem() does whatever it needs to do to flush 
the cache. If not, add something like arch_wb_cache_pmem_commit().

> The current pmem api provides:
> 
> memcpy*           -- step 1
> memcpy_flushcache -- this combines steps 1 and 2
> dax_flush         -- step 2
> wmb*              -- step 3
> 
> * not strictly part of the pmem api
> 
> So, if you didn't care about performance, you could write generic code
> that only used memcpy, dax_flush, and wmb (assuming other arches
> actually need the wmb).  What Mikulas did was to abstract out an API
> that could be called by generic code that would work optimally on all
> architectures.
> 
> This looks like a worth-while addition to the PMEM API, to me.  Mikulas,
> what do you think about refactoring the code as Christoph suggested?

I sent this patch 
https://www.redhat.com/archives/dm-devel/2018-May/msg00054.html so that 
you can take the functions pmem_memcpy, pmem_assign, pmem_flush and 
pmem_commit and move them to the generic linux headers. If you want to do 
it, do it.

> Cheers,
> Jeff

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2] swait: export symbols __prepare_to_swait and __finish_swait
  2018-05-23 20:38             ` Mikulas Patocka
@ 2018-05-23 21:51               ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-23 21:51 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Peter Zijlstra, Sebastian Andrzej Siewior, wagi,
	Christoph Hellwig, dm-devel, Dan Williams, tglx

On Wed, May 23 2018 at  4:38pm -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Wed, 23 May 2018, Mike Snitzer wrote:
> 
> > [Peter, in this v2 I switched to using _GPL for the exports and updated
> > the patch header.  As covered in previous mail, please let me know if
> > you're OK with me staging this change for 4.18 via linux-dm.git with
> > your Ack, thanks]
> > 
> > From: Mikulas Patocka <mpatocka@redhat.com>
> > Subject: [PATCH] swait: export symbols __prepare_to_swait and __finish_swait
> > 
> > __prepare_to_swait and __finish_swait are declared in
> > include/linux/swait.h but they are not exported; so they are not useable
> > from kernel modules.
> > 
> > A new consumer of swait (in dm-writecache) reduces its locking overhead
> > by using the spinlock in swait_queue_head to protect not only the wait
> > queue, but also the list of events.  Consequently, this swait consuming
> > kernel module needs to use these unlocked functions.
> > 
> > Peter Zijlstra explained:
> >   "The reason swait exists is to be deterministic (for RT) -- something
> >   that regular wait code cannot be.
> >   And by (ab)using / exporting the wait internal lock you risk losing
> >   that. So I don't think the proposed [dm-writecache] usage is bad, it
> >   is possible to create badness.
> >   So if we're going to export them; someone needs to keep an eye on things
> >   and ensure the lock isn't abused."
> > 
> > So while this new use of the wait internal lock doesn't jeopardize the
> > realtime requirements of swait, these exports do open swait's internal
> > locking up to being abused.  As such, EXPORT_SYMBOL_GPL is used because
> > any future consumers of __prepare_to_swait and __finish_swait must
> > always be thoroughly scrutinized.
> > 
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > ---
> >  kernel/sched/swait.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
> > index b6fb2c3b3ff7..5d891b65ada5 100644
> > --- a/kernel/sched/swait.c
> > +++ b/kernel/sched/swait.c
> > @@ -75,6 +75,7 @@ void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait)
> >  	if (list_empty(&wait->task_list))
> >  		list_add(&wait->task_list, &q->task_list);
> >  }
> > +EXPORT_SYMBOL_GPL(__prepare_to_swait);
> >  
> >  void prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait, int state)
> >  {
> > @@ -104,6 +105,7 @@ void __finish_swait(struct swait_queue_head *q, struct swait_queue *wait)
> >  	if (!list_empty(&wait->task_list))
> >  		list_del_init(&wait->task_list);
> >  }
> > +EXPORT_SYMBOL_GPL(__finish_swait);
> >  
> >  void finish_swait(struct swait_queue_head *q, struct swait_queue *wait)
> >  {
> > -- 
> > 2.15.0
> 
> Then - you should export swake_up_locked with EXPORT_SYMBOL_GPL too.
> 
> Because swake_up_locked is unusable without __prepare_to_swait and 
> __finish_swait.

Point taken.  But if swake_up_locked is unusable without them it is
implicitly _GPL once __prepare_to_swait and __finish_swait are exported
via _GPL.  Which is to say, I don't care to get into the games of
switching symbols from EXPORT_SYMBOL to EXPORT_SYMBOL_GPL unless
completely necessary.  Happy to leave well enough alone on this.

So I consider this v2 perfectly adequate for our needs.  And appreciate
any additional review/acks.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-24  8:15           ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-24  8:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, device-mapper development, Mike Snitzer, linux-nvdimm



On Tue, 22 May 2018, Dan Williams wrote:

> On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> > On Tue, May 22 2018 at  2:39am -0400,
> > Christoph Hellwig <hch@infradead.org> wrote:
> >
> >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> >> > Use new API for flushing persistent memory.
> >>
> >> The sentence doesnt make much sense.  'A new API', 'A better
> >> abstraction' maybe?
> >>
> >> >
> >> > The problem is this:
> >> > * on X86-64, non-temporal stores have the best performance
> >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> >> >   should flush cache as late as possible, because it performs better this
> >> >   way.
> >> >
> >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> >> > data persistently, all three functions must be called.
> >> >
> >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> >> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> >> >
> >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> >> > pmem_commit is wmb.
> >> >
> >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> >> > pmem_commit is empty.
> >>
> >> All these should be provided by the pmem layer, and be properly
> >> documented.  And be sorted before adding your new target that uses
> >> them.
> >
> > I don't see that as a hard requirement.  Mikulas did the work to figure
> > out what is more optimal on x86_64 vs amd64.  It makes a difference for
> > his target and that is sufficient to carry it locally until/when it is
> > either elevated to pmem.
> >
> > We cannot even get x86 and swait maintainers to reply to repeat requests
> > for review.  Stacking up further deps on pmem isn't high on my list.
> >
> 
> Except I'm being responsive. I agree with Christoph that we should
> build pmem helpers at an architecture level and not per-driver. Let's
> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
> up to x86 in this space. We already have PowerPC enabling PMEM API, so
> I don't see an unreasonable barrier to ask the same of ARM. This patch
> is not even cc'd to linux-arm-kernel. Has the subject been broached
> with them?

The ARM code can't "catch-up" with X86.

On X86 - non-temporal stores (i.e. memcpy_flushcache) are faster than 
cached write and cache flushing.

The ARM architecture doesn't have non-temporal stores. So, 
memcpy_flushcache on ARM does memcpy (that writes data to the cache) and 
then flushes the cache. But this eager cache flushig is slower than late 
cache flushing.

The optimal code sequence on ARM to write to persistent memory is to call 
memcpy, then do something else, and then call arch_wb_cache_pmem as late 
as possible. And this ARM-optimized code sequence is just horribly slow on 
X86.

This issue can't be "fixed" in ARM-specific source code. The ARM processor 
have such a characteristics that eager cache flushing is slower than late 
cache flushing - and that's it - you can't change processor behavior.

If you don't want '#if defined(CONFIG_X86_64)' in the dm-writecache 
driver, then just take the functions that are in this conditional block 
and move them to some generic linux header.

Mikulas
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-24  8:15           ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-24  8:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, device-mapper development, Mike Snitzer, linux-nvdimm



On Tue, 22 May 2018, Dan Williams wrote:

> On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Tue, May 22 2018 at  2:39am -0400,
> > Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> >
> >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> >> > Use new API for flushing persistent memory.
> >>
> >> The sentence doesnt make much sense.  'A new API', 'A better
> >> abstraction' maybe?
> >>
> >> >
> >> > The problem is this:
> >> > * on X86-64, non-temporal stores have the best performance
> >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> >> >   should flush cache as late as possible, because it performs better this
> >> >   way.
> >> >
> >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> >> > data persistently, all three functions must be called.
> >> >
> >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> >> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> >> >
> >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> >> > pmem_commit is wmb.
> >> >
> >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> >> > pmem_commit is empty.
> >>
> >> All these should be provided by the pmem layer, and be properly
> >> documented.  And be sorted before adding your new target that uses
> >> them.
> >
> > I don't see that as a hard requirement.  Mikulas did the work to figure
> > out what is more optimal on x86_64 vs amd64.  It makes a difference for
> > his target and that is sufficient to carry it locally until/when it is
> > either elevated to pmem.
> >
> > We cannot even get x86 and swait maintainers to reply to repeat requests
> > for review.  Stacking up further deps on pmem isn't high on my list.
> >
> 
> Except I'm being responsive. I agree with Christoph that we should
> build pmem helpers at an architecture level and not per-driver. Let's
> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
> up to x86 in this space. We already have PowerPC enabling PMEM API, so
> I don't see an unreasonable barrier to ask the same of ARM. This patch
> is not even cc'd to linux-arm-kernel. Has the subject been broached
> with them?

The ARM code can't "catch-up" with X86.

On X86 - non-temporal stores (i.e. memcpy_flushcache) are faster than 
cached write and cache flushing.

The ARM architecture doesn't have non-temporal stores. So, 
memcpy_flushcache on ARM does memcpy (that writes data to the cache) and 
then flushes the cache. But this eager cache flushig is slower than late 
cache flushing.

The optimal code sequence on ARM to write to persistent memory is to call 
memcpy, then do something else, and then call arch_wb_cache_pmem as late 
as possible. And this ARM-optimized code sequence is just horribly slow on 
X86.

This issue can't be "fixed" in ARM-specific source code. The ARM processor 
have such a characteristics that eager cache flushing is slower than late 
cache flushing - and that's it - you can't change processor behavior.

If you don't want '#if defined(CONFIG_X86_64)' in the dm-writecache 
driver, then just take the functions that are in this conditional block 
and move them to some generic linux header.

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2] swait: export symbols __prepare_to_swait and __finish_swait
  2018-05-23 18:10           ` [PATCH v2] swait: export " Mike Snitzer
  2018-05-23 20:38             ` Mikulas Patocka
@ 2018-05-24 14:10             ` Peter Zijlstra
  2018-05-24 15:09               ` Mike Snitzer
  1 sibling, 1 reply; 108+ messages in thread
From: Peter Zijlstra @ 2018-05-24 14:10 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Sebastian Andrzej Siewior, wagi, Christoph Hellwig, dm-devel,
	Mikulas Patocka, Dan Williams, tglx

On Wed, May 23, 2018 at 02:10:46PM -0400, Mike Snitzer wrote:
> [Peter, in this v2 I switched to using _GPL for the exports and updated
> the patch header.  As covered in previous mail, please let me know if
> you're OK with me staging this change for 4.18 via linux-dm.git with
> your Ack, thanks]
> 
> From: Mikulas Patocka <mpatocka@redhat.com>
> Subject: [PATCH] swait: export symbols __prepare_to_swait and __finish_swait
> 
> __prepare_to_swait and __finish_swait are declared in
> include/linux/swait.h but they are not exported; so they are not useable
> from kernel modules.
> 
> A new consumer of swait (in dm-writecache) reduces its locking overhead
> by using the spinlock in swait_queue_head to protect not only the wait
> queue, but also the list of events.  Consequently, this swait consuming
> kernel module needs to use these unlocked functions.
> 
> Peter Zijlstra explained:
>   "The reason swait exists is to be deterministic (for RT) -- something
>   that regular wait code cannot be.
>   And by (ab)using / exporting the wait internal lock you risk losing
>   that. So I don't think the proposed [dm-writecache] usage is bad, it
>   is possible to create badness.
>   So if we're going to export them; someone needs to keep an eye on things
>   and ensure the lock isn't abused."
> 
> So while this new use of the wait internal lock doesn't jeopardize the
> realtime requirements of swait, these exports do open swait's internal
> locking up to being abused.  As such, EXPORT_SYMBOL_GPL is used because
> any future consumers of __prepare_to_swait and __finish_swait must
> always be thoroughly scrutinized.
> 

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  kernel/sched/swait.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
> index b6fb2c3b3ff7..5d891b65ada5 100644
> --- a/kernel/sched/swait.c
> +++ b/kernel/sched/swait.c
> @@ -75,6 +75,7 @@ void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait)
>  	if (list_empty(&wait->task_list))
>  		list_add(&wait->task_list, &q->task_list);
>  }
> +EXPORT_SYMBOL_GPL(__prepare_to_swait);
>  
>  void prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait, int state)
>  {
> @@ -104,6 +105,7 @@ void __finish_swait(struct swait_queue_head *q, struct swait_queue *wait)
>  	if (!list_empty(&wait->task_list))
>  		list_del_init(&wait->task_list);
>  }
> +EXPORT_SYMBOL_GPL(__finish_swait);
>  
>  void finish_swait(struct swait_queue_head *q, struct swait_queue *wait)
>  {
> -- 
> 2.15.0
> 

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2] swait: export symbols __prepare_to_swait and __finish_swait
  2018-05-24 14:10             ` Peter Zijlstra
@ 2018-05-24 15:09               ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-24 15:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, wagi, Christoph Hellwig, dm-devel,
	Mikulas Patocka, Dan Williams, tglx

On Thu, May 24 2018 at 10:10am -0400,
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, May 23, 2018 at 02:10:46PM -0400, Mike Snitzer wrote:
> > [Peter, in this v2 I switched to using _GPL for the exports and updated
> > the patch header.  As covered in previous mail, please let me know if
> > you're OK with me staging this change for 4.18 via linux-dm.git with
> > your Ack, thanks]
> > 
> > From: Mikulas Patocka <mpatocka@redhat.com>
> > Subject: [PATCH] swait: export symbols __prepare_to_swait and __finish_swait
> > 
> > __prepare_to_swait and __finish_swait are declared in
> > include/linux/swait.h but they are not exported; so they are not useable
> > from kernel modules.
> > 
> > A new consumer of swait (in dm-writecache) reduces its locking overhead
> > by using the spinlock in swait_queue_head to protect not only the wait
> > queue, but also the list of events.  Consequently, this swait consuming
> > kernel module needs to use these unlocked functions.
> > 
> > Peter Zijlstra explained:
> >   "The reason swait exists is to be deterministic (for RT) -- something
> >   that regular wait code cannot be.
> >   And by (ab)using / exporting the wait internal lock you risk losing
> >   that. So I don't think the proposed [dm-writecache] usage is bad, it
> >   is possible to create badness.
> >   So if we're going to export them; someone needs to keep an eye on things
> >   and ensure the lock isn't abused."
> > 
> > So while this new use of the wait internal lock doesn't jeopardize the
> > realtime requirements of swait, these exports do open swait's internal
> > locking up to being abused.  As such, EXPORT_SYMBOL_GPL is used because
> > any future consumers of __prepare_to_swait and __finish_swait must
> > always be thoroughly scrutinized.
> > 
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Thanks Peter.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v2] x86: optimize memcpy_flushcache
  2018-05-19 14:21   ` Dan Williams
@ 2018-05-24 18:20     ` Mike Snitzer
  2018-06-18 13:23         ` Mike Snitzer
  0 siblings, 1 reply; 108+ messages in thread
From: Mike Snitzer @ 2018-05-24 18:20 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner
  Cc: Dan Williams, X86 ML, Mikulas Patocka, device-mapper development

[v2: revised header, reformatted asm, reduced indent in switch statement.
Ingo or Thomas: please review and consider picking this up for 4.18]

From: Mikulas Patocka <mpatocka@redhat.com>
Subject: [PATCH v2] x86: optimize memcpy_flushcache

In the context of constant short length stores to persistent memory,
memcpy_flushcache suffers from a 2% performance degradation compared to
explicitly using the "movnti" instruction.

Optimize 4, 8, and 16 byte memcpy_flushcache calls to explicitly use the
movnti instruction with inline assembler.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 arch/x86/include/asm/string_64.h | 28 +++++++++++++++++++++++++++-
 arch/x86/lib/usercopy_64.c       |  4 ++--
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 533f74c300c2..aaba83478cdc 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -147,7 +147,33 @@ memcpy_mcsafe(void *dst, const void *src, size_t cnt)
 
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
-void memcpy_flushcache(void *dst, const void *src, size_t cnt);
+void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
+static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t cnt)
+{
+	if (__builtin_constant_p(cnt)) {
+		switch (cnt) {
+		case 4:
+			asm volatile("movntil %1, %0"
+				     : "=m" (*(u32 *)dst)
+				     : "r" (*(u32 *)src));
+			return;
+		case 8:
+			asm volatile("movntiq %1, %0"
+				     : "=m" (*(u64 *)dst)
+				     : "r" (*(u64 *)src));
+			return;
+		case 16:
+			asm volatile("movntiq %1, %0"
+				     : "=m" (*(u64 *)dst)
+				     : "r" (*(u64 *)src));
+			asm volatile("movntiq %1, %0"
+				     : "=m" (*(u64 *)(dst + 8))
+				     : "r" (*(u64 *)(src + 8)));
+			return;
+		}
+	}
+	__memcpy_flushcache(dst, src, cnt);
+}
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 75d3776123cc..26f515aa3529 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -133,7 +133,7 @@ long __copy_user_flushcache(void *dst, const void __user *src, unsigned size)
 	return rc;
 }
 
-void memcpy_flushcache(void *_dst, const void *_src, size_t size)
+void __memcpy_flushcache(void *_dst, const void *_src, size_t size)
 {
 	unsigned long dest = (unsigned long) _dst;
 	unsigned long source = (unsigned long) _src;
@@ -196,7 +196,7 @@ void memcpy_flushcache(void *_dst, const void *_src, size_t size)
 		clean_cache_range((void *) dest, size);
 	}
 }
-EXPORT_SYMBOL_GPL(memcpy_flushcache);
+EXPORT_SYMBOL_GPL(__memcpy_flushcache);
 
 void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
 		size_t len)
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
  2018-05-19  5:25 ` [patch 4/4] dm-writecache: use new API for flushing Mikulas Patocka
  2018-05-22  6:39     ` Christoph Hellwig
@ 2018-05-25  3:12   ` Dan Williams
  2018-05-25  6:17     ` Mikulas Patocka
  1 sibling, 1 reply; 108+ messages in thread
From: Dan Williams @ 2018-05-25  3:12 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Mike Snitzer, device-mapper development

On Fri, May 18, 2018 at 10:25 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> Use new API for flushing persistent memory.
>
> The problem is this:
> * on X86-64, non-temporal stores have the best performance
> * ARM64 doesn't have non-temporal stores, so we must flush cache. We
>   should flush cache as late as possible, because it performs better this
>   way.
>
> We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> data persistently, all three functions must be called.
>
> The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
>
> On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> pmem_commit is wmb.
>
> On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> pmem_commit is empty.

I don't want to grow driver-local wrappers for pmem. You should use
memcpy_flushcache directly() and if an architecture does not define
memcpy_flushcache() then don't allow building dm-writecache, i.e. this
driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
see a need to add a standalone flush operation if all relevant archs
provide memcpy_flushcache(). As for commit, I'd say just use wmb()
directly since all archs define it. Alternatively we could introduce
memcpy_flushcache_relaxed() to be the un-ordered version of the copy
routine and memcpy_flushcache() would imply a wmb().

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
  2018-05-25  3:12   ` Dan Williams
@ 2018-05-25  6:17     ` Mikulas Patocka
  2018-05-25 12:51         ` Mike Snitzer
  0 siblings, 1 reply; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-25  6:17 UTC (permalink / raw)
  To: Dan Williams; +Cc: Mike Snitzer, device-mapper development



On Thu, 24 May 2018, Dan Williams wrote:

> On Fri, May 18, 2018 at 10:25 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> > Use new API for flushing persistent memory.
> >
> > The problem is this:
> > * on X86-64, non-temporal stores have the best performance
> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> >   should flush cache as late as possible, because it performs better this
> >   way.
> >
> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> > data persistently, all three functions must be called.
> >
> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> >
> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> > pmem_commit is wmb.
> >
> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> > pmem_commit is empty.
> 
> I don't want to grow driver-local wrappers for pmem. You should use
> memcpy_flushcache directly() and if an architecture does not define
> memcpy_flushcache() then don't allow building dm-writecache, i.e. this
> driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
> see a need to add a standalone flush operation if all relevant archs
> provide memcpy_flushcache(). As for commit, I'd say just use wmb()
> directly since all archs define it. Alternatively we could introduce
> memcpy_flushcache_relaxed() to be the un-ordered version of the copy
> routine and memcpy_flushcache() would imply a wmb().

But memcpy_flushcache() on ARM64 is slow.

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-25 12:51         ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-25 12:51 UTC (permalink / raw)
  To: Dan Williams; +Cc: device-mapper development, Mikulas Patocka, linux-nvdimm

On Fri, May 25 2018 at  2:17am -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Thu, 24 May 2018, Dan Williams wrote:
> 
> > On Fri, May 18, 2018 at 10:25 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> > > Use new API for flushing persistent memory.
> > >
> > > The problem is this:
> > > * on X86-64, non-temporal stores have the best performance
> > > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> > >   should flush cache as late as possible, because it performs better this
> > >   way.
> > >
> > > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> > > data persistently, all three functions must be called.
> > >
> > > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> > > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> > >
> > > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> > > pmem_commit is wmb.
> > >
> > > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> > > pmem_commit is empty.
> > 
> > I don't want to grow driver-local wrappers for pmem. You should use
> > memcpy_flushcache directly() and if an architecture does not define
> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
> > see a need to add a standalone flush operation if all relevant archs
> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
> > directly since all archs define it. Alternatively we could introduce
> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
> > routine and memcpy_flushcache() would imply a wmb().
> 
> But memcpy_flushcache() on ARM64 is slow.

Yes, Dan can you please take some time to fully appreciate what this
small wrapper API is providing (maybe you've done that, but your recent
reply is mixed-message).  Seems you're keeping the innovation it
provides at arms-length.  Basically the PMEM APIs you've helped
construct are lacking, forcing DM developers to own fixing them is an
inversion that only serves to stonewall.

Please, less time on stonewalling and more time lifting this wrapper
API; otherwise the dm-writecache local wrapper API is a near-term means
to an end.

I revised the dm-writecache patch header yesterday, please review:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.18&id=2105231db61b08752bc4247d2fe7838657700b0d
(the last paragraph in particular, I intend to move forward with this
wrapper unless someone on the PMEM side of the house steps up and lifts
it up between now and when the 4.18 merge window opens)

dm: add writecache target
The writecache target caches writes on persistent memory or SSD.
It is intended for databases or other programs that need extremely low
commit latency.

The writecache target doesn't cache reads because reads are supposed to
be cached in page cache in normal RAM.

The following describes the approach used to provide the most efficient
flushing of persistent memory on X86_64 vs ARM64:

* On X86_64 non-temporal stores (i.e. memcpy_flushcache) are faster
  than cached writes and cache flushing.

* The ARM64 architecture doesn't have non-temporal stores. So,
  memcpy_flushcache on ARM does memcpy (that writes data to the cache)
  and then flushes the cache.  But this eager cache flushig is slower
  than late cache flushing.

The optimal code sequence on ARM to write to persistent memory is to
call memcpy, then do something else, and then call arch_wb_cache_pmem as
late as possible. And this ARM-optimized code sequence is just horribly
slow on X86.

This issue can't be "fixed" in ARM-specific source code. The ARM
processor have characteristics such that eager cache flushing is slower
than late cache flushing - and that's it - you can't change processor
behavior.

We introduce a wrapper API for flushing persistent memory with functions
pmem_memcpy, pmem_flush and pmem_commit. To commit data persistently,
all three functions must be called.

The macro pmem_assign may be used instead of pmem_memcpy.  pmem_assign
(unlike pmem_memcpy) guarantees that 8-byte values are written
atomically.

On X86_64, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
pmem_commit is wmb.

On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
pmem_commit is empty.

It is clear that this wrapper API for flushing persistent memory needs
to be elevated out of this dm-writecache driver.  But that can happen
later without requiring DM developers to blaze new trails on pmem
specific implementation details/quirks (pmem developers need to clean up
their APIs given they are already spread across CONFIG_ARCH_HAS_PMEM_API
and CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE and they absolutely don't take
into account the duality of the different programming models needed to
achieve optimal cross-architecture use of persistent memory).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <msnitzer@redhat.com>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-25 12:51         ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-25 12:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: device-mapper development, Mikulas Patocka,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw

On Fri, May 25 2018 at  2:17am -0400,
Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> 
> 
> On Thu, 24 May 2018, Dan Williams wrote:
> 
> > On Fri, May 18, 2018 at 10:25 PM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > Use new API for flushing persistent memory.
> > >
> > > The problem is this:
> > > * on X86-64, non-temporal stores have the best performance
> > > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> > >   should flush cache as late as possible, because it performs better this
> > >   way.
> > >
> > > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> > > data persistently, all three functions must be called.
> > >
> > > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> > > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> > >
> > > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> > > pmem_commit is wmb.
> > >
> > > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> > > pmem_commit is empty.
> > 
> > I don't want to grow driver-local wrappers for pmem. You should use
> > memcpy_flushcache directly() and if an architecture does not define
> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
> > see a need to add a standalone flush operation if all relevant archs
> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
> > directly since all archs define it. Alternatively we could introduce
> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
> > routine and memcpy_flushcache() would imply a wmb().
> 
> But memcpy_flushcache() on ARM64 is slow.

Yes, Dan can you please take some time to fully appreciate what this
small wrapper API is providing (maybe you've done that, but your recent
reply is mixed-message).  Seems you're keeping the innovation it
provides at arms-length.  Basically the PMEM APIs you've helped
construct are lacking, forcing DM developers to own fixing them is an
inversion that only serves to stonewall.

Please, less time on stonewalling and more time lifting this wrapper
API; otherwise the dm-writecache local wrapper API is a near-term means
to an end.

I revised the dm-writecache patch header yesterday, please review:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.18&id=2105231db61b08752bc4247d2fe7838657700b0d
(the last paragraph in particular, I intend to move forward with this
wrapper unless someone on the PMEM side of the house steps up and lifts
it up between now and when the 4.18 merge window opens)

dm: add writecache target
The writecache target caches writes on persistent memory or SSD.
It is intended for databases or other programs that need extremely low
commit latency.

The writecache target doesn't cache reads because reads are supposed to
be cached in page cache in normal RAM.

The following describes the approach used to provide the most efficient
flushing of persistent memory on X86_64 vs ARM64:

* On X86_64 non-temporal stores (i.e. memcpy_flushcache) are faster
  than cached writes and cache flushing.

* The ARM64 architecture doesn't have non-temporal stores. So,
  memcpy_flushcache on ARM does memcpy (that writes data to the cache)
  and then flushes the cache.  But this eager cache flushig is slower
  than late cache flushing.

The optimal code sequence on ARM to write to persistent memory is to
call memcpy, then do something else, and then call arch_wb_cache_pmem as
late as possible. And this ARM-optimized code sequence is just horribly
slow on X86.

This issue can't be "fixed" in ARM-specific source code. The ARM
processor have characteristics such that eager cache flushing is slower
than late cache flushing - and that's it - you can't change processor
behavior.

We introduce a wrapper API for flushing persistent memory with functions
pmem_memcpy, pmem_flush and pmem_commit. To commit data persistently,
all three functions must be called.

The macro pmem_assign may be used instead of pmem_memcpy.  pmem_assign
(unlike pmem_memcpy) guarantees that 8-byte values are written
atomically.

On X86_64, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
pmem_commit is wmb.

On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
pmem_commit is empty.

It is clear that this wrapper API for flushing persistent memory needs
to be elevated out of this dm-writecache driver.  But that can happen
later without requiring DM developers to blaze new trails on pmem
specific implementation details/quirks (pmem developers need to clean up
their APIs given they are already spread across CONFIG_ARCH_HAS_PMEM_API
and CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE and they absolutely don't take
into account the duality of the different programming models needed to
achieve optimal cross-architecture use of persistent memory).

Signed-off-by: Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Mike Snitzer <msnitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-25 15:57           ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-25 15:57 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: device-mapper development, Mikulas Patocka, linux-nvdimm

On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Fri, May 25 2018 at  2:17am -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>>
>>
>> On Thu, 24 May 2018, Dan Williams wrote:
>>
>> > On Fri, May 18, 2018 at 10:25 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> > > Use new API for flushing persistent memory.
>> > >
>> > > The problem is this:
>> > > * on X86-64, non-temporal stores have the best performance
>> > > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
>> > >   should flush cache as late as possible, because it performs better this
>> > >   way.
>> > >
>> > > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
>> > > data persistently, all three functions must be called.
>> > >
>> > > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
>> > > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
>> > >
>> > > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
>> > > pmem_commit is wmb.
>> > >
>> > > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
>> > > pmem_commit is empty.
>> >
>> > I don't want to grow driver-local wrappers for pmem. You should use
>> > memcpy_flushcache directly() and if an architecture does not define
>> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
>> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
>> > see a need to add a standalone flush operation if all relevant archs
>> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
>> > directly since all archs define it. Alternatively we could introduce
>> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
>> > routine and memcpy_flushcache() would imply a wmb().
>>
>> But memcpy_flushcache() on ARM64 is slow.
>
> Yes, Dan can you please take some time to fully appreciate what this
> small wrapper API is providing (maybe you've done that, but your recent
> reply is mixed-message).  Seems you're keeping the innovation it
> provides at arms-length.  Basically the PMEM APIs you've helped
> construct are lacking, forcing DM developers to own fixing them is an
> inversion that only serves to stonewall.
>
> Please, less time on stonewalling and more time lifting this wrapper
> API; otherwise the dm-writecache local wrapper API is a near-term means
> to an end.
>
> I revised the dm-writecache patch header yesterday, please review:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.18&id=2105231db61b08752bc4247d2fe7838657700b0d
> (the last paragraph in particular, I intend to move forward with this
> wrapper unless someone on the PMEM side of the house steps up and lifts
> it up between now and when the 4.18 merge window opens)
>
> dm: add writecache target
> The writecache target caches writes on persistent memory or SSD.
> It is intended for databases or other programs that need extremely low
> commit latency.
>
> The writecache target doesn't cache reads because reads are supposed to
> be cached in page cache in normal RAM.
>
> The following describes the approach used to provide the most efficient
> flushing of persistent memory on X86_64 vs ARM64:
>
> * On X86_64 non-temporal stores (i.e. memcpy_flushcache) are faster
>   than cached writes and cache flushing.
>
> * The ARM64 architecture doesn't have non-temporal stores. So,
>   memcpy_flushcache on ARM does memcpy (that writes data to the cache)
>   and then flushes the cache.  But this eager cache flushig is slower
>   than late cache flushing.
>
> The optimal code sequence on ARM to write to persistent memory is to
> call memcpy, then do something else, and then call arch_wb_cache_pmem as
> late as possible. And this ARM-optimized code sequence is just horribly
> slow on X86.
>
> This issue can't be "fixed" in ARM-specific source code. The ARM
> processor have characteristics such that eager cache flushing is slower
> than late cache flushing - and that's it - you can't change processor
> behavior.
>
> We introduce a wrapper API for flushing persistent memory with functions
> pmem_memcpy, pmem_flush and pmem_commit. To commit data persistently,
> all three functions must be called.
>
> The macro pmem_assign may be used instead of pmem_memcpy.  pmem_assign
> (unlike pmem_memcpy) guarantees that 8-byte values are written
> atomically.
>
> On X86_64, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> pmem_commit is wmb.
>
> On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> pmem_commit is empty.
>
> It is clear that this wrapper API for flushing persistent memory needs
> to be elevated out of this dm-writecache driver.  But that can happen
> later without requiring DM developers to blaze new trails on pmem
> specific implementation details/quirks (pmem developers need to clean up
> their APIs given they are already spread across CONFIG_ARCH_HAS_PMEM_API
> and CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE and they absolutely don't take
> into account the duality of the different programming models needed to
> achieve optimal cross-architecture use of persistent memory).

Right, so again, what is wrong with memcpy_flushcache_relaxed() +
wmb() or otherwise making memcpy_flushcache() ordered. I do not see
that as a trailblazing requirement, I see that as typical review and a
reduction of the operation space that you are proposing.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-25 15:57           ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-25 15:57 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: device-mapper development, Mikulas Patocka, linux-nvdimm

On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, May 25 2018 at  2:17am -0400,
> Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
>>
>>
>> On Thu, 24 May 2018, Dan Williams wrote:
>>
>> > On Fri, May 18, 2018 at 10:25 PM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> > > Use new API for flushing persistent memory.
>> > >
>> > > The problem is this:
>> > > * on X86-64, non-temporal stores have the best performance
>> > > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
>> > >   should flush cache as late as possible, because it performs better this
>> > >   way.
>> > >
>> > > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
>> > > data persistently, all three functions must be called.
>> > >
>> > > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
>> > > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
>> > >
>> > > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
>> > > pmem_commit is wmb.
>> > >
>> > > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
>> > > pmem_commit is empty.
>> >
>> > I don't want to grow driver-local wrappers for pmem. You should use
>> > memcpy_flushcache directly() and if an architecture does not define
>> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
>> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
>> > see a need to add a standalone flush operation if all relevant archs
>> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
>> > directly since all archs define it. Alternatively we could introduce
>> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
>> > routine and memcpy_flushcache() would imply a wmb().
>>
>> But memcpy_flushcache() on ARM64 is slow.
>
> Yes, Dan can you please take some time to fully appreciate what this
> small wrapper API is providing (maybe you've done that, but your recent
> reply is mixed-message).  Seems you're keeping the innovation it
> provides at arms-length.  Basically the PMEM APIs you've helped
> construct are lacking, forcing DM developers to own fixing them is an
> inversion that only serves to stonewall.
>
> Please, less time on stonewalling and more time lifting this wrapper
> API; otherwise the dm-writecache local wrapper API is a near-term means
> to an end.
>
> I revised the dm-writecache patch header yesterday, please review:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.18&id=2105231db61b08752bc4247d2fe7838657700b0d
> (the last paragraph in particular, I intend to move forward with this
> wrapper unless someone on the PMEM side of the house steps up and lifts
> it up between now and when the 4.18 merge window opens)
>
> dm: add writecache target
> The writecache target caches writes on persistent memory or SSD.
> It is intended for databases or other programs that need extremely low
> commit latency.
>
> The writecache target doesn't cache reads because reads are supposed to
> be cached in page cache in normal RAM.
>
> The following describes the approach used to provide the most efficient
> flushing of persistent memory on X86_64 vs ARM64:
>
> * On X86_64 non-temporal stores (i.e. memcpy_flushcache) are faster
>   than cached writes and cache flushing.
>
> * The ARM64 architecture doesn't have non-temporal stores. So,
>   memcpy_flushcache on ARM does memcpy (that writes data to the cache)
>   and then flushes the cache.  But this eager cache flushig is slower
>   than late cache flushing.
>
> The optimal code sequence on ARM to write to persistent memory is to
> call memcpy, then do something else, and then call arch_wb_cache_pmem as
> late as possible. And this ARM-optimized code sequence is just horribly
> slow on X86.
>
> This issue can't be "fixed" in ARM-specific source code. The ARM
> processor have characteristics such that eager cache flushing is slower
> than late cache flushing - and that's it - you can't change processor
> behavior.
>
> We introduce a wrapper API for flushing persistent memory with functions
> pmem_memcpy, pmem_flush and pmem_commit. To commit data persistently,
> all three functions must be called.
>
> The macro pmem_assign may be used instead of pmem_memcpy.  pmem_assign
> (unlike pmem_memcpy) guarantees that 8-byte values are written
> atomically.
>
> On X86_64, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> pmem_commit is wmb.
>
> On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> pmem_commit is empty.
>
> It is clear that this wrapper API for flushing persistent memory needs
> to be elevated out of this dm-writecache driver.  But that can happen
> later without requiring DM developers to blaze new trails on pmem
> specific implementation details/quirks (pmem developers need to clean up
> their APIs given they are already spread across CONFIG_ARCH_HAS_PMEM_API
> and CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE and they absolutely don't take
> into account the duality of the different programming models needed to
> achieve optimal cross-architecture use of persistent memory).

Right, so again, what is wrong with memcpy_flushcache_relaxed() +
wmb() or otherwise making memcpy_flushcache() ordered. I do not see
that as a trailblazing requirement, I see that as typical review and a
reduction of the operation space that you are proposing.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-26  7:02             ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-26  7:02 UTC (permalink / raw)
  To: Dan Williams; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm



On Fri, 25 May 2018, Dan Williams wrote:

> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> > On Fri, May 25 2018 at  2:17am -0400,
> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> >> On Thu, 24 May 2018, Dan Williams wrote:
> >>
> >> > I don't want to grow driver-local wrappers for pmem. You should use
> >> > memcpy_flushcache directly() and if an architecture does not define
> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
> >> > see a need to add a standalone flush operation if all relevant archs
> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
> >> > directly since all archs define it. Alternatively we could introduce
> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
> >> > routine and memcpy_flushcache() would imply a wmb().
> >>
> >> But memcpy_flushcache() on ARM64 is slow.
> 
> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
> that as a trailblazing requirement, I see that as typical review and a
> reduction of the operation space that you are proposing.

memcpy_flushcache on ARM64 is generally wrong thing to do, because it is 
slower than memcpy and explicit cache flush.

Suppose that you want to write data to a block device and make it 
persistent. So you send a WRITE bio and then a FLUSH bio.

Now - how to implement these two bios on persistent memory:

On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does 
wmb() - this is the optimal implementation.

But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal 
implementation is that the WRITE bio does just memcpy() and the FLUSH bio 
does arch_wb_cache_pmem() on the affected range.

Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture 
doesn't have non-temporal stores. So, memcpy_flushcache() is implemented 
as memcpy() followed by a cache flush.

Now - if you flush the cache immediatelly after memcpy, the cache is full 
of dirty lines and the cache-flushing code has to write these lines back 
and that is slow.

If you flush the cache some time after memcpy (i.e. when the FLUSH bio is 
received), the processor already flushed some part of the cache on its 
own, so the cache-flushing function has less work to do and it is faster.

So the conclusion is - don't use memcpy_flushcache on ARM. This problem 
cannot be fixed by a better implementation of memcpy_flushcache.

Mikulas
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-26  7:02             ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-26  7:02 UTC (permalink / raw)
  To: Dan Williams; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm



On Fri, 25 May 2018, Dan Williams wrote:

> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Fri, May 25 2018 at  2:17am -0400,
> > Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> >> On Thu, 24 May 2018, Dan Williams wrote:
> >>
> >> > I don't want to grow driver-local wrappers for pmem. You should use
> >> > memcpy_flushcache directly() and if an architecture does not define
> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
> >> > see a need to add a standalone flush operation if all relevant archs
> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
> >> > directly since all archs define it. Alternatively we could introduce
> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
> >> > routine and memcpy_flushcache() would imply a wmb().
> >>
> >> But memcpy_flushcache() on ARM64 is slow.
> 
> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
> that as a trailblazing requirement, I see that as typical review and a
> reduction of the operation space that you are proposing.

memcpy_flushcache on ARM64 is generally wrong thing to do, because it is 
slower than memcpy and explicit cache flush.

Suppose that you want to write data to a block device and make it 
persistent. So you send a WRITE bio and then a FLUSH bio.

Now - how to implement these two bios on persistent memory:

On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does 
wmb() - this is the optimal implementation.

But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal 
implementation is that the WRITE bio does just memcpy() and the FLUSH bio 
does arch_wb_cache_pmem() on the affected range.

Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture 
doesn't have non-temporal stores. So, memcpy_flushcache() is implemented 
as memcpy() followed by a cache flush.

Now - if you flush the cache immediatelly after memcpy, the cache is full 
of dirty lines and the cache-flushing code has to write these lines back 
and that is slow.

If you flush the cache some time after memcpy (i.e. when the FLUSH bio is 
received), the processor already flushed some part of the cache on its 
own, so the cache-flushing function has less work to do and it is faster.

So the conclusion is - don't use memcpy_flushcache on ARM. This problem 
cannot be fixed by a better implementation of memcpy_flushcache.

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-26 15:26               ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-26 15:26 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm

On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Fri, 25 May 2018, Dan Williams wrote:
>
>> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote:
>> > On Fri, May 25 2018 at  2:17am -0400,
>> > Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >
>> >> On Thu, 24 May 2018, Dan Williams wrote:
>> >>
>> >> > I don't want to grow driver-local wrappers for pmem. You should use
>> >> > memcpy_flushcache directly() and if an architecture does not define
>> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
>> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
>> >> > see a need to add a standalone flush operation if all relevant archs
>> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
>> >> > directly since all archs define it. Alternatively we could introduce
>> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
>> >> > routine and memcpy_flushcache() would imply a wmb().
>> >>
>> >> But memcpy_flushcache() on ARM64 is slow.
>>
>> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
>> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
>> that as a trailblazing requirement, I see that as typical review and a
>> reduction of the operation space that you are proposing.
>
> memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
> slower than memcpy and explicit cache flush.
>
> Suppose that you want to write data to a block device and make it
> persistent. So you send a WRITE bio and then a FLUSH bio.
>
> Now - how to implement these two bios on persistent memory:
>
> On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
> wmb() - this is the optimal implementation.
>
> But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
> implementation is that the WRITE bio does just memcpy() and the FLUSH bio
> does arch_wb_cache_pmem() on the affected range.
>
> Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
> doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
> as memcpy() followed by a cache flush.
>
> Now - if you flush the cache immediatelly after memcpy, the cache is full
> of dirty lines and the cache-flushing code has to write these lines back
> and that is slow.
>
> If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
> received), the processor already flushed some part of the cache on its
> own, so the cache-flushing function has less work to do and it is faster.
>
> So the conclusion is - don't use memcpy_flushcache on ARM. This problem
> cannot be fixed by a better implementation of memcpy_flushcache.

It sounds like ARM might be better off with mapping its pmem as
write-through rather than write-back, and skip the explicit cache
management altogether. You speak of "optimal" and "sub-optimal", but
what would be more clear is fio measurements of the relative IOPs and
latency profiles of the different approaches. The reason I am
continuing to push here is that reducing the operation space from
'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
maintenance long term.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-26 15:26               ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-26 15:26 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm

On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
>
> On Fri, 25 May 2018, Dan Williams wrote:
>
>> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> > On Fri, May 25 2018 at  2:17am -0400,
>> > Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> >
>> >> On Thu, 24 May 2018, Dan Williams wrote:
>> >>
>> >> > I don't want to grow driver-local wrappers for pmem. You should use
>> >> > memcpy_flushcache directly() and if an architecture does not define
>> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
>> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
>> >> > see a need to add a standalone flush operation if all relevant archs
>> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
>> >> > directly since all archs define it. Alternatively we could introduce
>> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
>> >> > routine and memcpy_flushcache() would imply a wmb().
>> >>
>> >> But memcpy_flushcache() on ARM64 is slow.
>>
>> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
>> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
>> that as a trailblazing requirement, I see that as typical review and a
>> reduction of the operation space that you are proposing.
>
> memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
> slower than memcpy and explicit cache flush.
>
> Suppose that you want to write data to a block device and make it
> persistent. So you send a WRITE bio and then a FLUSH bio.
>
> Now - how to implement these two bios on persistent memory:
>
> On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
> wmb() - this is the optimal implementation.
>
> But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
> implementation is that the WRITE bio does just memcpy() and the FLUSH bio
> does arch_wb_cache_pmem() on the affected range.
>
> Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
> doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
> as memcpy() followed by a cache flush.
>
> Now - if you flush the cache immediatelly after memcpy, the cache is full
> of dirty lines and the cache-flushing code has to write these lines back
> and that is slow.
>
> If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
> received), the processor already flushed some part of the cache on its
> own, so the cache-flushing function has less work to do and it is faster.
>
> So the conclusion is - don't use memcpy_flushcache on ARM. This problem
> cannot be fixed by a better implementation of memcpy_flushcache.

It sounds like ARM might be better off with mapping its pmem as
write-through rather than write-back, and skip the explicit cache
management altogether. You speak of "optimal" and "sub-optimal", but
what would be more clear is fio measurements of the relative IOPs and
latency profiles of the different approaches. The reason I am
continuing to push here is that reducing the operation space from
'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
maintenance long term.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-28 13:32                 ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-28 13:32 UTC (permalink / raw)
  To: Dan Williams; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm



On Sat, 26 May 2018, Dan Williams wrote:

> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> >
> > On Fri, 25 May 2018, Dan Williams wrote:
> >
> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> >> > On Fri, May 25 2018 at  2:17am -0400,
> >> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> >> >
> >> >> On Thu, 24 May 2018, Dan Williams wrote:
> >> >>
> >> >> > I don't want to grow driver-local wrappers for pmem. You should use
> >> >> > memcpy_flushcache directly() and if an architecture does not define
> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
> >> >> > see a need to add a standalone flush operation if all relevant archs
> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
> >> >> > directly since all archs define it. Alternatively we could introduce
> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
> >> >> > routine and memcpy_flushcache() would imply a wmb().
> >> >>
> >> >> But memcpy_flushcache() on ARM64 is slow.
> >>
> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
> >> that as a trailblazing requirement, I see that as typical review and a
> >> reduction of the operation space that you are proposing.
> >
> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
> > slower than memcpy and explicit cache flush.
> >
> > Suppose that you want to write data to a block device and make it
> > persistent. So you send a WRITE bio and then a FLUSH bio.
> >
> > Now - how to implement these two bios on persistent memory:
> >
> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
> > wmb() - this is the optimal implementation.
> >
> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio
> > does arch_wb_cache_pmem() on the affected range.
> >
> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
> > as memcpy() followed by a cache flush.
> >
> > Now - if you flush the cache immediatelly after memcpy, the cache is full
> > of dirty lines and the cache-flushing code has to write these lines back
> > and that is slow.
> >
> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
> > received), the processor already flushed some part of the cache on its
> > own, so the cache-flushing function has less work to do and it is faster.
> >
> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem
> > cannot be fixed by a better implementation of memcpy_flushcache.
> 
> It sounds like ARM might be better off with mapping its pmem as
> write-through rather than write-back, and skip the explicit cache

I doubt it would perform well - write combining combines the writes into a 
larger segments - and write-through doesn't.

> management altogether. You speak of "optimal" and "sub-optimal", but
> what would be more clear is fio measurements of the relative IOPs and
> latency profiles of the different approaches. The reason I am
> continuing to push here is that reducing the operation space from
> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
> maintenance long term.

I measured it (with nvme backing store) and late cache flushing has 12% 
better performance than eager flushing with memcpy_flushcache().

131836 4k iops - vs - 117016.

Mikulas
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-28 13:32                 ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-28 13:32 UTC (permalink / raw)
  To: Dan Williams; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm



On Sat, 26 May 2018, Dan Williams wrote:

> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> >
> > On Fri, 25 May 2018, Dan Williams wrote:
> >
> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> > On Fri, May 25 2018 at  2:17am -0400,
> >> > Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> >
> >> >> On Thu, 24 May 2018, Dan Williams wrote:
> >> >>
> >> >> > I don't want to grow driver-local wrappers for pmem. You should use
> >> >> > memcpy_flushcache directly() and if an architecture does not define
> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
> >> >> > see a need to add a standalone flush operation if all relevant archs
> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
> >> >> > directly since all archs define it. Alternatively we could introduce
> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
> >> >> > routine and memcpy_flushcache() would imply a wmb().
> >> >>
> >> >> But memcpy_flushcache() on ARM64 is slow.
> >>
> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
> >> that as a trailblazing requirement, I see that as typical review and a
> >> reduction of the operation space that you are proposing.
> >
> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
> > slower than memcpy and explicit cache flush.
> >
> > Suppose that you want to write data to a block device and make it
> > persistent. So you send a WRITE bio and then a FLUSH bio.
> >
> > Now - how to implement these two bios on persistent memory:
> >
> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
> > wmb() - this is the optimal implementation.
> >
> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio
> > does arch_wb_cache_pmem() on the affected range.
> >
> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
> > as memcpy() followed by a cache flush.
> >
> > Now - if you flush the cache immediatelly after memcpy, the cache is full
> > of dirty lines and the cache-flushing code has to write these lines back
> > and that is slow.
> >
> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
> > received), the processor already flushed some part of the cache on its
> > own, so the cache-flushing function has less work to do and it is faster.
> >
> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem
> > cannot be fixed by a better implementation of memcpy_flushcache.
> 
> It sounds like ARM might be better off with mapping its pmem as
> write-through rather than write-back, and skip the explicit cache

I doubt it would perform well - write combining combines the writes into a 
larger segments - and write-through doesn't.

> management altogether. You speak of "optimal" and "sub-optimal", but
> what would be more clear is fio measurements of the relative IOPs and
> latency profiles of the different approaches. The reason I am
> continuing to push here is that reducing the operation space from
> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
> maintenance long term.

I measured it (with nvme backing store) and late cache flushing has 12% 
better performance than eager flushing with memcpy_flushcache().

131836 4k iops - vs - 117016.

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-28 13:52               ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-28 13:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, device-mapper development, Mike Snitzer, linux-nvdimm



On Tue, 22 May 2018, Dan Williams wrote:

> >> Except I'm being responsive.
> >
> > Except you're looking to immediately punt to linux-arm-kernel ;)
> 
> Well, I'm not, not really. I'm saying drop ARM support, it's not ready.

This is the worst thing to do - because once late cache flushing is 
dropped from the dm-writecache target, it could hardly be reintroduced 
again.

> >> I agree with Christoph that we should
> >> build pmem helpers at an architecture level and not per-driver. Let's
> >> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
> >> up to x86 in this space. We already have PowerPC enabling PMEM API, so
> >> I don't see an unreasonable barrier to ask the same of ARM. This patch
> >> is not even cc'd to linux-arm-kernel. Has the subject been broached
> >> with them?
> >
> > No idea.  Not by me.
> >
> > The thing is, I'm no expert in pmem.  You are.  Coordinating the change
> > with ARM et al feels unnecessarily limiting and quicky moves outside my
> > control.
> >
> > Serious question: Why can't this code land in this dm-writecache target
> > and then be lifted (or obsoleted)?
> 
> Because we already have an API, and we don't want to promote local
> solutions to global problems, or carry  unnecessary technical debt.
> 
> >
> > But if you think it worthwhile to force ARM to step up then fine.  That
> > does limit the availability of using writecache on ARM while they get
> > the PMEM API together.
> >
> > I'll do whatever you want.. just put the smack down and tell me how it
> > is ;)
> 
> I'd say just control the variables you can control. Drop the ARM
> support if you want to move forward and propose extensions / updates

What do we gain by dropping it?

> to the pmem api for x86 and I'll help push those since I was involved
> in pushing the x86 pmem api in the first instance. That way you don't
> need to touch this driver as new archs add their pmem api enabling.

The pmem API is x86-centric - that the problem.

Mikulas
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-28 13:52               ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-28 13:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, device-mapper development, Mike Snitzer, linux-nvdimm



On Tue, 22 May 2018, Dan Williams wrote:

> >> Except I'm being responsive.
> >
> > Except you're looking to immediately punt to linux-arm-kernel ;)
> 
> Well, I'm not, not really. I'm saying drop ARM support, it's not ready.

This is the worst thing to do - because once late cache flushing is 
dropped from the dm-writecache target, it could hardly be reintroduced 
again.

> >> I agree with Christoph that we should
> >> build pmem helpers at an architecture level and not per-driver. Let's
> >> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
> >> up to x86 in this space. We already have PowerPC enabling PMEM API, so
> >> I don't see an unreasonable barrier to ask the same of ARM. This patch
> >> is not even cc'd to linux-arm-kernel. Has the subject been broached
> >> with them?
> >
> > No idea.  Not by me.
> >
> > The thing is, I'm no expert in pmem.  You are.  Coordinating the change
> > with ARM et al feels unnecessarily limiting and quicky moves outside my
> > control.
> >
> > Serious question: Why can't this code land in this dm-writecache target
> > and then be lifted (or obsoleted)?
> 
> Because we already have an API, and we don't want to promote local
> solutions to global problems, or carry  unnecessary technical debt.
> 
> >
> > But if you think it worthwhile to force ARM to step up then fine.  That
> > does limit the availability of using writecache on ARM while they get
> > the PMEM API together.
> >
> > I'll do whatever you want.. just put the smack down and tell me how it
> > is ;)
> 
> I'd say just control the variables you can control. Drop the ARM
> support if you want to move forward and propose extensions / updates

What do we gain by dropping it?

> to the pmem api for x86 and I'll help push those since I was involved
> in pushing the x86 pmem api in the first instance. That way you don't
> need to touch this driver as new archs add their pmem api enabling.

The pmem API is x86-centric - that the problem.

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-28 17:41                 ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-28 17:41 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Christoph Hellwig, device-mapper development, Mike Snitzer, linux-nvdimm

On Mon, May 28, 2018 at 6:52 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Tue, 22 May 2018, Dan Williams wrote:
>
>> >> Except I'm being responsive.
>> >
>> > Except you're looking to immediately punt to linux-arm-kernel ;)
>>
>> Well, I'm not, not really. I'm saying drop ARM support, it's not ready.
>
> This is the worst thing to do - because once late cache flushing is
> dropped from the dm-writecache target, it could hardly be reintroduced
> again.
>
>> >> I agree with Christoph that we should
>> >> build pmem helpers at an architecture level and not per-driver. Let's
>> >> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
>> >> up to x86 in this space. We already have PowerPC enabling PMEM API, so
>> >> I don't see an unreasonable barrier to ask the same of ARM. This patch
>> >> is not even cc'd to linux-arm-kernel. Has the subject been broached
>> >> with them?
>> >
>> > No idea.  Not by me.
>> >
>> > The thing is, I'm no expert in pmem.  You are.  Coordinating the change
>> > with ARM et al feels unnecessarily limiting and quicky moves outside my
>> > control.
>> >
>> > Serious question: Why can't this code land in this dm-writecache target
>> > and then be lifted (or obsoleted)?
>>
>> Because we already have an API, and we don't want to promote local
>> solutions to global problems, or carry  unnecessary technical debt.
>>
>> >
>> > But if you think it worthwhile to force ARM to step up then fine.  That
>> > does limit the availability of using writecache on ARM while they get
>> > the PMEM API together.
>> >
>> > I'll do whatever you want.. just put the smack down and tell me how it
>> > is ;)
>>
>> I'd say just control the variables you can control. Drop the ARM
>> support if you want to move forward and propose extensions / updates
>
> What do we gain by dropping it?
>
>> to the pmem api for x86 and I'll help push those since I was involved
>> in pushing the x86 pmem api in the first instance. That way you don't
>> need to touch this driver as new archs add their pmem api enabling.
>
> The pmem API is x86-centric - that the problem.

When I read your patch I came away with the impression that ARM had
not added memcpy_flushcache() yet and you were working around that
fact. Now that I look, ARM *does* define memcpy_flushcache() and
you're avoiding it. You use memcpy+arch_wb_pmem where arch_wb_pmem on
ARM64 is defined as __clean_dcache_area_pop(dst, cnt). The ARM
memcpy_flushcache() implementation is:

    memcpy(dst, src, cnt);
    __clean_dcache_area_pop(dst, cnt);

So, I do not see how what you're doing is any less work unless you are
flushing less than you copy?

If memcpy_flushcache() is slower than memcpy + arch_wb_pmem then the
ARM implementation is broken and that needs to be addressed not worked
around in a driver.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-28 17:41                 ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-28 17:41 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Christoph Hellwig, device-mapper development, Mike Snitzer, linux-nvdimm

On Mon, May 28, 2018 at 6:52 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
>
> On Tue, 22 May 2018, Dan Williams wrote:
>
>> >> Except I'm being responsive.
>> >
>> > Except you're looking to immediately punt to linux-arm-kernel ;)
>>
>> Well, I'm not, not really. I'm saying drop ARM support, it's not ready.
>
> This is the worst thing to do - because once late cache flushing is
> dropped from the dm-writecache target, it could hardly be reintroduced
> again.
>
>> >> I agree with Christoph that we should
>> >> build pmem helpers at an architecture level and not per-driver. Let's
>> >> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
>> >> up to x86 in this space. We already have PowerPC enabling PMEM API, so
>> >> I don't see an unreasonable barrier to ask the same of ARM. This patch
>> >> is not even cc'd to linux-arm-kernel. Has the subject been broached
>> >> with them?
>> >
>> > No idea.  Not by me.
>> >
>> > The thing is, I'm no expert in pmem.  You are.  Coordinating the change
>> > with ARM et al feels unnecessarily limiting and quicky moves outside my
>> > control.
>> >
>> > Serious question: Why can't this code land in this dm-writecache target
>> > and then be lifted (or obsoleted)?
>>
>> Because we already have an API, and we don't want to promote local
>> solutions to global problems, or carry  unnecessary technical debt.
>>
>> >
>> > But if you think it worthwhile to force ARM to step up then fine.  That
>> > does limit the availability of using writecache on ARM while they get
>> > the PMEM API together.
>> >
>> > I'll do whatever you want.. just put the smack down and tell me how it
>> > is ;)
>>
>> I'd say just control the variables you can control. Drop the ARM
>> support if you want to move forward and propose extensions / updates
>
> What do we gain by dropping it?
>
>> to the pmem api for x86 and I'll help push those since I was involved
>> in pushing the x86 pmem api in the first instance. That way you don't
>> need to touch this driver as new archs add their pmem api enabling.
>
> The pmem API is x86-centric - that the problem.

When I read your patch I came away with the impression that ARM had
not added memcpy_flushcache() yet and you were working around that
fact. Now that I look, ARM *does* define memcpy_flushcache() and
you're avoiding it. You use memcpy+arch_wb_pmem where arch_wb_pmem on
ARM64 is defined as __clean_dcache_area_pop(dst, cnt). The ARM
memcpy_flushcache() implementation is:

    memcpy(dst, src, cnt);
    __clean_dcache_area_pop(dst, cnt);

So, I do not see how what you're doing is any less work unless you are
flushing less than you copy?

If memcpy_flushcache() is slower than memcpy + arch_wb_pmem then the
ARM implementation is broken and that needs to be addressed not worked
around in a driver.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-28 18:14                   ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-28 18:14 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm

On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Sat, 26 May 2018, Dan Williams wrote:
>
>> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >
>> >
>> > On Fri, 25 May 2018, Dan Williams wrote:
>> >
>> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote:
>> >> > On Fri, May 25 2018 at  2:17am -0400,
>> >> > Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >> >
>> >> >> On Thu, 24 May 2018, Dan Williams wrote:
>> >> >>
>> >> >> > I don't want to grow driver-local wrappers for pmem. You should use
>> >> >> > memcpy_flushcache directly() and if an architecture does not define
>> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
>> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
>> >> >> > see a need to add a standalone flush operation if all relevant archs
>> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
>> >> >> > directly since all archs define it. Alternatively we could introduce
>> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
>> >> >> > routine and memcpy_flushcache() would imply a wmb().
>> >> >>
>> >> >> But memcpy_flushcache() on ARM64 is slow.
>> >>
>> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
>> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
>> >> that as a trailblazing requirement, I see that as typical review and a
>> >> reduction of the operation space that you are proposing.
>> >
>> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
>> > slower than memcpy and explicit cache flush.
>> >
>> > Suppose that you want to write data to a block device and make it
>> > persistent. So you send a WRITE bio and then a FLUSH bio.
>> >
>> > Now - how to implement these two bios on persistent memory:
>> >
>> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
>> > wmb() - this is the optimal implementation.
>> >
>> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
>> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio
>> > does arch_wb_cache_pmem() on the affected range.
>> >
>> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
>> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
>> > as memcpy() followed by a cache flush.
>> >
>> > Now - if you flush the cache immediatelly after memcpy, the cache is full
>> > of dirty lines and the cache-flushing code has to write these lines back
>> > and that is slow.
>> >
>> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
>> > received), the processor already flushed some part of the cache on its
>> > own, so the cache-flushing function has less work to do and it is faster.
>> >
>> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem
>> > cannot be fixed by a better implementation of memcpy_flushcache.
>>
>> It sounds like ARM might be better off with mapping its pmem as
>> write-through rather than write-back, and skip the explicit cache
>
> I doubt it would perform well - write combining combines the writes into a
> larger segments - and write-through doesn't.
>

Last I checked write-through caching does not disable write combining

>> management altogether. You speak of "optimal" and "sub-optimal", but
>> what would be more clear is fio measurements of the relative IOPs and
>> latency profiles of the different approaches. The reason I am
>> continuing to push here is that reducing the operation space from
>> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
>> maintenance long term.
>
> I measured it (with nvme backing store) and late cache flushing has 12%
> better performance than eager flushing with memcpy_flushcache().

I assume what you're seeing is ARM64 over-flushing the amount of dirty
data so it becomes more efficient to do an amortized flush at the end?
However, that effectively makes memcpy_flushcache() unusable in the
way it can be used on x86. You claimed that ARM does not support
non-temporal stores, but it does, see the STNP instruction. I do not
want to see arch specific optimizations in drivers, so either
write-through mappings is a potential answer to remove the need to
explicitly manage flushing, or just implement STNP hacks in
memcpy_flushcache() like you did with MOVNT on x86.

> 131836 4k iops - vs - 117016.

To be clear this is memcpy_flushcache() vs memcpy + flush?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-28 18:14                   ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-28 18:14 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm

On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
>
> On Sat, 26 May 2018, Dan Williams wrote:
>
>> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> >
>> >
>> > On Fri, 25 May 2018, Dan Williams wrote:
>> >
>> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> >> > On Fri, May 25 2018 at  2:17am -0400,
>> >> > Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> >> >
>> >> >> On Thu, 24 May 2018, Dan Williams wrote:
>> >> >>
>> >> >> > I don't want to grow driver-local wrappers for pmem. You should use
>> >> >> > memcpy_flushcache directly() and if an architecture does not define
>> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
>> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
>> >> >> > see a need to add a standalone flush operation if all relevant archs
>> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
>> >> >> > directly since all archs define it. Alternatively we could introduce
>> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
>> >> >> > routine and memcpy_flushcache() would imply a wmb().
>> >> >>
>> >> >> But memcpy_flushcache() on ARM64 is slow.
>> >>
>> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
>> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
>> >> that as a trailblazing requirement, I see that as typical review and a
>> >> reduction of the operation space that you are proposing.
>> >
>> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
>> > slower than memcpy and explicit cache flush.
>> >
>> > Suppose that you want to write data to a block device and make it
>> > persistent. So you send a WRITE bio and then a FLUSH bio.
>> >
>> > Now - how to implement these two bios on persistent memory:
>> >
>> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
>> > wmb() - this is the optimal implementation.
>> >
>> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
>> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio
>> > does arch_wb_cache_pmem() on the affected range.
>> >
>> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
>> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
>> > as memcpy() followed by a cache flush.
>> >
>> > Now - if you flush the cache immediatelly after memcpy, the cache is full
>> > of dirty lines and the cache-flushing code has to write these lines back
>> > and that is slow.
>> >
>> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
>> > received), the processor already flushed some part of the cache on its
>> > own, so the cache-flushing function has less work to do and it is faster.
>> >
>> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem
>> > cannot be fixed by a better implementation of memcpy_flushcache.
>>
>> It sounds like ARM might be better off with mapping its pmem as
>> write-through rather than write-back, and skip the explicit cache
>
> I doubt it would perform well - write combining combines the writes into a
> larger segments - and write-through doesn't.
>

Last I checked write-through caching does not disable write combining

>> management altogether. You speak of "optimal" and "sub-optimal", but
>> what would be more clear is fio measurements of the relative IOPs and
>> latency profiles of the different approaches. The reason I am
>> continuing to push here is that reducing the operation space from
>> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
>> maintenance long term.
>
> I measured it (with nvme backing store) and late cache flushing has 12%
> better performance than eager flushing with memcpy_flushcache().

I assume what you're seeing is ARM64 over-flushing the amount of dirty
data so it becomes more efficient to do an amortized flush at the end?
However, that effectively makes memcpy_flushcache() unusable in the
way it can be used on x86. You claimed that ARM does not support
non-temporal stores, but it does, see the STNP instruction. I do not
want to see arch specific optimizations in drivers, so either
write-through mappings is a potential answer to remove the need to
explicitly manage flushing, or just implement STNP hacks in
memcpy_flushcache() like you did with MOVNT on x86.

> 131836 4k iops - vs - 117016.

To be clear this is memcpy_flushcache() vs memcpy + flush?

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 13:07                     ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-30 13:07 UTC (permalink / raw)
  To: Dan Williams; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm



On Mon, 28 May 2018, Dan Williams wrote:

> On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> >
> > On Sat, 26 May 2018, Dan Williams wrote:
> >
> >> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >> >
> >> >
> >> > On Fri, 25 May 2018, Dan Williams wrote:
> >> >
> >> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> >> >> > On Fri, May 25 2018 at  2:17am -0400,
> >> >> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> >> >> >
> >> >> >> On Thu, 24 May 2018, Dan Williams wrote:
> >> >> >>
> >> >> >> > I don't want to grow driver-local wrappers for pmem. You should use
> >> >> >> > memcpy_flushcache directly() and if an architecture does not define
> >> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
> >> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
> >> >> >> > see a need to add a standalone flush operation if all relevant archs
> >> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
> >> >> >> > directly since all archs define it. Alternatively we could introduce
> >> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
> >> >> >> > routine and memcpy_flushcache() would imply a wmb().
> >> >> >>
> >> >> >> But memcpy_flushcache() on ARM64 is slow.
> >> >>
> >> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
> >> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
> >> >> that as a trailblazing requirement, I see that as typical review and a
> >> >> reduction of the operation space that you are proposing.
> >> >
> >> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
> >> > slower than memcpy and explicit cache flush.
> >> >
> >> > Suppose that you want to write data to a block device and make it
> >> > persistent. So you send a WRITE bio and then a FLUSH bio.
> >> >
> >> > Now - how to implement these two bios on persistent memory:
> >> >
> >> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
> >> > wmb() - this is the optimal implementation.
> >> >
> >> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
> >> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio
> >> > does arch_wb_cache_pmem() on the affected range.
> >> >
> >> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
> >> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
> >> > as memcpy() followed by a cache flush.
> >> >
> >> > Now - if you flush the cache immediatelly after memcpy, the cache is full
> >> > of dirty lines and the cache-flushing code has to write these lines back
> >> > and that is slow.
> >> >
> >> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
> >> > received), the processor already flushed some part of the cache on its
> >> > own, so the cache-flushing function has less work to do and it is faster.
> >> >
> >> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem
> >> > cannot be fixed by a better implementation of memcpy_flushcache.
> >>
> >> It sounds like ARM might be better off with mapping its pmem as
> >> write-through rather than write-back, and skip the explicit cache
> >
> > I doubt it would perform well - write combining combines the writes into a
> > larger segments - and write-through doesn't.
> >
> 
> Last I checked write-through caching does not disable write combining
> 
> >> management altogether. You speak of "optimal" and "sub-optimal", but
> >> what would be more clear is fio measurements of the relative IOPs and
> >> latency profiles of the different approaches. The reason I am
> >> continuing to push here is that reducing the operation space from
> >> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
> >> maintenance long term.
> >
> > I measured it (with nvme backing store) and late cache flushing has 12%
> > better performance than eager flushing with memcpy_flushcache().
> 
> I assume what you're seeing is ARM64 over-flushing the amount of dirty
> data so it becomes more efficient to do an amortized flush at the end?
> However, that effectively makes memcpy_flushcache() unusable in the
> way it can be used on x86. You claimed that ARM does not support
> non-temporal stores, but it does, see the STNP instruction. I do not
> want to see arch specific optimizations in drivers, so either
> write-through mappings is a potential answer to remove the need to
> explicitly manage flushing, or just implement STNP hacks in
> memcpy_flushcache() like you did with MOVNT on x86.
> 
> > 131836 4k iops - vs - 117016.
> 
> To be clear this is memcpy_flushcache() vs memcpy + flush?

I found out what caused the difference. I used dax_flush on the version of 
dm-writecache that I had on the ARM machine (with the kernel 4.14, because 
it is the last version where dax on ramdisk works) - and I thought that 
dax_flush flushes the cache, but it doesn't.

When I replaced dax_flush with arch_wb_cache_pmem, the performance 
difference between early flushing and late flushing disappeared.

So I think we can remove this per-architecture switch from dm-writecache.

Mikulas
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 13:07                     ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-30 13:07 UTC (permalink / raw)
  To: Dan Williams; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm



On Mon, 28 May 2018, Dan Williams wrote:

> On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> >
> > On Sat, 26 May 2018, Dan Williams wrote:
> >
> >> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> >
> >> >
> >> > On Fri, 25 May 2018, Dan Williams wrote:
> >> >
> >> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> >> > On Fri, May 25 2018 at  2:17am -0400,
> >> >> > Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> >> >
> >> >> >> On Thu, 24 May 2018, Dan Williams wrote:
> >> >> >>
> >> >> >> > I don't want to grow driver-local wrappers for pmem. You should use
> >> >> >> > memcpy_flushcache directly() and if an architecture does not define
> >> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
> >> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
> >> >> >> > see a need to add a standalone flush operation if all relevant archs
> >> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
> >> >> >> > directly since all archs define it. Alternatively we could introduce
> >> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
> >> >> >> > routine and memcpy_flushcache() would imply a wmb().
> >> >> >>
> >> >> >> But memcpy_flushcache() on ARM64 is slow.
> >> >>
> >> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
> >> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
> >> >> that as a trailblazing requirement, I see that as typical review and a
> >> >> reduction of the operation space that you are proposing.
> >> >
> >> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
> >> > slower than memcpy and explicit cache flush.
> >> >
> >> > Suppose that you want to write data to a block device and make it
> >> > persistent. So you send a WRITE bio and then a FLUSH bio.
> >> >
> >> > Now - how to implement these two bios on persistent memory:
> >> >
> >> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
> >> > wmb() - this is the optimal implementation.
> >> >
> >> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
> >> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio
> >> > does arch_wb_cache_pmem() on the affected range.
> >> >
> >> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
> >> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
> >> > as memcpy() followed by a cache flush.
> >> >
> >> > Now - if you flush the cache immediatelly after memcpy, the cache is full
> >> > of dirty lines and the cache-flushing code has to write these lines back
> >> > and that is slow.
> >> >
> >> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
> >> > received), the processor already flushed some part of the cache on its
> >> > own, so the cache-flushing function has less work to do and it is faster.
> >> >
> >> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem
> >> > cannot be fixed by a better implementation of memcpy_flushcache.
> >>
> >> It sounds like ARM might be better off with mapping its pmem as
> >> write-through rather than write-back, and skip the explicit cache
> >
> > I doubt it would perform well - write combining combines the writes into a
> > larger segments - and write-through doesn't.
> >
> 
> Last I checked write-through caching does not disable write combining
> 
> >> management altogether. You speak of "optimal" and "sub-optimal", but
> >> what would be more clear is fio measurements of the relative IOPs and
> >> latency profiles of the different approaches. The reason I am
> >> continuing to push here is that reducing the operation space from
> >> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
> >> maintenance long term.
> >
> > I measured it (with nvme backing store) and late cache flushing has 12%
> > better performance than eager flushing with memcpy_flushcache().
> 
> I assume what you're seeing is ARM64 over-flushing the amount of dirty
> data so it becomes more efficient to do an amortized flush at the end?
> However, that effectively makes memcpy_flushcache() unusable in the
> way it can be used on x86. You claimed that ARM does not support
> non-temporal stores, but it does, see the STNP instruction. I do not
> want to see arch specific optimizations in drivers, so either
> write-through mappings is a potential answer to remove the need to
> explicitly manage flushing, or just implement STNP hacks in
> memcpy_flushcache() like you did with MOVNT on x86.
> 
> > 131836 4k iops - vs - 117016.
> 
> To be clear this is memcpy_flushcache() vs memcpy + flush?

I found out what caused the difference. I used dax_flush on the version of 
dm-writecache that I had on the ARM machine (with the kernel 4.14, because 
it is the last version where dax on ramdisk works) - and I thought that 
dax_flush flushes the cache, but it doesn't.

When I replaced dax_flush with arch_wb_cache_pmem, the performance 
difference between early flushing and late flushing disappeared.

So I think we can remove this per-architecture switch from dm-writecache.

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 13:16                       ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-30 13:16 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, linux-nvdimm

On Wed, May 30 2018 at  9:07am -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Mon, 28 May 2018, Dan Williams wrote:
> 
> > On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> > >
> > > I measured it (with nvme backing store) and late cache flushing has 12%
> > > better performance than eager flushing with memcpy_flushcache().
> > 
> > I assume what you're seeing is ARM64 over-flushing the amount of dirty
> > data so it becomes more efficient to do an amortized flush at the end?
> > However, that effectively makes memcpy_flushcache() unusable in the
> > way it can be used on x86. You claimed that ARM does not support
> > non-temporal stores, but it does, see the STNP instruction. I do not
> > want to see arch specific optimizations in drivers, so either
> > write-through mappings is a potential answer to remove the need to
> > explicitly manage flushing, or just implement STNP hacks in
> > memcpy_flushcache() like you did with MOVNT on x86.
> > 
> > > 131836 4k iops - vs - 117016.
> > 
> > To be clear this is memcpy_flushcache() vs memcpy + flush?
> 
> I found out what caused the difference. I used dax_flush on the version of 
> dm-writecache that I had on the ARM machine (with the kernel 4.14, because 
> it is the last version where dax on ramdisk works) - and I thought that 
> dax_flush flushes the cache, but it doesn't.
> 
> When I replaced dax_flush with arch_wb_cache_pmem, the performance 
> difference between early flushing and late flushing disappeared.
> 
> So I think we can remove this per-architecture switch from dm-writecache.

That is really great news, can you submit an incremental patch that
layers ontop of the linux-dm.git 'dm-4.18' branch?

Thanks,
Mike
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 13:16                       ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-30 13:16 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, linux-nvdimm

On Wed, May 30 2018 at  9:07am -0400,
Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> 
> 
> On Mon, 28 May 2018, Dan Williams wrote:
> 
> > On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > >
> > > I measured it (with nvme backing store) and late cache flushing has 12%
> > > better performance than eager flushing with memcpy_flushcache().
> > 
> > I assume what you're seeing is ARM64 over-flushing the amount of dirty
> > data so it becomes more efficient to do an amortized flush at the end?
> > However, that effectively makes memcpy_flushcache() unusable in the
> > way it can be used on x86. You claimed that ARM does not support
> > non-temporal stores, but it does, see the STNP instruction. I do not
> > want to see arch specific optimizations in drivers, so either
> > write-through mappings is a potential answer to remove the need to
> > explicitly manage flushing, or just implement STNP hacks in
> > memcpy_flushcache() like you did with MOVNT on x86.
> > 
> > > 131836 4k iops - vs - 117016.
> > 
> > To be clear this is memcpy_flushcache() vs memcpy + flush?
> 
> I found out what caused the difference. I used dax_flush on the version of 
> dm-writecache that I had on the ARM machine (with the kernel 4.14, because 
> it is the last version where dax on ramdisk works) - and I thought that 
> dax_flush flushes the cache, but it doesn't.
> 
> When I replaced dax_flush with arch_wb_cache_pmem, the performance 
> difference between early flushing and late flushing disappeared.
> 
> So I think we can remove this per-architecture switch from dm-writecache.

That is really great news, can you submit an incremental patch that
layers ontop of the linux-dm.git 'dm-4.18' branch?

Thanks,
Mike

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 13:21                         ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-30 13:21 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: device-mapper development, linux-nvdimm



On Wed, 30 May 2018, Mike Snitzer wrote:

> That is really great news, can you submit an incremental patch that
> layers ontop of the linux-dm.git 'dm-4.18' branch?
> 
> Thanks,
> Mike

I've sent the current version that I have. I fixed the bugs that were 
reported here (missing DAX, dm_bufio_client_create, __branch_check__ 
long->int truncation).

Mikulas
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 13:21                         ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-30 13:21 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: device-mapper development, linux-nvdimm



On Wed, 30 May 2018, Mike Snitzer wrote:

> That is really great news, can you submit an incremental patch that
> layers ontop of the linux-dm.git 'dm-4.18' branch?
> 
> Thanks,
> Mike

I've sent the current version that I have. I fixed the bugs that were 
reported here (missing DAX, dm_bufio_client_create, __branch_check__ 
long->int truncation).

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 13:26                           ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-30 13:26 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, linux-nvdimm

On Wed, May 30 2018 at  9:21am -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Wed, 30 May 2018, Mike Snitzer wrote:
> 
> > That is really great news, can you submit an incremental patch that
> > layers ontop of the linux-dm.git 'dm-4.18' branch?
> > 
> > Thanks,
> > Mike
> 
> I've sent the current version that I have. I fixed the bugs that were 
> reported here (missing DAX, dm_bufio_client_create, __branch_check__ 
> long->int truncation).

OK, but a monolithic dm-writecache.c is no longer useful to me.  I can
drop Arnd's gcc warning fix (with the idea that Ingo or Steve will take
your __branch_check__ patch).  Not sure what the dm_bufio_client_create
fix is... must've missed a report about that.

ANyway, point is we're on too a different phase of dm-writecache.c's
development.  I've picked it up and am trying to get it ready for the
4.18 merge window (likely opening Sunday).  Therefore it needs to be in
a git tree, and incremental changes overlayed.  I cannot be rebasing at
this late stage in the 4.18 development window.

Thanks,
Mike

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 13:26                           ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-30 13:26 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, linux-nvdimm

On Wed, May 30 2018 at  9:21am -0400,
Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> 
> 
> On Wed, 30 May 2018, Mike Snitzer wrote:
> 
> > That is really great news, can you submit an incremental patch that
> > layers ontop of the linux-dm.git 'dm-4.18' branch?
> > 
> > Thanks,
> > Mike
> 
> I've sent the current version that I have. I fixed the bugs that were 
> reported here (missing DAX, dm_bufio_client_create, __branch_check__ 
> long->int truncation).

OK, but a monolithic dm-writecache.c is no longer useful to me.  I can
drop Arnd's gcc warning fix (with the idea that Ingo or Steve will take
your __branch_check__ patch).  Not sure what the dm_bufio_client_create
fix is... must've missed a report about that.

ANyway, point is we're on too a different phase of dm-writecache.c's
development.  I've picked it up and am trying to get it ready for the
4.18 merge window (likely opening Sunday).  Therefore it needs to be in
a git tree, and incremental changes overlayed.  I cannot be rebasing at
this late stage in the 4.18 development window.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 13:33                             ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-30 13:33 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: device-mapper development, linux-nvdimm



On Wed, 30 May 2018, Mike Snitzer wrote:

> On Wed, May 30 2018 at  9:21am -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > 
> > 
> > On Wed, 30 May 2018, Mike Snitzer wrote:
> > 
> > > That is really great news, can you submit an incremental patch that
> > > layers ontop of the linux-dm.git 'dm-4.18' branch?
> > > 
> > > Thanks,
> > > Mike
> > 
> > I've sent the current version that I have. I fixed the bugs that were 
> > reported here (missing DAX, dm_bufio_client_create, __branch_check__ 
> > long->int truncation).
> 
> OK, but a monolithic dm-writecache.c is no longer useful to me.  I can
> drop Arnd's gcc warning fix (with the idea that Ingo or Steve will take
> your __branch_check__ patch).  Not sure what the dm_bufio_client_create
> fix is... must've missed a report about that.
> 
> ANyway, point is we're on too a different phase of dm-writecache.c's
> development.  I've picked it up and am trying to get it ready for the
> 4.18 merge window (likely opening Sunday).  Therefore it needs to be in
> a git tree, and incremental changes overlayed.  I cannot be rebasing at
> this late stage in the 4.18 development window.
> 
> Thanks,
> Mike

I downloaded dm-writecache from your git repository some times ago - but 
you changed a lot of useless things (i.e. reordering the fields in the 
structure) since that time - so, you'll have to merge the changes.

You dropped the latency measuring code - why? We still would like to 
benchmark the driver.

Mikulas
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 13:33                             ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-30 13:33 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: device-mapper development, linux-nvdimm



On Wed, 30 May 2018, Mike Snitzer wrote:

> On Wed, May 30 2018 at  9:21am -0400,
> Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > 
> > 
> > On Wed, 30 May 2018, Mike Snitzer wrote:
> > 
> > > That is really great news, can you submit an incremental patch that
> > > layers ontop of the linux-dm.git 'dm-4.18' branch?
> > > 
> > > Thanks,
> > > Mike
> > 
> > I've sent the current version that I have. I fixed the bugs that were 
> > reported here (missing DAX, dm_bufio_client_create, __branch_check__ 
> > long->int truncation).
> 
> OK, but a monolithic dm-writecache.c is no longer useful to me.  I can
> drop Arnd's gcc warning fix (with the idea that Ingo or Steve will take
> your __branch_check__ patch).  Not sure what the dm_bufio_client_create
> fix is... must've missed a report about that.
> 
> ANyway, point is we're on too a different phase of dm-writecache.c's
> development.  I've picked it up and am trying to get it ready for the
> 4.18 merge window (likely opening Sunday).  Therefore it needs to be in
> a git tree, and incremental changes overlayed.  I cannot be rebasing at
> this late stage in the 4.18 development window.
> 
> Thanks,
> Mike

I downloaded dm-writecache from your git repository some times ago - but 
you changed a lot of useless things (i.e. reordering the fields in the 
structure) since that time - so, you'll have to merge the changes.

You dropped the latency measuring code - why? We still would like to 
benchmark the driver.

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [dm-devel] [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 13:42                   ` Jeff Moyer
  0 siblings, 0 replies; 108+ messages in thread
From: Jeff Moyer @ 2018-05-30 13:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, device-mapper development, Mikulas Patocka,
	Mike Snitzer, linux-nvdimm

Dan Williams <dan.j.williams@intel.com> writes:

> When I read your patch I came away with the impression that ARM had
> not added memcpy_flushcache() yet and you were working around that
> fact. Now that I look, ARM *does* define memcpy_flushcache() and
> you're avoiding it. You use memcpy+arch_wb_pmem where arch_wb_pmem on
> ARM64 is defined as __clean_dcache_area_pop(dst, cnt). The ARM
> memcpy_flushcache() implementation is:
>
>     memcpy(dst, src, cnt);
>     __clean_dcache_area_pop(dst, cnt);
>
> So, I do not see how what you're doing is any less work unless you are
> flushing less than you copy?
>
> If memcpy_flushcache() is slower than memcpy + arch_wb_pmem then the
> ARM implementation is broken and that needs to be addressed not worked
> around in a driver.

I think Mikulas wanted to batch up multiple copies and flush at the
end.  According to his commit message, that batching gained him 2%
performance.

-Jeff
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [dm-devel] [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 13:42                   ` Jeff Moyer
  0 siblings, 0 replies; 108+ messages in thread
From: Jeff Moyer @ 2018-05-30 13:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, device-mapper development, Mikulas Patocka,
	Mike Snitzer, linux-nvdimm

Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> writes:

> When I read your patch I came away with the impression that ARM had
> not added memcpy_flushcache() yet and you were working around that
> fact. Now that I look, ARM *does* define memcpy_flushcache() and
> you're avoiding it. You use memcpy+arch_wb_pmem where arch_wb_pmem on
> ARM64 is defined as __clean_dcache_area_pop(dst, cnt). The ARM
> memcpy_flushcache() implementation is:
>
>     memcpy(dst, src, cnt);
>     __clean_dcache_area_pop(dst, cnt);
>
> So, I do not see how what you're doing is any less work unless you are
> flushing less than you copy?
>
> If memcpy_flushcache() is slower than memcpy + arch_wb_pmem then the
> ARM implementation is broken and that needs to be addressed not worked
> around in a driver.

I think Mikulas wanted to batch up multiple copies and flush at the
end.  According to his commit message, that batching gained him 2%
performance.

-Jeff

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [dm-devel] [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 13:51                     ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-30 13:51 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, device-mapper development, Mike Snitzer, linux-nvdimm



On Wed, 30 May 2018, Jeff Moyer wrote:

> Dan Williams <dan.j.williams@intel.com> writes:
> 
> > When I read your patch I came away with the impression that ARM had
> > not added memcpy_flushcache() yet and you were working around that
> > fact. Now that I look, ARM *does* define memcpy_flushcache() and
> > you're avoiding it. You use memcpy+arch_wb_pmem where arch_wb_pmem on
> > ARM64 is defined as __clean_dcache_area_pop(dst, cnt). The ARM
> > memcpy_flushcache() implementation is:
> >
> >     memcpy(dst, src, cnt);
> >     __clean_dcache_area_pop(dst, cnt);
> >
> > So, I do not see how what you're doing is any less work unless you are
> > flushing less than you copy?
> >
> > If memcpy_flushcache() is slower than memcpy + arch_wb_pmem then the
> > ARM implementation is broken and that needs to be addressed not worked
> > around in a driver.
> 
> I think Mikulas wanted to batch up multiple copies and flush at the
> end.  According to his commit message, that batching gained him 2%
> performance.
> 
> -Jeff

No - this 2% difference is inlined memcpy_flushcache() vs out-of-line 
memcpy_flushcache().

I thought that dax_flush() performed 12% better memcpy_flushcache() - but 
the reason why it performed better was - that it was not flushing the 
cache at all.

Mikulas
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [dm-devel] [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 13:51                     ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-30 13:51 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, device-mapper development, Mike Snitzer, linux-nvdimm



On Wed, 30 May 2018, Jeff Moyer wrote:

> Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> writes:
> 
> > When I read your patch I came away with the impression that ARM had
> > not added memcpy_flushcache() yet and you were working around that
> > fact. Now that I look, ARM *does* define memcpy_flushcache() and
> > you're avoiding it. You use memcpy+arch_wb_pmem where arch_wb_pmem on
> > ARM64 is defined as __clean_dcache_area_pop(dst, cnt). The ARM
> > memcpy_flushcache() implementation is:
> >
> >     memcpy(dst, src, cnt);
> >     __clean_dcache_area_pop(dst, cnt);
> >
> > So, I do not see how what you're doing is any less work unless you are
> > flushing less than you copy?
> >
> > If memcpy_flushcache() is slower than memcpy + arch_wb_pmem then the
> > ARM implementation is broken and that needs to be addressed not worked
> > around in a driver.
> 
> I think Mikulas wanted to batch up multiple copies and flush at the
> end.  According to his commit message, that batching gained him 2%
> performance.
> 
> -Jeff

No - this 2% difference is inlined memcpy_flushcache() vs out-of-line 
memcpy_flushcache().

I thought that dax_flush() performed 12% better memcpy_flushcache() - but 
the reason why it performed better was - that it was not flushing the 
cache at all.

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [dm-devel] [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 13:52                     ` Jeff Moyer
  0 siblings, 0 replies; 108+ messages in thread
From: Jeff Moyer @ 2018-05-30 13:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, device-mapper development, Mikulas Patocka,
	Mike Snitzer, linux-nvdimm

Jeff Moyer <jmoyer@redhat.com> writes:

> Dan Williams <dan.j.williams@intel.com> writes:
>
>> When I read your patch I came away with the impression that ARM had
>> not added memcpy_flushcache() yet and you were working around that
>> fact. Now that I look, ARM *does* define memcpy_flushcache() and
>> you're avoiding it. You use memcpy+arch_wb_pmem where arch_wb_pmem on
>> ARM64 is defined as __clean_dcache_area_pop(dst, cnt). The ARM
>> memcpy_flushcache() implementation is:
>>
>>     memcpy(dst, src, cnt);
>>     __clean_dcache_area_pop(dst, cnt);
>>
>> So, I do not see how what you're doing is any less work unless you are
>> flushing less than you copy?
>>
>> If memcpy_flushcache() is slower than memcpy + arch_wb_pmem then the
>> ARM implementation is broken and that needs to be addressed not worked
>> around in a driver.
>
> I think Mikulas wanted to batch up multiple copies and flush at the
> end.  According to his commit message, that batching gained him 2%
> performance.

Nevermind me, I just caught up with the rest of the thread.  :)

-Jeff
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [dm-devel] [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 13:52                     ` Jeff Moyer
  0 siblings, 0 replies; 108+ messages in thread
From: Jeff Moyer @ 2018-05-30 13:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, device-mapper development, Mikulas Patocka,
	Mike Snitzer, linux-nvdimm

Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> writes:
>
>> When I read your patch I came away with the impression that ARM had
>> not added memcpy_flushcache() yet and you were working around that
>> fact. Now that I look, ARM *does* define memcpy_flushcache() and
>> you're avoiding it. You use memcpy+arch_wb_pmem where arch_wb_pmem on
>> ARM64 is defined as __clean_dcache_area_pop(dst, cnt). The ARM
>> memcpy_flushcache() implementation is:
>>
>>     memcpy(dst, src, cnt);
>>     __clean_dcache_area_pop(dst, cnt);
>>
>> So, I do not see how what you're doing is any less work unless you are
>> flushing less than you copy?
>>
>> If memcpy_flushcache() is slower than memcpy + arch_wb_pmem then the
>> ARM implementation is broken and that needs to be addressed not worked
>> around in a driver.
>
> I think Mikulas wanted to batch up multiple copies and flush at the
> end.  According to his commit message, that batching gained him 2%
> performance.

Nevermind me, I just caught up with the rest of the thread.  :)

-Jeff

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 13:54                               ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-30 13:54 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, linux-nvdimm

On Wed, May 30 2018 at  9:33am -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Wed, 30 May 2018, Mike Snitzer wrote:
> 
> > On Wed, May 30 2018 at  9:21am -0400,
> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> > 
> > > 
> > > 
> > > On Wed, 30 May 2018, Mike Snitzer wrote:
> > > 
> > > > That is really great news, can you submit an incremental patch that
> > > > layers ontop of the linux-dm.git 'dm-4.18' branch?
> > > > 
> > > > Thanks,
> > > > Mike
> > > 
> > > I've sent the current version that I have. I fixed the bugs that were 
> > > reported here (missing DAX, dm_bufio_client_create, __branch_check__ 
> > > long->int truncation).
> > 
> > OK, but a monolithic dm-writecache.c is no longer useful to me.  I can
> > drop Arnd's gcc warning fix (with the idea that Ingo or Steve will take
> > your __branch_check__ patch).  Not sure what the dm_bufio_client_create
> > fix is... must've missed a report about that.
> > 
> > ANyway, point is we're on too a different phase of dm-writecache.c's
> > development.  I've picked it up and am trying to get it ready for the
> > 4.18 merge window (likely opening Sunday).  Therefore it needs to be in
> > a git tree, and incremental changes overlayed.  I cannot be rebasing at
> > this late stage in the 4.18 development window.
> > 
> > Thanks,
> > Mike
> 
> I downloaded dm-writecache from your git repository some times ago - but 
> you changed a lot of useless things (i.e. reordering the fields in the 
> structure) since that time - so, you'll have to merge the changes.

Fine I'll deal with it.  reordering the fields eliminated holes in the
structure and reduced struct members spanning cache lines.
 
> You dropped the latency measuring code - why? We still would like to 
> benchmark the driver.

I saved that patch.  It is a development-only patch.  I tried to have
you work on formalizing the code further by making the functions
actually get called, and by wrapping the code with
CONFIG_DM_WRITECACHE_MEASURE_LATENCY.  In the end I dropped it for now
and we'll just have to apply the patch when we need to benchmark.

Here is the current patch if you'd like to improve it (e.g. actually
call measure_latency_start and measure_latency_end in places) -- I
seemed to have lost my incremental change to switch over to
CONFIG_DM_WRITECACHE_MEASURE_LATENCY (likely due to rebase); can worry
about that later.

This is based on the dm-4.18 branch.

>From 20a7c123271741cb7260154b68730942417e803a Mon Sep 17 00:00:00 2001
From: Mikulas Patocka <mpatocka@redhat.com>
Date: Tue, 22 May 2018 15:54:53 -0400
Subject: [PATCH] dm writecache: add the ability to measure some latencies

Developer-only code that won't go upstream (as-is).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
 drivers/md/dm-writecache.c | 94 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 87 insertions(+), 7 deletions(-)

diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index 844c4fb2fcfc..e733a14faf8f 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -28,6 +28,8 @@
 #define AUTOCOMMIT_BLOCKS_PMEM		64
 #define AUTOCOMMIT_MSEC			1000
 
+//#define WC_MEASURE_LATENCY
+
 #define BITMAP_GRANULARITY	65536
 #if BITMAP_GRANULARITY < PAGE_SIZE
 #undef BITMAP_GRANULARITY
@@ -217,6 +219,15 @@ struct dm_writecache {
 	struct dm_kcopyd_client *dm_kcopyd;
 	unsigned long *dirty_bitmap;
 	unsigned dirty_bitmap_size;
+
+#ifdef WC_MEASURE_LATENCY
+	ktime_t lock_acquired_time;
+	ktime_t max_lock_held;
+	ktime_t max_lock_wait;
+	ktime_t max_freelist_wait;
+	ktime_t measure_latency_time;
+	ktime_t max_measure_latency;
+#endif
 };
 
 #define WB_LIST_INLINE		16
@@ -243,15 +254,60 @@ struct copy_struct {
 DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(dm_writecache_throttle,
 					    "A percentage of time allocated for data copying");
 
-static void wc_lock(struct dm_writecache *wc)
+static inline void measure_latency_start(struct dm_writecache *wc)
+{
+#ifdef WC_MEASURE_LATENCY
+	wc->measure_latency_time = ktime_get();
+#endif
+}
+
+static inline void measure_latency_end(struct dm_writecache *wc, unsigned long n)
 {
+#ifdef WC_MEASURE_LATENCY
+	ktime_t now = ktime_get();
+	if (now - wc->measure_latency_time > wc->max_measure_latency) {
+		wc->max_measure_latency = now - wc->measure_latency_time;
+		printk(KERN_DEBUG "dm-writecache: measured latency %lld.%03lldus, %lu steps\n",
+		       wc->max_measure_latency / 1000, wc->max_measure_latency % 1000, n);
+	}
+#endif
+}
+
+static void __wc_lock(struct dm_writecache *wc, int line)
+{
+#ifdef WC_MEASURE_LATENCY
+	ktime_t before, after;
+	before = ktime_get();
+#endif
 	mutex_lock(&wc->lock);
+#ifdef WC_MEASURE_LATENCY
+	after = ktime_get();
+	if (unlikely(after - before > wc->max_lock_wait)) {
+		wc->max_lock_wait = after - before;
+		printk(KERN_DEBUG "dm-writecache: waiting for lock for %lld.%03lldus at %d\n",
+		       wc->max_lock_wait / 1000, wc->max_lock_wait % 1000, line);
+		after = ktime_get();
+	}
+	wc->lock_acquired_time = after;
+#endif
 }
+#define wc_lock(wc)	__wc_lock(wc, __LINE__)
 
-static void wc_unlock(struct dm_writecache *wc)
+static void __wc_unlock(struct dm_writecache *wc, int line)
 {
+#ifdef WC_MEASURE_LATENCY
+	ktime_t now = ktime_get();
+	if (now - wc->lock_acquired_time > wc->max_lock_held) {
+		wc->max_lock_held = now - wc->lock_acquired_time;
+		printk(KERN_DEBUG "dm-writecache: lock held for %lld.%03lldus at %d\n",
+		       wc->max_lock_held / 1000, wc->max_lock_held % 1000, line);
+	}
+#endif
 	mutex_unlock(&wc->lock);
 }
+#define wc_unlock(wc)	__wc_unlock(wc, __LINE__)
+
+#define wc_unlock_long(wc)	mutex_unlock(&wc->lock)
 
 #if IS_ENABLED(CONFIG_DAX_DRIVER)
 static int persistent_memory_claim(struct dm_writecache *wc)
@@ -293,6 +349,10 @@ static int persistent_memory_claim(struct dm_writecache *wc)
 		r = -EOPNOTSUPP;
 		goto err2;
 	}
+#ifdef WC_MEASURE_LATENCY
+	printk(KERN_DEBUG "dm-writecache: device %s, pfn %016llx\n",
+	       wc->ssd_dev->name, pfn.val);
+#endif
 	if (da != p) {
 		long i;
 		wc->memory_map = NULL;
@@ -701,16 +761,35 @@ static void writecache_free_entry(struct dm_writecache *wc, struct wc_entry *e)
 		swake_up(&wc->freelist_wait);
 }
 
-static void writecache_wait_on_freelist(struct dm_writecache *wc)
+static void __writecache_wait_on_freelist(struct dm_writecache *wc, bool measure, int line)
 {
 	DECLARE_SWAITQUEUE(wait);
+#ifdef WC_MEASURE_LATENCY
+	ktime_t before, after;
+#endif
 
 	prepare_to_swait(&wc->freelist_wait, &wait, TASK_UNINTERRUPTIBLE);
 	wc_unlock(wc);
+#ifdef WC_MEASURE_LATENCY
+	if (measure)
+		before = ktime_get();
+#endif
 	io_schedule();
 	finish_swait(&wc->freelist_wait, &wait);
+#ifdef WC_MEASURE_LATENCY
+	if (measure) {
+		after = ktime_get();
+		if (unlikely(after - before > wc->max_freelist_wait)) {
+			wc->max_freelist_wait = after - before;
+			printk(KERN_DEBUG "dm-writecache: waiting on freelist for %lld.%03lldus at %d\n",
+			       wc->max_freelist_wait / 1000, wc->max_freelist_wait % 1000, line);
+		}
+	}
+#endif
 	wc_lock(wc);
 }
+#define writecache_wait_on_freelist(wc)		__writecache_wait_on_freelist(wc, true, __LINE__)
+#define writecache_wait_on_freelist_long(wc)	__writecache_wait_on_freelist(wc, false, __LINE__)
 
 static void writecache_poison_lists(struct dm_writecache *wc)
 {
@@ -890,7 +969,7 @@ static void writecache_suspend(struct dm_target *ti)
 
 	writecache_poison_lists(wc);
 
-	wc_unlock(wc);
+	wc_unlock_long(wc);
 }
 
 static int writecache_alloc_entries(struct dm_writecache *wc)
@@ -1001,7 +1080,7 @@ static void writecache_resume(struct dm_target *ti)
 		writecache_commit_flushed(wc);
 	}
 
-	wc_unlock(wc);
+	wc_unlock_long(wc);
 }
 
 static int process_flush_mesg(unsigned argc, char **argv, struct dm_writecache *wc)
@@ -1472,8 +1551,9 @@ static void __writeback_throttle(struct dm_writecache *wc, struct writeback_list
 	if (unlikely(wc->max_writeback_jobs)) {
 		if (READ_ONCE(wc->writeback_size) - wbl->size >= wc->max_writeback_jobs) {
 			wc_lock(wc);
-			while (wc->writeback_size - wbl->size >= wc->max_writeback_jobs)
-				writecache_wait_on_freelist(wc);
+			while (wc->writeback_size - wbl->size >= wc->max_writeback_jobs) {
+				writecache_wait_on_freelist_long(wc);
+			}
 			wc_unlock(wc);
 		}
 	}
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 13:54                               ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-30 13:54 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, linux-nvdimm

On Wed, May 30 2018 at  9:33am -0400,
Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> 
> 
> On Wed, 30 May 2018, Mike Snitzer wrote:
> 
> > On Wed, May 30 2018 at  9:21am -0400,
> > Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > 
> > > 
> > > 
> > > On Wed, 30 May 2018, Mike Snitzer wrote:
> > > 
> > > > That is really great news, can you submit an incremental patch that
> > > > layers ontop of the linux-dm.git 'dm-4.18' branch?
> > > > 
> > > > Thanks,
> > > > Mike
> > > 
> > > I've sent the current version that I have. I fixed the bugs that were 
> > > reported here (missing DAX, dm_bufio_client_create, __branch_check__ 
> > > long->int truncation).
> > 
> > OK, but a monolithic dm-writecache.c is no longer useful to me.  I can
> > drop Arnd's gcc warning fix (with the idea that Ingo or Steve will take
> > your __branch_check__ patch).  Not sure what the dm_bufio_client_create
> > fix is... must've missed a report about that.
> > 
> > ANyway, point is we're on too a different phase of dm-writecache.c's
> > development.  I've picked it up and am trying to get it ready for the
> > 4.18 merge window (likely opening Sunday).  Therefore it needs to be in
> > a git tree, and incremental changes overlayed.  I cannot be rebasing at
> > this late stage in the 4.18 development window.
> > 
> > Thanks,
> > Mike
> 
> I downloaded dm-writecache from your git repository some times ago - but 
> you changed a lot of useless things (i.e. reordering the fields in the 
> structure) since that time - so, you'll have to merge the changes.

Fine I'll deal with it.  reordering the fields eliminated holes in the
structure and reduced struct members spanning cache lines.
 
> You dropped the latency measuring code - why? We still would like to 
> benchmark the driver.

I saved that patch.  It is a development-only patch.  I tried to have
you work on formalizing the code further by making the functions
actually get called, and by wrapping the code with
CONFIG_DM_WRITECACHE_MEASURE_LATENCY.  In the end I dropped it for now
and we'll just have to apply the patch when we need to benchmark.

Here is the current patch if you'd like to improve it (e.g. actually
call measure_latency_start and measure_latency_end in places) -- I
seemed to have lost my incremental change to switch over to
CONFIG_DM_WRITECACHE_MEASURE_LATENCY (likely due to rebase); can worry
about that later.

This is based on the dm-4.18 branch.

From 20a7c123271741cb7260154b68730942417e803a Mon Sep 17 00:00:00 2001
From: Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Date: Tue, 22 May 2018 15:54:53 -0400
Subject: [PATCH] dm writecache: add the ability to measure some latencies

Developer-only code that won't go upstream (as-is).

Signed-off-by: Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
 drivers/md/dm-writecache.c | 94 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 87 insertions(+), 7 deletions(-)

diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index 844c4fb2fcfc..e733a14faf8f 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -28,6 +28,8 @@
 #define AUTOCOMMIT_BLOCKS_PMEM		64
 #define AUTOCOMMIT_MSEC			1000
 
+//#define WC_MEASURE_LATENCY
+
 #define BITMAP_GRANULARITY	65536
 #if BITMAP_GRANULARITY < PAGE_SIZE
 #undef BITMAP_GRANULARITY
@@ -217,6 +219,15 @@ struct dm_writecache {
 	struct dm_kcopyd_client *dm_kcopyd;
 	unsigned long *dirty_bitmap;
 	unsigned dirty_bitmap_size;
+
+#ifdef WC_MEASURE_LATENCY
+	ktime_t lock_acquired_time;
+	ktime_t max_lock_held;
+	ktime_t max_lock_wait;
+	ktime_t max_freelist_wait;
+	ktime_t measure_latency_time;
+	ktime_t max_measure_latency;
+#endif
 };
 
 #define WB_LIST_INLINE		16
@@ -243,15 +254,60 @@ struct copy_struct {
 DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(dm_writecache_throttle,
 					    "A percentage of time allocated for data copying");
 
-static void wc_lock(struct dm_writecache *wc)
+static inline void measure_latency_start(struct dm_writecache *wc)
+{
+#ifdef WC_MEASURE_LATENCY
+	wc->measure_latency_time = ktime_get();
+#endif
+}
+
+static inline void measure_latency_end(struct dm_writecache *wc, unsigned long n)
 {
+#ifdef WC_MEASURE_LATENCY
+	ktime_t now = ktime_get();
+	if (now - wc->measure_latency_time > wc->max_measure_latency) {
+		wc->max_measure_latency = now - wc->measure_latency_time;
+		printk(KERN_DEBUG "dm-writecache: measured latency %lld.%03lldus, %lu steps\n",
+		       wc->max_measure_latency / 1000, wc->max_measure_latency % 1000, n);
+	}
+#endif
+}
+
+static void __wc_lock(struct dm_writecache *wc, int line)
+{
+#ifdef WC_MEASURE_LATENCY
+	ktime_t before, after;
+	before = ktime_get();
+#endif
 	mutex_lock(&wc->lock);
+#ifdef WC_MEASURE_LATENCY
+	after = ktime_get();
+	if (unlikely(after - before > wc->max_lock_wait)) {
+		wc->max_lock_wait = after - before;
+		printk(KERN_DEBUG "dm-writecache: waiting for lock for %lld.%03lldus at %d\n",
+		       wc->max_lock_wait / 1000, wc->max_lock_wait % 1000, line);
+		after = ktime_get();
+	}
+	wc->lock_acquired_time = after;
+#endif
 }
+#define wc_lock(wc)	__wc_lock(wc, __LINE__)
 
-static void wc_unlock(struct dm_writecache *wc)
+static void __wc_unlock(struct dm_writecache *wc, int line)
 {
+#ifdef WC_MEASURE_LATENCY
+	ktime_t now = ktime_get();
+	if (now - wc->lock_acquired_time > wc->max_lock_held) {
+		wc->max_lock_held = now - wc->lock_acquired_time;
+		printk(KERN_DEBUG "dm-writecache: lock held for %lld.%03lldus at %d\n",
+		       wc->max_lock_held / 1000, wc->max_lock_held % 1000, line);
+	}
+#endif
 	mutex_unlock(&wc->lock);
 }
+#define wc_unlock(wc)	__wc_unlock(wc, __LINE__)
+
+#define wc_unlock_long(wc)	mutex_unlock(&wc->lock)
 
 #if IS_ENABLED(CONFIG_DAX_DRIVER)
 static int persistent_memory_claim(struct dm_writecache *wc)
@@ -293,6 +349,10 @@ static int persistent_memory_claim(struct dm_writecache *wc)
 		r = -EOPNOTSUPP;
 		goto err2;
 	}
+#ifdef WC_MEASURE_LATENCY
+	printk(KERN_DEBUG "dm-writecache: device %s, pfn %016llx\n",
+	       wc->ssd_dev->name, pfn.val);
+#endif
 	if (da != p) {
 		long i;
 		wc->memory_map = NULL;
@@ -701,16 +761,35 @@ static void writecache_free_entry(struct dm_writecache *wc, struct wc_entry *e)
 		swake_up(&wc->freelist_wait);
 }
 
-static void writecache_wait_on_freelist(struct dm_writecache *wc)
+static void __writecache_wait_on_freelist(struct dm_writecache *wc, bool measure, int line)
 {
 	DECLARE_SWAITQUEUE(wait);
+#ifdef WC_MEASURE_LATENCY
+	ktime_t before, after;
+#endif
 
 	prepare_to_swait(&wc->freelist_wait, &wait, TASK_UNINTERRUPTIBLE);
 	wc_unlock(wc);
+#ifdef WC_MEASURE_LATENCY
+	if (measure)
+		before = ktime_get();
+#endif
 	io_schedule();
 	finish_swait(&wc->freelist_wait, &wait);
+#ifdef WC_MEASURE_LATENCY
+	if (measure) {
+		after = ktime_get();
+		if (unlikely(after - before > wc->max_freelist_wait)) {
+			wc->max_freelist_wait = after - before;
+			printk(KERN_DEBUG "dm-writecache: waiting on freelist for %lld.%03lldus at %d\n",
+			       wc->max_freelist_wait / 1000, wc->max_freelist_wait % 1000, line);
+		}
+	}
+#endif
 	wc_lock(wc);
 }
+#define writecache_wait_on_freelist(wc)		__writecache_wait_on_freelist(wc, true, __LINE__)
+#define writecache_wait_on_freelist_long(wc)	__writecache_wait_on_freelist(wc, false, __LINE__)
 
 static void writecache_poison_lists(struct dm_writecache *wc)
 {
@@ -890,7 +969,7 @@ static void writecache_suspend(struct dm_target *ti)
 
 	writecache_poison_lists(wc);
 
-	wc_unlock(wc);
+	wc_unlock_long(wc);
 }
 
 static int writecache_alloc_entries(struct dm_writecache *wc)
@@ -1001,7 +1080,7 @@ static void writecache_resume(struct dm_target *ti)
 		writecache_commit_flushed(wc);
 	}
 
-	wc_unlock(wc);
+	wc_unlock_long(wc);
 }
 
 static int process_flush_mesg(unsigned argc, char **argv, struct dm_writecache *wc)
@@ -1472,8 +1551,9 @@ static void __writeback_throttle(struct dm_writecache *wc, struct writeback_list
 	if (unlikely(wc->max_writeback_jobs)) {
 		if (READ_ONCE(wc->writeback_size) - wbl->size >= wc->max_writeback_jobs) {
 			wc_lock(wc);
-			while (wc->writeback_size - wbl->size >= wc->max_writeback_jobs)
-				writecache_wait_on_freelist(wc);
+			while (wc->writeback_size - wbl->size >= wc->max_writeback_jobs) {
+				writecache_wait_on_freelist_long(wc);
+			}
 			wc_unlock(wc);
 		}
 	}

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 14:09                                 ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-30 14:09 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: device-mapper development, linux-nvdimm



On Wed, 30 May 2018, Mike Snitzer wrote:

> On Wed, May 30 2018 at  9:33am -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > 
> > 
> > On Wed, 30 May 2018, Mike Snitzer wrote:
> > 
> > > On Wed, May 30 2018 at  9:21am -0400,
> > > Mikulas Patocka <mpatocka@redhat.com> wrote:
> > > 
> > > > 
> > > > 
> > > > On Wed, 30 May 2018, Mike Snitzer wrote:
> > > > 
> > > > > That is really great news, can you submit an incremental patch that
> > > > > layers ontop of the linux-dm.git 'dm-4.18' branch?
> > > > > 
> > > > > Thanks,
> > > > > Mike
> > > > 
> > > > I've sent the current version that I have. I fixed the bugs that were 
> > > > reported here (missing DAX, dm_bufio_client_create, __branch_check__ 
> > > > long->int truncation).
> > > 
> > > OK, but a monolithic dm-writecache.c is no longer useful to me.  I can
> > > drop Arnd's gcc warning fix (with the idea that Ingo or Steve will take
> > > your __branch_check__ patch).  Not sure what the dm_bufio_client_create
> > > fix is... must've missed a report about that.
> > > 
> > > ANyway, point is we're on too a different phase of dm-writecache.c's
> > > development.  I've picked it up and am trying to get it ready for the
> > > 4.18 merge window (likely opening Sunday).  Therefore it needs to be in
> > > a git tree, and incremental changes overlayed.  I cannot be rebasing at
> > > this late stage in the 4.18 development window.
> > > 
> > > Thanks,
> > > Mike
> > 
> > I downloaded dm-writecache from your git repository some times ago - but 
> > you changed a lot of useless things (i.e. reordering the fields in the 
> > structure) since that time - so, you'll have to merge the changes.
> 
> Fine I'll deal with it.  reordering the fields eliminated holes in the
> structure and reduced struct members spanning cache lines.

And what about this?
#define WC_MODE_PMEM(wc)                        ((wc)->pmem_mode)

The code that I had just allowed the compiler to optimize out 
persistent-memory code if we have DM_WRITECACHE_ONLY_SSD defined - and you 
deleted it.

Most architectures don't have persistent memory and the dm-writecache 
driver could work in ssd-only mode on them. On these architectures, I 
define
#define WC_MODE_PMEM(wc)                        false
- and the compiler will just automatically remove the tests for that 
condition and the unused branch. It does also eliminate unused static 
functions.

Mikulas
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 14:09                                 ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-30 14:09 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: device-mapper development, linux-nvdimm



On Wed, 30 May 2018, Mike Snitzer wrote:

> On Wed, May 30 2018 at  9:33am -0400,
> Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > 
> > 
> > On Wed, 30 May 2018, Mike Snitzer wrote:
> > 
> > > On Wed, May 30 2018 at  9:21am -0400,
> > > Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > 
> > > > 
> > > > 
> > > > On Wed, 30 May 2018, Mike Snitzer wrote:
> > > > 
> > > > > That is really great news, can you submit an incremental patch that
> > > > > layers ontop of the linux-dm.git 'dm-4.18' branch?
> > > > > 
> > > > > Thanks,
> > > > > Mike
> > > > 
> > > > I've sent the current version that I have. I fixed the bugs that were 
> > > > reported here (missing DAX, dm_bufio_client_create, __branch_check__ 
> > > > long->int truncation).
> > > 
> > > OK, but a monolithic dm-writecache.c is no longer useful to me.  I can
> > > drop Arnd's gcc warning fix (with the idea that Ingo or Steve will take
> > > your __branch_check__ patch).  Not sure what the dm_bufio_client_create
> > > fix is... must've missed a report about that.
> > > 
> > > ANyway, point is we're on too a different phase of dm-writecache.c's
> > > development.  I've picked it up and am trying to get it ready for the
> > > 4.18 merge window (likely opening Sunday).  Therefore it needs to be in
> > > a git tree, and incremental changes overlayed.  I cannot be rebasing at
> > > this late stage in the 4.18 development window.
> > > 
> > > Thanks,
> > > Mike
> > 
> > I downloaded dm-writecache from your git repository some times ago - but 
> > you changed a lot of useless things (i.e. reordering the fields in the 
> > structure) since that time - so, you'll have to merge the changes.
> 
> Fine I'll deal with it.  reordering the fields eliminated holes in the
> structure and reduced struct members spanning cache lines.

And what about this?
#define WC_MODE_PMEM(wc)                        ((wc)->pmem_mode)

The code that I had just allowed the compiler to optimize out 
persistent-memory code if we have DM_WRITECACHE_ONLY_SSD defined - and you 
deleted it.

Most architectures don't have persistent memory and the dm-writecache 
driver could work in ssd-only mode on them. On these architectures, I 
define
#define WC_MODE_PMEM(wc)                        false
- and the compiler will just automatically remove the tests for that 
condition and the unused branch. It does also eliminate unused static 
functions.

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 14:21                                   ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-30 14:21 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, linux-nvdimm

On Wed, May 30 2018 at 10:09am -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Wed, 30 May 2018, Mike Snitzer wrote:
> 
> > On Wed, May 30 2018 at  9:33am -0400,
> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> > 
> > > 
> > > 
> > > On Wed, 30 May 2018, Mike Snitzer wrote:
> > > 
> > > > On Wed, May 30 2018 at  9:21am -0400,
> > > > Mikulas Patocka <mpatocka@redhat.com> wrote:
> > > > 
> > > > > 
> > > > > 
> > > > > On Wed, 30 May 2018, Mike Snitzer wrote:
> > > > > 
> > > > > > That is really great news, can you submit an incremental patch that
> > > > > > layers ontop of the linux-dm.git 'dm-4.18' branch?
> > > > > > 
> > > > > > Thanks,
> > > > > > Mike
> > > > > 
> > > > > I've sent the current version that I have. I fixed the bugs that were 
> > > > > reported here (missing DAX, dm_bufio_client_create, __branch_check__ 
> > > > > long->int truncation).
> > > > 
> > > > OK, but a monolithic dm-writecache.c is no longer useful to me.  I can
> > > > drop Arnd's gcc warning fix (with the idea that Ingo or Steve will take
> > > > your __branch_check__ patch).  Not sure what the dm_bufio_client_create
> > > > fix is... must've missed a report about that.
> > > > 
> > > > ANyway, point is we're on too a different phase of dm-writecache.c's
> > > > development.  I've picked it up and am trying to get it ready for the
> > > > 4.18 merge window (likely opening Sunday).  Therefore it needs to be in
> > > > a git tree, and incremental changes overlayed.  I cannot be rebasing at
> > > > this late stage in the 4.18 development window.
> > > > 
> > > > Thanks,
> > > > Mike
> > > 
> > > I downloaded dm-writecache from your git repository some times ago - but 
> > > you changed a lot of useless things (i.e. reordering the fields in the 
> > > structure) since that time - so, you'll have to merge the changes.
> > 
> > Fine I'll deal with it.  reordering the fields eliminated holes in the
> > structure and reduced struct members spanning cache lines.
> 
> And what about this?
> #define WC_MODE_PMEM(wc)                        ((wc)->pmem_mode)
> 
> The code that I had just allowed the compiler to optimize out 
> persistent-memory code if we have DM_WRITECACHE_ONLY_SSD defined - and you 
> deleted it.
> 
> Most architectures don't have persistent memory and the dm-writecache 
> driver could work in ssd-only mode on them. On these architectures, I 
> define
> #define WC_MODE_PMEM(wc)                        false
> - and the compiler will just automatically remove the tests for that 
> condition and the unused branch. It does also eliminate unused static 
> functions.

This level of microoptimization can be backfilled.  But as it was, there
were too many #defines.  And I'm really not concerned with eliminating
unused static functions for this case.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 14:21                                   ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-30 14:21 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, linux-nvdimm

On Wed, May 30 2018 at 10:09am -0400,
Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> 
> 
> On Wed, 30 May 2018, Mike Snitzer wrote:
> 
> > On Wed, May 30 2018 at  9:33am -0400,
> > Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > 
> > > 
> > > 
> > > On Wed, 30 May 2018, Mike Snitzer wrote:
> > > 
> > > > On Wed, May 30 2018 at  9:21am -0400,
> > > > Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > > 
> > > > > 
> > > > > 
> > > > > On Wed, 30 May 2018, Mike Snitzer wrote:
> > > > > 
> > > > > > That is really great news, can you submit an incremental patch that
> > > > > > layers ontop of the linux-dm.git 'dm-4.18' branch?
> > > > > > 
> > > > > > Thanks,
> > > > > > Mike
> > > > > 
> > > > > I've sent the current version that I have. I fixed the bugs that were 
> > > > > reported here (missing DAX, dm_bufio_client_create, __branch_check__ 
> > > > > long->int truncation).
> > > > 
> > > > OK, but a monolithic dm-writecache.c is no longer useful to me.  I can
> > > > drop Arnd's gcc warning fix (with the idea that Ingo or Steve will take
> > > > your __branch_check__ patch).  Not sure what the dm_bufio_client_create
> > > > fix is... must've missed a report about that.
> > > > 
> > > > ANyway, point is we're on too a different phase of dm-writecache.c's
> > > > development.  I've picked it up and am trying to get it ready for the
> > > > 4.18 merge window (likely opening Sunday).  Therefore it needs to be in
> > > > a git tree, and incremental changes overlayed.  I cannot be rebasing at
> > > > this late stage in the 4.18 development window.
> > > > 
> > > > Thanks,
> > > > Mike
> > > 
> > > I downloaded dm-writecache from your git repository some times ago - but 
> > > you changed a lot of useless things (i.e. reordering the fields in the 
> > > structure) since that time - so, you'll have to merge the changes.
> > 
> > Fine I'll deal with it.  reordering the fields eliminated holes in the
> > structure and reduced struct members spanning cache lines.
> 
> And what about this?
> #define WC_MODE_PMEM(wc)                        ((wc)->pmem_mode)
> 
> The code that I had just allowed the compiler to optimize out 
> persistent-memory code if we have DM_WRITECACHE_ONLY_SSD defined - and you 
> deleted it.
> 
> Most architectures don't have persistent memory and the dm-writecache 
> driver could work in ssd-only mode on them. On these architectures, I 
> define
> #define WC_MODE_PMEM(wc)                        false
> - and the compiler will just automatically remove the tests for that 
> condition and the unused branch. It does also eliminate unused static 
> functions.

This level of microoptimization can be backfilled.  But as it was, there
were too many #defines.  And I'm really not concerned with eliminating
unused static functions for this case.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 14:46                                     ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-30 14:46 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: device-mapper development, linux-nvdimm



On Wed, 30 May 2018, Mike Snitzer wrote:

> On Wed, May 30 2018 at 10:09am -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > 
> > 
> > On Wed, 30 May 2018, Mike Snitzer wrote:
> > 
> > > On Wed, May 30 2018 at  9:33am -0400,
> > > Mikulas Patocka <mpatocka@redhat.com> wrote:
> > > 
> > > > 
> > > > 
> > > > On Wed, 30 May 2018, Mike Snitzer wrote:
> > > > 
> > > > > On Wed, May 30 2018 at  9:21am -0400,
> > > > > Mikulas Patocka <mpatocka@redhat.com> wrote:
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > On Wed, 30 May 2018, Mike Snitzer wrote:
> > > > > > 
> > > > > > > That is really great news, can you submit an incremental patch that
> > > > > > > layers ontop of the linux-dm.git 'dm-4.18' branch?
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Mike
> > > > > > 
> > > > > > I've sent the current version that I have. I fixed the bugs that were 
> > > > > > reported here (missing DAX, dm_bufio_client_create, __branch_check__ 
> > > > > > long->int truncation).
> > > > > 
> > > > > OK, but a monolithic dm-writecache.c is no longer useful to me.  I can
> > > > > drop Arnd's gcc warning fix (with the idea that Ingo or Steve will take
> > > > > your __branch_check__ patch).  Not sure what the dm_bufio_client_create
> > > > > fix is... must've missed a report about that.
> > > > > 
> > > > > ANyway, point is we're on too a different phase of dm-writecache.c's
> > > > > development.  I've picked it up and am trying to get it ready for the
> > > > > 4.18 merge window (likely opening Sunday).  Therefore it needs to be in
> > > > > a git tree, and incremental changes overlayed.  I cannot be rebasing at
> > > > > this late stage in the 4.18 development window.
> > > > > 
> > > > > Thanks,
> > > > > Mike
> > > > 
> > > > I downloaded dm-writecache from your git repository some times ago - but 
> > > > you changed a lot of useless things (i.e. reordering the fields in the 
> > > > structure) since that time - so, you'll have to merge the changes.
> > > 
> > > Fine I'll deal with it.  reordering the fields eliminated holes in the
> > > structure and reduced struct members spanning cache lines.
> > 
> > And what about this?
> > #define WC_MODE_PMEM(wc)                        ((wc)->pmem_mode)
> > 
> > The code that I had just allowed the compiler to optimize out 
> > persistent-memory code if we have DM_WRITECACHE_ONLY_SSD defined - and you 
> > deleted it.
> > 
> > Most architectures don't have persistent memory and the dm-writecache 
> > driver could work in ssd-only mode on them. On these architectures, I 
> > define
> > #define WC_MODE_PMEM(wc)                        false
> > - and the compiler will just automatically remove the tests for that 
> > condition and the unused branch. It does also eliminate unused static 
> > functions.
> 
> This level of microoptimization can be backfilled.  But as it was, there
> were too many #defines.  And I'm really not concerned with eliminating
> unused static functions for this case.

I don't see why "too many defines" would be a problem.

If I compile it with and without pmem support, the difference is 
15kB-vs-12kB. If we look at just one function (writecache_map), the 
difference is 1595 bytes - vs - 1280 bytes. So, it produces real savings 
in code size.

The problem with performance is not caused a condition that always jumps 
the same way (that is predicted by the CPU and it causes no delays in the 
pipeline) - the problem is that a bigger function consumes more i-cache. 
There is no reason to include code that can't be executed.


Note that we should also redefine pmem_assign on architectures that don't 
support persistent memory:
#ifndef DM_WRITECACHE_ONLY_SSD
#define pmem_assign(dest, src)                                          \
do {                                                                    \
        typeof(dest) uniq = (src);                                      \
        memcpy_flushcache(&(dest), &uniq, sizeof(dest));                \
} while (0)
#else
#define pmem_assign(dest, src)          ((dest) = (src))
#endif

I.e. we should not call memcpy_flushcache if we can't have persistent 
memory. Cache flushing is slow and we should not do it if we don't have 
to.

Mikulas
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 14:46                                     ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-30 14:46 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: device-mapper development, linux-nvdimm



On Wed, 30 May 2018, Mike Snitzer wrote:

> On Wed, May 30 2018 at 10:09am -0400,
> Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > 
> > 
> > On Wed, 30 May 2018, Mike Snitzer wrote:
> > 
> > > On Wed, May 30 2018 at  9:33am -0400,
> > > Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > 
> > > > 
> > > > 
> > > > On Wed, 30 May 2018, Mike Snitzer wrote:
> > > > 
> > > > > On Wed, May 30 2018 at  9:21am -0400,
> > > > > Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > On Wed, 30 May 2018, Mike Snitzer wrote:
> > > > > > 
> > > > > > > That is really great news, can you submit an incremental patch that
> > > > > > > layers ontop of the linux-dm.git 'dm-4.18' branch?
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Mike
> > > > > > 
> > > > > > I've sent the current version that I have. I fixed the bugs that were 
> > > > > > reported here (missing DAX, dm_bufio_client_create, __branch_check__ 
> > > > > > long->int truncation).
> > > > > 
> > > > > OK, but a monolithic dm-writecache.c is no longer useful to me.  I can
> > > > > drop Arnd's gcc warning fix (with the idea that Ingo or Steve will take
> > > > > your __branch_check__ patch).  Not sure what the dm_bufio_client_create
> > > > > fix is... must've missed a report about that.
> > > > > 
> > > > > ANyway, point is we're on too a different phase of dm-writecache.c's
> > > > > development.  I've picked it up and am trying to get it ready for the
> > > > > 4.18 merge window (likely opening Sunday).  Therefore it needs to be in
> > > > > a git tree, and incremental changes overlayed.  I cannot be rebasing at
> > > > > this late stage in the 4.18 development window.
> > > > > 
> > > > > Thanks,
> > > > > Mike
> > > > 
> > > > I downloaded dm-writecache from your git repository some times ago - but 
> > > > you changed a lot of useless things (i.e. reordering the fields in the 
> > > > structure) since that time - so, you'll have to merge the changes.
> > > 
> > > Fine I'll deal with it.  reordering the fields eliminated holes in the
> > > structure and reduced struct members spanning cache lines.
> > 
> > And what about this?
> > #define WC_MODE_PMEM(wc)                        ((wc)->pmem_mode)
> > 
> > The code that I had just allowed the compiler to optimize out 
> > persistent-memory code if we have DM_WRITECACHE_ONLY_SSD defined - and you 
> > deleted it.
> > 
> > Most architectures don't have persistent memory and the dm-writecache 
> > driver could work in ssd-only mode on them. On these architectures, I 
> > define
> > #define WC_MODE_PMEM(wc)                        false
> > - and the compiler will just automatically remove the tests for that 
> > condition and the unused branch. It does also eliminate unused static 
> > functions.
> 
> This level of microoptimization can be backfilled.  But as it was, there
> were too many #defines.  And I'm really not concerned with eliminating
> unused static functions for this case.

I don't see why "too many defines" would be a problem.

If I compile it with and without pmem support, the difference is 
15kB-vs-12kB. If we look at just one function (writecache_map), the 
difference is 1595 bytes - vs - 1280 bytes. So, it produces real savings 
in code size.

The problem with performance is not caused a condition that always jumps 
the same way (that is predicted by the CPU and it causes no delays in the 
pipeline) - the problem is that a bigger function consumes more i-cache. 
There is no reason to include code that can't be executed.


Note that we should also redefine pmem_assign on architectures that don't 
support persistent memory:
#ifndef DM_WRITECACHE_ONLY_SSD
#define pmem_assign(dest, src)                                          \
do {                                                                    \
        typeof(dest) uniq = (src);                                      \
        memcpy_flushcache(&(dest), &uniq, sizeof(dest));                \
} while (0)
#else
#define pmem_assign(dest, src)          ((dest) = (src))
#endif

I.e. we should not call memcpy_flushcache if we can't have persistent 
memory. Cache flushing is slow and we should not do it if we don't have 
to.

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 15:58                       ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-30 15:58 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm

On Wed, May 30, 2018 at 6:07 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Mon, 28 May 2018, Dan Williams wrote:
>
>> On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >
>> >
>> > On Sat, 26 May 2018, Dan Williams wrote:
>> >
>> >> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >> >
>> >> >
>> >> > On Fri, 25 May 2018, Dan Williams wrote:
>> >> >
>> >> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote:
>> >> >> > On Fri, May 25 2018 at  2:17am -0400,
>> >> >> > Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >> >> >
>> >> >> >> On Thu, 24 May 2018, Dan Williams wrote:
>> >> >> >>
>> >> >> >> > I don't want to grow driver-local wrappers for pmem. You should use
>> >> >> >> > memcpy_flushcache directly() and if an architecture does not define
>> >> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
>> >> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
>> >> >> >> > see a need to add a standalone flush operation if all relevant archs
>> >> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
>> >> >> >> > directly since all archs define it. Alternatively we could introduce
>> >> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
>> >> >> >> > routine and memcpy_flushcache() would imply a wmb().
>> >> >> >>
>> >> >> >> But memcpy_flushcache() on ARM64 is slow.
>> >> >>
>> >> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
>> >> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
>> >> >> that as a trailblazing requirement, I see that as typical review and a
>> >> >> reduction of the operation space that you are proposing.
>> >> >
>> >> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
>> >> > slower than memcpy and explicit cache flush.
>> >> >
>> >> > Suppose that you want to write data to a block device and make it
>> >> > persistent. So you send a WRITE bio and then a FLUSH bio.
>> >> >
>> >> > Now - how to implement these two bios on persistent memory:
>> >> >
>> >> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
>> >> > wmb() - this is the optimal implementation.
>> >> >
>> >> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
>> >> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio
>> >> > does arch_wb_cache_pmem() on the affected range.
>> >> >
>> >> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
>> >> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
>> >> > as memcpy() followed by a cache flush.
>> >> >
>> >> > Now - if you flush the cache immediatelly after memcpy, the cache is full
>> >> > of dirty lines and the cache-flushing code has to write these lines back
>> >> > and that is slow.
>> >> >
>> >> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
>> >> > received), the processor already flushed some part of the cache on its
>> >> > own, so the cache-flushing function has less work to do and it is faster.
>> >> >
>> >> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem
>> >> > cannot be fixed by a better implementation of memcpy_flushcache.
>> >>
>> >> It sounds like ARM might be better off with mapping its pmem as
>> >> write-through rather than write-back, and skip the explicit cache
>> >
>> > I doubt it would perform well - write combining combines the writes into a
>> > larger segments - and write-through doesn't.
>> >
>>
>> Last I checked write-through caching does not disable write combining
>>
>> >> management altogether. You speak of "optimal" and "sub-optimal", but
>> >> what would be more clear is fio measurements of the relative IOPs and
>> >> latency profiles of the different approaches. The reason I am
>> >> continuing to push here is that reducing the operation space from
>> >> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
>> >> maintenance long term.
>> >
>> > I measured it (with nvme backing store) and late cache flushing has 12%
>> > better performance than eager flushing with memcpy_flushcache().
>>
>> I assume what you're seeing is ARM64 over-flushing the amount of dirty
>> data so it becomes more efficient to do an amortized flush at the end?
>> However, that effectively makes memcpy_flushcache() unusable in the
>> way it can be used on x86. You claimed that ARM does not support
>> non-temporal stores, but it does, see the STNP instruction. I do not
>> want to see arch specific optimizations in drivers, so either
>> write-through mappings is a potential answer to remove the need to
>> explicitly manage flushing, or just implement STNP hacks in
>> memcpy_flushcache() like you did with MOVNT on x86.
>>
>> > 131836 4k iops - vs - 117016.
>>
>> To be clear this is memcpy_flushcache() vs memcpy + flush?
>
> I found out what caused the difference. I used dax_flush on the version of
> dm-writecache that I had on the ARM machine (with the kernel 4.14, because
> it is the last version where dax on ramdisk works) - and I thought that
> dax_flush flushes the cache, but it doesn't.
>
> When I replaced dax_flush with arch_wb_cache_pmem, the performance
> difference between early flushing and late flushing disappeared.
>
> So I think we can remove this per-architecture switch from dm-writecache.

Great find! Thanks for the due diligence. Feel free to add:

    Acked-by: Dan Williams <dan.j.williams@intel.com>

...on the reworks to unify ARM and x86.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 15:58                       ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-30 15:58 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm

On Wed, May 30, 2018 at 6:07 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
>
> On Mon, 28 May 2018, Dan Williams wrote:
>
>> On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> >
>> >
>> > On Sat, 26 May 2018, Dan Williams wrote:
>> >
>> >> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> >> >
>> >> >
>> >> > On Fri, 25 May 2018, Dan Williams wrote:
>> >> >
>> >> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> >> >> > On Fri, May 25 2018 at  2:17am -0400,
>> >> >> > Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> >> >> >
>> >> >> >> On Thu, 24 May 2018, Dan Williams wrote:
>> >> >> >>
>> >> >> >> > I don't want to grow driver-local wrappers for pmem. You should use
>> >> >> >> > memcpy_flushcache directly() and if an architecture does not define
>> >> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
>> >> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
>> >> >> >> > see a need to add a standalone flush operation if all relevant archs
>> >> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
>> >> >> >> > directly since all archs define it. Alternatively we could introduce
>> >> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
>> >> >> >> > routine and memcpy_flushcache() would imply a wmb().
>> >> >> >>
>> >> >> >> But memcpy_flushcache() on ARM64 is slow.
>> >> >>
>> >> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
>> >> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
>> >> >> that as a trailblazing requirement, I see that as typical review and a
>> >> >> reduction of the operation space that you are proposing.
>> >> >
>> >> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
>> >> > slower than memcpy and explicit cache flush.
>> >> >
>> >> > Suppose that you want to write data to a block device and make it
>> >> > persistent. So you send a WRITE bio and then a FLUSH bio.
>> >> >
>> >> > Now - how to implement these two bios on persistent memory:
>> >> >
>> >> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
>> >> > wmb() - this is the optimal implementation.
>> >> >
>> >> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
>> >> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio
>> >> > does arch_wb_cache_pmem() on the affected range.
>> >> >
>> >> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
>> >> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
>> >> > as memcpy() followed by a cache flush.
>> >> >
>> >> > Now - if you flush the cache immediatelly after memcpy, the cache is full
>> >> > of dirty lines and the cache-flushing code has to write these lines back
>> >> > and that is slow.
>> >> >
>> >> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
>> >> > received), the processor already flushed some part of the cache on its
>> >> > own, so the cache-flushing function has less work to do and it is faster.
>> >> >
>> >> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem
>> >> > cannot be fixed by a better implementation of memcpy_flushcache.
>> >>
>> >> It sounds like ARM might be better off with mapping its pmem as
>> >> write-through rather than write-back, and skip the explicit cache
>> >
>> > I doubt it would perform well - write combining combines the writes into a
>> > larger segments - and write-through doesn't.
>> >
>>
>> Last I checked write-through caching does not disable write combining
>>
>> >> management altogether. You speak of "optimal" and "sub-optimal", but
>> >> what would be more clear is fio measurements of the relative IOPs and
>> >> latency profiles of the different approaches. The reason I am
>> >> continuing to push here is that reducing the operation space from
>> >> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
>> >> maintenance long term.
>> >
>> > I measured it (with nvme backing store) and late cache flushing has 12%
>> > better performance than eager flushing with memcpy_flushcache().
>>
>> I assume what you're seeing is ARM64 over-flushing the amount of dirty
>> data so it becomes more efficient to do an amortized flush at the end?
>> However, that effectively makes memcpy_flushcache() unusable in the
>> way it can be used on x86. You claimed that ARM does not support
>> non-temporal stores, but it does, see the STNP instruction. I do not
>> want to see arch specific optimizations in drivers, so either
>> write-through mappings is a potential answer to remove the need to
>> explicitly manage flushing, or just implement STNP hacks in
>> memcpy_flushcache() like you did with MOVNT on x86.
>>
>> > 131836 4k iops - vs - 117016.
>>
>> To be clear this is memcpy_flushcache() vs memcpy + flush?
>
> I found out what caused the difference. I used dax_flush on the version of
> dm-writecache that I had on the ARM machine (with the kernel 4.14, because
> it is the last version where dax on ramdisk works) - and I thought that
> dax_flush flushes the cache, but it doesn't.
>
> When I replaced dax_flush with arch_wb_cache_pmem, the performance
> difference between early flushing and late flushing disappeared.
>
> So I think we can remove this per-architecture switch from dm-writecache.

Great find! Thanks for the due diligence. Feel free to add:

    Acked-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

...on the reworks to unify ARM and x86.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 22:39                         ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-30 22:39 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm

On Wed, May 30, 2018 at 8:58 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, May 30, 2018 at 6:07 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>>
>>
>> On Mon, 28 May 2018, Dan Williams wrote:
>>
>>> On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>>> >
>>> >
>>> > On Sat, 26 May 2018, Dan Williams wrote:
>>> >
>>> >> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>>> >> >
>>> >> >
>>> >> > On Fri, 25 May 2018, Dan Williams wrote:
>>> >> >
>>> >> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@redhat.com> wrote:
>>> >> >> > On Fri, May 25 2018 at  2:17am -0400,
>>> >> >> > Mikulas Patocka <mpatocka@redhat.com> wrote:
>>> >> >> >
>>> >> >> >> On Thu, 24 May 2018, Dan Williams wrote:
>>> >> >> >>
>>> >> >> >> > I don't want to grow driver-local wrappers for pmem. You should use
>>> >> >> >> > memcpy_flushcache directly() and if an architecture does not define
>>> >> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
>>> >> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
>>> >> >> >> > see a need to add a standalone flush operation if all relevant archs
>>> >> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
>>> >> >> >> > directly since all archs define it. Alternatively we could introduce
>>> >> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
>>> >> >> >> > routine and memcpy_flushcache() would imply a wmb().
>>> >> >> >>
>>> >> >> >> But memcpy_flushcache() on ARM64 is slow.
>>> >> >>
>>> >> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
>>> >> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
>>> >> >> that as a trailblazing requirement, I see that as typical review and a
>>> >> >> reduction of the operation space that you are proposing.
>>> >> >
>>> >> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
>>> >> > slower than memcpy and explicit cache flush.
>>> >> >
>>> >> > Suppose that you want to write data to a block device and make it
>>> >> > persistent. So you send a WRITE bio and then a FLUSH bio.
>>> >> >
>>> >> > Now - how to implement these two bios on persistent memory:
>>> >> >
>>> >> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
>>> >> > wmb() - this is the optimal implementation.
>>> >> >
>>> >> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
>>> >> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio
>>> >> > does arch_wb_cache_pmem() on the affected range.
>>> >> >
>>> >> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
>>> >> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
>>> >> > as memcpy() followed by a cache flush.
>>> >> >
>>> >> > Now - if you flush the cache immediatelly after memcpy, the cache is full
>>> >> > of dirty lines and the cache-flushing code has to write these lines back
>>> >> > and that is slow.
>>> >> >
>>> >> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
>>> >> > received), the processor already flushed some part of the cache on its
>>> >> > own, so the cache-flushing function has less work to do and it is faster.
>>> >> >
>>> >> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem
>>> >> > cannot be fixed by a better implementation of memcpy_flushcache.
>>> >>
>>> >> It sounds like ARM might be better off with mapping its pmem as
>>> >> write-through rather than write-back, and skip the explicit cache
>>> >
>>> > I doubt it would perform well - write combining combines the writes into a
>>> > larger segments - and write-through doesn't.
>>> >
>>>
>>> Last I checked write-through caching does not disable write combining
>>>
>>> >> management altogether. You speak of "optimal" and "sub-optimal", but
>>> >> what would be more clear is fio measurements of the relative IOPs and
>>> >> latency profiles of the different approaches. The reason I am
>>> >> continuing to push here is that reducing the operation space from
>>> >> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
>>> >> maintenance long term.
>>> >
>>> > I measured it (with nvme backing store) and late cache flushing has 12%
>>> > better performance than eager flushing with memcpy_flushcache().
>>>
>>> I assume what you're seeing is ARM64 over-flushing the amount of dirty
>>> data so it becomes more efficient to do an amortized flush at the end?
>>> However, that effectively makes memcpy_flushcache() unusable in the
>>> way it can be used on x86. You claimed that ARM does not support
>>> non-temporal stores, but it does, see the STNP instruction. I do not
>>> want to see arch specific optimizations in drivers, so either
>>> write-through mappings is a potential answer to remove the need to
>>> explicitly manage flushing, or just implement STNP hacks in
>>> memcpy_flushcache() like you did with MOVNT on x86.
>>>
>>> > 131836 4k iops - vs - 117016.
>>>
>>> To be clear this is memcpy_flushcache() vs memcpy + flush?
>>
>> I found out what caused the difference. I used dax_flush on the version of
>> dm-writecache that I had on the ARM machine (with the kernel 4.14, because
>> it is the last version where dax on ramdisk works) - and I thought that
>> dax_flush flushes the cache, but it doesn't.
>>
>> When I replaced dax_flush with arch_wb_cache_pmem, the performance
>> difference between early flushing and late flushing disappeared.
>>
>> So I think we can remove this per-architecture switch from dm-writecache.
>
> Great find! Thanks for the due diligence. Feel free to add:
>
>     Acked-by: Dan Williams <dan.j.williams@intel.com>
>
> ...on the reworks to unify ARM and x86.

One more note. The side effect of not using dax_flush() is that you
may end up flushing caches on systems where the platform has asserted
it will take responsibility for flushing caches at power loss. If /
when those systems become more prevalent we may want to think of a way
to combine the non-temporal optimization and the cache-flush-bypass
optimizations. However that is something that can wait for a later
change beyond 4.18.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-30 22:39                         ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-30 22:39 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm

On Wed, May 30, 2018 at 8:58 AM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> On Wed, May 30, 2018 at 6:07 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>
>>
>> On Mon, 28 May 2018, Dan Williams wrote:
>>
>>> On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>> >
>>> >
>>> > On Sat, 26 May 2018, Dan Williams wrote:
>>> >
>>> >> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>> >> >
>>> >> >
>>> >> > On Fri, 25 May 2018, Dan Williams wrote:
>>> >> >
>>> >> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>> >> >> > On Fri, May 25 2018 at  2:17am -0400,
>>> >> >> > Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>> >> >> >
>>> >> >> >> On Thu, 24 May 2018, Dan Williams wrote:
>>> >> >> >>
>>> >> >> >> > I don't want to grow driver-local wrappers for pmem. You should use
>>> >> >> >> > memcpy_flushcache directly() and if an architecture does not define
>>> >> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
>>> >> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
>>> >> >> >> > see a need to add a standalone flush operation if all relevant archs
>>> >> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
>>> >> >> >> > directly since all archs define it. Alternatively we could introduce
>>> >> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
>>> >> >> >> > routine and memcpy_flushcache() would imply a wmb().
>>> >> >> >>
>>> >> >> >> But memcpy_flushcache() on ARM64 is slow.
>>> >> >>
>>> >> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() +
>>> >> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see
>>> >> >> that as a trailblazing requirement, I see that as typical review and a
>>> >> >> reduction of the operation space that you are proposing.
>>> >> >
>>> >> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is
>>> >> > slower than memcpy and explicit cache flush.
>>> >> >
>>> >> > Suppose that you want to write data to a block device and make it
>>> >> > persistent. So you send a WRITE bio and then a FLUSH bio.
>>> >> >
>>> >> > Now - how to implement these two bios on persistent memory:
>>> >> >
>>> >> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does
>>> >> > wmb() - this is the optimal implementation.
>>> >> >
>>> >> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal
>>> >> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio
>>> >> > does arch_wb_cache_pmem() on the affected range.
>>> >> >
>>> >> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture
>>> >> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented
>>> >> > as memcpy() followed by a cache flush.
>>> >> >
>>> >> > Now - if you flush the cache immediatelly after memcpy, the cache is full
>>> >> > of dirty lines and the cache-flushing code has to write these lines back
>>> >> > and that is slow.
>>> >> >
>>> >> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is
>>> >> > received), the processor already flushed some part of the cache on its
>>> >> > own, so the cache-flushing function has less work to do and it is faster.
>>> >> >
>>> >> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem
>>> >> > cannot be fixed by a better implementation of memcpy_flushcache.
>>> >>
>>> >> It sounds like ARM might be better off with mapping its pmem as
>>> >> write-through rather than write-back, and skip the explicit cache
>>> >
>>> > I doubt it would perform well - write combining combines the writes into a
>>> > larger segments - and write-through doesn't.
>>> >
>>>
>>> Last I checked write-through caching does not disable write combining
>>>
>>> >> management altogether. You speak of "optimal" and "sub-optimal", but
>>> >> what would be more clear is fio measurements of the relative IOPs and
>>> >> latency profiles of the different approaches. The reason I am
>>> >> continuing to push here is that reducing the operation space from
>>> >> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the
>>> >> maintenance long term.
>>> >
>>> > I measured it (with nvme backing store) and late cache flushing has 12%
>>> > better performance than eager flushing with memcpy_flushcache().
>>>
>>> I assume what you're seeing is ARM64 over-flushing the amount of dirty
>>> data so it becomes more efficient to do an amortized flush at the end?
>>> However, that effectively makes memcpy_flushcache() unusable in the
>>> way it can be used on x86. You claimed that ARM does not support
>>> non-temporal stores, but it does, see the STNP instruction. I do not
>>> want to see arch specific optimizations in drivers, so either
>>> write-through mappings is a potential answer to remove the need to
>>> explicitly manage flushing, or just implement STNP hacks in
>>> memcpy_flushcache() like you did with MOVNT on x86.
>>>
>>> > 131836 4k iops - vs - 117016.
>>>
>>> To be clear this is memcpy_flushcache() vs memcpy + flush?
>>
>> I found out what caused the difference. I used dax_flush on the version of
>> dm-writecache that I had on the ARM machine (with the kernel 4.14, because
>> it is the last version where dax on ramdisk works) - and I thought that
>> dax_flush flushes the cache, but it doesn't.
>>
>> When I replaced dax_flush with arch_wb_cache_pmem, the performance
>> difference between early flushing and late flushing disappeared.
>>
>> So I think we can remove this per-architecture switch from dm-writecache.
>
> Great find! Thanks for the due diligence. Feel free to add:
>
>     Acked-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>
> ...on the reworks to unify ARM and x86.

One more note. The side effect of not using dax_flush() is that you
may end up flushing caches on systems where the platform has asserted
it will take responsibility for flushing caches at power loss. If /
when those systems become more prevalent we may want to think of a way
to combine the non-temporal optimization and the cache-flush-bypass
optimizations. However that is something that can wait for a later
change beyond 4.18.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-31  3:39                                   ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-31  3:39 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, linux-nvdimm

On Wed, May 30 2018 at 10:09P -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:
 
> And what about this?
> #define WC_MODE_PMEM(wc)                        ((wc)->pmem_mode)
> 
> The code that I had just allowed the compiler to optimize out 
> persistent-memory code if we have DM_WRITECACHE_ONLY_SSD defined - and you 
> deleted it.
> 
> Most architectures don't have persistent memory and the dm-writecache 
> driver could work in ssd-only mode on them. On these architectures, I 
> define
> #define WC_MODE_PMEM(wc)                        false
> - and the compiler will just automatically remove the tests for that 
> condition and the unused branch. It does also eliminate unused static 
> functions.

Here is the patch that I just folded into the rebased version of
dm-writecache now in the dm-4.18 branch.

(I rebased ontop of Jens' latest block tree for 4.18 that now includes
the mempool_init changes, etc.)

---
 drivers/md/dm-writecache.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index fcbfaf7c27ec..691b5ffb799f 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -34,13 +34,21 @@
 #define BITMAP_GRANULARITY	PAGE_SIZE
 #endif
 
+#if defined(CONFIG_ARCH_HAS_PMEM_API) && IS_ENABLED(CONFIG_DAX_DRIVER)
+#define DM_WRITECACHE_HAS_PMEM
+#endif
+
+#ifdef DM_WRITECACHE_HAS_PMEM
 #define pmem_assign(dest, src)					\
 do {								\
 	typeof(dest) uniq = (src);				\
 	memcpy_flushcache(&(dest), &uniq, sizeof(dest));	\
 } while (0)
+#else
+#define pmem_assign(dest, src)	((dest)) = (src))
+#endif
 
-#if defined(__HAVE_ARCH_MEMCPY_MCSAFE) && defined(CONFIG_ARCH_HAS_PMEM_API)
+#if defined(__HAVE_ARCH_MEMCPY_MCSAFE) && defined(DM_WRITECACHE_HAS_PMEM)
 #define DM_WRITECACHE_HANDLE_HARDWARE_ERRORS
 #endif
 
@@ -87,8 +95,13 @@ struct wc_entry {
 #endif
 };
 
+#ifdef DM_WRITECACHE_HAS_PMEM
 #define WC_MODE_PMEM(wc)			((wc)->pmem_mode)
 #define WC_MODE_FUA(wc)				((wc)->writeback_fua)
+#else
+#define WC_MODE_PMEM(wc)			false
+#define WC_MODE_FUA(wc)				false
+#endif
 #define WC_MODE_SORT_FREELIST(wc)		(!WC_MODE_PMEM(wc))
 
 struct dm_writecache {
@@ -1857,7 +1870,7 @@ static int writecache_ctr(struct dm_target *ti, unsigned argc, char **argv)
 	if (!strcasecmp(string, "s")) {
 		wc->pmem_mode = false;
 	} else if (!strcasecmp(string, "p")) {
-#if defined(CONFIG_ARCH_HAS_PMEM_API) && IS_ENABLED(CONFIG_DAX_DRIVER)
+#ifdef DM_WRITECACHE_HAS_PMEM
 		wc->pmem_mode = true;
 		wc->writeback_fua = true;
 #else
-- 
2.15.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-31  3:39                                   ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-31  3:39 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, linux-nvdimm

On Wed, May 30 2018 at 10:09P -0400,
Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
 
> And what about this?
> #define WC_MODE_PMEM(wc)                        ((wc)->pmem_mode)
> 
> The code that I had just allowed the compiler to optimize out 
> persistent-memory code if we have DM_WRITECACHE_ONLY_SSD defined - and you 
> deleted it.
> 
> Most architectures don't have persistent memory and the dm-writecache 
> driver could work in ssd-only mode on them. On these architectures, I 
> define
> #define WC_MODE_PMEM(wc)                        false
> - and the compiler will just automatically remove the tests for that 
> condition and the unused branch. It does also eliminate unused static 
> functions.

Here is the patch that I just folded into the rebased version of
dm-writecache now in the dm-4.18 branch.

(I rebased ontop of Jens' latest block tree for 4.18 that now includes
the mempool_init changes, etc.)

---
 drivers/md/dm-writecache.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index fcbfaf7c27ec..691b5ffb799f 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -34,13 +34,21 @@
 #define BITMAP_GRANULARITY	PAGE_SIZE
 #endif
 
+#if defined(CONFIG_ARCH_HAS_PMEM_API) && IS_ENABLED(CONFIG_DAX_DRIVER)
+#define DM_WRITECACHE_HAS_PMEM
+#endif
+
+#ifdef DM_WRITECACHE_HAS_PMEM
 #define pmem_assign(dest, src)					\
 do {								\
 	typeof(dest) uniq = (src);				\
 	memcpy_flushcache(&(dest), &uniq, sizeof(dest));	\
 } while (0)
+#else
+#define pmem_assign(dest, src)	((dest)) = (src))
+#endif
 
-#if defined(__HAVE_ARCH_MEMCPY_MCSAFE) && defined(CONFIG_ARCH_HAS_PMEM_API)
+#if defined(__HAVE_ARCH_MEMCPY_MCSAFE) && defined(DM_WRITECACHE_HAS_PMEM)
 #define DM_WRITECACHE_HANDLE_HARDWARE_ERRORS
 #endif
 
@@ -87,8 +95,13 @@ struct wc_entry {
 #endif
 };
 
+#ifdef DM_WRITECACHE_HAS_PMEM
 #define WC_MODE_PMEM(wc)			((wc)->pmem_mode)
 #define WC_MODE_FUA(wc)				((wc)->writeback_fua)
+#else
+#define WC_MODE_PMEM(wc)			false
+#define WC_MODE_FUA(wc)				false
+#endif
 #define WC_MODE_SORT_FREELIST(wc)		(!WC_MODE_PMEM(wc))
 
 struct dm_writecache {
@@ -1857,7 +1870,7 @@ static int writecache_ctr(struct dm_target *ti, unsigned argc, char **argv)
 	if (!strcasecmp(string, "s")) {
 		wc->pmem_mode = false;
 	} else if (!strcasecmp(string, "p")) {
-#if defined(CONFIG_ARCH_HAS_PMEM_API) && IS_ENABLED(CONFIG_DAX_DRIVER)
+#ifdef DM_WRITECACHE_HAS_PMEM
 		wc->pmem_mode = true;
 		wc->writeback_fua = true;
 #else
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-31  3:42                                       ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-31  3:42 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, linux-nvdimm

On Wed, May 30 2018 at 10:46P -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Wed, 30 May 2018, Mike Snitzer wrote:
> 
> > > > Fine I'll deal with it.  reordering the fields eliminated holes in the
> > > > structure and reduced struct members spanning cache lines.
> > > 
> > > And what about this?
> > > #define WC_MODE_PMEM(wc)                        ((wc)->pmem_mode)
> > > 
> > > The code that I had just allowed the compiler to optimize out 
> > > persistent-memory code if we have DM_WRITECACHE_ONLY_SSD defined - and you 
> > > deleted it.
> > > 
> > > Most architectures don't have persistent memory and the dm-writecache 
> > > driver could work in ssd-only mode on them. On these architectures, I 
> > > define
> > > #define WC_MODE_PMEM(wc)                        false
> > > - and the compiler will just automatically remove the tests for that 
> > > condition and the unused branch. It does also eliminate unused static 
> > > functions.
> > 
> > This level of microoptimization can be backfilled.  But as it was, there
> > were too many #defines.  And I'm really not concerned with eliminating
> > unused static functions for this case.
> 
> I don't see why "too many defines" would be a problem.
> 
> If I compile it with and without pmem support, the difference is 
> 15kB-vs-12kB. If we look at just one function (writecache_map), the 
> difference is 1595 bytes - vs - 1280 bytes. So, it produces real savings 
> in code size.
> 
> The problem with performance is not caused a condition that always jumps 
> the same way (that is predicted by the CPU and it causes no delays in the 
> pipeline) - the problem is that a bigger function consumes more i-cache. 
> There is no reason to include code that can't be executed.

Please double check you see the reduced code size you're expecting using
the latest dm-writecache.c in linux-dm.git's dm-4.18 branch.

Thanks,
Mike
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-31  3:42                                       ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-31  3:42 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, linux-nvdimm

On Wed, May 30 2018 at 10:46P -0400,
Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> 
> 
> On Wed, 30 May 2018, Mike Snitzer wrote:
> 
> > > > Fine I'll deal with it.  reordering the fields eliminated holes in the
> > > > structure and reduced struct members spanning cache lines.
> > > 
> > > And what about this?
> > > #define WC_MODE_PMEM(wc)                        ((wc)->pmem_mode)
> > > 
> > > The code that I had just allowed the compiler to optimize out 
> > > persistent-memory code if we have DM_WRITECACHE_ONLY_SSD defined - and you 
> > > deleted it.
> > > 
> > > Most architectures don't have persistent memory and the dm-writecache 
> > > driver could work in ssd-only mode on them. On these architectures, I 
> > > define
> > > #define WC_MODE_PMEM(wc)                        false
> > > - and the compiler will just automatically remove the tests for that 
> > > condition and the unused branch. It does also eliminate unused static 
> > > functions.
> > 
> > This level of microoptimization can be backfilled.  But as it was, there
> > were too many #defines.  And I'm really not concerned with eliminating
> > unused static functions for this case.
> 
> I don't see why "too many defines" would be a problem.
> 
> If I compile it with and without pmem support, the difference is 
> 15kB-vs-12kB. If we look at just one function (writecache_map), the 
> difference is 1595 bytes - vs - 1280 bytes. So, it produces real savings 
> in code size.
> 
> The problem with performance is not caused a condition that always jumps 
> the same way (that is predicted by the CPU and it causes no delays in the 
> pipeline) - the problem is that a bigger function consumes more i-cache. 
> There is no reason to include code that can't be executed.

Please double check you see the reduced code size you're expecting using
the latest dm-writecache.c in linux-dm.git's dm-4.18 branch.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-31  8:16                                     ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-31  8:16 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: device-mapper development, linux-nvdimm



On Wed, 30 May 2018, Mike Snitzer wrote:

> On Wed, May 30 2018 at 10:09P -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
>  
> > And what about this?
> > #define WC_MODE_PMEM(wc)                        ((wc)->pmem_mode)
> > 
> > The code that I had just allowed the compiler to optimize out 
> > persistent-memory code if we have DM_WRITECACHE_ONLY_SSD defined - and you 
> > deleted it.
> > 
> > Most architectures don't have persistent memory and the dm-writecache 
> > driver could work in ssd-only mode on them. On these architectures, I 
> > define
> > #define WC_MODE_PMEM(wc)                        false
> > - and the compiler will just automatically remove the tests for that 
> > condition and the unused branch. It does also eliminate unused static 
> > functions.
> 
> Here is the patch that I just folded into the rebased version of
> dm-writecache now in the dm-4.18 branch.
> 
> (I rebased ontop of Jens' latest block tree for 4.18 that now includes
> the mempool_init changes, etc.)
> 
> ---
>  drivers/md/dm-writecache.c | 17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> index fcbfaf7c27ec..691b5ffb799f 100644
> --- a/drivers/md/dm-writecache.c
> +++ b/drivers/md/dm-writecache.c
> @@ -34,13 +34,21 @@
>  #define BITMAP_GRANULARITY	PAGE_SIZE
>  #endif
>  
> +#if defined(CONFIG_ARCH_HAS_PMEM_API) && IS_ENABLED(CONFIG_DAX_DRIVER)
> +#define DM_WRITECACHE_HAS_PMEM
> +#endif
> +
> +#ifdef DM_WRITECACHE_HAS_PMEM
>  #define pmem_assign(dest, src)					\
>  do {								\
>  	typeof(dest) uniq = (src);				\
>  	memcpy_flushcache(&(dest), &uniq, sizeof(dest));	\
>  } while (0)
> +#else
> +#define pmem_assign(dest, src)	((dest)) = (src))
> +#endif
>  
> -#if defined(__HAVE_ARCH_MEMCPY_MCSAFE) && defined(CONFIG_ARCH_HAS_PMEM_API)
> +#if defined(__HAVE_ARCH_MEMCPY_MCSAFE) && defined(DM_WRITECACHE_HAS_PMEM)
>  #define DM_WRITECACHE_HANDLE_HARDWARE_ERRORS
>  #endif
>  
> @@ -87,8 +95,13 @@ struct wc_entry {
>  #endif
>  };
>  
> +#ifdef DM_WRITECACHE_HAS_PMEM
>  #define WC_MODE_PMEM(wc)			((wc)->pmem_mode)
>  #define WC_MODE_FUA(wc)				((wc)->writeback_fua)
> +#else
> +#define WC_MODE_PMEM(wc)			false
> +#define WC_MODE_FUA(wc)				false
> +#endif
>  #define WC_MODE_SORT_FREELIST(wc)		(!WC_MODE_PMEM(wc))
>  
>  struct dm_writecache {
> @@ -1857,7 +1870,7 @@ static int writecache_ctr(struct dm_target *ti, unsigned argc, char **argv)
>  	if (!strcasecmp(string, "s")) {
>  		wc->pmem_mode = false;
>  	} else if (!strcasecmp(string, "p")) {
> -#if defined(CONFIG_ARCH_HAS_PMEM_API) && IS_ENABLED(CONFIG_DAX_DRIVER)
> +#ifdef DM_WRITECACHE_HAS_PMEM
>  		wc->pmem_mode = true;
>  		wc->writeback_fua = true;
>  #else
> -- 
> 2.15.0

OK.

I think that persistent_memory_claim should also be conditioned based on 
DM_WRITECACHE_HAS_PMEM - i.e. if we have DM_WRITECACHE_HAS_PMEM, we don't 
need to use other ifdefs.

Is there some difference between "#if IS_ENABLED(CONFIG_OPTION)" and
"#if defined(CONFIG_OPTION)"?

Mikulas
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-31  8:16                                     ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-31  8:16 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: device-mapper development, linux-nvdimm



On Wed, 30 May 2018, Mike Snitzer wrote:

> On Wed, May 30 2018 at 10:09P -0400,
> Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>  
> > And what about this?
> > #define WC_MODE_PMEM(wc)                        ((wc)->pmem_mode)
> > 
> > The code that I had just allowed the compiler to optimize out 
> > persistent-memory code if we have DM_WRITECACHE_ONLY_SSD defined - and you 
> > deleted it.
> > 
> > Most architectures don't have persistent memory and the dm-writecache 
> > driver could work in ssd-only mode on them. On these architectures, I 
> > define
> > #define WC_MODE_PMEM(wc)                        false
> > - and the compiler will just automatically remove the tests for that 
> > condition and the unused branch. It does also eliminate unused static 
> > functions.
> 
> Here is the patch that I just folded into the rebased version of
> dm-writecache now in the dm-4.18 branch.
> 
> (I rebased ontop of Jens' latest block tree for 4.18 that now includes
> the mempool_init changes, etc.)
> 
> ---
>  drivers/md/dm-writecache.c | 17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> index fcbfaf7c27ec..691b5ffb799f 100644
> --- a/drivers/md/dm-writecache.c
> +++ b/drivers/md/dm-writecache.c
> @@ -34,13 +34,21 @@
>  #define BITMAP_GRANULARITY	PAGE_SIZE
>  #endif
>  
> +#if defined(CONFIG_ARCH_HAS_PMEM_API) && IS_ENABLED(CONFIG_DAX_DRIVER)
> +#define DM_WRITECACHE_HAS_PMEM
> +#endif
> +
> +#ifdef DM_WRITECACHE_HAS_PMEM
>  #define pmem_assign(dest, src)					\
>  do {								\
>  	typeof(dest) uniq = (src);				\
>  	memcpy_flushcache(&(dest), &uniq, sizeof(dest));	\
>  } while (0)
> +#else
> +#define pmem_assign(dest, src)	((dest)) = (src))
> +#endif
>  
> -#if defined(__HAVE_ARCH_MEMCPY_MCSAFE) && defined(CONFIG_ARCH_HAS_PMEM_API)
> +#if defined(__HAVE_ARCH_MEMCPY_MCSAFE) && defined(DM_WRITECACHE_HAS_PMEM)
>  #define DM_WRITECACHE_HANDLE_HARDWARE_ERRORS
>  #endif
>  
> @@ -87,8 +95,13 @@ struct wc_entry {
>  #endif
>  };
>  
> +#ifdef DM_WRITECACHE_HAS_PMEM
>  #define WC_MODE_PMEM(wc)			((wc)->pmem_mode)
>  #define WC_MODE_FUA(wc)				((wc)->writeback_fua)
> +#else
> +#define WC_MODE_PMEM(wc)			false
> +#define WC_MODE_FUA(wc)				false
> +#endif
>  #define WC_MODE_SORT_FREELIST(wc)		(!WC_MODE_PMEM(wc))
>  
>  struct dm_writecache {
> @@ -1857,7 +1870,7 @@ static int writecache_ctr(struct dm_target *ti, unsigned argc, char **argv)
>  	if (!strcasecmp(string, "s")) {
>  		wc->pmem_mode = false;
>  	} else if (!strcasecmp(string, "p")) {
> -#if defined(CONFIG_ARCH_HAS_PMEM_API) && IS_ENABLED(CONFIG_DAX_DRIVER)
> +#ifdef DM_WRITECACHE_HAS_PMEM
>  		wc->pmem_mode = true;
>  		wc->writeback_fua = true;
>  #else
> -- 
> 2.15.0

OK.

I think that persistent_memory_claim should also be conditioned based on 
DM_WRITECACHE_HAS_PMEM - i.e. if we have DM_WRITECACHE_HAS_PMEM, we don't 
need to use other ifdefs.

Is there some difference between "#if IS_ENABLED(CONFIG_OPTION)" and
"#if defined(CONFIG_OPTION)"?

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-31  8:19                           ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-31  8:19 UTC (permalink / raw)
  To: Dan Williams; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm



On Wed, 30 May 2018, Dan Williams wrote:

> > Great find! Thanks for the due diligence. Feel free to add:
> >
> >     Acked-by: Dan Williams <dan.j.williams@intel.com>
> >
> > ...on the reworks to unify ARM and x86.
> 
> One more note. The side effect of not using dax_flush() is that you
> may end up flushing caches on systems where the platform has asserted
> it will take responsibility for flushing caches at power loss. If /
> when those systems become more prevalent we may want to think of a way
> to combine the non-temporal optimization and the cache-flush-bypass
> optimizations. However that is something that can wait for a later
> change beyond 4.18.

We could define memcpy_flushpmem, that falls back to memcpy or 
memcpy_flushcache, depending on whether the platform flushes the caches at 
power loss or not.

Mikulas
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-31  8:19                           ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-31  8:19 UTC (permalink / raw)
  To: Dan Williams; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm



On Wed, 30 May 2018, Dan Williams wrote:

> > Great find! Thanks for the due diligence. Feel free to add:
> >
> >     Acked-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> >
> > ...on the reworks to unify ARM and x86.
> 
> One more note. The side effect of not using dax_flush() is that you
> may end up flushing caches on systems where the platform has asserted
> it will take responsibility for flushing caches at power loss. If /
> when those systems become more prevalent we may want to think of a way
> to combine the non-temporal optimization and the cache-flush-bypass
> optimizations. However that is something that can wait for a later
> change beyond 4.18.

We could define memcpy_flushpmem, that falls back to memcpy or 
memcpy_flushcache, depending on whether the platform flushes the caches at 
power loss or not.

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-31 12:09                                       ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-31 12:09 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, linux-nvdimm

On Thu, May 31 2018 at  4:16am -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Wed, 30 May 2018, Mike Snitzer wrote:
> 
> > On Wed, May 30 2018 at 10:09P -0400,
> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> >  
> > > And what about this?
> > > #define WC_MODE_PMEM(wc)                        ((wc)->pmem_mode)
> > > 
> > > The code that I had just allowed the compiler to optimize out 
> > > persistent-memory code if we have DM_WRITECACHE_ONLY_SSD defined - and you 
> > > deleted it.
> > > 
> > > Most architectures don't have persistent memory and the dm-writecache 
> > > driver could work in ssd-only mode on them. On these architectures, I 
> > > define
> > > #define WC_MODE_PMEM(wc)                        false
> > > - and the compiler will just automatically remove the tests for that 
> > > condition and the unused branch. It does also eliminate unused static 
> > > functions.
> > 
> > Here is the patch that I just folded into the rebased version of
> > dm-writecache now in the dm-4.18 branch.
> > 
> > (I rebased ontop of Jens' latest block tree for 4.18 that now includes
> > the mempool_init changes, etc.)
> > 
> > ---
> >  drivers/md/dm-writecache.c | 17 +++++++++++++++--
> >  1 file changed, 15 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> > index fcbfaf7c27ec..691b5ffb799f 100644
> > --- a/drivers/md/dm-writecache.c
> > +++ b/drivers/md/dm-writecache.c
> > @@ -34,13 +34,21 @@
> >  #define BITMAP_GRANULARITY	PAGE_SIZE
> >  #endif
> >  
> > +#if defined(CONFIG_ARCH_HAS_PMEM_API) && IS_ENABLED(CONFIG_DAX_DRIVER)
> > +#define DM_WRITECACHE_HAS_PMEM
> > +#endif
> > +
> > +#ifdef DM_WRITECACHE_HAS_PMEM
> >  #define pmem_assign(dest, src)					\
> >  do {								\
> >  	typeof(dest) uniq = (src);				\
> >  	memcpy_flushcache(&(dest), &uniq, sizeof(dest));	\
> >  } while (0)
> > +#else
> > +#define pmem_assign(dest, src)	((dest)) = (src))
> > +#endif
> >  
> > -#if defined(__HAVE_ARCH_MEMCPY_MCSAFE) && defined(CONFIG_ARCH_HAS_PMEM_API)
> > +#if defined(__HAVE_ARCH_MEMCPY_MCSAFE) && defined(DM_WRITECACHE_HAS_PMEM)
> >  #define DM_WRITECACHE_HANDLE_HARDWARE_ERRORS
> >  #endif
> >  
> > @@ -87,8 +95,13 @@ struct wc_entry {
> >  #endif
> >  };
> >  
> > +#ifdef DM_WRITECACHE_HAS_PMEM
> >  #define WC_MODE_PMEM(wc)			((wc)->pmem_mode)
> >  #define WC_MODE_FUA(wc)				((wc)->writeback_fua)
> > +#else
> > +#define WC_MODE_PMEM(wc)			false
> > +#define WC_MODE_FUA(wc)				false
> > +#endif
> >  #define WC_MODE_SORT_FREELIST(wc)		(!WC_MODE_PMEM(wc))
> >  
> >  struct dm_writecache {
> > @@ -1857,7 +1870,7 @@ static int writecache_ctr(struct dm_target *ti, unsigned argc, char **argv)
> >  	if (!strcasecmp(string, "s")) {
> >  		wc->pmem_mode = false;
> >  	} else if (!strcasecmp(string, "p")) {
> > -#if defined(CONFIG_ARCH_HAS_PMEM_API) && IS_ENABLED(CONFIG_DAX_DRIVER)
> > +#ifdef DM_WRITECACHE_HAS_PMEM
> >  		wc->pmem_mode = true;
> >  		wc->writeback_fua = true;
> >  #else
> > -- 
> > 2.15.0
> 
> OK.
> 
> I think that persistent_memory_claim should also be conditioned based on 
> DM_WRITECACHE_HAS_PMEM - i.e. if we have DM_WRITECACHE_HAS_PMEM, we don't 
> need to use other ifdefs.
> 
> Is there some difference between "#if IS_ENABLED(CONFIG_OPTION)" and
> "#if defined(CONFIG_OPTION)"?

Not that I'm aware of:
  #define IS_ENABLED(option) __or(IS_BUILTIN(option), IS_MODULE(option))

Here is the incremental patch I just folded in:

diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index fd3bc232b7d6..f2ae02f22c43 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -34,7 +34,7 @@
 #define BITMAP_GRANULARITY     PAGE_SIZE
 #endif

-#if defined(CONFIG_ARCH_HAS_PMEM_API) && IS_ENABLED(CONFIG_DAX_DRIVER)
+#if IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API) && IS_ENABLED(CONFIG_DAX_DRIVER)
 #define DM_WRITECACHE_HAS_PMEM
 #endif

@@ -218,7 +218,7 @@ static void wc_unlock(struct dm_writecache *wc)
        mutex_unlock(&wc->lock);
 }

-#if IS_ENABLED(CONFIG_DAX_DRIVER)
+#ifdef DM_WRITECACHE_HAS_PMEM
 static int persistent_memory_claim(struct dm_writecache *wc)
 {
        int r;
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-31 12:09                                       ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-05-31 12:09 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, linux-nvdimm

On Thu, May 31 2018 at  4:16am -0400,
Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> 
> 
> On Wed, 30 May 2018, Mike Snitzer wrote:
> 
> > On Wed, May 30 2018 at 10:09P -0400,
> > Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >  
> > > And what about this?
> > > #define WC_MODE_PMEM(wc)                        ((wc)->pmem_mode)
> > > 
> > > The code that I had just allowed the compiler to optimize out 
> > > persistent-memory code if we have DM_WRITECACHE_ONLY_SSD defined - and you 
> > > deleted it.
> > > 
> > > Most architectures don't have persistent memory and the dm-writecache 
> > > driver could work in ssd-only mode on them. On these architectures, I 
> > > define
> > > #define WC_MODE_PMEM(wc)                        false
> > > - and the compiler will just automatically remove the tests for that 
> > > condition and the unused branch. It does also eliminate unused static 
> > > functions.
> > 
> > Here is the patch that I just folded into the rebased version of
> > dm-writecache now in the dm-4.18 branch.
> > 
> > (I rebased ontop of Jens' latest block tree for 4.18 that now includes
> > the mempool_init changes, etc.)
> > 
> > ---
> >  drivers/md/dm-writecache.c | 17 +++++++++++++++--
> >  1 file changed, 15 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> > index fcbfaf7c27ec..691b5ffb799f 100644
> > --- a/drivers/md/dm-writecache.c
> > +++ b/drivers/md/dm-writecache.c
> > @@ -34,13 +34,21 @@
> >  #define BITMAP_GRANULARITY	PAGE_SIZE
> >  #endif
> >  
> > +#if defined(CONFIG_ARCH_HAS_PMEM_API) && IS_ENABLED(CONFIG_DAX_DRIVER)
> > +#define DM_WRITECACHE_HAS_PMEM
> > +#endif
> > +
> > +#ifdef DM_WRITECACHE_HAS_PMEM
> >  #define pmem_assign(dest, src)					\
> >  do {								\
> >  	typeof(dest) uniq = (src);				\
> >  	memcpy_flushcache(&(dest), &uniq, sizeof(dest));	\
> >  } while (0)
> > +#else
> > +#define pmem_assign(dest, src)	((dest)) = (src))
> > +#endif
> >  
> > -#if defined(__HAVE_ARCH_MEMCPY_MCSAFE) && defined(CONFIG_ARCH_HAS_PMEM_API)
> > +#if defined(__HAVE_ARCH_MEMCPY_MCSAFE) && defined(DM_WRITECACHE_HAS_PMEM)
> >  #define DM_WRITECACHE_HANDLE_HARDWARE_ERRORS
> >  #endif
> >  
> > @@ -87,8 +95,13 @@ struct wc_entry {
> >  #endif
> >  };
> >  
> > +#ifdef DM_WRITECACHE_HAS_PMEM
> >  #define WC_MODE_PMEM(wc)			((wc)->pmem_mode)
> >  #define WC_MODE_FUA(wc)				((wc)->writeback_fua)
> > +#else
> > +#define WC_MODE_PMEM(wc)			false
> > +#define WC_MODE_FUA(wc)				false
> > +#endif
> >  #define WC_MODE_SORT_FREELIST(wc)		(!WC_MODE_PMEM(wc))
> >  
> >  struct dm_writecache {
> > @@ -1857,7 +1870,7 @@ static int writecache_ctr(struct dm_target *ti, unsigned argc, char **argv)
> >  	if (!strcasecmp(string, "s")) {
> >  		wc->pmem_mode = false;
> >  	} else if (!strcasecmp(string, "p")) {
> > -#if defined(CONFIG_ARCH_HAS_PMEM_API) && IS_ENABLED(CONFIG_DAX_DRIVER)
> > +#ifdef DM_WRITECACHE_HAS_PMEM
> >  		wc->pmem_mode = true;
> >  		wc->writeback_fua = true;
> >  #else
> > -- 
> > 2.15.0
> 
> OK.
> 
> I think that persistent_memory_claim should also be conditioned based on 
> DM_WRITECACHE_HAS_PMEM - i.e. if we have DM_WRITECACHE_HAS_PMEM, we don't 
> need to use other ifdefs.
> 
> Is there some difference between "#if IS_ENABLED(CONFIG_OPTION)" and
> "#if defined(CONFIG_OPTION)"?

Not that I'm aware of:
  #define IS_ENABLED(option) __or(IS_BUILTIN(option), IS_MODULE(option))

Here is the incremental patch I just folded in:

diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index fd3bc232b7d6..f2ae02f22c43 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -34,7 +34,7 @@
 #define BITMAP_GRANULARITY     PAGE_SIZE
 #endif

-#if defined(CONFIG_ARCH_HAS_PMEM_API) && IS_ENABLED(CONFIG_DAX_DRIVER)
+#if IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API) && IS_ENABLED(CONFIG_DAX_DRIVER)
 #define DM_WRITECACHE_HAS_PMEM
 #endif

@@ -218,7 +218,7 @@ static void wc_unlock(struct dm_writecache *wc)
        mutex_unlock(&wc->lock);
 }

-#if IS_ENABLED(CONFIG_DAX_DRIVER)
+#ifdef DM_WRITECACHE_HAS_PMEM
 static int persistent_memory_claim(struct dm_writecache *wc)
 {
        int r;

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-31 14:51                             ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-31 14:51 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm

On Thu, May 31, 2018 at 1:19 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Wed, 30 May 2018, Dan Williams wrote:
>
>> > Great find! Thanks for the due diligence. Feel free to add:
>> >
>> >     Acked-by: Dan Williams <dan.j.williams@intel.com>
>> >
>> > ...on the reworks to unify ARM and x86.
>>
>> One more note. The side effect of not using dax_flush() is that you
>> may end up flushing caches on systems where the platform has asserted
>> it will take responsibility for flushing caches at power loss. If /
>> when those systems become more prevalent we may want to think of a way
>> to combine the non-temporal optimization and the cache-flush-bypass
>> optimizations. However that is something that can wait for a later
>> change beyond 4.18.
>
> We could define memcpy_flushpmem, that falls back to memcpy or
> memcpy_flushcache, depending on whether the platform flushes the caches at
> power loss or not.

The problem is that some platforms only power fail protect a subset of
the physical address range, but yes, if the platform makes a global
assertion we can globally replace memcpy_flushpmem() with plain
memcpy.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-31 14:51                             ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-31 14:51 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm

On Thu, May 31, 2018 at 1:19 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
>
> On Wed, 30 May 2018, Dan Williams wrote:
>
>> > Great find! Thanks for the due diligence. Feel free to add:
>> >
>> >     Acked-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>> >
>> > ...on the reworks to unify ARM and x86.
>>
>> One more note. The side effect of not using dax_flush() is that you
>> may end up flushing caches on systems where the platform has asserted
>> it will take responsibility for flushing caches at power loss. If /
>> when those systems become more prevalent we may want to think of a way
>> to combine the non-temporal optimization and the cache-flush-bypass
>> optimizations. However that is something that can wait for a later
>> change beyond 4.18.
>
> We could define memcpy_flushpmem, that falls back to memcpy or
> memcpy_flushcache, depending on whether the platform flushes the caches at
> power loss or not.

The problem is that some platforms only power fail protect a subset of
the physical address range, but yes, if the platform makes a global
assertion we can globally replace memcpy_flushpmem() with plain
memcpy.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-31 15:31                               ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-31 15:31 UTC (permalink / raw)
  To: Dan Williams; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm



On Thu, 31 May 2018, Dan Williams wrote:

> On Thu, May 31, 2018 at 1:19 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> >
> > On Wed, 30 May 2018, Dan Williams wrote:
> >
> >> > Great find! Thanks for the due diligence. Feel free to add:
> >> >
> >> >     Acked-by: Dan Williams <dan.j.williams@intel.com>
> >> >
> >> > ...on the reworks to unify ARM and x86.
> >>
> >> One more note. The side effect of not using dax_flush() is that you
> >> may end up flushing caches on systems where the platform has asserted
> >> it will take responsibility for flushing caches at power loss. If /
> >> when those systems become more prevalent we may want to think of a way
> >> to combine the non-temporal optimization and the cache-flush-bypass
> >> optimizations. However that is something that can wait for a later
> >> change beyond 4.18.
> >
> > We could define memcpy_flushpmem, that falls back to memcpy or
> > memcpy_flushcache, depending on whether the platform flushes the caches at
> > power loss or not.
> 
> The problem is that some platforms only power fail protect a subset of
> the physical address range,

How can this be? A psysical address may be cached on any CPU, so either 
there is enough power to flush all the CPUs' caches or there isn't.

How does the CPU design that protects only a part of physical addresses 
look like?

> but yes, if the platform makes a global
> assertion we can globally replace memcpy_flushpmem() with plain
> memcpy.

Mikulas
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-31 15:31                               ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-05-31 15:31 UTC (permalink / raw)
  To: Dan Williams; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm



On Thu, 31 May 2018, Dan Williams wrote:

> On Thu, May 31, 2018 at 1:19 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> >
> > On Wed, 30 May 2018, Dan Williams wrote:
> >
> >> > Great find! Thanks for the due diligence. Feel free to add:
> >> >
> >> >     Acked-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> >> >
> >> > ...on the reworks to unify ARM and x86.
> >>
> >> One more note. The side effect of not using dax_flush() is that you
> >> may end up flushing caches on systems where the platform has asserted
> >> it will take responsibility for flushing caches at power loss. If /
> >> when those systems become more prevalent we may want to think of a way
> >> to combine the non-temporal optimization and the cache-flush-bypass
> >> optimizations. However that is something that can wait for a later
> >> change beyond 4.18.
> >
> > We could define memcpy_flushpmem, that falls back to memcpy or
> > memcpy_flushcache, depending on whether the platform flushes the caches at
> > power loss or not.
> 
> The problem is that some platforms only power fail protect a subset of
> the physical address range,

How can this be? A psysical address may be cached on any CPU, so either 
there is enough power to flush all the CPUs' caches or there isn't.

How does the CPU design that protects only a part of physical addresses 
look like?

> but yes, if the platform makes a global
> assertion we can globally replace memcpy_flushpmem() with plain
> memcpy.

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-31 16:39                                 ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-31 16:39 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm

On Thu, May 31, 2018 at 8:31 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Thu, 31 May 2018, Dan Williams wrote:
>
>> On Thu, May 31, 2018 at 1:19 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >
>> >
>> > On Wed, 30 May 2018, Dan Williams wrote:
>> >
>> >> > Great find! Thanks for the due diligence. Feel free to add:
>> >> >
>> >> >     Acked-by: Dan Williams <dan.j.williams@intel.com>
>> >> >
>> >> > ...on the reworks to unify ARM and x86.
>> >>
>> >> One more note. The side effect of not using dax_flush() is that you
>> >> may end up flushing caches on systems where the platform has asserted
>> >> it will take responsibility for flushing caches at power loss. If /
>> >> when those systems become more prevalent we may want to think of a way
>> >> to combine the non-temporal optimization and the cache-flush-bypass
>> >> optimizations. However that is something that can wait for a later
>> >> change beyond 4.18.
>> >
>> > We could define memcpy_flushpmem, that falls back to memcpy or
>> > memcpy_flushcache, depending on whether the platform flushes the caches at
>> > power loss or not.
>>
>> The problem is that some platforms only power fail protect a subset of
>> the physical address range,
>
> How can this be? A psysical address may be cached on any CPU, so either
> there is enough power to flush all the CPUs' caches or there isn't.
>
> How does the CPU design that protects only a part of physical addresses
> look like?

It's not necessarily a CPU problem, it may be a problem of having
enough stored energy to potentially software loops to flush caches.
There's also the consideration that a general purpose platform may mix
persistent memory technologies from different vendors where some might
be flash-backed DRAM and some might be persistent media directly.

For now I don't think we need to worry about it, but I don't want to
make the assumption that this property is platform global given the
history of how persistent memory has been deployed to date.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-05-31 16:39                                 ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2018-05-31 16:39 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, Mike Snitzer, linux-nvdimm

On Thu, May 31, 2018 at 8:31 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
>
> On Thu, 31 May 2018, Dan Williams wrote:
>
>> On Thu, May 31, 2018 at 1:19 AM, Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> >
>> >
>> > On Wed, 30 May 2018, Dan Williams wrote:
>> >
>> >> > Great find! Thanks for the due diligence. Feel free to add:
>> >> >
>> >> >     Acked-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>> >> >
>> >> > ...on the reworks to unify ARM and x86.
>> >>
>> >> One more note. The side effect of not using dax_flush() is that you
>> >> may end up flushing caches on systems where the platform has asserted
>> >> it will take responsibility for flushing caches at power loss. If /
>> >> when those systems become more prevalent we may want to think of a way
>> >> to combine the non-temporal optimization and the cache-flush-bypass
>> >> optimizations. However that is something that can wait for a later
>> >> change beyond 4.18.
>> >
>> > We could define memcpy_flushpmem, that falls back to memcpy or
>> > memcpy_flushcache, depending on whether the platform flushes the caches at
>> > power loss or not.
>>
>> The problem is that some platforms only power fail protect a subset of
>> the physical address range,
>
> How can this be? A psysical address may be cached on any CPU, so either
> there is enough power to flush all the CPUs' caches or there isn't.
>
> How does the CPU design that protects only a part of physical addresses
> look like?

It's not necessarily a CPU problem, it may be a problem of having
enough stored energy to potentially software loops to flush caches.
There's also the consideration that a general purpose platform may mix
persistent memory technologies from different vendors where some might
be flash-backed DRAM and some might be persistent media directly.

For now I don't think we need to worry about it, but I don't want to
make the assumption that this property is platform global given the
history of how persistent memory has been deployed to date.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-06-03 15:03                                         ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-06-03 15:03 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: device-mapper development, linux-nvdimm



On Wed, 30 May 2018, Mike Snitzer wrote:

> On Wed, May 30 2018 at 10:46P -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > 
> > 
> > On Wed, 30 May 2018, Mike Snitzer wrote:
> > 
> > > > > Fine I'll deal with it.  reordering the fields eliminated holes in the
> > > > > structure and reduced struct members spanning cache lines.
> > > > 
> > > > And what about this?
> > > > #define WC_MODE_PMEM(wc)                        ((wc)->pmem_mode)
> > > > 
> > > > The code that I had just allowed the compiler to optimize out 
> > > > persistent-memory code if we have DM_WRITECACHE_ONLY_SSD defined - and you 
> > > > deleted it.
> > > > 
> > > > Most architectures don't have persistent memory and the dm-writecache 
> > > > driver could work in ssd-only mode on them. On these architectures, I 
> > > > define
> > > > #define WC_MODE_PMEM(wc)                        false
> > > > - and the compiler will just automatically remove the tests for that 
> > > > condition and the unused branch. It does also eliminate unused static 
> > > > functions.
> > > 
> > > This level of microoptimization can be backfilled.  But as it was, there
> > > were too many #defines.  And I'm really not concerned with eliminating
> > > unused static functions for this case.
> > 
> > I don't see why "too many defines" would be a problem.
> > 
> > If I compile it with and without pmem support, the difference is 
> > 15kB-vs-12kB. If we look at just one function (writecache_map), the 
> > difference is 1595 bytes - vs - 1280 bytes. So, it produces real savings 
> > in code size.
> > 
> > The problem with performance is not caused a condition that always jumps 
> > the same way (that is predicted by the CPU and it causes no delays in the 
> > pipeline) - the problem is that a bigger function consumes more i-cache. 
> > There is no reason to include code that can't be executed.
> 
> Please double check you see the reduced code size you're expecting using
> the latest dm-writecache.c in linux-dm.git's dm-4.18 branch.
> 
> Thanks,
> Mike

I checked that - it's OK.

Mikulas
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [patch 4/4] dm-writecache: use new API for flushing
@ 2018-06-03 15:03                                         ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-06-03 15:03 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: device-mapper development, linux-nvdimm



On Wed, 30 May 2018, Mike Snitzer wrote:

> On Wed, May 30 2018 at 10:46P -0400,
> Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > 
> > 
> > On Wed, 30 May 2018, Mike Snitzer wrote:
> > 
> > > > > Fine I'll deal with it.  reordering the fields eliminated holes in the
> > > > > structure and reduced struct members spanning cache lines.
> > > > 
> > > > And what about this?
> > > > #define WC_MODE_PMEM(wc)                        ((wc)->pmem_mode)
> > > > 
> > > > The code that I had just allowed the compiler to optimize out 
> > > > persistent-memory code if we have DM_WRITECACHE_ONLY_SSD defined - and you 
> > > > deleted it.
> > > > 
> > > > Most architectures don't have persistent memory and the dm-writecache 
> > > > driver could work in ssd-only mode on them. On these architectures, I 
> > > > define
> > > > #define WC_MODE_PMEM(wc)                        false
> > > > - and the compiler will just automatically remove the tests for that 
> > > > condition and the unused branch. It does also eliminate unused static 
> > > > functions.
> > > 
> > > This level of microoptimization can be backfilled.  But as it was, there
> > > were too many #defines.  And I'm really not concerned with eliminating
> > > unused static functions for this case.
> > 
> > I don't see why "too many defines" would be a problem.
> > 
> > If I compile it with and without pmem support, the difference is 
> > 15kB-vs-12kB. If we look at just one function (writecache_map), the 
> > difference is 1595 bytes - vs - 1280 bytes. So, it produces real savings 
> > in code size.
> > 
> > The problem with performance is not caused a condition that always jumps 
> > the same way (that is predicted by the CPU and it causes no delays in the 
> > pipeline) - the problem is that a bigger function consumes more i-cache. 
> > There is no reason to include code that can't be executed.
> 
> Please double check you see the reduced code size you're expecting using
> the latest dm-writecache.c in linux-dm.git's dm-4.18 branch.
> 
> Thanks,
> Mike

I checked that - it's OK.

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v2 RESEND] x86: optimize memcpy_flushcache
  2018-05-24 18:20     ` [PATCH v2] " Mike Snitzer
@ 2018-06-18 13:23         ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-06-18 13:23 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner
  Cc: Mikulas Patocka, Dan Williams, device-mapper development, X86 ML,
	linux-kernel

From: Mikulas Patocka <mpatocka@redhat.com>
Subject: [PATCH v2] x86: optimize memcpy_flushcache

In the context of constant short length stores to persistent memory,
memcpy_flushcache suffers from a 2% performance degradation compared to
explicitly using the "movnti" instruction.

Optimize 4, 8, and 16 byte memcpy_flushcache calls to explicitly use the
movnti instruction with inline assembler.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 arch/x86/include/asm/string_64.h | 28 +++++++++++++++++++++++++++-
 arch/x86/lib/usercopy_64.c       |  4 ++--
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 533f74c300c2..aaba83478cdc 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -147,7 +147,33 @@ memcpy_mcsafe(void *dst, const void *src, size_t cnt)
 
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
-void memcpy_flushcache(void *dst, const void *src, size_t cnt);
+void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
+static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t cnt)
+{
+	if (__builtin_constant_p(cnt)) {
+		switch (cnt) {
+		case 4:
+			asm volatile("movntil %1, %0"
+				     : "=m" (*(u32 *)dst)
+				     : "r" (*(u32 *)src));
+			return;
+		case 8:
+			asm volatile("movntiq %1, %0"
+				     : "=m" (*(u64 *)dst)
+				     : "r" (*(u64 *)src));
+			return;
+		case 16:
+			asm volatile("movntiq %1, %0"
+				     : "=m" (*(u64 *)dst)
+				     : "r" (*(u64 *)src));
+			asm volatile("movntiq %1, %0"
+				     : "=m" (*(u64 *)(dst + 8))
+				     : "r" (*(u64 *)(src + 8)));
+			return;
+		}
+	}
+	__memcpy_flushcache(dst, src, cnt);
+}
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 75d3776123cc..26f515aa3529 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -133,7 +133,7 @@ long __copy_user_flushcache(void *dst, const void __user *src, unsigned size)
 	return rc;
 }
 
-void memcpy_flushcache(void *_dst, const void *_src, size_t size)
+void __memcpy_flushcache(void *_dst, const void *_src, size_t size)
 {
 	unsigned long dest = (unsigned long) _dst;
 	unsigned long source = (unsigned long) _src;
@@ -196,7 +196,7 @@ void memcpy_flushcache(void *_dst, const void *_src, size_t size)
 		clean_cache_range((void *) dest, size);
 	}
 }
-EXPORT_SYMBOL_GPL(memcpy_flushcache);
+EXPORT_SYMBOL_GPL(__memcpy_flushcache);
 
 void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
 		size_t len)
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH v2 RESEND] x86: optimize memcpy_flushcache
@ 2018-06-18 13:23         ` Mike Snitzer
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Snitzer @ 2018-06-18 13:23 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner
  Cc: Dan Williams, X86 ML, Mikulas Patocka, device-mapper development,
	linux-kernel

From: Mikulas Patocka <mpatocka@redhat.com>
Subject: [PATCH v2] x86: optimize memcpy_flushcache

In the context of constant short length stores to persistent memory,
memcpy_flushcache suffers from a 2% performance degradation compared to
explicitly using the "movnti" instruction.

Optimize 4, 8, and 16 byte memcpy_flushcache calls to explicitly use the
movnti instruction with inline assembler.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 arch/x86/include/asm/string_64.h | 28 +++++++++++++++++++++++++++-
 arch/x86/lib/usercopy_64.c       |  4 ++--
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 533f74c300c2..aaba83478cdc 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -147,7 +147,33 @@ memcpy_mcsafe(void *dst, const void *src, size_t cnt)
 
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
-void memcpy_flushcache(void *dst, const void *src, size_t cnt);
+void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
+static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t cnt)
+{
+	if (__builtin_constant_p(cnt)) {
+		switch (cnt) {
+		case 4:
+			asm volatile("movntil %1, %0"
+				     : "=m" (*(u32 *)dst)
+				     : "r" (*(u32 *)src));
+			return;
+		case 8:
+			asm volatile("movntiq %1, %0"
+				     : "=m" (*(u64 *)dst)
+				     : "r" (*(u64 *)src));
+			return;
+		case 16:
+			asm volatile("movntiq %1, %0"
+				     : "=m" (*(u64 *)dst)
+				     : "r" (*(u64 *)src));
+			asm volatile("movntiq %1, %0"
+				     : "=m" (*(u64 *)(dst + 8))
+				     : "r" (*(u64 *)(src + 8)));
+			return;
+		}
+	}
+	__memcpy_flushcache(dst, src, cnt);
+}
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 75d3776123cc..26f515aa3529 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -133,7 +133,7 @@ long __copy_user_flushcache(void *dst, const void __user *src, unsigned size)
 	return rc;
 }
 
-void memcpy_flushcache(void *_dst, const void *_src, size_t size)
+void __memcpy_flushcache(void *_dst, const void *_src, size_t size)
 {
 	unsigned long dest = (unsigned long) _dst;
 	unsigned long source = (unsigned long) _src;
@@ -196,7 +196,7 @@ void memcpy_flushcache(void *_dst, const void *_src, size_t size)
 		clean_cache_range((void *) dest, size);
 	}
 }
-EXPORT_SYMBOL_GPL(memcpy_flushcache);
+EXPORT_SYMBOL_GPL(__memcpy_flushcache);
 
 void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
 		size_t len)
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 RESEND] x86: optimize memcpy_flushcache
  2018-06-18 13:23         ` Mike Snitzer
  (?)
@ 2018-06-21 14:31         ` Ingo Molnar
  2018-06-22  1:19             ` Mikulas Patocka
  -1 siblings, 1 reply; 108+ messages in thread
From: Ingo Molnar @ 2018-06-21 14:31 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Thomas Gleixner, Mikulas Patocka, Dan Williams,
	device-mapper development, X86 ML, linux-kernel


* Mike Snitzer <snitzer@redhat.com> wrote:

> From: Mikulas Patocka <mpatocka@redhat.com>
> Subject: [PATCH v2] x86: optimize memcpy_flushcache
> 
> In the context of constant short length stores to persistent memory,
> memcpy_flushcache suffers from a 2% performance degradation compared to
> explicitly using the "movnti" instruction.
> 
> Optimize 4, 8, and 16 byte memcpy_flushcache calls to explicitly use the
> movnti instruction with inline assembler.

Linus requested asm optimizations to include actual benchmarks, so it would be 
nice to describe how this was tested, on what hardware, and what the before/after 
numbers are.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 RESEND] x86: optimize memcpy_flushcache
  2018-06-21 14:31         ` Ingo Molnar
@ 2018-06-22  1:19             ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-06-22  1:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Snitzer, Thomas Gleixner, Dan Williams,
	device-mapper development, X86 ML, linux-kernel



On Thu, 21 Jun 2018, Ingo Molnar wrote:

> 
> * Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > From: Mikulas Patocka <mpatocka@redhat.com>
> > Subject: [PATCH v2] x86: optimize memcpy_flushcache
> > 
> > In the context of constant short length stores to persistent memory,
> > memcpy_flushcache suffers from a 2% performance degradation compared to
> > explicitly using the "movnti" instruction.
> > 
> > Optimize 4, 8, and 16 byte memcpy_flushcache calls to explicitly use the
> > movnti instruction with inline assembler.
> 
> Linus requested asm optimizations to include actual benchmarks, so it would be 
> nice to describe how this was tested, on what hardware, and what the before/after 
> numbers are.
> 
> Thanks,
> 
> 	Ingo

It was tested on 4-core skylake machine with persistent memory being 
emulated using the memmap kernel option. The dm-writecache target used the 
emulated persistent memory as a cache and sata SSD as a backing device. 
The patch results in 2% improved throughput when writing data using dd.

I don't have access to the machine anymore.

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 RESEND] x86: optimize memcpy_flushcache
@ 2018-06-22  1:19             ` Mikulas Patocka
  0 siblings, 0 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-06-22  1:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Snitzer, X86 ML, linux-kernel, device-mapper development,
	Thomas Gleixner, Dan Williams



On Thu, 21 Jun 2018, Ingo Molnar wrote:

> 
> * Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > From: Mikulas Patocka <mpatocka@redhat.com>
> > Subject: [PATCH v2] x86: optimize memcpy_flushcache
> > 
> > In the context of constant short length stores to persistent memory,
> > memcpy_flushcache suffers from a 2% performance degradation compared to
> > explicitly using the "movnti" instruction.
> > 
> > Optimize 4, 8, and 16 byte memcpy_flushcache calls to explicitly use the
> > movnti instruction with inline assembler.
> 
> Linus requested asm optimizations to include actual benchmarks, so it would be 
> nice to describe how this was tested, on what hardware, and what the before/after 
> numbers are.
> 
> Thanks,
> 
> 	Ingo

It was tested on 4-core skylake machine with persistent memory being 
emulated using the memmap kernel option. The dm-writecache target used the 
emulated persistent memory as a cache and sata SSD as a backing device. 
The patch results in 2% improved throughput when writing data using dd.

I don't have access to the machine anymore.

Mikulas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 RESEND] x86: optimize memcpy_flushcache
  2018-06-22  1:19             ` Mikulas Patocka
  (?)
@ 2018-06-22  1:30             ` Ingo Molnar
  2018-08-08 21:22               ` [PATCH v3 " Mikulas Patocka
  -1 siblings, 1 reply; 108+ messages in thread
From: Ingo Molnar @ 2018-06-22  1:30 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mike Snitzer, Thomas Gleixner, Dan Williams,
	device-mapper development, X86 ML, linux-kernel


* Mikulas Patocka <mpatocka@redhat.com> wrote:

> On Thu, 21 Jun 2018, Ingo Molnar wrote:
> 
> > 
> > * Mike Snitzer <snitzer@redhat.com> wrote:
> > 
> > > From: Mikulas Patocka <mpatocka@redhat.com>
> > > Subject: [PATCH v2] x86: optimize memcpy_flushcache
> > > 
> > > In the context of constant short length stores to persistent memory,
> > > memcpy_flushcache suffers from a 2% performance degradation compared to
> > > explicitly using the "movnti" instruction.
> > > 
> > > Optimize 4, 8, and 16 byte memcpy_flushcache calls to explicitly use the
> > > movnti instruction with inline assembler.
> > 
> > Linus requested asm optimizations to include actual benchmarks, so it would be 
> > nice to describe how this was tested, on what hardware, and what the before/after 
> > numbers are.
> > 
> > Thanks,
> > 
> > 	Ingo
> 
> It was tested on 4-core skylake machine with persistent memory being 
> emulated using the memmap kernel option. The dm-writecache target used the 
> emulated persistent memory as a cache and sata SSD as a backing device. 
> The patch results in 2% improved throughput when writing data using dd.
> 
> I don't have access to the machine anymore.

I think this information is enough, but do we know how well memmap emulation 
represents true persistent memory speed and cache management characteristics?
It might be representative - but I don't know for sure, nor probably most
readers of the changelog.

So could you please put all this into an updated changelog, and also add a short 
description that outlines exactly which codepaths end up using this method in a 
typical persistent memory setup? All filesystem ops - or only reads, etc?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v3 RESEND] x86: optimize memcpy_flushcache
  2018-06-22  1:30             ` Ingo Molnar
@ 2018-08-08 21:22               ` Mikulas Patocka
  2018-09-10 13:18                 ` Ingo Molnar
  2018-09-11  6:22                 ` [tip:x86/asm] x86/asm: Optimize memcpy_flushcache() tip-bot for Mikulas Patocka
  0 siblings, 2 replies; 108+ messages in thread
From: Mikulas Patocka @ 2018-08-08 21:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Snitzer, Thomas Gleixner, Dan Williams,
	device-mapper development, X86 ML, linux-kernel



On Fri, 22 Jun 2018, Ingo Molnar wrote:

> 
> * Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > On Thu, 21 Jun 2018, Ingo Molnar wrote:
> > 
> > > 
> > > * Mike Snitzer <snitzer@redhat.com> wrote:
> > > 
> > > > From: Mikulas Patocka <mpatocka@redhat.com>
> > > > Subject: [PATCH v2] x86: optimize memcpy_flushcache
> > > > 
> > > > In the context of constant short length stores to persistent memory,
> > > > memcpy_flushcache suffers from a 2% performance degradation compared to
> > > > explicitly using the "movnti" instruction.
> > > > 
> > > > Optimize 4, 8, and 16 byte memcpy_flushcache calls to explicitly use the
> > > > movnti instruction with inline assembler.
> > > 
> > > Linus requested asm optimizations to include actual benchmarks, so it would be 
> > > nice to describe how this was tested, on what hardware, and what the before/after 
> > > numbers are.
> > > 
> > > Thanks,
> > > 
> > > 	Ingo
> > 
> > It was tested on 4-core skylake machine with persistent memory being 
> > emulated using the memmap kernel option. The dm-writecache target used the 
> > emulated persistent memory as a cache and sata SSD as a backing device. 
> > The patch results in 2% improved throughput when writing data using dd.
> > 
> > I don't have access to the machine anymore.
> 
> I think this information is enough, but do we know how well memmap emulation 
> represents true persistent memory speed and cache management characteristics?
> It might be representative - but I don't know for sure, nor probably most
> readers of the changelog.
> 
> So could you please put all this into an updated changelog, and also add a short 
> description that outlines exactly which codepaths end up using this method in a 
> typical persistent memory setup? All filesystem ops - or only reads, etc?
> 
> Thanks,
> 
> 	Ingo

Here I resend it:


From: Mikulas Patocka <mpatocka@redhat.com>
Subject: [PATCH] x86: optimize memcpy_flushcache

I use memcpy_flushcache in my persistent memory driver for metadata
updates, there are many 8-byte and 16-byte updates and it turns out that
the overhead of memcpy_flushcache causes 2% performance degradation
compared to "movnti" instruction explicitly coded using inline assembler.

The tests were done on a Skylake processor with persistent memory emulated
using the "memmap" kernel parameter. dd was used to copy data to the
dm-writecache target.

This patch recognizes memcpy_flushcache calls with constant short length
and turns them into inline assembler - so that I don't have to use inline
assembler in the driver.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 arch/x86/include/asm/string_64.h |   20 +++++++++++++++++++-
 arch/x86/lib/usercopy_64.c       |    4 ++--
 2 files changed, 21 insertions(+), 3 deletions(-)

Index: linux-2.6/arch/x86/include/asm/string_64.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/string_64.h
+++ linux-2.6/arch/x86/include/asm/string_64.h
@@ -149,7 +149,25 @@ memcpy_mcsafe(void *dst, const void *src
 
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
-void memcpy_flushcache(void *dst, const void *src, size_t cnt);
+void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
+static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t cnt)
+{
+	if (__builtin_constant_p(cnt)) {
+		switch (cnt) {
+			case 4:
+				asm ("movntil %1, %0" : "=m"(*(u32 *)dst) : "r"(*(u32 *)src));
+				return;
+			case 8:
+				asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : "r"(*(u64 *)src));
+				return;
+			case 16:
+				asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : "r"(*(u64 *)src));
+				asm ("movntiq %1, %0" : "=m"(*(u64 *)(dst + 8)) : "r"(*(u64 *)(src + 8)));
+				return;
+		}
+	}
+	__memcpy_flushcache(dst, src, cnt);
+}
 #endif
 
 #endif /* __KERNEL__ */
Index: linux-2.6/arch/x86/lib/usercopy_64.c
===================================================================
--- linux-2.6.orig/arch/x86/lib/usercopy_64.c
+++ linux-2.6/arch/x86/lib/usercopy_64.c
@@ -153,7 +153,7 @@ long __copy_user_flushcache(void *dst, c
 	return rc;
 }
 
-void memcpy_flushcache(void *_dst, const void *_src, size_t size)
+void __memcpy_flushcache(void *_dst, const void *_src, size_t size)
 {
 	unsigned long dest = (unsigned long) _dst;
 	unsigned long source = (unsigned long) _src;
@@ -216,7 +216,7 @@ void memcpy_flushcache(void *_dst, const
 		clean_cache_range((void *) dest, size);
 	}
 }
-EXPORT_SYMBOL_GPL(memcpy_flushcache);
+EXPORT_SYMBOL_GPL(__memcpy_flushcache);
 
 void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
 		size_t len)

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v3 RESEND] x86: optimize memcpy_flushcache
  2018-08-08 21:22               ` [PATCH v3 " Mikulas Patocka
@ 2018-09-10 13:18                 ` Ingo Molnar
  2018-09-11  6:22                 ` [tip:x86/asm] x86/asm: Optimize memcpy_flushcache() tip-bot for Mikulas Patocka
  1 sibling, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2018-09-10 13:18 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mike Snitzer, Thomas Gleixner, Dan Williams,
	device-mapper development, X86 ML, linux-kernel


* Mikulas Patocka <mpatocka@redhat.com> wrote:

> Here I resend it:
> 
> 
> From: Mikulas Patocka <mpatocka@redhat.com>
> Subject: [PATCH] x86: optimize memcpy_flushcache
> 
> I use memcpy_flushcache in my persistent memory driver for metadata
> updates, there are many 8-byte and 16-byte updates and it turns out that
> the overhead of memcpy_flushcache causes 2% performance degradation
> compared to "movnti" instruction explicitly coded using inline assembler.
> 
> The tests were done on a Skylake processor with persistent memory emulated
> using the "memmap" kernel parameter. dd was used to copy data to the
> dm-writecache target.
> 
> This patch recognizes memcpy_flushcache calls with constant short length
> and turns them into inline assembler - so that I don't have to use inline
> assembler in the driver.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> 
> ---
>  arch/x86/include/asm/string_64.h |   20 +++++++++++++++++++-
>  arch/x86/lib/usercopy_64.c       |    4 ++--
>  2 files changed, 21 insertions(+), 3 deletions(-)

Applied to tip:x86/asm, thanks!

I'll push it out later today after some testing.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [tip:x86/asm] x86/asm: Optimize memcpy_flushcache()
  2018-08-08 21:22               ` [PATCH v3 " Mikulas Patocka
  2018-09-10 13:18                 ` Ingo Molnar
@ 2018-09-11  6:22                 ` tip-bot for Mikulas Patocka
  1 sibling, 0 replies; 108+ messages in thread
From: tip-bot for Mikulas Patocka @ 2018-09-11  6:22 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, hpa, peterz, tglx, mpatocka, snitzer, linux-kernel,
	torvalds, dan.j.williams, dm-devel

Commit-ID:  02101c45ec5b19d607af7372680f5259050b4e9c
Gitweb:     https://git.kernel.org/tip/02101c45ec5b19d607af7372680f5259050b4e9c
Author:     Mikulas Patocka <mpatocka@redhat.com>
AuthorDate: Wed, 8 Aug 2018 17:22:16 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 10 Sep 2018 15:17:12 +0200

x86/asm: Optimize memcpy_flushcache()

I use memcpy_flushcache() in my persistent memory driver for metadata
updates, there are many 8-byte and 16-byte updates and it turns out that
the overhead of memcpy_flushcache causes 2% performance degradation
compared to "movnti" instruction explicitly coded using inline assembler.

The tests were done on a Skylake processor with persistent memory emulated
using the "memmap" kernel parameter. dd was used to copy data to the
dm-writecache target.

This patch recognizes memcpy_flushcache calls with constant short length
and turns them into inline assembler - so that I don't have to use inline
assembler in the driver.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: device-mapper development <dm-devel@redhat.com>
Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1808081720460.24747@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/string_64.h | 20 +++++++++++++++++++-
 arch/x86/lib/usercopy_64.c       |  4 ++--
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index d33f92b9fa22..7ad41bfcc16c 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -149,7 +149,25 @@ memcpy_mcsafe(void *dst, const void *src, size_t cnt)
 
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
-void memcpy_flushcache(void *dst, const void *src, size_t cnt);
+void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
+static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t cnt)
+{
+	if (__builtin_constant_p(cnt)) {
+		switch (cnt) {
+			case 4:
+				asm ("movntil %1, %0" : "=m"(*(u32 *)dst) : "r"(*(u32 *)src));
+				return;
+			case 8:
+				asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : "r"(*(u64 *)src));
+				return;
+			case 16:
+				asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : "r"(*(u64 *)src));
+				asm ("movntiq %1, %0" : "=m"(*(u64 *)(dst + 8)) : "r"(*(u64 *)(src + 8)));
+				return;
+		}
+	}
+	__memcpy_flushcache(dst, src, cnt);
+}
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 9c5606d88f61..c50a1d815a37 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -153,7 +153,7 @@ long __copy_user_flushcache(void *dst, const void __user *src, unsigned size)
 	return rc;
 }
 
-void memcpy_flushcache(void *_dst, const void *_src, size_t size)
+void __memcpy_flushcache(void *_dst, const void *_src, size_t size)
 {
 	unsigned long dest = (unsigned long) _dst;
 	unsigned long source = (unsigned long) _src;
@@ -216,7 +216,7 @@ void memcpy_flushcache(void *_dst, const void *_src, size_t size)
 		clean_cache_range((void *) dest, size);
 	}
 }
-EXPORT_SYMBOL_GPL(memcpy_flushcache);
+EXPORT_SYMBOL_GPL(__memcpy_flushcache);
 
 void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
 		size_t len)

^ permalink raw reply related	[flat|nested] 108+ messages in thread

end of thread, other threads:[~2018-09-11  6:22 UTC | newest]

Thread overview: 108+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-19  5:25 [patch 0/4] dm-writecache patches Mikulas Patocka
2018-05-19  5:25 ` [patch 1/4] x86: optimize memcpy_flushcache Mikulas Patocka
2018-05-19 14:21   ` Dan Williams
2018-05-24 18:20     ` [PATCH v2] " Mike Snitzer
2018-06-18 13:23       ` [PATCH v2 RESEND] " Mike Snitzer
2018-06-18 13:23         ` Mike Snitzer
2018-06-21 14:31         ` Ingo Molnar
2018-06-22  1:19           ` Mikulas Patocka
2018-06-22  1:19             ` Mikulas Patocka
2018-06-22  1:30             ` Ingo Molnar
2018-08-08 21:22               ` [PATCH v3 " Mikulas Patocka
2018-09-10 13:18                 ` Ingo Molnar
2018-09-11  6:22                 ` [tip:x86/asm] x86/asm: Optimize memcpy_flushcache() tip-bot for Mikulas Patocka
2018-05-19  5:25 ` [patch 2/4] swait: export the symbols __prepare_to_swait and __finish_swait Mikulas Patocka
2018-05-22  6:34   ` Christoph Hellwig
2018-05-22 18:52     ` Mike Snitzer
2018-05-23  9:21       ` Peter Zijlstra
2018-05-23 15:10         ` Mike Snitzer
2018-05-23 18:10           ` [PATCH v2] swait: export " Mike Snitzer
2018-05-23 20:38             ` Mikulas Patocka
2018-05-23 21:51               ` Mike Snitzer
2018-05-24 14:10             ` Peter Zijlstra
2018-05-24 15:09               ` Mike Snitzer
2018-05-19  5:25 ` [patch 3/4] dm-writecache Mikulas Patocka
2018-05-22  6:37   ` Christoph Hellwig
2018-05-19  5:25 ` [patch 4/4] dm-writecache: use new API for flushing Mikulas Patocka
2018-05-22  6:39   ` [dm-devel] " Christoph Hellwig
2018-05-22  6:39     ` Christoph Hellwig
2018-05-22 18:41     ` Mike Snitzer
2018-05-22 18:41       ` Mike Snitzer
2018-05-22 19:00       ` Dan Williams
2018-05-22 19:00         ` Dan Williams
2018-05-22 19:19         ` Mike Snitzer
2018-05-22 19:19           ` Mike Snitzer
2018-05-22 19:27           ` Dan Williams
2018-05-22 19:27             ` Dan Williams
2018-05-22 20:52             ` Mike Snitzer
2018-05-22 20:52               ` Mike Snitzer
2018-05-22 22:53               ` [dm-devel] " Jeff Moyer
2018-05-22 22:53                 ` Jeff Moyer
2018-05-23 20:57                 ` Mikulas Patocka
2018-05-23 20:57                   ` Mikulas Patocka
2018-05-28 13:52             ` Mikulas Patocka
2018-05-28 13:52               ` Mikulas Patocka
2018-05-28 17:41               ` Dan Williams
2018-05-28 17:41                 ` Dan Williams
2018-05-30 13:42                 ` [dm-devel] " Jeff Moyer
2018-05-30 13:42                   ` Jeff Moyer
2018-05-30 13:51                   ` Mikulas Patocka
2018-05-30 13:51                     ` Mikulas Patocka
2018-05-30 13:52                   ` Jeff Moyer
2018-05-30 13:52                     ` Jeff Moyer
2018-05-24  8:15         ` Mikulas Patocka
2018-05-24  8:15           ` Mikulas Patocka
2018-05-25  3:12   ` Dan Williams
2018-05-25  6:17     ` Mikulas Patocka
2018-05-25 12:51       ` Mike Snitzer
2018-05-25 12:51         ` Mike Snitzer
2018-05-25 15:57         ` Dan Williams
2018-05-25 15:57           ` Dan Williams
2018-05-26  7:02           ` Mikulas Patocka
2018-05-26  7:02             ` Mikulas Patocka
2018-05-26 15:26             ` Dan Williams
2018-05-26 15:26               ` Dan Williams
2018-05-28 13:32               ` Mikulas Patocka
2018-05-28 13:32                 ` Mikulas Patocka
2018-05-28 18:14                 ` Dan Williams
2018-05-28 18:14                   ` Dan Williams
2018-05-30 13:07                   ` Mikulas Patocka
2018-05-30 13:07                     ` Mikulas Patocka
2018-05-30 13:16                     ` Mike Snitzer
2018-05-30 13:16                       ` Mike Snitzer
2018-05-30 13:21                       ` Mikulas Patocka
2018-05-30 13:21                         ` Mikulas Patocka
2018-05-30 13:26                         ` Mike Snitzer
2018-05-30 13:26                           ` Mike Snitzer
2018-05-30 13:33                           ` Mikulas Patocka
2018-05-30 13:33                             ` Mikulas Patocka
2018-05-30 13:54                             ` Mike Snitzer
2018-05-30 13:54                               ` Mike Snitzer
2018-05-30 14:09                               ` Mikulas Patocka
2018-05-30 14:09                                 ` Mikulas Patocka
2018-05-30 14:21                                 ` Mike Snitzer
2018-05-30 14:21                                   ` Mike Snitzer
2018-05-30 14:46                                   ` Mikulas Patocka
2018-05-30 14:46                                     ` Mikulas Patocka
2018-05-31  3:42                                     ` Mike Snitzer
2018-05-31  3:42                                       ` Mike Snitzer
2018-06-03 15:03                                       ` Mikulas Patocka
2018-06-03 15:03                                         ` Mikulas Patocka
2018-05-31  3:39                                 ` Mike Snitzer
2018-05-31  3:39                                   ` Mike Snitzer
2018-05-31  8:16                                   ` Mikulas Patocka
2018-05-31  8:16                                     ` Mikulas Patocka
2018-05-31 12:09                                     ` Mike Snitzer
2018-05-31 12:09                                       ` Mike Snitzer
2018-05-30 15:58                     ` Dan Williams
2018-05-30 15:58                       ` Dan Williams
2018-05-30 22:39                       ` Dan Williams
2018-05-30 22:39                         ` Dan Williams
2018-05-31  8:19                         ` Mikulas Patocka
2018-05-31  8:19                           ` Mikulas Patocka
2018-05-31 14:51                           ` Dan Williams
2018-05-31 14:51                             ` Dan Williams
2018-05-31 15:31                             ` Mikulas Patocka
2018-05-31 15:31                               ` Mikulas Patocka
2018-05-31 16:39                               ` Dan Williams
2018-05-31 16:39                                 ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.