linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] staging: Add dm-writeboost
@ 2013-09-01 11:10 Akira Hayakawa
  2013-09-16 21:53 ` Mike Snitzer
  0 siblings, 1 reply; 25+ messages in thread
From: Akira Hayakawa @ 2013-09-01 11:10 UTC (permalink / raw)
  To: gregkh
  Cc: ruby.wktk, linux-kernel, dm-devel, agk, m.chehab, devel, akpm,
	joe, cesarb

This patch introduces dm-writeboost to staging tree.

dm-writeboost is a log-structured caching software.
It batches in-coming random-writes to a big sequential write
to a cache device.

Unlike other block caching softwares like dm-cache and bcache,
dm-writeboost focuses on bursty writes.
Since the implementation is optimized on writes,
the benchmark using fio indicates that
it performs 259MB/s random writes with
SSD with 266MB/s sequential write throughput
which is only 3% loss.

Furthermore,
because it uses SSD cache device sequentially,
the lifetime of the device is maximized.

The merit of putting this software in staging tree is
to make it more possible to get feedback from users
and thus polish the code.

Signed-off-by: Akira Hayakawa <ruby.wktk@gmail.com>
---
 MAINTAINERS                                     |    7 +
 drivers/staging/Kconfig                         |    2 +
 drivers/staging/Makefile                        |    1 +
 drivers/staging/dm-writeboost/Kconfig           |    8 +
 drivers/staging/dm-writeboost/Makefile          |    1 +
 drivers/staging/dm-writeboost/TODO              |   11 +
 drivers/staging/dm-writeboost/dm-writeboost.c   | 3445 +++++++++++++++++++++++
 drivers/staging/dm-writeboost/dm-writeboost.txt |  133 +
 8 files changed, 3608 insertions(+)
 create mode 100644 drivers/staging/dm-writeboost/Kconfig
 create mode 100644 drivers/staging/dm-writeboost/Makefile
 create mode 100644 drivers/staging/dm-writeboost/TODO
 create mode 100644 drivers/staging/dm-writeboost/dm-writeboost.c
 create mode 100644 drivers/staging/dm-writeboost/dm-writeboost.txt

diff --git a/MAINTAINERS b/MAINTAINERS
index d167c03..975e4b0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8023,6 +8023,13 @@ M:	Arnaud Patard <arnaud.patard@rtp-net.org>
 S:	Odd Fixes
 F:	drivers/staging/xgifb/
 
+STAGING - LOG STRUCTURED CACHING
+M:	Akira Hayakawa <ruby.wktk@gmail.com>
+S:	Maintained
+L:	dm-devel@redhat.com
+W:	https://github.com/akiradeveloper/dm-writeboost
+F:	drivers/staging/dm-writeboost/
+
 STARFIRE/DURALAN NETWORK DRIVER
 M:	Ion Badulescu <ionut@badula.org>
 S:	Odd Fixes
diff --git a/drivers/staging/Kconfig b/drivers/staging/Kconfig
index 3626dbc8..6d639fc 100644
--- a/drivers/staging/Kconfig
+++ b/drivers/staging/Kconfig
@@ -148,4 +148,6 @@ source "drivers/staging/dgnc/Kconfig"
 
 source "drivers/staging/dgap/Kconfig"
 
+source "drivers/staging/dm-writeboost/Kconfig"
+
 endif # STAGING
diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile
index d1b4b80..f26010b 100644
--- a/drivers/staging/Makefile
+++ b/drivers/staging/Makefile
@@ -66,3 +66,4 @@ obj-$(CONFIG_USB_BTMTK)		+= btmtk_usb/
 obj-$(CONFIG_XILLYBUS)		+= xillybus/
 obj-$(CONFIG_DGNC)			+= dgnc/
 obj-$(CONFIG_DGAP)			+= dgap/
+obj-$(CONFIG_DM_WRITEBOOST)	+= dm-writeboost/
diff --git a/drivers/staging/dm-writeboost/Kconfig b/drivers/staging/dm-writeboost/Kconfig
new file mode 100644
index 0000000..fc33f63
--- /dev/null
+++ b/drivers/staging/dm-writeboost/Kconfig
@@ -0,0 +1,8 @@
+config DM_WRITEBOOST
+	tristate "Log-structured Caching"
+	depends on BLK_DEV_DM
+	default n
+	---help---
+	  A cache layer that
+	  batches random writes into a big sequential write
+	  to a cache device in Log-structured manner.
diff --git a/drivers/staging/dm-writeboost/Makefile b/drivers/staging/dm-writeboost/Makefile
new file mode 100644
index 0000000..d4e3100
--- /dev/null
+++ b/drivers/staging/dm-writeboost/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_DM_WRITEBOOST) += dm-writeboost.o
diff --git a/drivers/staging/dm-writeboost/TODO b/drivers/staging/dm-writeboost/TODO
new file mode 100644
index 0000000..638c1ee
--- /dev/null
+++ b/drivers/staging/dm-writeboost/TODO
@@ -0,0 +1,11 @@
+TODO:
+    - Get feedback from 3rd party users.
+    - Reviewed by Mike Snitzer.
+    - Audit userspace interfaces to make sure they are sane.
+      Should use the same approach that is proven successful.
+    - Fix document. Learn from other targets.
+    - Add more comments inline
+      to explain what it does and how it works.
+
+Please send patches to Greg Kroah-Hartman <gregkh@linuxfoundation.org> and Cc:
+Akira Hayakawa <ruby.wktk@gmail.com>
diff --git a/drivers/staging/dm-writeboost/dm-writeboost.c b/drivers/staging/dm-writeboost/dm-writeboost.c
new file mode 100644
index 0000000..43e108b
--- /dev/null
+++ b/drivers/staging/dm-writeboost/dm-writeboost.c
@@ -0,0 +1,3445 @@
+/*
+ * dm-writeboost.c : Log-structured Caching for Linux.
+ * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
+ *
+ * This file is released under the GPL.
+ */
+
+#define DM_MSG_PREFIX "writeboost"
+
+#include <linux/module.h>
+#include <linux/version.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/timer.h>
+#include <linux/device-mapper.h>
+#include <linux/dm-io.h>
+
+#define WBERR(f, args...) \
+	DMERR("err@%d " f, __LINE__, ## args)
+#define WBWARN(f, args...) \
+	DMWARN("warn@%d " f, __LINE__, ## args)
+#define WBINFO(f, args...) \
+	DMINFO("info@%d " f, __LINE__, ## args)
+
+/*
+ * (1 << x) sector.
+ * 4 <= x <= 11
+ * dm-writeboost supports segment size up to 1MB.
+ *
+ * All the comments are if
+ * the segment size is the maximum 1MB.
+ */
+#define WB_SEGMENTSIZE_ORDER 11
+
+/*
+ * By default,
+ * we allocate 64 * 1MB RAM buffers statically.
+ */
+#define NR_RAMBUF_POOL 64
+
+/*
+ * The first 4KB (1<<3 sectors) in segment
+ * is for metadata.
+ */
+#define NR_CACHES_INSEG ((1 << (WB_SEGMENTSIZE_ORDER - 3)) - 1)
+
+static void *do_kmalloc_retry(size_t size, gfp_t flags, int lineno)
+{
+	size_t count = 0;
+	void *p;
+
+retry_alloc:
+	p = kmalloc(size, flags);
+	if (!p) {
+		count++;
+		WBWARN("L%d size:%lu, count:%lu",
+		       lineno, size, count);
+		schedule_timeout_interruptible(msecs_to_jiffies(1));
+		goto retry_alloc;
+	}
+	return p;
+}
+#define kmalloc_retry(size, flags) \
+	do_kmalloc_retry((size), (flags), __LINE__)
+
+struct part {
+	void *memory;
+};
+
+struct arr {
+	struct part *parts;
+	size_t nr_elems;
+	size_t elemsize;
+};
+
+#define ALLOC_SIZE (1 << 16)
+static size_t nr_elems_in_part(struct arr *arr)
+{
+	return ALLOC_SIZE / arr->elemsize;
+};
+
+static size_t nr_parts(struct arr *arr)
+{
+	return dm_div_up(arr->nr_elems, nr_elems_in_part(arr));
+}
+
+static struct arr *make_arr(size_t elemsize, size_t nr_elems)
+{
+	size_t i, j;
+	struct part *part;
+
+	struct arr *arr = kmalloc(sizeof(*arr), GFP_KERNEL);
+	if (!arr) {
+		WBERR();
+		return NULL;
+	}
+
+	arr->elemsize = elemsize;
+	arr->nr_elems = nr_elems;
+	arr->parts = kmalloc(sizeof(struct part) * nr_parts(arr), GFP_KERNEL);
+	if (!arr->parts) {
+		WBERR();
+		goto bad_alloc_parts;
+	}
+
+	for (i = 0; i < nr_parts(arr); i++) {
+		part = arr->parts + i;
+		part->memory = kmalloc(ALLOC_SIZE, GFP_KERNEL);
+		if (!part->memory) {
+			WBERR();
+			for (j = 0; j < i; j++) {
+				part = arr->parts + j;
+				kfree(part->memory);
+			}
+			goto bad_alloc_parts_memory;
+		}
+	}
+	return arr;
+
+bad_alloc_parts_memory:
+	kfree(arr->parts);
+bad_alloc_parts:
+	kfree(arr);
+	return NULL;
+}
+
+static void kill_arr(struct arr *arr)
+{
+	size_t i;
+	for (i = 0; i < nr_parts(arr); i++) {
+		struct part *part = arr->parts + i;
+		kfree(part->memory);
+	}
+	kfree(arr->parts);
+	kfree(arr);
+}
+
+static void *arr_at(struct arr *arr, size_t i)
+{
+	size_t n = nr_elems_in_part(arr);
+	size_t j = i / n;
+	size_t k = i % n;
+	struct part *part = arr->parts + j;
+	return part->memory + (arr->elemsize * k);
+}
+
+static struct dm_io_client *wb_io_client;
+
+struct safe_io {
+	struct work_struct work;
+	int err;
+	unsigned long err_bits;
+	struct dm_io_request *io_req;
+	unsigned num_regions;
+	struct dm_io_region *regions;
+};
+static struct workqueue_struct *safe_io_wq;
+
+static void safe_io_proc(struct work_struct *work)
+{
+	struct safe_io *io = container_of(work, struct safe_io, work);
+	io->err_bits = 0;
+	io->err = dm_io(io->io_req, io->num_regions, io->regions,
+			&io->err_bits);
+}
+
+/*
+ * dm_io wrapper.
+ * @thread run operation this in other thread to avoid deadlock.
+ */
+static int dm_safe_io_internal(
+		struct dm_io_request *io_req,
+		unsigned num_regions, struct dm_io_region *regions,
+		unsigned long *err_bits, bool thread, int lineno)
+{
+	int err;
+	dev_t dev;
+
+	if (thread) {
+		struct safe_io io = {
+			.io_req = io_req,
+			.regions = regions,
+			.num_regions = num_regions,
+		};
+
+		INIT_WORK_ONSTACK(&io.work, safe_io_proc);
+
+		queue_work(safe_io_wq, &io.work);
+		flush_work(&io.work);
+
+		err = io.err;
+		if (err_bits)
+			*err_bits = io.err_bits;
+	} else {
+		err = dm_io(io_req, num_regions, regions, err_bits);
+	}
+
+	dev = regions->bdev->bd_dev;
+
+	/* dm_io routines permits NULL for err_bits pointer. */
+	if (err || (err_bits && *err_bits)) {
+		unsigned long eb;
+		if (!err_bits)
+			eb = (~(unsigned long)0);
+		else
+			eb = *err_bits;
+		WBERR("L%d err(%d, %lu), rw(%d), sector(%lu), dev(%u:%u)",
+		      lineno, err, eb,
+		      io_req->bi_rw, regions->sector,
+		      MAJOR(dev), MINOR(dev));
+	}
+
+	return err;
+}
+#define dm_safe_io(io_req, num_regions, regions, err_bits, thread) \
+	dm_safe_io_internal((io_req), (num_regions), (regions), \
+			    (err_bits), (thread), __LINE__)
+
+static void dm_safe_io_retry_internal(
+		struct dm_io_request *io_req,
+		unsigned num_regions, struct dm_io_region *regions,
+		bool thread, int lineno)
+{
+	int err, count = 0;
+	unsigned long err_bits;
+	dev_t dev;
+
+retry_io:
+	err_bits = 0;
+	err = dm_safe_io_internal(io_req, num_regions, regions, &err_bits,
+				  thread, lineno);
+
+	dev = regions->bdev->bd_dev;
+	if (err || err_bits) {
+		count++;
+		WBWARN("L%d count(%d)", lineno, count);
+
+		schedule_timeout_interruptible(msecs_to_jiffies(1000));
+		goto retry_io;
+	}
+
+	if (count) {
+		WBWARN("L%d rw(%d), sector(%lu), dev(%u:%u)",
+		       lineno,
+		       io_req->bi_rw, regions->sector,
+		       MAJOR(dev), MINOR(dev));
+	}
+}
+#define dm_safe_io_retry(io_req, num_regions, regions, thread) \
+	dm_safe_io_retry_internal((io_req), (num_regions), (regions), \
+				  (thread), __LINE__)
+
+/*
+ * device_id = 0
+ * is reserved for invalid cache block.
+ */
+typedef u8 device_id;
+
+struct wb_device {
+	struct kobject kobj;
+
+	u8 migrate_threshold;
+
+	struct wb_cache *cache;
+
+	device_id id;
+	struct dm_dev *device;
+
+	atomic64_t nr_dirty_caches;
+
+	struct mapped_device *md;
+};
+
+/*
+ * cache_id = 0
+ * is reserved for no cache.
+ */
+typedef u8 cache_id;
+
+/*
+ * dm-writeboost is able to manange
+ * only (1 << 7) - 1
+ * virtual devices and cache devices.
+ * id = 0 are reserved for special purposes.
+ */
+#define WB_NR_SLOTS (1 << 7)
+
+cache_id cache_id_ptr;
+
+struct wb_cache *wb_caches[WB_NR_SLOTS];
+
+struct wb_device *wb_devices[WB_NR_SLOTS];
+
+/*
+ * Type for cache line index.
+ *
+ * dm-writeboost can supoort a cache device
+ * with size less than 4KB * (1 << 32)
+ * that is 16TB.
+ */
+typedef u32 cache_nr;
+
+/*
+ * Accounts for a 4KB cache line
+ * which consists of 8 sectors
+ * that is managed by dirty bit for each.
+ */
+struct metablock {
+	sector_t sector;
+
+	cache_nr idx; /* Const */
+
+	struct hlist_node ht_list;
+
+	/*
+	 * 8 bit flag for dirtiness
+	 * for each sector in cache line.
+	 *
+	 * In the current implementation,
+	 * we recover only dirty caches
+	 * in crash recovery.
+	 *
+	 * Adding recover flag
+	 * to recover clean caches
+	 * badly complicates the code.
+	 * All in all, nearly meaningless
+	 * because caches are likely to be dirty.
+	 */
+	u8 dirty_bits;
+
+	device_id device_id;
+};
+
+static void inc_nr_dirty_caches(device_id id)
+{
+	struct wb_device *o = wb_devices[id];
+	BUG_ON(!o);
+	atomic64_inc(&o->nr_dirty_caches);
+}
+
+static void dec_nr_dirty_caches(device_id id)
+{
+	struct wb_device *o = wb_devices[id];
+	BUG_ON(!o);
+	atomic64_dec(&o->nr_dirty_caches);
+}
+
+/*
+ * On-disk metablock
+ */
+struct metablock_device {
+	sector_t sector;
+	device_id device_id;
+
+	u8 dirty_bits;
+
+	u32 lap;
+} __packed;
+
+struct rambuffer {
+	void *data;
+	struct completion done;
+};
+
+#define SZ_MAX (~(size_t)0)
+struct segment_header {
+	struct metablock mb_array[NR_CACHES_INSEG];
+
+	/*
+	 * ID uniformly increases.
+	 * ID 0 is used to tell that the segment is invalid
+	 * and valid id >= 1.
+	 */
+	size_t global_id;
+
+	/*
+	 * Segment can be flushed half-done.
+	 * length is the number of
+	 * metablocks that must be counted in
+	 * in resuming.
+	 */
+	u8 length;
+
+	cache_nr start_idx; /* Const */
+	sector_t start_sector; /* Const */
+
+	struct list_head migrate_list;
+
+	struct completion flush_done;
+
+	struct completion migrate_done;
+
+	spinlock_t lock;
+
+	atomic_t nr_inflight_ios;
+};
+
+#define lockseg(seg, flags) spin_lock_irqsave(&(seg)->lock, flags)
+#define unlockseg(seg, flags) spin_unlock_irqrestore(&(seg)->lock, flags)
+
+static void cleanup_mb_if_dirty(struct segment_header *seg,
+				struct metablock *mb)
+{
+	unsigned long flags;
+
+	bool b = false;
+	lockseg(seg, flags);
+	if (mb->dirty_bits) {
+		mb->dirty_bits = 0;
+		b = true;
+	}
+	unlockseg(seg, flags);
+
+	if (b)
+		dec_nr_dirty_caches(mb->device_id);
+}
+
+static u8 atomic_read_mb_dirtiness(struct segment_header *seg,
+				   struct metablock *mb)
+{
+	unsigned long flags;
+	u8 r;
+
+	lockseg(seg, flags);
+	r = mb->dirty_bits;
+	unlockseg(seg, flags);
+
+	return r;
+}
+
+/*
+ * On-disk segment header.
+ * At most 4KB in total.
+ */
+struct segment_header_device {
+	/* --- At most512 byte for atomicity. --- */
+	size_t global_id;
+	u8 length;
+	u32 lap; /* Initially 0. 1 for the first lap. */
+	/* -------------------------------------- */
+	/* This array must locate at the tail */
+	struct metablock_device mbarr[NR_CACHES_INSEG];
+} __packed;
+
+struct lookup_key {
+	device_id device_id;
+	sector_t sector;
+};
+
+enum STATFLAG {
+	STAT_WRITE = 0,
+	STAT_HIT,
+	STAT_ON_BUFFER,
+	STAT_FULLSIZE,
+};
+#define STATLEN (1 << 4)
+
+struct ht_head {
+	struct hlist_head ht_list;
+};
+
+struct wb_cache {
+	struct kobject kobj;
+
+	cache_id id;
+	struct dm_dev *device;
+	struct mutex io_lock;
+	cache_nr nr_caches; /* Const */
+	size_t nr_segments; /* Const */
+	struct arr *segment_header_array;
+
+	/*
+	 * Chained hashtable
+	 */
+	struct arr *htable;
+	size_t htsize;
+	struct ht_head *null_head;
+
+	cache_nr cursor; /* Index that has been written the most lately */
+	struct segment_header *current_seg;
+	struct rambuffer *current_rambuf;
+	struct rambuffer *rambuf_pool;
+
+	size_t last_migrated_segment_id;
+	size_t last_flushed_segment_id;
+	size_t reserving_segment_id;
+
+	/*
+	 * For Flush daemon
+	 */
+	struct work_struct flush_work;
+	struct workqueue_struct *flush_wq;
+	spinlock_t flush_queue_lock;
+	struct list_head flush_queue;
+	wait_queue_head_t flush_wait_queue;
+
+	/*
+	 * For deferred ack for barriers.
+	 */
+	struct work_struct barrier_deadline_work;
+	struct timer_list barrier_deadline_timer;
+	struct bio_list barrier_ios;
+	unsigned long barrier_deadline_ms;
+
+	/*
+	 * For Migration daemon
+	 */
+	struct work_struct migrate_work;
+	struct workqueue_struct *migrate_wq;
+	bool allow_migrate;
+	bool force_migrate;
+
+	/*
+	 * For migration
+	 */
+	wait_queue_head_t migrate_wait_queue;
+	atomic_t migrate_fail_count;
+	atomic_t migrate_io_count;
+	bool migrate_dests[WB_NR_SLOTS];
+	size_t nr_max_batched_migration;
+	size_t nr_cur_batched_migration;
+	struct list_head migrate_list;
+	u8 *dirtiness_snapshot;
+	void *migrate_buffer;
+
+	bool on_terminate;
+
+	atomic64_t stat[STATLEN];
+
+	unsigned long update_interval;
+	unsigned long commit_super_block_interval;
+	unsigned long flush_current_buffer_interval;
+};
+
+static void inc_stat(struct wb_cache *cache,
+		     int rw, bool found, bool on_buffer, bool fullsize)
+{
+	atomic64_t *v;
+
+	int i = 0;
+	if (rw)
+		i |= (1 << STAT_WRITE);
+	if (found)
+		i |= (1 << STAT_HIT);
+	if (on_buffer)
+		i |= (1 << STAT_ON_BUFFER);
+	if (fullsize)
+		i |= (1 << STAT_FULLSIZE);
+
+	v = &cache->stat[i];
+	atomic64_inc(v);
+}
+
+static void clear_stat(struct wb_cache *cache)
+{
+	int i;
+	for (i = 0; i < STATLEN; i++) {
+		atomic64_t *v = &cache->stat[i];
+		atomic64_set(v, 0);
+	}
+}
+
+static struct metablock *mb_at(struct wb_cache *cache, cache_nr idx)
+{
+	size_t seg_idx = idx / NR_CACHES_INSEG;
+	struct segment_header *seg =
+		arr_at(cache->segment_header_array, seg_idx);
+	cache_nr idx_inseg = idx % NR_CACHES_INSEG;
+	return seg->mb_array + idx_inseg;
+}
+
+static void mb_array_empty_init(struct wb_cache *cache)
+{
+	size_t i;
+	for (i = 0; i < cache->nr_caches; i++) {
+		struct metablock *mb = mb_at(cache, i);
+		INIT_HLIST_NODE(&mb->ht_list);
+
+		mb->idx = i;
+		mb->dirty_bits = 0;
+	}
+}
+
+static int __must_check ht_empty_init(struct wb_cache *cache)
+{
+	cache_nr idx;
+	size_t i;
+	size_t nr_heads;
+	struct arr *arr;
+
+	cache->htsize = cache->nr_caches;
+	nr_heads = cache->htsize + 1;
+	arr = make_arr(sizeof(struct ht_head), nr_heads);
+	if (!arr) {
+		WBERR();
+		return -ENOMEM;
+	}
+
+	cache->htable = arr;
+
+	for (i = 0; i < nr_heads; i++) {
+		struct ht_head *hd = arr_at(arr, i);
+		INIT_HLIST_HEAD(&hd->ht_list);
+	}
+
+	/*
+	 * Our hashtable has one special bucket called null head.
+	 * Orphan metablocks are linked to the null head.
+	 */
+	cache->null_head = arr_at(cache->htable, cache->htsize);
+
+	for (idx = 0; idx < cache->nr_caches; idx++) {
+		struct metablock *mb = mb_at(cache, idx);
+		hlist_add_head(&mb->ht_list, &cache->null_head->ht_list);
+	}
+
+	return 0;
+}
+
+static cache_nr ht_hash(struct wb_cache *cache, struct lookup_key *key)
+{
+	return key->sector % cache->htsize;
+}
+
+static bool mb_hit(struct metablock *mb, struct lookup_key *key)
+{
+	return (mb->sector == key->sector) && (mb->device_id == key->device_id);
+}
+
+static void ht_del(struct wb_cache *cache, struct metablock *mb)
+{
+	struct ht_head *null_head;
+
+	hlist_del(&mb->ht_list);
+
+	null_head = cache->null_head;
+	hlist_add_head(&mb->ht_list, &null_head->ht_list);
+}
+
+static void ht_register(struct wb_cache *cache, struct ht_head *head,
+			struct lookup_key *key, struct metablock *mb)
+{
+	hlist_del(&mb->ht_list);
+	hlist_add_head(&mb->ht_list, &head->ht_list);
+
+	mb->device_id = key->device_id;
+	mb->sector = key->sector;
+};
+
+static struct metablock *ht_lookup(struct wb_cache *cache,
+				   struct ht_head *head, struct lookup_key *key)
+{
+	struct metablock *mb, *found = NULL;
+	hlist_for_each_entry(mb, &head->ht_list, ht_list) {
+		if (mb_hit(mb, key)) {
+			found = mb;
+			break;
+		}
+	}
+	return found;
+}
+
+static void discard_caches_inseg(struct wb_cache *cache,
+				 struct segment_header *seg)
+{
+	u8 i;
+	for (i = 0; i < NR_CACHES_INSEG; i++) {
+		struct metablock *mb = seg->mb_array + i;
+		ht_del(cache, mb);
+	}
+}
+
+static int __must_check init_segment_header_array(struct wb_cache *cache)
+{
+	size_t segment_idx, nr_segments = cache->nr_segments;
+	cache->segment_header_array =
+		make_arr(sizeof(struct segment_header), nr_segments);
+	if (!cache->segment_header_array) {
+		WBERR();
+		return -ENOMEM;
+	}
+
+	for (segment_idx = 0; segment_idx < nr_segments; segment_idx++) {
+		struct segment_header *seg =
+			arr_at(cache->segment_header_array, segment_idx);
+		seg->start_idx = NR_CACHES_INSEG * segment_idx;
+		seg->start_sector =
+			((segment_idx % nr_segments) + 1) *
+			(1 << WB_SEGMENTSIZE_ORDER);
+
+		seg->length = 0;
+
+		atomic_set(&seg->nr_inflight_ios, 0);
+
+		spin_lock_init(&seg->lock);
+
+		INIT_LIST_HEAD(&seg->migrate_list);
+
+		init_completion(&seg->flush_done);
+		complete_all(&seg->flush_done);
+
+		init_completion(&seg->migrate_done);
+		complete_all(&seg->migrate_done);
+	}
+
+	return 0;
+}
+
+static struct segment_header *get_segment_header_by_id(struct wb_cache *cache,
+						       size_t segment_id)
+{
+	struct segment_header *r =
+		arr_at(cache->segment_header_array,
+		       (segment_id - 1) % cache->nr_segments);
+	return r;
+}
+
+static u32 calc_segment_lap(struct wb_cache *cache, size_t segment_id)
+{
+	u32 a = (segment_id - 1) / cache->nr_segments;
+	return a + 1;
+};
+
+static sector_t calc_mb_start_sector(struct segment_header *seg,
+				     cache_nr mb_idx)
+{
+	size_t k = 1 + (mb_idx % NR_CACHES_INSEG);
+	return seg->start_sector + (k << 3);
+}
+
+static u8 count_dirty_caches_remained(struct segment_header *seg)
+{
+	u8 i, count = 0;
+
+	struct metablock *mb;
+	for (i = 0; i < seg->length; i++) {
+		mb = seg->mb_array + i;
+		if (mb->dirty_bits)
+			count++;
+	}
+	return count;
+}
+
+static void prepare_segment_header_device(struct segment_header_device *dest,
+					  struct wb_cache *cache,
+					  struct segment_header *src)
+{
+	cache_nr i;
+	u8 left, right;
+
+	dest->global_id = src->global_id;
+	dest->length = src->length;
+	dest->lap = calc_segment_lap(cache, src->global_id);
+
+	left = src->length - 1;
+	right = (cache->cursor) % NR_CACHES_INSEG;
+	BUG_ON(left != right);
+
+	for (i = 0; i < src->length; i++) {
+		struct metablock *mb = src->mb_array + i;
+		struct metablock_device *mbdev = &dest->mbarr[i];
+		mbdev->device_id = mb->device_id;
+		mbdev->sector = mb->sector;
+		mbdev->dirty_bits = mb->dirty_bits;
+		mbdev->lap = dest->lap;
+	}
+}
+
+struct flush_context {
+	struct list_head flush_queue;
+	struct segment_header *seg;
+	struct rambuffer *rambuf;
+	struct bio_list barrier_ios;
+};
+
+static void flush_proc(struct work_struct *work)
+{
+	unsigned long flags;
+
+	struct wb_cache *cache =
+		container_of(work, struct wb_cache, flush_work);
+
+	while (true) {
+		struct flush_context *ctx;
+		struct segment_header *seg;
+		struct dm_io_request io_req;
+		struct dm_io_region region;
+
+		spin_lock_irqsave(&cache->flush_queue_lock, flags);
+		while (list_empty(&cache->flush_queue)) {
+			spin_unlock_irqrestore(&cache->flush_queue_lock, flags);
+			wait_event_interruptible_timeout(
+				cache->flush_wait_queue,
+				(!list_empty(&cache->flush_queue)),
+				msecs_to_jiffies(100));
+			spin_lock_irqsave(&cache->flush_queue_lock, flags);
+
+			if (cache->on_terminate)
+				return;
+		}
+
+		/* Pop the first entry */
+		ctx = list_first_entry(
+			&cache->flush_queue, struct flush_context, flush_queue);
+		list_del(&ctx->flush_queue);
+		spin_unlock_irqrestore(&cache->flush_queue_lock, flags);
+
+		seg = ctx->seg;
+
+		io_req = (struct dm_io_request) {
+			.client = wb_io_client,
+			.bi_rw = WRITE,
+			.notify.fn = NULL,
+			.mem.type = DM_IO_KMEM,
+			.mem.ptr.addr = ctx->rambuf->data,
+		};
+
+		region = (struct dm_io_region) {
+			.bdev = cache->device->bdev,
+			.sector = seg->start_sector,
+			.count = (seg->length + 1) << 3,
+		};
+
+		dm_safe_io_retry(&io_req, 1, &region, false);
+
+		cache->last_flushed_segment_id = seg->global_id;
+
+		complete_all(&seg->flush_done);
+
+		complete_all(&ctx->rambuf->done);
+
+		if (!bio_list_empty(&ctx->barrier_ios)) {
+			struct bio *bio;
+			blkdev_issue_flush(cache->device->bdev, GFP_NOIO, NULL);
+			while ((bio = bio_list_pop(&ctx->barrier_ios)))
+				bio_endio(bio, 0);
+
+			mod_timer(&cache->barrier_deadline_timer,
+				  msecs_to_jiffies(cache->barrier_deadline_ms));
+		}
+
+		kfree(ctx);
+	}
+}
+
+static void prepare_meta_rambuffer(void *rambuffer,
+				   struct wb_cache *cache,
+				   struct segment_header *seg)
+{
+	prepare_segment_header_device(rambuffer, cache, seg);
+}
+
+static void queue_flushing(struct wb_cache *cache)
+{
+	unsigned long flags;
+	struct segment_header *current_seg = cache->current_seg, *new_seg;
+	struct flush_context *ctx;
+	bool empty;
+	struct rambuffer *next_rambuf;
+	size_t next_id, n1 = 0, n2 = 0;
+
+	while (atomic_read(&current_seg->nr_inflight_ios)) {
+		n1++;
+		if (n1 == 100)
+			WBWARN();
+		schedule_timeout_interruptible(msecs_to_jiffies(1));
+	}
+
+	prepare_meta_rambuffer(cache->current_rambuf->data, cache,
+			       cache->current_seg);
+
+	INIT_COMPLETION(current_seg->migrate_done);
+	INIT_COMPLETION(current_seg->flush_done);
+
+	ctx = kmalloc_retry(sizeof(*ctx), GFP_NOIO);
+	INIT_LIST_HEAD(&ctx->flush_queue);
+	ctx->seg = current_seg;
+	ctx->rambuf = cache->current_rambuf;
+
+	bio_list_init(&ctx->barrier_ios);
+	bio_list_merge(&ctx->barrier_ios, &cache->barrier_ios);
+	bio_list_init(&cache->barrier_ios);
+
+	spin_lock_irqsave(&cache->flush_queue_lock, flags);
+	empty = list_empty(&cache->flush_queue);
+	list_add_tail(&ctx->flush_queue, &cache->flush_queue);
+	spin_unlock_irqrestore(&cache->flush_queue_lock, flags);
+	if (empty)
+		wake_up_interruptible(&cache->flush_wait_queue);
+
+	next_id = current_seg->global_id + 1;
+	new_seg = get_segment_header_by_id(cache, next_id);
+	new_seg->global_id = next_id;
+
+	while (atomic_read(&new_seg->nr_inflight_ios)) {
+		n2++;
+		if (n2 == 100)
+			WBWARN();
+		schedule_timeout_interruptible(msecs_to_jiffies(1));
+	}
+
+	BUG_ON(count_dirty_caches_remained(new_seg));
+
+	discard_caches_inseg(cache, new_seg);
+
+	/* Set the cursor to the last of the flushed segment. */
+	cache->cursor = current_seg->start_idx + (NR_CACHES_INSEG - 1);
+	new_seg->length = 0;
+
+	next_rambuf = cache->rambuf_pool + (next_id % NR_RAMBUF_POOL);
+	wait_for_completion(&next_rambuf->done);
+	INIT_COMPLETION(next_rambuf->done);
+
+	cache->current_rambuf = next_rambuf;
+
+	cache->current_seg = new_seg;
+}
+
+static void migrate_mb(struct wb_cache *cache, struct segment_header *seg,
+		       struct metablock *mb, u8 dirty_bits, bool thread)
+{
+	struct wb_device *wb = wb_devices[mb->device_id];
+
+	if (!dirty_bits)
+		return;
+
+	if (dirty_bits == 255) {
+		void *buf = kmalloc_retry(1 << 12, GFP_NOIO);
+		struct dm_io_request io_req_r, io_req_w;
+		struct dm_io_region region_r, region_w;
+
+		io_req_r = (struct dm_io_request) {
+			.client = wb_io_client,
+			.bi_rw = READ,
+			.notify.fn = NULL,
+			.mem.type = DM_IO_KMEM,
+			.mem.ptr.addr = buf,
+		};
+		region_r = (struct dm_io_region) {
+			.bdev = cache->device->bdev,
+			.sector = calc_mb_start_sector(seg, mb->idx),
+			.count = (1 << 3),
+		};
+
+		dm_safe_io_retry(&io_req_r, 1, &region_r, thread);
+
+		io_req_w = (struct dm_io_request) {
+			.client = wb_io_client,
+			.bi_rw = WRITE_FUA,
+			.notify.fn = NULL,
+			.mem.type = DM_IO_KMEM,
+			.mem.ptr.addr = buf,
+		};
+		region_w = (struct dm_io_region) {
+			.bdev = wb->device->bdev,
+			.sector = mb->sector,
+			.count = (1 << 3),
+		};
+		dm_safe_io_retry(&io_req_w, 1, &region_w, thread);
+
+		kfree(buf);
+	} else {
+		void *buf = kmalloc_retry(1 << SECTOR_SHIFT, GFP_NOIO);
+		size_t i;
+		for (i = 0; i < 8; i++) {
+			bool bit_on = dirty_bits & (1 << i);
+			struct dm_io_request io_req_r, io_req_w;
+			struct dm_io_region region_r, region_w;
+			sector_t src;
+
+			if (!bit_on)
+				continue;
+
+			io_req_r = (struct dm_io_request) {
+				.client = wb_io_client,
+				.bi_rw = READ,
+				.notify.fn = NULL,
+				.mem.type = DM_IO_KMEM,
+				.mem.ptr.addr = buf,
+			};
+			/* A tmp variable just to avoid 80 cols rule */
+			src = calc_mb_start_sector(seg, mb->idx) + i;
+			region_r = (struct dm_io_region) {
+				.bdev = cache->device->bdev,
+				.sector = src,
+				.count = 1,
+			};
+			dm_safe_io_retry(&io_req_r, 1, &region_r, thread);
+
+			io_req_w = (struct dm_io_request) {
+				.client = wb_io_client,
+				.bi_rw = WRITE,
+				.notify.fn = NULL,
+				.mem.type = DM_IO_KMEM,
+				.mem.ptr.addr = buf,
+			};
+			region_w = (struct dm_io_region) {
+				.bdev = wb->device->bdev,
+				.sector = mb->sector + 1 * i,
+				.count = 1,
+			};
+			dm_safe_io_retry(&io_req_w, 1, &region_w, thread);
+		}
+		kfree(buf);
+	}
+}
+
+static void migrate_endio(unsigned long error, void *context)
+{
+	struct wb_cache *cache = context;
+
+	if (error)
+		atomic_inc(&cache->migrate_fail_count);
+
+	if (atomic_dec_and_test(&cache->migrate_io_count))
+		wake_up_interruptible(&cache->migrate_wait_queue);
+}
+
+static void submit_migrate_io(struct wb_cache *cache,
+			      struct segment_header *seg, size_t k)
+{
+	u8 i, j;
+	size_t a = NR_CACHES_INSEG * k;
+	void *p = cache->migrate_buffer + (NR_CACHES_INSEG << 12) * k;
+
+	for (i = 0; i < seg->length; i++) {
+		struct metablock *mb = seg->mb_array + i;
+
+		struct wb_device *wb = wb_devices[mb->device_id];
+		u8 dirty_bits = *(cache->dirtiness_snapshot + (a + i));
+
+		unsigned long offset;
+		void *base, *addr;
+
+		struct dm_io_request io_req_w;
+		struct dm_io_region region_w;
+
+		if (!dirty_bits)
+			continue;
+
+		offset = i << 12;
+		base = p + offset;
+
+		if (dirty_bits == 255) {
+			addr = base;
+			io_req_w = (struct dm_io_request) {
+				.client = wb_io_client,
+				.bi_rw = WRITE,
+				.notify.fn = migrate_endio,
+				.notify.context = cache,
+				.mem.type = DM_IO_VMA,
+				.mem.ptr.vma = addr,
+			};
+			region_w = (struct dm_io_region) {
+				.bdev = wb->device->bdev,
+				.sector = mb->sector,
+				.count = (1 << 3),
+			};
+			dm_safe_io_retry(&io_req_w, 1, &region_w, false);
+		} else {
+			for (j = 0; j < 8; j++) {
+				bool b = dirty_bits & (1 << j);
+				if (!b)
+					continue;
+
+				addr = base + (j << SECTOR_SHIFT);
+				io_req_w = (struct dm_io_request) {
+					.client = wb_io_client,
+					.bi_rw = WRITE,
+					.notify.fn = migrate_endio,
+					.notify.context = cache,
+					.mem.type = DM_IO_VMA,
+					.mem.ptr.vma = addr,
+				};
+				region_w = (struct dm_io_region) {
+					.bdev = wb->device->bdev,
+					.sector = mb->sector + j,
+					.count = 1,
+				};
+				dm_safe_io_retry(
+					&io_req_w, 1, &region_w, false);
+			}
+		}
+	}
+}
+
+static void memorize_dirty_state(struct wb_cache *cache,
+				 struct segment_header *seg, size_t k,
+				 size_t *migrate_io_count)
+{
+	u8 i, j;
+	size_t a = NR_CACHES_INSEG * k;
+	void *p = cache->migrate_buffer + (NR_CACHES_INSEG << 12) * k;
+	struct metablock *mb;
+
+	struct dm_io_request io_req_r = {
+		.client = wb_io_client,
+		.bi_rw = READ,
+		.notify.fn = NULL,
+		.mem.type = DM_IO_VMA,
+		.mem.ptr.vma = p,
+	};
+	struct dm_io_region region_r = {
+		.bdev = cache->device->bdev,
+		.sector = seg->start_sector + (1 << 3),
+		.count = seg->length << 3,
+	};
+	dm_safe_io_retry(&io_req_r, 1, &region_r, false);
+
+	/*
+	 * We take snapshot of the dirtiness in the segments.
+	 * The snapshot segments
+	 * are dirtier than themselves of any future moment
+	 * and we will migrate the possible dirtiest
+	 * state of the segments
+	 * which won't lose any dirty data that was acknowledged.
+	 */
+	for (i = 0; i < seg->length; i++) {
+		mb = seg->mb_array + i;
+		*(cache->dirtiness_snapshot + (a + i)) =
+			atomic_read_mb_dirtiness(seg, mb);
+	}
+
+	for (i = 0; i < seg->length; i++) {
+		u8 dirty_bits;
+
+		mb = seg->mb_array + i;
+
+		dirty_bits = *(cache->dirtiness_snapshot + (a + i));
+
+		if (!dirty_bits)
+			continue;
+
+		*(cache->migrate_dests + mb->device_id) = true;
+
+		if (dirty_bits == 255) {
+			(*migrate_io_count)++;
+		} else {
+			for (j = 0; j < 8; j++) {
+				if (dirty_bits & (1 << j))
+					(*migrate_io_count)++;
+			}
+		}
+	}
+}
+
+static void cleanup_segment(struct wb_cache *cache, struct segment_header *seg)
+{
+	u8 i;
+	for (i = 0; i < seg->length; i++) {
+		struct metablock *mb = seg->mb_array + i;
+		cleanup_mb_if_dirty(seg, mb);
+	}
+}
+
+static void migrate_linked_segments(struct wb_cache *cache)
+{
+	struct segment_header *seg;
+	u8 i;
+	size_t k, migrate_io_count = 0;
+
+	for (i = 0; i < WB_NR_SLOTS; i++)
+		*(cache->migrate_dests + i) = false;
+
+	k = 0;
+	list_for_each_entry(seg, &cache->migrate_list, migrate_list) {
+		memorize_dirty_state(cache, seg, k, &migrate_io_count);
+		k++;
+	}
+
+migrate_write:
+	atomic_set(&cache->migrate_io_count, migrate_io_count);
+	atomic_set(&cache->migrate_fail_count, 0);
+
+	k = 0;
+	list_for_each_entry(seg, &cache->migrate_list, migrate_list) {
+		submit_migrate_io(cache, seg, k);
+		k++;
+	}
+
+	wait_event_interruptible(cache->migrate_wait_queue,
+				 atomic_read(&cache->migrate_io_count) == 0);
+
+	if (atomic_read(&cache->migrate_fail_count)) {
+		WBWARN("%u writebacks failed. retry.",
+		       atomic_read(&cache->migrate_fail_count));
+		goto migrate_write;
+	}
+
+	BUG_ON(atomic_read(&cache->migrate_io_count));
+
+	list_for_each_entry(seg, &cache->migrate_list, migrate_list) {
+		cleanup_segment(cache, seg);
+	}
+
+	for (i = 1; i < WB_NR_SLOTS; i++) {
+		struct wb_device *wb;
+		bool b = *(cache->migrate_dests + i);
+		if (!b)
+			continue;
+
+		wb = wb_devices[i];
+		blkdev_issue_flush(wb->device->bdev, GFP_NOIO, NULL);
+	}
+
+	/*
+	 * Discarding the migrated regions
+	 * can avoid unnecessary wear amplifier in the future.
+	 *
+	 * But note that we should not discard
+	 * the metablock region because
+	 * whether or not to ensure
+	 * the discarded block returns certain value
+	 * is depends on venders
+	 * and unexpected metablock data
+	 * will craze the cache.
+	 */
+	list_for_each_entry(seg, &cache->migrate_list, migrate_list) {
+		blkdev_issue_discard(cache->device->bdev,
+				     seg->start_sector + (1 << 3),
+				     seg->length << 3,
+				     GFP_NOIO, 0);
+	}
+}
+
+static void migrate_proc(struct work_struct *work)
+{
+	struct wb_cache *cache =
+		container_of(work, struct wb_cache, migrate_work);
+
+	while (true) {
+		bool allow_migrate;
+		size_t i, nr_mig_candidates, nr_mig;
+		struct segment_header *seg, *tmp;
+
+		if (cache->on_terminate)
+			return;
+
+		/*
+		 * reserving_id > 0 means
+		 * that migration is immediate.
+		 */
+		allow_migrate = cache->reserving_segment_id ||
+				cache->allow_migrate;
+
+		if (!allow_migrate) {
+			schedule_timeout_interruptible(msecs_to_jiffies(1000));
+			continue;
+		}
+
+		nr_mig_candidates = cache->last_flushed_segment_id -
+				    cache->last_migrated_segment_id;
+
+		if (!nr_mig_candidates) {
+			schedule_timeout_interruptible(msecs_to_jiffies(1000));
+			continue;
+		}
+
+		if (cache->nr_cur_batched_migration !=
+		    cache->nr_max_batched_migration){
+			vfree(cache->migrate_buffer);
+			kfree(cache->dirtiness_snapshot);
+			cache->nr_cur_batched_migration =
+				cache->nr_max_batched_migration;
+			cache->migrate_buffer =
+				vmalloc(cache->nr_cur_batched_migration *
+					(NR_CACHES_INSEG << 12));
+			cache->dirtiness_snapshot =
+				kmalloc_retry(cache->nr_cur_batched_migration *
+					      NR_CACHES_INSEG,
+					      GFP_NOIO);
+
+			BUG_ON(!cache->migrate_buffer);
+			BUG_ON(!cache->dirtiness_snapshot);
+		}
+
+		/*
+		 * Batched Migration:
+		 * We will migrate at most nr_max_batched_migration
+		 * segments at a time.
+		 */
+		nr_mig = min(nr_mig_candidates,
+			     cache->nr_cur_batched_migration);
+
+		for (i = 1; i <= nr_mig; i++) {
+			seg = get_segment_header_by_id(
+					cache,
+					cache->last_migrated_segment_id + i);
+			list_add_tail(&seg->migrate_list, &cache->migrate_list);
+		}
+
+		migrate_linked_segments(cache);
+
+		/*
+		 * (Locking)
+		 * Only line of code changes
+		 * last_migrate_segment_id during runtime.
+		 */
+		cache->last_migrated_segment_id += nr_mig;
+
+		list_for_each_entry_safe(seg, tmp,
+					 &cache->migrate_list,
+					 migrate_list) {
+			complete_all(&seg->migrate_done);
+			list_del(&seg->migrate_list);
+		}
+	}
+}
+
+static void wait_for_migration(struct wb_cache *cache, size_t id)
+{
+	struct segment_header *seg = get_segment_header_by_id(cache, id);
+
+	cache->reserving_segment_id = id;
+	wait_for_completion(&seg->migrate_done);
+	cache->reserving_segment_id = 0;
+}
+
+struct superblock_device {
+	size_t last_migrated_segment_id;
+} __packed;
+
+static void commit_super_block(struct wb_cache *cache)
+{
+	struct superblock_device o;
+	void *buf;
+	struct dm_io_request io_req;
+	struct dm_io_region region;
+
+	o.last_migrated_segment_id = cache->last_migrated_segment_id;
+
+	buf = kmalloc_retry(1 << SECTOR_SHIFT, GFP_NOIO);
+	memcpy(buf, &o, sizeof(o));
+
+	io_req = (struct dm_io_request) {
+		.client = wb_io_client,
+		.bi_rw = WRITE_FUA,
+		.notify.fn = NULL,
+		.mem.type = DM_IO_KMEM,
+		.mem.ptr.addr = buf,
+	};
+	region = (struct dm_io_region) {
+		.bdev = cache->device->bdev,
+		.sector = 0,
+		.count = 1,
+	};
+	dm_safe_io_retry(&io_req, 1, &region, true);
+	kfree(buf);
+}
+
+static int __must_check read_superblock_device(struct superblock_device *dest,
+					       struct wb_cache *cache)
+{
+	int r = 0;
+	struct dm_io_request io_req;
+	struct dm_io_region region;
+
+	void *buf = kmalloc(1 << SECTOR_SHIFT, GFP_KERNEL);
+	if (!buf) {
+		WBERR();
+		return -ENOMEM;
+	}
+
+	io_req = (struct dm_io_request) {
+		.client = wb_io_client,
+		.bi_rw = READ,
+		.notify.fn = NULL,
+		.mem.type = DM_IO_KMEM,
+		.mem.ptr.addr = buf,
+	};
+	region = (struct dm_io_region) {
+		.bdev = cache->device->bdev,
+		.sector = 0,
+		.count = 1,
+	};
+	r = dm_safe_io(&io_req, 1, &region, NULL, true);
+	if (r) {
+		WBERR();
+		goto bad_io;
+	}
+	memcpy(dest, buf, sizeof(*dest));
+bad_io:
+	kfree(buf);
+	return r;
+}
+
+static sector_t calc_segment_header_start(size_t segment_idx)
+{
+	return (1 << WB_SEGMENTSIZE_ORDER) * (segment_idx + 1);
+}
+
+static int __must_check read_segment_header_device(
+		struct segment_header_device *dest,
+		struct wb_cache *cache, size_t segment_idx)
+{
+	int r = 0;
+	struct dm_io_request io_req;
+	struct dm_io_region region;
+	void *buf = kmalloc(1 << 12, GFP_KERNEL);
+	if (!buf) {
+		WBERR();
+		return -ENOMEM;
+	}
+
+	io_req = (struct dm_io_request) {
+		.client = wb_io_client,
+		.bi_rw = READ,
+		.notify.fn = NULL,
+		.mem.type = DM_IO_KMEM,
+		.mem.ptr.addr = buf,
+	};
+	region = (struct dm_io_region) {
+		.bdev = cache->device->bdev,
+		.sector = calc_segment_header_start(segment_idx),
+		.count = (1 << 3),
+	};
+	r = dm_safe_io(&io_req, 1, &region, NULL, false);
+	if (r) {
+		WBERR();
+		goto bad_io;
+	}
+	memcpy(dest, buf, sizeof(*dest));
+bad_io:
+	kfree(buf);
+	return r;
+}
+
+static void update_by_segment_header_device(struct wb_cache *cache,
+					    struct segment_header_device *src)
+{
+	cache_nr i;
+	struct segment_header *seg =
+		get_segment_header_by_id(cache, src->global_id);
+	seg->length = src->length;
+
+	INIT_COMPLETION(seg->migrate_done);
+
+	for (i = 0 ; i < src->length; i++) {
+		cache_nr k;
+		struct lookup_key key;
+		struct ht_head *head;
+		struct metablock *found, *mb = seg->mb_array + i;
+		struct metablock_device *mbdev = &src->mbarr[i];
+
+		if (!mbdev->dirty_bits)
+			continue;
+
+		mb->sector = mbdev->sector;
+		mb->device_id = mbdev->device_id;
+		mb->dirty_bits = mbdev->dirty_bits;
+
+		inc_nr_dirty_caches(mb->device_id);
+
+		key = (struct lookup_key) {
+			.device_id = mb->device_id,
+			.sector = mb->sector,
+		};
+
+		k = ht_hash(cache, &key);
+		head = arr_at(cache->htable, k);
+
+		found = ht_lookup(cache, head, &key);
+		if (found)
+			ht_del(cache, found);
+		ht_register(cache, head, &key, mb);
+	}
+}
+
+static bool checkup_atomicity(struct segment_header_device *header)
+{
+	u8 i;
+	for (i = 0; i < header->length; i++) {
+		struct metablock_device *o;
+		o = header->mbarr + i;
+		if (o->lap != header->lap)
+			return false;
+	}
+	return true;
+}
+
+static int __must_check recover_cache(struct wb_cache *cache)
+{
+	int r = 0;
+	struct segment_header_device *header;
+	struct segment_header *seg;
+	size_t i, j,
+	       max_id, oldest_id, last_flushed_id, init_segment_id,
+	       oldest_idx, nr_segments = cache->nr_segments;
+
+	struct superblock_device uninitialized_var(sup);
+	r = read_superblock_device(&sup, cache);
+	if (r) {
+		WBERR();
+		return r;
+	}
+
+	header = kmalloc(sizeof(*header), GFP_KERNEL);
+	if (!header) {
+		WBERR();
+		return -ENOMEM;
+	}
+
+	/*
+	 * Finding the oldest, non-zero id and its index.
+	 */
+
+	max_id = SZ_MAX;
+	oldest_id = max_id;
+	oldest_idx = 0;
+	for (i = 0; i < nr_segments; i++) {
+		r = read_segment_header_device(header, cache, i);
+		if (r) {
+			WBERR();
+			kfree(header);
+			return r;
+		}
+
+		if (header->global_id < 1)
+			continue;
+
+		if (header->global_id < oldest_id) {
+			oldest_idx = i;
+			oldest_id = header->global_id;
+		}
+	}
+
+	last_flushed_id = 0;
+
+	/*
+	 * This is an invariant.
+	 * We always start from the segment
+	 * that is right after the last_flush_id.
+	 */
+	init_segment_id = last_flushed_id + 1;
+
+	/*
+	 * If no segment was flushed
+	 * then there is nothing to recover.
+	 */
+	if (oldest_id == max_id)
+		goto setup_init_segment;
+
+	/*
+	 * What we have to do in the next loop is to
+	 * revive the segments that are
+	 * flushed but yet not migrated.
+	 */
+
+	/*
+	 * Example:
+	 * There are only 5 segments.
+	 * The segments we will consider are of id k+2 and k+3
+	 * because they are dirty but not migrated.
+	 *
+	 * id: [     k+3    ][  k+4   ][   k    ][     k+1     ][  K+2  ]
+	 *      last_flushed  init_seg  migrated  last_migrated  flushed
+	 */
+	for (i = oldest_idx; i < (nr_segments + oldest_idx); i++) {
+		j = i % nr_segments;
+		r = read_segment_header_device(header, cache, j);
+		if (r) {
+			WBERR();
+			kfree(header);
+			return r;
+		}
+
+		/*
+		 * Valid global_id > 0.
+		 * We encounter header with global_id = 0 and
+		 * we can consider
+		 * this and the followings are all invalid.
+		 */
+		if (header->global_id <= last_flushed_id)
+			break;
+
+		if (!checkup_atomicity(header)) {
+			WBWARN("header atomicity broken id %lu",
+			       header->global_id);
+			break;
+		}
+
+		/*
+		 * Now the header is proven valid.
+		 */
+
+		last_flushed_id = header->global_id;
+		init_segment_id = last_flushed_id + 1;
+
+		/*
+		 * If the data is already on the backing store,
+		 * we ignore the segment.
+		 */
+		if (header->global_id <= sup.last_migrated_segment_id)
+			continue;
+
+		update_by_segment_header_device(cache, header);
+	}
+
+setup_init_segment:
+	kfree(header);
+
+	seg = get_segment_header_by_id(cache, init_segment_id);
+	seg->global_id = init_segment_id;
+	atomic_set(&seg->nr_inflight_ios, 0);
+
+	cache->last_flushed_segment_id = seg->global_id - 1;
+
+	cache->last_migrated_segment_id =
+		cache->last_flushed_segment_id > cache->nr_segments ?
+		cache->last_flushed_segment_id - cache->nr_segments : 0;
+
+	if (sup.last_migrated_segment_id > cache->last_migrated_segment_id)
+		cache->last_migrated_segment_id = sup.last_migrated_segment_id;
+
+	wait_for_migration(cache, seg->global_id);
+
+	discard_caches_inseg(cache, seg);
+
+	/*
+	 * cursor is set to the first element of the segment.
+	 * This means that we will not use the element.
+	 */
+	cache->cursor = seg->start_idx;
+	seg->length = 1;
+
+	cache->current_seg = seg;
+
+	return 0;
+}
+
+static sector_t dm_devsize(struct dm_dev *dev)
+{
+	return i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT;
+}
+
+static size_t calc_nr_segments(struct dm_dev *dev)
+{
+	sector_t devsize = dm_devsize(dev);
+
+	/*
+	 * Disk format
+	 *
+	 * Overall:
+	 * superblock(1MB) [segment(1MB)]+
+	 * We reserve the first segment (1MB) as the superblock.
+	 *
+	 * segment(1MB):
+	 * segment_header_device(4KB) metablock_device(4KB)*NR_CACHES_INSEG
+	 */
+	return devsize / (1 << WB_SEGMENTSIZE_ORDER) - 1;
+}
+
+struct format_segmd_context {
+	atomic64_t count;
+	int err;
+};
+
+static void format_segmd_endio(unsigned long error, void *__context)
+{
+	struct format_segmd_context *context = __context;
+	if (error)
+		context->err = 1;
+	atomic64_dec(&context->count);
+}
+
+static int __must_check format_cache_device(struct dm_dev *dev)
+{
+	size_t i, nr_segments = calc_nr_segments(dev);
+	struct format_segmd_context context;
+	struct dm_io_request io_req_sup;
+	struct dm_io_region region_sup;
+	void *buf;
+
+	int r = 0;
+
+	buf = kzalloc(1 << SECTOR_SHIFT, GFP_KERNEL);
+	if (!buf) {
+		WBERR();
+		return -ENOMEM;
+	}
+
+	io_req_sup = (struct dm_io_request) {
+		.client = wb_io_client,
+		.bi_rw = WRITE_FUA,
+		.notify.fn = NULL,
+		.mem.type = DM_IO_KMEM,
+		.mem.ptr.addr = buf,
+	};
+	region_sup = (struct dm_io_region) {
+		.bdev = dev->bdev,
+		.sector = 0,
+		.count = 1,
+	};
+	r = dm_safe_io(&io_req_sup, 1, &region_sup, NULL, false);
+	kfree(buf);
+
+	if (r) {
+		WBERR();
+		return r;
+	}
+
+	atomic64_set(&context.count, nr_segments);
+	context.err = 0;
+
+	buf = kzalloc(1 << 12, GFP_KERNEL);
+	if (!buf) {
+		WBERR();
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < nr_segments; i++) {
+		struct dm_io_request io_req_seg = {
+			.client = wb_io_client,
+			.bi_rw = WRITE,
+			.notify.fn = format_segmd_endio,
+			.notify.context = &context,
+			.mem.type = DM_IO_KMEM,
+			.mem.ptr.addr = buf,
+		};
+		struct dm_io_region region_seg = {
+			.bdev = dev->bdev,
+			.sector = calc_segment_header_start(i),
+			.count = (1 << 3),
+		};
+		r = dm_safe_io(&io_req_seg, 1, &region_seg, NULL, false);
+		if (r) {
+			WBERR();
+			break;
+		}
+	}
+	kfree(buf);
+
+	if (r) {
+		WBERR();
+		return r;
+	}
+
+	while (atomic64_read(&context.count))
+		schedule_timeout_interruptible(msecs_to_jiffies(100));
+
+	if (context.err) {
+		WBERR();
+		return -EIO;
+	}
+
+	return blkdev_issue_flush(dev->bdev, GFP_KERNEL, NULL);
+}
+
+static bool is_on_buffer(struct wb_cache *cache, cache_nr mb_idx)
+{
+	cache_nr start = cache->current_seg->start_idx;
+	if (mb_idx < start)
+		return false;
+
+	if (mb_idx >= (start + NR_CACHES_INSEG))
+		return false;
+
+	return true;
+}
+
+static void bio_remap(struct bio *bio, struct dm_dev *dev, sector_t sector)
+{
+	bio->bi_bdev = dev->bdev;
+	bio->bi_sector = sector;
+}
+
+static sector_t calc_cache_alignment(struct wb_cache *cache,
+				     sector_t bio_sector)
+{
+	return (bio_sector / (1 << 3)) * (1 << 3);
+}
+
+static void migrate_buffered_mb(struct wb_cache *cache,
+				struct metablock *mb, u8 dirty_bits)
+{
+	u8 i, k = 1 + (mb->idx % NR_CACHES_INSEG);
+	sector_t offset = (k << 3);
+
+	void *buf = kmalloc_retry(1 << SECTOR_SHIFT, GFP_NOIO);
+	for (i = 0; i < 8; i++) {
+		struct wb_device *wb;
+		struct dm_io_request io_req;
+		struct dm_io_region region;
+		void *src;
+		sector_t dest;
+
+		bool bit_on = dirty_bits & (1 << i);
+		if (!bit_on)
+			continue;
+
+		src = cache->current_rambuf->data +
+		      ((offset + i) << SECTOR_SHIFT);
+		memcpy(buf, src, 1 << SECTOR_SHIFT);
+
+		io_req = (struct dm_io_request) {
+			.client = wb_io_client,
+			.bi_rw = WRITE_FUA,
+			.notify.fn = NULL,
+			.mem.type = DM_IO_KMEM,
+			.mem.ptr.addr = buf,
+		};
+
+		wb = wb_devices[mb->device_id];
+		dest = mb->sector + 1 * i;
+		region = (struct dm_io_region) {
+			.bdev = wb->device->bdev,
+			.sector = dest,
+			.count = 1,
+		};
+
+		dm_safe_io_retry(&io_req, 1, &region, true);
+	}
+	kfree(buf);
+}
+
+static void queue_current_buffer(struct wb_cache *cache)
+{
+	/*
+	 * Before we get the next segment
+	 * we must wait until the segment is all clean.
+	 * A clean segment doesn't have
+	 * log to flush and dirties to migrate.
+	 */
+	size_t next_id = cache->current_seg->global_id + 1;
+
+	struct segment_header *next_seg =
+		get_segment_header_by_id(cache, next_id);
+
+	wait_for_completion(&next_seg->flush_done);
+
+	wait_for_migration(cache, next_id);
+
+	queue_flushing(cache);
+}
+
+static void flush_current_buffer_sync(struct wb_cache *cache)
+{
+	struct segment_header *old_seg;
+
+	mutex_lock(&cache->io_lock);
+	old_seg = cache->current_seg;
+
+	queue_current_buffer(cache);
+	cache->cursor = (cache->cursor + 1) % cache->nr_caches;
+	cache->current_seg->length = 1;
+	mutex_unlock(&cache->io_lock);
+
+	wait_for_completion(&old_seg->flush_done);
+}
+
+static void flush_barrier_ios(struct work_struct *work)
+{
+	struct wb_cache *cache =
+		container_of(work, struct wb_cache,
+			     barrier_deadline_work);
+
+	if (bio_list_empty(&cache->barrier_ios))
+		return;
+
+	flush_current_buffer_sync(cache);
+}
+
+static void barrier_deadline_proc(unsigned long data)
+{
+	struct wb_cache *cache = (struct wb_cache *) data;
+	schedule_work(&cache->barrier_deadline_work);
+}
+
+static void queue_barrier_io(struct wb_cache *cache, struct bio *bio)
+{
+	mutex_lock(&cache->io_lock);
+	bio_list_add(&cache->barrier_ios, bio);
+	mutex_unlock(&cache->io_lock);
+
+	if (!timer_pending(&cache->barrier_deadline_timer))
+		mod_timer(&cache->barrier_deadline_timer,
+			  msecs_to_jiffies(cache->barrier_deadline_ms));
+}
+
+struct per_bio_data {
+	void *ptr;
+};
+
+static int writeboost_map(struct dm_target *ti, struct bio *bio)
+{
+	unsigned long flags;
+	struct wb_cache *cache;
+	struct segment_header *uninitialized_var(seg);
+	struct metablock *mb, *new_mb;
+	struct per_bio_data *map_context;
+	sector_t bio_count, bio_offset, s;
+	bool bio_fullsize, found, on_buffer,
+	     refresh_segment, b;
+	int rw;
+	struct lookup_key key;
+	struct ht_head *head;
+	cache_nr update_mb_idx, idx_inseg, k;
+	size_t start;
+	void *data;
+
+	struct wb_device *wb = ti->private;
+	struct dm_dev *orig = wb->device;
+
+	map_context = dm_per_bio_data(bio, ti->per_bio_data_size);
+	map_context->ptr = NULL;
+
+	if (!wb->cache) {
+		bio_remap(bio, orig, bio->bi_sector);
+		return DM_MAPIO_REMAPPED;
+	}
+
+	/*
+	 * We only discard only the backing store because
+	 * blocks on cache device are unlikely to be discarded.
+	 *
+	 * Discarding blocks is likely to be operated
+	 * long after writing;
+	 * the block is likely to be migrated before.
+	 * Moreover,
+	 * we discard the segment at the end of migration
+	 * and that's enough for discarding blocks.
+	 */
+	if (bio->bi_rw & REQ_DISCARD) {
+		bio_remap(bio, orig, bio->bi_sector);
+		return DM_MAPIO_REMAPPED;
+	}
+
+	cache = wb->cache;
+
+	if (bio->bi_rw & REQ_FLUSH) {
+		BUG_ON(bio->bi_size);
+		queue_barrier_io(cache, bio);
+		return DM_MAPIO_SUBMITTED;
+	}
+
+	bio_count = bio->bi_size >> SECTOR_SHIFT;
+	bio_fullsize = (bio_count == (1 << 3));
+	bio_offset = bio->bi_sector % (1 << 3);
+
+	rw = bio_data_dir(bio);
+
+	key = (struct lookup_key) {
+		.sector = calc_cache_alignment(cache, bio->bi_sector),
+		.device_id = wb->id,
+	};
+
+	k = ht_hash(cache, &key);
+	head = arr_at(cache->htable, k);
+
+	/*
+	 * (Locking)
+	 * Why mutex?
+	 *
+	 * The reason we use mutex instead of rw_semaphore
+	 * that can allow truely concurrent read access
+	 * is that mutex is even lighter than rw_semaphore.
+	 * Since dm-writebuffer is a real performance centric software
+	 * the overhead of rw_semaphore is crucial.
+	 * All in all,
+	 * since exclusive region in read path is enough small
+	 * and cheap, using rw_semaphore and let the reads
+	 * execute concurrently won't improve the performance
+	 * as much as one expects.
+	 */
+	mutex_lock(&cache->io_lock);
+	mb = ht_lookup(cache, head, &key);
+	if (mb) {
+		seg = ((void *) mb) - (mb->idx % NR_CACHES_INSEG) *
+				      sizeof(struct metablock);
+		atomic_inc(&seg->nr_inflight_ios);
+	}
+
+	found = (mb != NULL);
+	on_buffer = false;
+	if (found)
+		on_buffer = is_on_buffer(cache, mb->idx);
+
+	inc_stat(cache, rw, found, on_buffer, bio_fullsize);
+
+	if (!rw) {
+		u8 dirty_bits;
+
+		mutex_unlock(&cache->io_lock);
+
+		if (!found) {
+			bio_remap(bio, orig, bio->bi_sector);
+			return DM_MAPIO_REMAPPED;
+		}
+
+		dirty_bits = atomic_read_mb_dirtiness(seg, mb);
+
+		if (unlikely(on_buffer)) {
+
+			if (dirty_bits)
+				migrate_buffered_mb(cache, mb, dirty_bits);
+
+			/*
+			 * Cache class
+			 * Live and Stable
+			 *
+			 * Live:
+			 * The cache is on the RAM buffer.
+			 *
+			 * Stable:
+			 * The cache is not on the RAM buffer
+			 * but at least queued in flush_queue.
+			 */
+
+			/*
+			 * (Locking)
+			 * Dirtiness of a live cache
+			 *
+			 * We can assume dirtiness of a cache only increase
+			 * when it is on the buffer, we call this cache is live.
+			 * This eases the locking because
+			 * we don't worry the dirtiness of
+			 * a live cache fluctuates.
+			 */
+
+			atomic_dec(&seg->nr_inflight_ios);
+			bio_remap(bio, orig, bio->bi_sector);
+			return DM_MAPIO_REMAPPED;
+		}
+
+		wait_for_completion(&seg->flush_done);
+		if (likely(dirty_bits == 255)) {
+			bio_remap(bio,
+				  cache->device,
+				  calc_mb_start_sector(seg, mb->idx)
+				  + bio_offset);
+			map_context->ptr = seg;
+		} else {
+
+			/*
+			 * (Locking)
+			 * Dirtiness of a stable cache
+			 *
+			 * Unlike the live caches that don't
+			 * fluctuate the dirtiness,
+			 * stable caches which are not on the buffer
+			 * but on the cache device
+			 * may decrease the dirtiness by other processes
+			 * than the migrate daemon.
+			 * This works fine
+			 * because migrating the same cache twice
+			 * doesn't craze the cache concistency.
+			 */
+
+			migrate_mb(cache, seg, mb, dirty_bits, true);
+			cleanup_mb_if_dirty(seg, mb);
+
+			atomic_dec(&seg->nr_inflight_ios);
+			bio_remap(bio, orig, bio->bi_sector);
+		}
+		return DM_MAPIO_REMAPPED;
+	}
+
+	if (found) {
+
+		if (unlikely(on_buffer)) {
+			mutex_unlock(&cache->io_lock);
+
+			update_mb_idx = mb->idx;
+			goto write_on_buffer;
+		} else {
+			u8 dirty_bits = atomic_read_mb_dirtiness(seg, mb);
+
+			/*
+			 * First clean up the previous cache
+			 * and migrate the cache if needed.
+			 */
+			bool needs_cleanup_prev_cache =
+				!bio_fullsize || !(dirty_bits == 255);
+
+			if (unlikely(needs_cleanup_prev_cache)) {
+				wait_for_completion(&seg->flush_done);
+				migrate_mb(cache, seg, mb, dirty_bits, true);
+			}
+
+			/*
+			 * Fullsize dirty cache
+			 * can be discarded without migration.
+			 */
+
+			cleanup_mb_if_dirty(seg, mb);
+
+			ht_del(cache, mb);
+
+			atomic_dec(&seg->nr_inflight_ios);
+			goto write_not_found;
+		}
+	}
+
+write_not_found:
+	;
+
+	/*
+	 * If cache->cursor is 254, 509, ...
+	 * that is the last cache line in the segment.
+	 * We must flush the current segment and
+	 * get the new one.
+	 */
+	refresh_segment = !((cache->cursor + 1) % NR_CACHES_INSEG);
+
+	if (refresh_segment)
+		queue_current_buffer(cache);
+
+	cache->cursor = (cache->cursor + 1) % cache->nr_caches;
+
+	/*
+	 * update_mb_idx is the cache line index to update.
+	 */
+	update_mb_idx = cache->cursor;
+
+	seg = cache->current_seg;
+	atomic_inc(&seg->nr_inflight_ios);
+
+	new_mb = seg->mb_array + (update_mb_idx % NR_CACHES_INSEG);
+	new_mb->dirty_bits = 0;
+	ht_register(cache, head, &key, new_mb);
+	mutex_unlock(&cache->io_lock);
+
+	mb = new_mb;
+
+write_on_buffer:
+	;
+	idx_inseg = update_mb_idx % NR_CACHES_INSEG;
+	s = (idx_inseg + 1) << 3;
+
+	b = false;
+	lockseg(seg, flags);
+	if (!mb->dirty_bits) {
+		seg->length++;
+		BUG_ON(seg->length >  NR_CACHES_INSEG);
+		b = true;
+	}
+
+	if (likely(bio_fullsize)) {
+		mb->dirty_bits = 255;
+	} else {
+		u8 i;
+		u8 acc_bits = 0;
+		s += bio_offset;
+		for (i = bio_offset; i < (bio_offset+bio_count); i++)
+			acc_bits += (1 << i);
+
+		mb->dirty_bits |= acc_bits;
+	}
+
+	BUG_ON(!mb->dirty_bits);
+
+	unlockseg(seg, flags);
+
+	if (b)
+		inc_nr_dirty_caches(mb->device_id);
+
+	start = s << SECTOR_SHIFT;
+	data = bio_data(bio);
+
+	memcpy(cache->current_rambuf->data + start, data, bio->bi_size);
+	atomic_dec(&seg->nr_inflight_ios);
+
+	if (bio->bi_rw & REQ_FUA) {
+		queue_barrier_io(cache, bio);
+		return DM_MAPIO_SUBMITTED;
+	}
+
+	bio_endio(bio, 0);
+	return DM_MAPIO_SUBMITTED;
+}
+
+static int writeboost_end_io(struct dm_target *ti, struct bio *bio, int error)
+{
+	struct segment_header *seg;
+	struct per_bio_data *map_context =
+		dm_per_bio_data(bio, ti->per_bio_data_size);
+
+	if (!map_context->ptr)
+		return 0;
+
+	seg = map_context->ptr;
+	atomic_dec(&seg->nr_inflight_ios);
+
+	return 0;
+}
+
+static ssize_t var_show(unsigned long var, char *page)
+{
+	return sprintf(page, "%lu\n", var);
+}
+
+static int var_store(unsigned long *var, const char *page)
+{
+	char *p = (char *) page;
+	int r = kstrtoul(p, 10, var);
+	if (r) {
+		WBERR("could not parse the digits");
+		return r;
+	}
+	return 0;
+}
+
+#define validate_cond(cond) \
+	do { \
+		if (!(cond)) { \
+			WBERR("violated %s", #cond); \
+			return -EINVAL; \
+		} \
+	} while (false)
+
+static struct kobject *devices_kobj;
+
+struct device_sysfs_entry {
+	struct attribute attr;
+	ssize_t (*show)(struct wb_device *, char *);
+	ssize_t (*store)(struct wb_device *, const char *, size_t);
+};
+
+#define to_device(attr) container_of((attr), struct device_sysfs_entry, attr)
+static ssize_t device_attr_show(struct kobject *kobj, struct attribute *attr,
+				char *page)
+{
+	struct wb_device *device;
+
+	struct device_sysfs_entry *entry = to_device(attr);
+	if (!entry->show) {
+		WBERR();
+		return -EIO;
+	}
+
+	device = container_of(kobj, struct wb_device, kobj);
+	return entry->show(device, page);
+}
+
+static ssize_t device_attr_store(struct kobject *kobj, struct attribute *attr,
+				 const char *page, size_t len)
+{
+	struct wb_device *device;
+
+	struct device_sysfs_entry *entry = to_device(attr);
+	if (!entry->store) {
+		WBERR();
+		return -EIO;
+	}
+
+	device = container_of(kobj, struct wb_device, kobj);
+	return entry->store(device, page, len);
+}
+
+static cache_id cache_id_of(struct wb_device *device)
+{
+	cache_id id;
+	if (!device->cache)
+		id = 0;
+	else
+		id = device->cache->id;
+	return id;
+}
+
+static ssize_t cache_id_show(struct wb_device *device, char *page)
+{
+	return var_show(cache_id_of(device), (page));
+}
+
+static struct device_sysfs_entry cache_id_entry = {
+	.attr = { .name = "cache_id", .mode = S_IRUGO },
+	.show = cache_id_show,
+};
+
+static ssize_t dev_show(struct wb_device *device, char *page)
+{
+	return sprintf(page, "%s\n", dm_device_name(device->md));
+}
+
+static struct device_sysfs_entry dev_entry = {
+	.attr = { .name = "dev", .mode = S_IRUGO },
+	.show = dev_show,
+};
+
+static ssize_t device_no_show(struct wb_device *wb, char *page)
+{
+	return sprintf(page, "%s\n", wb->device->name);
+}
+
+static struct device_sysfs_entry device_no_entry = {
+	.attr = { .name = "device_no", .mode = S_IRUGO },
+	.show = device_no_show,
+};
+
+static ssize_t migrate_threshold_show(struct wb_device *device, char *page)
+{
+	return var_show(device->migrate_threshold, (page));
+}
+
+static ssize_t migrate_threshold_store(struct wb_device *device,
+				       const char *page, size_t count)
+{
+	unsigned long x;
+	int r = var_store(&x, page);
+	if (r) {
+		WBERR();
+		return r;
+	}
+	validate_cond(0 <= x || x <= 100);
+
+	device->migrate_threshold = x;
+	return count;
+}
+
+static struct device_sysfs_entry migrate_threshold_entry = {
+	.attr = { .name = "migrate_threshold", .mode = S_IRUGO | S_IWUSR },
+	.show = migrate_threshold_show,
+	.store = migrate_threshold_store,
+};
+
+static ssize_t nr_dirty_caches_show(struct wb_device *device, char *page)
+{
+	unsigned long val = atomic64_read(&device->nr_dirty_caches);
+	return var_show(val, page);
+}
+
+static struct device_sysfs_entry nr_dirty_caches_entry = {
+	.attr = { .name = "nr_dirty_caches", .mode = S_IRUGO },
+	.show = nr_dirty_caches_show,
+};
+
+static struct attribute *device_default_attrs[] = {
+	&cache_id_entry.attr,
+	&dev_entry.attr,
+	&device_no_entry.attr,
+	&migrate_threshold_entry.attr,
+	&nr_dirty_caches_entry.attr,
+	NULL,
+};
+
+static const struct sysfs_ops device_sysfs_ops = {
+	.show = device_attr_show,
+	.store = device_attr_store,
+};
+
+static void device_release(struct kobject *kobj) { return; }
+
+static struct kobj_type device_ktype = {
+	.sysfs_ops = &device_sysfs_ops,
+	.default_attrs = device_default_attrs,
+	.release = device_release,
+};
+
+static int parse_cache_id(char *s, u8 *cache_id)
+{
+	unsigned id;
+	if (sscanf(s, "%u", &id) != 1) {
+		WBERR();
+		return -EINVAL;
+	}
+	if (id >= WB_NR_SLOTS) {
+		WBERR();
+		return -EINVAL;
+	}
+	*cache_id = id;
+	return 0;
+}
+
+/*
+ * <device-id> <path> <cache-id>
+ *
+ * By replacing it with cache_id_ptr
+ * cache-id can be removed from this constructor
+ * that will result in code dedup
+ * in this constructor and switch_to message.
+ *
+ * The reason this constructor takes cache-id
+ * is to nicely pipe with dmsetup table that is,
+ * dmsetup table SRC | dmsetup create DEST
+ * should clone the logical device.
+ * This is considered to be an implicit rule
+ * in device-mapper for dm-writeboost to follow.
+ * Other non-essential tunable parameters
+ * will not be cloned.
+ */
+static int writeboost_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	struct wb_device *wb;
+	unsigned device_id;
+	u8 cache_id;
+	struct dm_dev *dev;
+
+	int r = dm_set_target_max_io_len(ti, (1 << 3));
+	if (r) {
+		WBERR();
+		return r;
+	}
+
+	wb = kzalloc(sizeof(*wb), GFP_KERNEL);
+	if (!wb) {
+		WBERR();
+		return -ENOMEM;
+	}
+
+	/*
+	 * EMC's textbook on storage system says
+	 * storage should keep its disk util less than 70%.
+	 */
+	wb->migrate_threshold = 70;
+
+	atomic64_set(&wb->nr_dirty_caches, 0);
+
+	if (sscanf(argv[0], "%u", &device_id) != 1) {
+		WBERR();
+		r = -EINVAL;
+		goto bad_device_id;
+	}
+	if (device_id >= WB_NR_SLOTS) {
+		WBERR();
+		r = -EINVAL;
+		goto bad_device_id;
+	}
+	wb->id = device_id;
+
+	r = dm_get_device(ti, argv[1], dm_table_get_mode(ti->table),
+			  &dev);
+	if (r) {
+		WBERR("%d", r);
+		goto bad_get_device;
+	}
+	wb->device = dev;
+
+	wb->cache = NULL;
+	cache_id = 0;
+	r = parse_cache_id(argv[2], &cache_id);
+	if (r) {
+		WBERR();
+		goto bad_cache_id;
+	}
+	if (cache_id) {
+		struct wb_cache *cache = wb_caches[cache_id];
+		if (!cache) {
+			WBERR("cache is not set for id(%u)", cache_id);
+			goto bad_no_cache;
+		}
+		wb->cache = wb_caches[cache_id];
+	}
+
+	wb_devices[wb->id] = wb;
+	ti->private = wb;
+
+	ti->per_bio_data_size = sizeof(struct per_bio_data);
+
+	ti->num_flush_bios = 1;
+	ti->num_discard_bios = 1;
+
+	ti->discard_zeroes_data_unsupported = true;
+
+	/*
+	 * /sys/module/dm_writeboost/devices/$id/$atribute
+	 *                                      /dev # -> Note
+	 *                                      /device
+	 */
+
+	/*
+	 * Reference to the mapped_device
+	 * is used to show device name (major:minor).
+	 * major:minor is used in admin scripts
+	 * to get the sysfs node of a wb_device.
+	 */
+	wb->md = dm_table_get_md(ti->table);
+
+	return 0;
+
+bad_no_cache:
+bad_cache_id:
+	dm_put_device(ti, wb->device);
+bad_get_device:
+bad_device_id:
+	kfree(wb);
+	return r;
+}
+
+static void writeboost_dtr(struct dm_target *ti)
+{
+	struct wb_device *wb = ti->private;
+	dm_put_device(ti, wb->device);
+	ti->private = NULL;
+	kfree(wb);
+}
+
+struct kobject *get_bdev_kobject(struct block_device *bdev)
+{
+	return &disk_to_dev(bdev->bd_disk)->kobj;
+}
+
+static int writeboost_message(struct dm_target *ti, unsigned argc, char **argv)
+{
+	int r;
+	struct wb_device *wb = ti->private;
+	char *cmd = argv[0];
+
+	/*
+	 * We must separate these add/remove sysfs code from .ctr
+	 * for a very complex reason.
+	 */
+	if (!strcasecmp(cmd, "add_sysfs")) {
+		struct kobject *dev_kobj;
+		r = kobject_init_and_add(&wb->kobj, &device_ktype,
+					 devices_kobj, "%u", wb->id);
+		if (r) {
+			WBERR();
+			return r;
+		}
+
+		dev_kobj = get_bdev_kobject(wb->device->bdev);
+		r = sysfs_create_link(&wb->kobj, dev_kobj, "device");
+		if (r) {
+			WBERR();
+			kobject_del(&wb->kobj);
+			kobject_put(&wb->kobj);
+			return r;
+		}
+
+		kobject_uevent(&wb->kobj, KOBJ_ADD);
+		return 0;
+	}
+
+	if (!strcasecmp(cmd, "remove_sysfs")) {
+		kobject_uevent(&wb->kobj, KOBJ_REMOVE);
+
+		sysfs_remove_link(&wb->kobj, "device");
+		kobject_del(&wb->kobj);
+		kobject_put(&wb->kobj);
+
+		wb_devices[wb->id] = NULL;
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+static int writeboost_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
+			    struct bio_vec *biovec, int max_size)
+{
+	struct wb_device *wb = ti->private;
+	struct dm_dev *device = wb->device;
+	struct request_queue *q = bdev_get_queue(device->bdev);
+
+	if (!q->merge_bvec_fn)
+		return max_size;
+
+	bvm->bi_bdev = device->bdev;
+	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
+}
+
+static int writeboost_iterate_devices(struct dm_target *ti,
+				      iterate_devices_callout_fn fn, void *data)
+{
+	struct wb_device *wb = ti->private;
+	struct dm_dev *orig = wb->device;
+	sector_t start = 0;
+	sector_t len = dm_devsize(orig);
+	return fn(ti, orig, start, len, data);
+}
+
+static void writeboost_io_hints(struct dm_target *ti,
+				struct queue_limits *limits)
+{
+	blk_limits_io_min(limits, 512);
+	blk_limits_io_opt(limits, 4096);
+}
+
+static void writeboost_status(struct dm_target *ti, status_type_t type,
+			      unsigned flags, char *result, unsigned maxlen)
+{
+	unsigned int sz = 0;
+	struct wb_device *wb = ti->private;
+
+	switch (type) {
+	case STATUSTYPE_INFO:
+		result[0] = '\0';
+		break;
+
+	case STATUSTYPE_TABLE:
+		DMEMIT("%d %s %d", wb->id, wb->device->name, cache_id_of(wb));
+		break;
+	}
+}
+
+static struct target_type writeboost_target = {
+	.name = "writeboost",
+	.version = {0, 1, 0},
+	.module = THIS_MODULE,
+	.map = writeboost_map,
+	.ctr = writeboost_ctr,
+	.dtr = writeboost_dtr,
+	.end_io = writeboost_end_io,
+	.merge = writeboost_merge,
+	.message = writeboost_message,
+	.status = writeboost_status,
+	.io_hints = writeboost_io_hints,
+	.iterate_devices = writeboost_iterate_devices,
+};
+
+static int writeboost_mgr_map(struct dm_target *ti, struct bio *bio)
+{
+	bio_endio(bio, 0);
+	return DM_MAPIO_SUBMITTED;
+}
+
+static int writeboost_mgr_ctr(struct dm_target *ti,
+			      unsigned int argc, char **argv)
+{
+	return 0;
+}
+
+static void writeboost_mgr_dtr(struct dm_target *ti) { return; }
+
+static struct kobject *caches_kobj;
+
+struct cache_sysfs_entry {
+	struct attribute attr;
+	ssize_t (*show)(struct wb_cache *, char *);
+	ssize_t (*store)(struct wb_cache *, const char *, size_t);
+};
+
+#define to_cache(attr) container_of((attr), struct cache_sysfs_entry, attr)
+static ssize_t cache_attr_show(struct kobject *kobj,
+			       struct attribute *attr, char *page)
+{
+	struct wb_cache *cache;
+
+	struct cache_sysfs_entry *entry = to_cache(attr);
+	if (!entry->show) {
+		WBERR();
+		return -EIO;
+	}
+
+	cache = container_of(kobj, struct wb_cache, kobj);
+	return entry->show(cache, page);
+}
+
+static ssize_t cache_attr_store(struct kobject *kobj, struct attribute *attr,
+				const char *page, size_t len)
+{
+	struct wb_cache *cache;
+
+	struct cache_sysfs_entry *entry = to_cache(attr);
+	if (!entry->store) {
+		WBERR();
+		return -EIO;
+	}
+
+	cache = container_of(kobj, struct wb_cache, kobj);
+	return entry->store(cache, page, len);
+}
+
+static ssize_t commit_super_block_interval_show(struct wb_cache *cache,
+						char *page)
+{
+	return var_show(cache->commit_super_block_interval, (page));
+}
+
+static ssize_t commit_super_block_interval_store(struct wb_cache *cache,
+						 const char *page, size_t count)
+{
+	unsigned long x;
+	int r = var_store(&x, page);
+	if (r) {
+		WBERR();
+		return r;
+	}
+	validate_cond(0 <= x);
+
+	cache->commit_super_block_interval = x;
+	return count;
+}
+
+static struct cache_sysfs_entry commit_super_block_interval_entry = {
+	.attr = { .name = "commit_super_block_interval",
+		  .mode = S_IRUGO | S_IWUSR },
+	.show = commit_super_block_interval_show,
+	.store = commit_super_block_interval_store,
+};
+
+static ssize_t nr_max_batched_migration_show(struct wb_cache *cache,
+					     char *page)
+{
+	return var_show(cache->nr_max_batched_migration, page);
+}
+
+static ssize_t nr_max_batched_migration_store(struct wb_cache *cache,
+					      const char *page, size_t count)
+{
+	unsigned long x;
+	int r = var_store(&x, page);
+	if (r) {
+		WBERR();
+		return r;
+	}
+	validate_cond(1 <= x);
+
+	cache->nr_max_batched_migration = x;
+	return count;
+}
+
+static struct cache_sysfs_entry nr_max_batched_migration_entry = {
+	.attr = { .name = "nr_max_batched_migration",
+		  .mode = S_IRUGO | S_IWUSR },
+	.show = nr_max_batched_migration_show,
+	.store = nr_max_batched_migration_store,
+};
+
+static ssize_t allow_migrate_show(struct wb_cache *cache, char *page)
+{
+	return var_show(cache->allow_migrate, (page));
+}
+
+static ssize_t allow_migrate_store(struct wb_cache *cache,
+				   const char *page, size_t count)
+{
+	unsigned long x;
+	int r = var_store(&x, page);
+	if (r) {
+		WBERR();
+		return r;
+	}
+	validate_cond(x == 0 || x == 1);
+
+	cache->allow_migrate = x;
+	return count;
+}
+
+static struct cache_sysfs_entry allow_migrate_entry = {
+	.attr = { .name = "allow_migrate", .mode = S_IRUGO | S_IWUSR },
+	.show = allow_migrate_show,
+	.store = allow_migrate_store,
+};
+
+static ssize_t force_migrate_show(struct wb_cache *cache, char *page)
+{
+	return var_show(cache->force_migrate, page);
+}
+
+static ssize_t force_migrate_store(struct wb_cache *cache,
+				   const char *page, size_t count)
+{
+	unsigned long x;
+	int r = var_store(&x, page);
+	if (r) {
+		WBERR();
+		return r;
+	}
+	validate_cond(x == 0 || x == 1);
+
+	cache->force_migrate = x;
+	return count;
+}
+
+static struct cache_sysfs_entry force_migrate_entry = {
+	.attr = { .name = "force_migrate", .mode = S_IRUGO | S_IWUSR },
+	.show = force_migrate_show,
+	.store = force_migrate_store,
+};
+
+static ssize_t update_interval_show(struct wb_cache *cache, char *page)
+{
+	return var_show(cache->update_interval, page);
+}
+
+static ssize_t update_interval_store(struct wb_cache *cache,
+				     const char *page, size_t count)
+{
+	unsigned long x;
+	int r = var_store(&x, page);
+	if (r) {
+		WBERR();
+		return r;
+	}
+	validate_cond(0 <= x);
+
+	cache->update_interval = x;
+	return count;
+}
+
+static struct cache_sysfs_entry update_interval_entry = {
+	.attr = { .name = "update_interval", .mode = S_IRUGO | S_IWUSR },
+	.show = update_interval_show,
+	.store = update_interval_store,
+};
+
+static ssize_t flush_current_buffer_interval_show(struct wb_cache *cache,
+						  char *page)
+{
+	return var_show(cache->flush_current_buffer_interval, page);
+}
+
+static ssize_t flush_current_buffer_interval_store(struct wb_cache *cache,
+						   const char *page,
+						   size_t count)
+{
+	unsigned long x;
+	int r = var_store(&x, page);
+	if (r) {
+		WBERR();
+		return r;
+	}
+	validate_cond(0 <= x);
+
+	cache->flush_current_buffer_interval = x;
+	return count;
+}
+
+static struct cache_sysfs_entry flush_current_buffer_interval_entry = {
+	.attr = { .name = "flush_current_buffer_interval",
+		  .mode = S_IRUGO | S_IWUSR },
+	.show = flush_current_buffer_interval_show,
+	.store = flush_current_buffer_interval_store,
+};
+
+static ssize_t commit_super_block_show(struct wb_cache *cache, char *page)
+{
+	return var_show(0, (page));
+}
+
+static ssize_t commit_super_block_store(struct wb_cache *cache,
+					const char *page, size_t count)
+{
+	unsigned long x;
+	int r = var_store(&x, page);
+	if (r) {
+		WBERR();
+		return r;
+	}
+	validate_cond(x == 1);
+
+	mutex_lock(&cache->io_lock);
+	commit_super_block(cache);
+	mutex_unlock(&cache->io_lock);
+
+	return count;
+}
+
+static struct cache_sysfs_entry commit_super_block_entry = {
+	.attr = { .name = "commit_super_block", .mode = S_IRUGO | S_IWUSR },
+	.show = commit_super_block_show,
+	.store = commit_super_block_store,
+};
+
+static ssize_t flush_current_buffer_show(struct wb_cache *cache, char *page)
+{
+	return var_show(0, (page));
+}
+
+static ssize_t flush_current_buffer_store(struct wb_cache *cache,
+					  const char *page, size_t count)
+{
+	unsigned long x;
+	int r = var_store(&x, page);
+	if (r) {
+		WBERR();
+		return r;
+	}
+	validate_cond(x == 1);
+
+	flush_current_buffer_sync(cache);
+	return count;
+}
+
+static struct cache_sysfs_entry flush_current_buffer_entry = {
+	.attr = { .name = "flush_current_buffer", .mode = S_IRUGO | S_IWUSR },
+	.show = flush_current_buffer_show,
+	.store = flush_current_buffer_store,
+};
+
+static ssize_t last_flushed_segment_id_show(struct wb_cache *cache, char *page)
+{
+	return var_show(cache->last_flushed_segment_id, (page));
+}
+
+static struct cache_sysfs_entry last_flushed_segment_id_entry = {
+	.attr = { .name = "last_flushed_segment_id", .mode = S_IRUGO },
+	.show = last_flushed_segment_id_show,
+};
+
+static ssize_t last_migrated_segment_id_show(struct wb_cache *cache, char *page)
+{
+	return var_show(cache->last_migrated_segment_id, (page));
+}
+
+static struct cache_sysfs_entry last_migrated_segment_id_entry = {
+	.attr = { .name = "last_migrated_segment_id", .mode = S_IRUGO },
+	.show = last_migrated_segment_id_show,
+};
+
+static ssize_t barrier_deadline_ms_show(struct wb_cache *cache, char *page)
+{
+	return var_show(cache->barrier_deadline_ms, (page));
+}
+
+static ssize_t barrier_deadline_ms_store(struct wb_cache *cache,
+					 const char *page, size_t count)
+{
+	unsigned long x;
+	int r = var_store(&x, page);
+	if (r) {
+		WBERR();
+		return r;
+	}
+	validate_cond(1 <= x);
+
+	cache->barrier_deadline_ms = x;
+	return count;
+}
+
+static struct cache_sysfs_entry barrier_deadline_ms_entry = {
+	.attr = { .name = "barrier_deadline_ms", .mode = S_IRUGO | S_IWUSR },
+	.show = barrier_deadline_ms_show,
+	.store = barrier_deadline_ms_store,
+};
+
+static struct attribute *cache_default_attrs[] = {
+	&commit_super_block_interval_entry.attr,
+	&nr_max_batched_migration_entry.attr,
+	&allow_migrate_entry.attr,
+	&commit_super_block_entry.attr,
+	&flush_current_buffer_entry.attr,
+	&flush_current_buffer_interval_entry.attr,
+	&force_migrate_entry.attr,
+	&update_interval_entry.attr,
+	&last_flushed_segment_id_entry.attr,
+	&last_migrated_segment_id_entry.attr,
+	&barrier_deadline_ms_entry.attr,
+	NULL,
+};
+
+static const struct sysfs_ops cache_sysfs_ops = {
+	.show = cache_attr_show,
+	.store = cache_attr_store,
+};
+
+static void cache_release(struct kobject *kobj) { return; }
+
+static struct kobj_type cache_ktype = {
+	.sysfs_ops = &cache_sysfs_ops,
+	.default_attrs = cache_default_attrs,
+	.release = cache_release,
+};
+
+static int __must_check init_rambuf_pool(struct wb_cache *cache)
+{
+	size_t i, j;
+	struct rambuffer *rambuf;
+
+	cache->rambuf_pool = kmalloc(sizeof(struct rambuffer) * NR_RAMBUF_POOL,
+				     GFP_KERNEL);
+	if (!cache->rambuf_pool) {
+		WBERR();
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < NR_RAMBUF_POOL; i++) {
+		rambuf = cache->rambuf_pool + i;
+		init_completion(&rambuf->done);
+		complete_all(&rambuf->done);
+
+		rambuf->data = kmalloc(
+			1 << (WB_SEGMENTSIZE_ORDER + SECTOR_SHIFT),
+			GFP_KERNEL);
+		if (!rambuf->data) {
+			WBERR();
+			for (j = 0; j < i; j++) {
+				rambuf = cache->rambuf_pool + j;
+				kfree(rambuf->data);
+			}
+			kfree(cache->rambuf_pool);
+			return -ENOMEM;
+		}
+	}
+
+	return 0;
+}
+
+static void free_rambuf_pool(struct wb_cache *cache)
+{
+	struct rambuffer *rambuf;
+	size_t i;
+	for (i = 0; i < NR_RAMBUF_POOL; i++) {
+		rambuf = cache->rambuf_pool + i;
+		kfree(rambuf->data);
+	}
+	kfree(cache->rambuf_pool);
+}
+
+static int writeboost_mgr_message(struct dm_target *ti,
+				  unsigned int argc, char **argv)
+{
+	char *cmd = argv[0];
+
+	/*
+	 * <path>
+	 * @path path to the cache device
+	 */
+	if (!strcasecmp(cmd, "format_cache_device")) {
+		int r;
+		struct dm_dev *dev;
+		if (dm_get_device(ti, argv[1], dm_table_get_mode(ti->table),
+				  &dev)) {
+			WBERR();
+			return -EINVAL;
+		}
+
+		r = format_cache_device(dev);
+
+		dm_put_device(ti, dev);
+		return r;
+	}
+
+	/*
+	 * <id>
+	 *
+	 * writeboost-mgr has cursor to point the
+	 * cache device to operate.
+	 */
+	if (!strcasecmp(cmd, "switch_to")) {
+		u8 id;
+		int r = parse_cache_id(argv[1], &id);
+		if (r) {
+			WBERR();
+			return r;
+		}
+		cache_id_ptr = id;
+		return 0;
+	}
+
+	if (!strcasecmp(cmd, "clear_stat")) {
+		struct wb_cache *cache = wb_caches[cache_id_ptr];
+		if (!cache) {
+			WBERR();
+			return -EINVAL;
+		}
+
+		clear_stat(cache);
+		return 0;
+	}
+
+	/*
+	 * <path>
+	 */
+	if (!strcasecmp(cmd, "resume_cache")) {
+		int r = 0;
+		struct kobject *dev_kobj;
+		struct dm_dev *dev;
+
+		struct wb_cache *cache = kzalloc(sizeof(*cache), GFP_KERNEL);
+		if (!cache) {
+			WBERR();
+			return -ENOMEM;
+		}
+
+		if (dm_get_device(ti, argv[1], dm_table_get_mode(ti->table),
+				  &dev)) {
+			WBERR();
+			r = -EINVAL;
+			goto bad_get_device;
+		}
+
+		cache->id = cache_id_ptr;
+		cache->device = dev;
+		cache->nr_segments = calc_nr_segments(cache->device);
+		cache->nr_caches = cache->nr_segments * NR_CACHES_INSEG;
+		cache->on_terminate = false;
+		cache->allow_migrate = false;
+		cache->force_migrate = false;
+		cache->reserving_segment_id = 0;
+		mutex_init(&cache->io_lock);
+
+		/*
+		 * /sys/module/dm_writeboost/caches/
+		 *   $id/$attribute
+		 *      /device -> /sys/block/$name
+		 */
+		cache->update_interval = 1;
+		cache->commit_super_block_interval = 0;
+		cache->flush_current_buffer_interval = 0;
+		r = kobject_init_and_add(&cache->kobj, &cache_ktype,
+					 caches_kobj, "%u", cache->id);
+		if (r) {
+			WBERR();
+			goto bad_kobj_add;
+		}
+
+		dev_kobj = get_bdev_kobject(cache->device->bdev);
+		r = sysfs_create_link(&cache->kobj, dev_kobj, "device");
+		if (r) {
+			WBERR();
+			goto bad_device_lns;
+		}
+
+		kobject_uevent(&cache->kobj, KOBJ_ADD);
+
+		r = init_rambuf_pool(cache);
+		if (r) {
+			WBERR();
+			goto bad_init_rambuf_pool;
+		}
+		/*
+		 * Select arbitrary one
+		 * as the initial rambuffer
+		 */
+		cache->current_rambuf = cache->rambuf_pool + 0;
+
+		r = init_segment_header_array(cache);
+		if (r) {
+			WBERR();
+			goto bad_alloc_segment_header_array;
+		}
+		mb_array_empty_init(cache);
+
+		r = ht_empty_init(cache);
+		if (r) {
+			WBERR();
+			goto bad_alloc_ht;
+		}
+
+		cache->migrate_buffer = vmalloc(NR_CACHES_INSEG << 12);
+		if (!cache->migrate_buffer) {
+			WBERR();
+			goto bad_alloc_migrate_buffer;
+		}
+
+		cache->dirtiness_snapshot = kmalloc(
+				NR_CACHES_INSEG,
+				GFP_KERNEL);
+		if (!cache->dirtiness_snapshot) {
+			WBERR();
+			goto bad_alloc_dirtiness_snapshot;
+		}
+
+		cache->migrate_wq = create_singlethread_workqueue("migratewq");
+		if (!cache->migrate_wq) {
+			WBERR();
+			goto bad_migratewq;
+		}
+
+		INIT_WORK(&cache->migrate_work, migrate_proc);
+		init_waitqueue_head(&cache->migrate_wait_queue);
+		INIT_LIST_HEAD(&cache->migrate_list);
+		atomic_set(&cache->migrate_fail_count, 0);
+		atomic_set(&cache->migrate_io_count, 0);
+		cache->nr_max_batched_migration = 1;
+		cache->nr_cur_batched_migration = 1;
+		queue_work(cache->migrate_wq, &cache->migrate_work);
+
+		setup_timer(&cache->barrier_deadline_timer,
+			    barrier_deadline_proc, (unsigned long) cache);
+		bio_list_init(&cache->barrier_ios);
+		/*
+		 * Deadline is 3 ms by default.
+		 * 2.5 us to process on bio
+		 * and 3 ms is enough long to process 255 bios.
+		 * If the buffer doesn't get full within 3 ms,
+		 * we can doubt write starves
+		 * by waiting formerly submitted barrier to be complete.
+		 */
+		cache->barrier_deadline_ms = 3;
+		INIT_WORK(&cache->barrier_deadline_work, flush_barrier_ios);
+
+		cache->flush_wq = create_singlethread_workqueue("flushwq");
+		if (!cache->flush_wq) {
+			WBERR();
+			goto bad_flushwq;
+		}
+		spin_lock_init(&cache->flush_queue_lock);
+		INIT_WORK(&cache->flush_work, flush_proc);
+		INIT_LIST_HEAD(&cache->flush_queue);
+		init_waitqueue_head(&cache->flush_wait_queue);
+		queue_work(cache->flush_wq, &cache->flush_work);
+
+		r = recover_cache(cache);
+		if (r) {
+			WBERR();
+			goto bad_recover;
+		}
+
+		wb_caches[cache->id] = cache;
+
+		clear_stat(cache);
+
+		return 0;
+
+bad_recover:
+		cache->on_terminate = true;
+		cancel_work_sync(&cache->flush_work);
+		destroy_workqueue(cache->flush_wq);
+bad_flushwq:
+		cache->on_terminate = true;
+		cancel_work_sync(&cache->barrier_deadline_work);
+		cancel_work_sync(&cache->migrate_work);
+		destroy_workqueue(cache->migrate_wq);
+bad_migratewq:
+		kfree(cache->dirtiness_snapshot);
+bad_alloc_dirtiness_snapshot:
+		vfree(cache->migrate_buffer);
+bad_alloc_migrate_buffer:
+		kill_arr(cache->htable);
+bad_alloc_ht:
+		kill_arr(cache->segment_header_array);
+bad_alloc_segment_header_array:
+		free_rambuf_pool(cache);
+bad_init_rambuf_pool:
+		kobject_uevent(&cache->kobj, KOBJ_REMOVE);
+		sysfs_remove_link(&cache->kobj, "device");
+bad_device_lns:
+		kobject_del(&cache->kobj);
+		kobject_put(&cache->kobj);
+bad_kobj_add:
+		dm_put_device(ti, cache->device);
+bad_get_device:
+		kfree(cache);
+		wb_caches[cache_id_ptr] = NULL;
+		return r;
+	}
+
+	if (!strcasecmp(cmd, "free_cache")) {
+		struct wb_cache *cache = wb_caches[cache_id_ptr];
+
+		cache->on_terminate = true;
+
+		cancel_work_sync(&cache->flush_work);
+		destroy_workqueue(cache->flush_wq);
+
+		cancel_work_sync(&cache->barrier_deadline_work);
+
+		cancel_work_sync(&cache->migrate_work);
+		destroy_workqueue(cache->migrate_wq);
+		kfree(cache->dirtiness_snapshot);
+		vfree(cache->migrate_buffer);
+
+		kill_arr(cache->htable);
+		kill_arr(cache->segment_header_array);
+
+		free_rambuf_pool(cache);
+
+		kobject_uevent(&cache->kobj, KOBJ_REMOVE);
+		sysfs_remove_link(&cache->kobj, "device");
+		kobject_del(&cache->kobj);
+		kobject_put(&cache->kobj);
+
+		dm_put_device(ti, cache->device);
+		kfree(cache);
+
+		wb_caches[cache_id_ptr] = NULL;
+
+		return 0;
+	}
+
+	WBERR();
+	return -EINVAL;
+}
+
+static size_t calc_static_memory_consumption(struct wb_cache *cache)
+{
+	size_t seg = sizeof(struct segment_header) * cache->nr_segments;
+	size_t ht = sizeof(struct ht_head) * cache->htsize;
+	size_t rambuf_pool = NR_RAMBUF_POOL << (WB_SEGMENTSIZE_ORDER + 9);
+	size_t mig_buf = cache->nr_cur_batched_migration *
+			 (NR_CACHES_INSEG << 12);
+
+	return seg + ht + rambuf_pool + mig_buf;
+};
+
+static void writeboost_mgr_status(struct dm_target *ti, status_type_t type,
+				  unsigned flags, char *result,
+				  unsigned int maxlen)
+{
+	int i;
+	struct wb_cache *cache;
+	unsigned int sz = 0;
+
+	switch (type) {
+	case STATUSTYPE_INFO:
+		DMEMIT("\n");
+		DMEMIT("current cache_id_ptr: %u\n", cache_id_ptr);
+
+		if (cache_id_ptr == 0) {
+			DMEMIT("sizeof struct\n");
+			DMEMIT("metablock: %lu\n",
+			       sizeof(struct metablock));
+			DMEMIT("metablock_device: %lu\n",
+			       sizeof(struct metablock_device));
+			DMEMIT("segment_header: %lu\n",
+			       sizeof(struct segment_header));
+			DMEMIT("segment_header_device: %lu (<= 4096)",
+			       sizeof(struct segment_header_device));
+			break;
+		}
+
+		cache = wb_caches[cache_id_ptr];
+		if (!cache) {
+			WBERR("no cache for the cache_id_ptr %u",
+			      cache_id_ptr);
+			return;
+		}
+
+		DMEMIT("static RAM(approx.): %lu (byte)\n",
+		       calc_static_memory_consumption(cache));
+		DMEMIT("allow_migrate: %d\n", cache->allow_migrate);
+		DMEMIT("nr_segments: %lu\n", cache->nr_segments);
+		DMEMIT("last_migrated_segment_id: %lu\n",
+		       cache->last_migrated_segment_id);
+		DMEMIT("last_flushed_segment_id: %lu\n",
+		       cache->last_flushed_segment_id);
+		DMEMIT("current segment id: %lu\n",
+		       cache->current_seg->global_id);
+		DMEMIT("cursor: %u\n", cache->cursor);
+		DMEMIT("\n");
+		DMEMIT("write? hit? on_buffer? fullsize?\n");
+		for (i = 0; i < STATLEN; i++) {
+			atomic64_t *v;
+			if (i == (STATLEN-1))
+				break;
+
+			v = &cache->stat[i];
+			DMEMIT("%d %d %d %d %lu",
+				i & (1 << STAT_WRITE)      ? 1 : 0,
+				i & (1 << STAT_HIT)        ? 1 : 0,
+				i & (1 << STAT_ON_BUFFER)  ? 1 : 0,
+				i & (1 << STAT_FULLSIZE)   ? 1 : 0,
+				atomic64_read(v));
+			DMEMIT("\n");
+		}
+		break;
+
+	case STATUSTYPE_TABLE:
+		break;
+	}
+}
+
+static struct target_type writeboost_mgr_target = {
+	.name = "writeboost-mgr",
+	.version = {0, 1, 0},
+	.module = THIS_MODULE,
+	.map = writeboost_mgr_map,
+	.ctr = writeboost_mgr_ctr,
+	.dtr = writeboost_mgr_dtr,
+	.message = writeboost_mgr_message,
+	.status = writeboost_mgr_status,
+};
+
+static int __init writeboost_module_init(void)
+{
+	size_t i;
+	struct module *mod;
+	struct kobject *wb_kobj;
+	int r;
+
+	r = dm_register_target(&writeboost_target);
+	if (r < 0) {
+		WBERR("%d", r);
+		return r;
+	}
+
+	r = dm_register_target(&writeboost_mgr_target);
+	if (r < 0) {
+		WBERR("%d", r);
+		goto bad_register_mgr_target;
+	}
+
+	/*
+	 * /sys/module/dm_writeboost/devices
+	 *                          /caches
+	 */
+
+	mod = THIS_MODULE;
+	wb_kobj = &(mod->mkobj.kobj);
+
+	r = -ENOMEM;
+
+	devices_kobj = kobject_create_and_add("devices", wb_kobj);
+	if (!devices_kobj) {
+		WBERR();
+		goto bad_kobj_devices;
+	}
+
+	caches_kobj = kobject_create_and_add("caches", wb_kobj);
+	if (!caches_kobj) {
+		WBERR();
+		goto bad_kobj_caches;
+	}
+
+	safe_io_wq = alloc_workqueue("safeiowq",
+				     WQ_NON_REENTRANT | WQ_MEM_RECLAIM, 0);
+	if (!safe_io_wq) {
+		WBERR();
+		goto bad_wq;
+	}
+
+	wb_io_client = dm_io_client_create();
+	if (IS_ERR(wb_io_client)) {
+		WBERR();
+		r = PTR_ERR(wb_io_client);
+		goto bad_io_client;
+	}
+
+	cache_id_ptr = 0;
+
+	for (i = 0; i < WB_NR_SLOTS; i++)
+		wb_devices[i] = NULL;
+
+	for (i = 0; i < WB_NR_SLOTS; i++)
+		wb_caches[i] = NULL;
+
+	return 0;
+
+bad_io_client:
+	destroy_workqueue(safe_io_wq);
+bad_wq:
+	kobject_put(caches_kobj);
+bad_kobj_caches:
+	kobject_put(devices_kobj);
+bad_kobj_devices:
+	dm_unregister_target(&writeboost_mgr_target);
+bad_register_mgr_target:
+	dm_unregister_target(&writeboost_target);
+
+	return r;
+}
+
+static void __exit writeboost_module_exit(void)
+{
+	dm_io_client_destroy(wb_io_client);
+	destroy_workqueue(safe_io_wq);
+
+	kobject_put(caches_kobj);
+	kobject_put(devices_kobj);
+
+	dm_unregister_target(&writeboost_mgr_target);
+	dm_unregister_target(&writeboost_target);
+}
+
+module_init(writeboost_module_init);
+module_exit(writeboost_module_exit);
+
+MODULE_AUTHOR("Akira Hayakawa <ruby.wktk@gmail.com>");
+MODULE_DESCRIPTION(DM_NAME " writeboost target");
+MODULE_LICENSE("GPL");
diff --git a/drivers/staging/dm-writeboost/dm-writeboost.txt b/drivers/staging/dm-writeboost/dm-writeboost.txt
new file mode 100644
index 0000000..704e634
--- /dev/null
+++ b/drivers/staging/dm-writeboost/dm-writeboost.txt
@@ -0,0 +1,133 @@
+dm-writeboost
+=============
+dm-writeboost provides write-back log-structured caching.
+It batches random writes into a big sequential write.
+
+dm-writeboost is composed of two target_type
+named writeboost and writeboost-mgr.
+
+- writeboost target is responsible for
+  creating logical volumes and controlling ios.
+- writeboost-mgr target is reponsible for doing
+  formatting/initializing/destructing cache devices.
+
+1. Example
+==========
+This article shows show to create
+a logical volume named myLV
+backed by /dev/myBacking and
+using /dev/myCache as a cache device.
+
+myLV |-- (backing store) /dev/myBacking
+     |-- (cache device)  /dev/myCache
+
+dmsetup create mgr --table "writeboost-mgr 0 1"
+dmsetup mgr 0 message format-cache /dev/myCache
+dmsetup mgr 0 message switch_to 3
+dmsetup mgr 0 message resume-cache /dev/myCache
+dmsetup create perflv 0 10000 writeboost 5 /dev/myBacking 3
+
+2. Userland Tools
+=================
+Using dm-writeboost through said kernel interfaces
+is possible but not recommended.
+Instead, dm-writeboost provides nice userland tools
+that make it safer to manage the kernel module.
+
+The userland tools is managed in Github.
+https://github.com/akiradeveloper/dm-writeboost
+
+Also, quick-start script is provided in the repo.
+For the information, please read the README there.
+
+3. Sysfs Interfaces
+===================
+dm-writeboost provides
+sysfs interfaces to control the module behavior.
+The sysfs tree is located under /sys/module/dm_writeboost.
+
+/sys/module/dm_writeboost
+|
+|-- devices
+|   `-- 5
+|       |-- cache_id
+|       |-- dev
+|       |-- device -> ../../../../devices/virtual/block/dm-0
+|	|-- device_no
+|       |-- migrate_threshold
+|       |-- nr_dirty_caches
+|
+|-- caches
+|   `-- 3
+|       |-- allow_migrate
+|       |-- barrier_deadline_ms
+|       |-- commit_super_block
+|       |-- commit_super_block_interval
+|       |-- device -> ../../../../devices/virtual/block/dm-1
+|       |-- flush_current_buffer
+|       |-- flush_current_buffer_interval
+|       |-- force_migrate
+|       |-- last_flushed_segment_id
+|       |-- last_migrated_segment_id
+|       |-- nr_max_batched_migration
+|       `-- update_interval
+
+4. Technical Features
+=====================
+There are not a few technical features that
+distinguishes dm-writeboost from other cache softwares.
+
+4.1 RAM buffer and immediate completion
+dm-writeboost allocated RAM buffers of 64MB in total by default.
+All of the writes are first stored in one of these RAM buffers
+and immediate completion is notified to the upper layer
+that is quite fast in few microseconds.
+
+4.2 Metadata durability
+After RAM buffer gets full or some deadline comes
+dm-writeboost creates segment log
+that combines RAM buffer and its metadata.
+Metadata have information such as relation between
+address in the cache device and
+the counterpart in the backing store.
+As the segment log is
+finally written to persistent cache device,
+any data will not be lost due to machine failure.
+
+4.3 Asynchronous log flushing
+dm-writeboost has a background worker called flush daemon.
+Flushing segment log starts from simply queueing the flush task.
+Flush daemon in background
+periodically checks if the queue has some tasks
+and actually executes the tasks if exists.
+The fact that the upper layer doesn't block in queueing the task
+can maximizes the write throughput,
+that is measured as 259MB/s random writes
+with cache device of 266MB/s sequential write which is only 3% loss
+and 1.5GB/s theoritically with a fast enough cache like PCI-e SSDs.
+
+4.4 Deferred ack for REQ_FUA or REQ_FLUSH bios
+Some applications such as NFS, journal filesystems
+and databases often submit SYNC write which
+incurs bios flagged with REQ_FUA or REQ_FLUSH.
+Handling these unusual bios immediately and thus synchronously
+desparately deteriorates the whole throughput.
+To address this issue, dm-writeboost handles acks for these bios
+lazily or in deferred manner.
+Completion related to these bios will not be done
+until they are written persistently to the cache device
+so this storategy doesn't betray the semantics.
+In the worst case scenario, a bio with some of these flags
+is completed in deadline period that is configurable
+in barrier_deadline_ms in said sysfs.
+
+4.5 Asynchronous and autonomous migration
+Some time after a log segment is flushed to the cache device
+it will be migrated to the backing store.
+Migrate daemon is also a background worker that
+periodically checks if log segments to migrate exist.
+
+Restlessly migrating highly burdens backing store
+so migration is preferable to execute when the backing store is in lazy time.
+writeboost-daemon in userland surveils the load of the backing store
+and autonomously turns on and off migration according to the load.
-- 
1.8.3.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: staging: Add dm-writeboost
  2013-09-01 11:10 [PATCH] staging: Add dm-writeboost Akira Hayakawa
@ 2013-09-16 21:53 ` Mike Snitzer
  2013-09-16 22:49   ` Dan Carpenter
                     ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Mike Snitzer @ 2013-09-16 21:53 UTC (permalink / raw)
  To: Akira Hayakawa
  Cc: gregkh, devel, linux-kernel, dm-devel, cesarb, joe, akpm, agk, m.chehab

On Sun, Sep 01 2013 at  7:10am -0400,
Akira Hayakawa <ruby.wktk@gmail.com> wrote:

> This patch introduces dm-writeboost to staging tree.
> 
> dm-writeboost is a log-structured caching software.
> It batches in-coming random-writes to a big sequential write
> to a cache device.
> 
> Unlike other block caching softwares like dm-cache and bcache,
> dm-writeboost focuses on bursty writes.
> Since the implementation is optimized on writes,
> the benchmark using fio indicates that
> it performs 259MB/s random writes with
> SSD with 266MB/s sequential write throughput
> which is only 3% loss.
> 
> Furthermore,
> because it uses SSD cache device sequentially,
> the lifetime of the device is maximized.
> 
> The merit of putting this software in staging tree is
> to make it more possible to get feedback from users
> and thus polish the code.
> 
> Signed-off-by: Akira Hayakawa <ruby.wktk@gmail.com>
> ---
>  MAINTAINERS                                     |    7 +
>  drivers/staging/Kconfig                         |    2 +
>  drivers/staging/Makefile                        |    1 +
>  drivers/staging/dm-writeboost/Kconfig           |    8 +
>  drivers/staging/dm-writeboost/Makefile          |    1 +
>  drivers/staging/dm-writeboost/TODO              |   11 +
>  drivers/staging/dm-writeboost/dm-writeboost.c   | 3445 +++++++++++++++++++++++
>  drivers/staging/dm-writeboost/dm-writeboost.txt |  133 +
>  8 files changed, 3608 insertions(+)
>  create mode 100644 drivers/staging/dm-writeboost/Kconfig
>  create mode 100644 drivers/staging/dm-writeboost/Makefile
>  create mode 100644 drivers/staging/dm-writeboost/TODO
>  create mode 100644 drivers/staging/dm-writeboost/dm-writeboost.c
>  create mode 100644 drivers/staging/dm-writeboost/dm-writeboost.txt

Hi Akira,

Sorry for not getting back to you sooner.  I'll make an effort to be
more responsive from now on.

Here is a list of things I noticed during the first _partial_ pass in
reviewing the code:

- the various typedefs aren't needed (e.g. device_id, cache_id,
  cache_nr)

- variable names need to be a bit more verbose (arr => array)

- really not convinced we need WB{ERR,WARN,INFO} -- may have been useful
  for early development but production code shouldn't be emitting
  messages with line numbers

- all the code in one file is too cumbersome; would like to see it split
  into multiple files.. not clear on what that split would look like yet

- any chance the log-structured IO could be abstracted to a new class in
  drivers/md/persistent-data/ ?  At least factor out a library with the
  interface that drives the IO.

- really dislike the use of an intermediate "writeboost-mgr" target to
  administer the writeboost devices.  There is no need for this.  Just
  have a normal DM target whose .ctr takes care of validation and
  determines whether a device needs formatting, etc.  Otherwise I cannot
  see how you can properly stack DM devices on writeboost devices
  (suspend+resume become tediously different)

- shouldn't need to worry about managing your own sysfs hierarchy;
  when a dm-writeboost .ctr takes a reference on a backing or cache
  device it will establish a proper hierarchy (see: dm_get_device).  What
  advantages are you seeing from having/managing this sysfs tree?

- I haven't had time to review the migration daemon post you made today;
  but it concerns me that dm-writeboost ever required this extra service
  for normal function.  I'll take a closer look at what you're asking
  and respond tomorrow.

But in short this code really isn't even suitable for inclusion via
staging.  There are far too many things, on a fundamental interface
level, that we need to sort out.

Probably best for you to publish the dm-writeboost code a git repo on
github.com or the like.  I just don't see what benefit there is to
putting code like this in staging.  Users already need considerable
userspace tools and infrastructure will also be changing in the
near-term (e.g. the migration daemon).

Regards,
Mike

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: staging: Add dm-writeboost
  2013-09-16 21:53 ` Mike Snitzer
@ 2013-09-16 22:49   ` Dan Carpenter
  2013-09-17 12:41   ` Akira Hayakawa
  2013-09-17 12:43   ` Akira Hayakawa
  2 siblings, 0 replies; 25+ messages in thread
From: Dan Carpenter @ 2013-09-16 22:49 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Akira Hayakawa, devel, gregkh, linux-kernel, dm-devel, agk, joe,
	akpm, cesarb, m.chehab

On Mon, Sep 16, 2013 at 05:53:57PM -0400, Mike Snitzer wrote:
> - variable names need to be a bit more verbose (arr => array)

"struct array " is a horrible name.  :P  Please don't use either "arr"
or "array".

regards,
dan carpenter


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: staging: Add dm-writeboost
  2013-09-16 21:53 ` Mike Snitzer
  2013-09-16 22:49   ` Dan Carpenter
@ 2013-09-17 12:41   ` Akira Hayakawa
  2013-09-17 20:18     ` Mike Snitzer
  2013-09-17 12:43   ` Akira Hayakawa
  2 siblings, 1 reply; 25+ messages in thread
From: Akira Hayakawa @ 2013-09-17 12:41 UTC (permalink / raw)
  To: snitzer
  Cc: gregkh, devel, linux-kernel, dm-devel, cesarb, joe, akpm, agk, m.chehab

Mike,

First, thank you for your commenting.
I was looking forward to your comments.


I suppose you are sensing some "smell" in my design.
You are worrying that dm-writeboost will not only confuse users
but also fall into worst situation of giving up backward-compatibility
after merging into tree.

That dm-writeboost's design is too eccentric as a DM target makes you so.

That you said
>   determines whether a device needs formatting, etc.  Otherwise I cannot
>   see how you can properly stack DM devices on writeboost devices
>   (suspend+resume become tediously different)
is a proof of smell.

Alasdair also said
> I read a statement like that as an indication of an interface or
> architectural problem.  The device-mapper approach is to 'design out'
> problems, rather than relying on users not doing bad things.
> Study the existing interfaces used by other targets to understand
> some approaches that proved successful, then decide which ones
> come closest to your needs.

and

Mikulas said
> Another idea:
> 
> Make the interface of dm-lc (the arguments to constructor, messages and 
> the status line) the same as dm-cache, so that they can be driven by the 
> same userspace code.
Though I guess this is going too far
since dm-writeboost and dm-cache are the different things
designing them similar definitely makes sense.

are also sensing of smell.


I am afraid so I am and
I am thinking of re-designing dm-writeboost
at the fundamental architectural level.
The interfaces will be similar to that of dm-cache as a result.

This will be a really a BIG change.

> Probably best for you to publish the dm-writeboost code a git repo on
> github.com or the like.  I just don't see what benefit there is to
> putting code like this in staging.  Users already need considerable
> userspace tools and infrastructure will also be changing in the
> near-term (e.g. the migration daemon).
Yes, I agree with that regarding the current implementation.
I withdraw from the proposal for staging.
I am really sorry for Greg and others caring about dm-writeboost.
But I will be back after re-designing.
staging means lot to get 3rd party users is for sure.


Since this will be a big change.
I want to agree on the design before going forward.
I will explain why the interfaces of dm-writeboost
are designed so complicated.


Essentially,
it is because dm-writeboost supports "cache-sharing".
The functionality of sharing a cache by devices is required in some cases.

If I remove the functionality the design will be much simpler
and the code will be slightly faster.


What to be removed after re-designing follows
and they are also explaining why cache-sharing makes the design bad.

(1) writeboost-mgr (maybe)
Mike said
> - really dislike the use of an intermediate "writeboost-mgr" target to
>   administer the writeboost devices.  There is no need for this.

but I don't think having a singleton intermediate writeboost-mgr
is completely weird.
dm-thin target also has a singleton thin-pool target
that create and destroy thin devices.

Below is a description from Documentation/device-mapper/thin-provisioning.txt

  Using an existing pool device
  -----------------------------

      dmsetup create pool \
          --table "0 20971520 thin-pool $metadata_dev $data_dev \
                   $data_block_size $low_water_mark"



  i) Creating a new thinly-provisioned volume.

    To create a new thinly- provisioned volume you must send a message to an
    active pool device, /dev/mapper/pool in this example.

      dmsetup message /dev/mapper/pool 0 "create_thin 0"

    Here '0' is an identifier for the volume, a 24-bit number.  It's up
    to the caller to allocate and manage these identifiers.  If the
    identifier is already in use, the message will fail with -EEXIST.

But I do agree on having writeboost-mgr is a smell of over-engineering.
A target does nothing at all but being an admin looks little bit weird
for a design of DM target.
Maybe this should be removed.

(2) device_id and cache_id
To manage which backing devices are attached to a cache
These IDs are needed like dm-thin.
But they are not needed if I give up cache-sharing and

> - the various typedefs aren't needed (e.g. device_id, cache_id,
>   cache_nr)
will be all cleared.

(3) sysfs
>   device it will establish a proper hierarchy (see: dm_get_device).  What
>   advantages are you seeing from having/managing this sysfs tree?
One of the advantages is 
that userland tools can see the relations between devices.
Some GUI application might want to draw that by refering the sysfs.

If I get rid of the cache-sharing,
the dimension of relations between devices
will not be needed and will be removed toward userland
and the alternative is to
set/get the tunable parameters are thru message and status
like dm-cache.

In addition,
there actually is a smelling code causing by this design.
The code below add/remove the sysfs interfaces
that should be done in .ctr but is separated for
actually very complicated reason
belonging to the minor behavior of dmsetup reload.

I fully agree on removing this sysfs anyway because
I don't think I will be able to maintain this sysfs forever
and that one of the reasons why I provides userland tools in Python
as an abstraction layer.

        if (!strcasecmp(cmd, "add_sysfs")) {
                struct kobject *dev_kobj;
                r = kobject_init_and_add(&wb->kobj, &device_ktype,
                                         devices_kobj, "%u", wb->id);
                if (r) {
                        WBERR();
                        return r;
                }

                dev_kobj = get_bdev_kobject(wb->device->bdev);
                r = sysfs_create_link(&wb->kobj, dev_kobj, "device");
                if (r) {
                        WBERR();
                        kobject_del(&wb->kobj);
                        kobject_put(&wb->kobj);
                        return r;
                }

                kobject_uevent(&wb->kobj, KOBJ_ADD);
                return 0;
        }

        if (!strcasecmp(cmd, "remove_sysfs")) {
                kobject_uevent(&wb->kobj, KOBJ_REMOVE);

                sysfs_remove_link(&wb->kobj, "device");
                kobject_del(&wb->kobj);
                kobject_put(&wb->kobj);

                wb_devices[wb->id] = NULL;
                return 0;
        }


Simplify the design and
make it more possible to maintain the target
for the future is what I fully agree with.
Being adhere to cache-sharing by
risking the future maintainability doesn't pay.
Re-designing the dm-writeboost resemble to dm-cache
is a leading candidate of course.

I will ask for comment for the new design in the next reply.

Akira

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: staging: Add dm-writeboost
  2013-09-16 21:53 ` Mike Snitzer
  2013-09-16 22:49   ` Dan Carpenter
  2013-09-17 12:41   ` Akira Hayakawa
@ 2013-09-17 12:43   ` Akira Hayakawa
  2013-09-17 20:59     ` Mike Snitzer
  2 siblings, 1 reply; 25+ messages in thread
From: Akira Hayakawa @ 2013-09-17 12:43 UTC (permalink / raw)
  To: snitzer
  Cc: gregkh, devel, linux-kernel, dm-devel, cesarb, joe, akpm, agk, m.chehab

Hi, Mike

There are two designs in my mind
regarding the formatting cache.

You said
>   administer the writeboost devices.  There is no need for this.  Just
>   have a normal DM target whose .ctr takes care of validation and
>   determines whether a device needs formatting, etc.  
makes me wonder how I format the cache device.


There are two choices for formatting cache and create a writeboost device
standing on the point of removing writeboost-mgr existing in the current design.
I will explain them from how the interface will look like.

(1) dmsetup create myDevice ... "... $backing_path $cache_path"
which will returns error if the superblock of the given cache device
is invalid and needs formatting.
And then the user formats the cache device by some userland tool.

(2) dmsetup create myDevice ... "... $backing_path $cache_path $do_format"
which also returns error if the superblock of the given cache device
is invalid and needs formatting when $do_format is 0.
And then user formats the cache device by setting $do_format to 1 and try again.

There pros and cons about the design tradeoffs:
- (i)  (1) is simpler. do_format parameter in (2) doesn't seem to be sane.
       (1) is like the interfaces of filesystems where dmsetup create is like mounting a filesystem.
- (ii) (2) can implement everything in kernel. It can gather all the information
       about how the superblock in one place, kernel code.

Excuse for the current design:
- The reason I design writeboost-mgr is almost regarding (ii) above.
  writeboost-mgr has a message "format_cache_device" and
  writeboost-format-cache userland command kicks the message to format cache.

- writeboost-mgr has also a message "resume_cache"
  that validates and builds a in-memory structure according to the cache binding to given $cache_id
  and user later dmsetup create the writeboost device with the $cache_id.
  However, resuming the cache metadata should be done under .ctr like dm-cache does
  and should not relate LV to create and cache by external cache_id
  is what I realized by looking at the code of dm-cache which
  calls dm_cache_metadata_open() routines under .ctr .
  I don't know why I should not do this it is nicer to trust DM guys in RedHat on this point.

writeboost-mgr is something like smell of over-engineering but
is useful for simplifying the design for above reasons.


Which do you think better?

Akira


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: staging: Add dm-writeboost
  2013-09-17 12:41   ` Akira Hayakawa
@ 2013-09-17 20:18     ` Mike Snitzer
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Snitzer @ 2013-09-17 20:18 UTC (permalink / raw)
  To: Akira Hayakawa
  Cc: gregkh, devel, linux-kernel, dm-devel, cesarb, joe, akpm, agk, m.chehab

On Tue, Sep 17 2013 at  8:41am -0400,
Akira Hayakawa <ruby.wktk@gmail.com> wrote:

> Mike,
> 
> First, thank you for your commenting.
> I was looking forward to your comments.
> 
> 
> I suppose you are sensing some "smell" in my design.
> You are worrying that dm-writeboost will not only confuse users
> but also fall into worst situation of giving up backward-compatibility
> after merging into tree.
> 
> That dm-writeboost's design is too eccentric as a DM target makes you so.
> 
> That you said
> >   determines whether a device needs formatting, etc.  Otherwise I cannot
> >   see how you can properly stack DM devices on writeboost devices
> >   (suspend+resume become tediously different)
> is a proof of smell.
> 
> Alasdair also said
> > I read a statement like that as an indication of an interface or
> > architectural problem.  The device-mapper approach is to 'design out'
> > problems, rather than relying on users not doing bad things.
> > Study the existing interfaces used by other targets to understand
> > some approaches that proved successful, then decide which ones
> > come closest to your needs.
> 
> and
> 
> Mikulas said
> > Another idea:
> > 
> > Make the interface of dm-lc (the arguments to constructor, messages and 
> > the status line) the same as dm-cache, so that they can be driven by the 
> > same userspace code.
> Though I guess this is going too far
> since dm-writeboost and dm-cache are the different things
> designing them similar definitely makes sense.
> 
> are also sensing of smell.
> 
> 
> I am afraid so I am and
> I am thinking of re-designing dm-writeboost
> at the fundamental architectural level.
> The interfaces will be similar to that of dm-cache as a result.
> 
> This will be a really a BIG change.
> 
> > Probably best for you to publish the dm-writeboost code a git repo on
> > github.com or the like.  I just don't see what benefit there is to
> > putting code like this in staging.  Users already need considerable
> > userspace tools and infrastructure will also be changing in the
> > near-term (e.g. the migration daemon).
> Yes, I agree with that regarding the current implementation.
> I withdraw from the proposal for staging.
> I am really sorry for Greg and others caring about dm-writeboost.
> But I will be back after re-designing.

OK, appreciate your willingness to rework this.  

> staging means lot to get 3rd party users is for sure.

We don't need to go through staging.  If the dm-writeboost target is
designed well and provides a tangible benefit it doesn't need
wide-spread users as justification for going in.  The users will come if
it is implemented well.

> Simplify the design and
> make it more possible to maintain the target
> for the future is what I fully agree with.
> Being adhere to cache-sharing by
> risking the future maintainability doesn't pay.
> Re-designing the dm-writeboost resemble to dm-cache
> is a leading candidate of course.

Simplifying the code is certainly desirable.  So dropping the sharing
sounds like a step in the right direction.  Plus you can share the cache
by layering multiple linear devices ontop of the dm-writeboost device.

Also managing dm-writeboost devices with lvm2 is a priority, so any
interface similarities dm-writeboost has with dm-cache will be
beneficial.

Mike

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: staging: Add dm-writeboost
  2013-09-17 12:43   ` Akira Hayakawa
@ 2013-09-17 20:59     ` Mike Snitzer
  2013-09-22  0:09       ` Reworking dm-writeboost [was: Re: staging: Add dm-writeboost] Akira Hayakawa
  0 siblings, 1 reply; 25+ messages in thread
From: Mike Snitzer @ 2013-09-17 20:59 UTC (permalink / raw)
  To: Akira Hayakawa
  Cc: gregkh, devel, linux-kernel, dm-devel, cesarb, joe, akpm, agk,
	m.chehab, ejt

On Tue, Sep 17 2013 at  8:43am -0400,
Akira Hayakawa <ruby.wktk@gmail.com> wrote:

> Hi, Mike
> 
> There are two designs in my mind
> regarding the formatting cache.
> 
> You said
> >   administer the writeboost devices.  There is no need for this.  Just
> >   have a normal DM target whose .ctr takes care of validation and
> >   determines whether a device needs formatting, etc.  
> makes me wonder how I format the cache device.
> 
> 
> There are two choices for formatting cache and create a writeboost device
> standing on the point of removing writeboost-mgr existing in the current design.
> I will explain them from how the interface will look like.
> 
> (1) dmsetup create myDevice ... "... $backing_path $cache_path"
> which will returns error if the superblock of the given cache device
> is invalid and needs formatting.
> And then the user formats the cache device by some userland tool.
> 
> (2) dmsetup create myDevice ... "... $backing_path $cache_path $do_format"
> which also returns error if the superblock of the given cache device
> is invalid and needs formatting when $do_format is 0.
> And then user formats the cache device by setting $do_format to 1 and try again.
> 
> There pros and cons about the design tradeoffs:
> - (i)  (1) is simpler. do_format parameter in (2) doesn't seem to be sane.
>        (1) is like the interfaces of filesystems where dmsetup create is like mounting a filesystem.
> - (ii) (2) can implement everything in kernel. It can gather all the information
>        about how the superblock in one place, kernel code.
> 
> Excuse for the current design:
> - The reason I design writeboost-mgr is almost regarding (ii) above.
>   writeboost-mgr has a message "format_cache_device" and
>   writeboost-format-cache userland command kicks the message to format cache.
> 
> - writeboost-mgr has also a message "resume_cache"
>   that validates and builds a in-memory structure according to the cache binding to given $cache_id
>   and user later dmsetup create the writeboost device with the $cache_id.
>   However, resuming the cache metadata should be done under .ctr like dm-cache does
>   and should not relate LV to create and cache by external cache_id
>   is what I realized by looking at the code of dm-cache which
>   calls dm_cache_metadata_open() routines under .ctr .

Right, any in-core structures should be allocated in .ctr()

> writeboost-mgr is something like smell of over-engineering but
> is useful for simplifying the design for above reasons.
> 
> 
> Which do you think better?

Have you looked at how both dm-cache and dm-thinp handle this?
Userspace takes care to write all zeroes to the start of the metadata
device before the first use in the kernel.

In the kernel, see __superblock_all_zeroes(), the superblock on the
metadata device is checked to see whether it is all 0s or not.  If it is
all 0s then the kernel code knows it needs to format (writing the
superblock, etc).

I see no reason why dm-writeboost couldn't use the same design.

Also, have you looked at forking dm-cache as a starting point for
dm-writeboost?  It is an option, not yet clear if it'd help you as there
is likely a fair amount of work picking through code that isn't
relevant.  But it'd be nice to have the writeboost code follow the same
basic design principles.

Like I mentioned before, especially if the log structured block code
could be factored out.  I haven't yet looked close enough at that aspect
of writeboost code to know if it could benefit from the existing
bio-prison code or persistent-data library at all.  writeboost would
obviously need a new space map type, etc.

Could be the log structured nature of writeboost is very different.
I'll review this closer tomorrow.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
  2013-09-17 20:59     ` Mike Snitzer
@ 2013-09-22  0:09       ` Akira Hayakawa
  2013-09-24 12:20         ` Akira Hayakawa
  0 siblings, 1 reply; 25+ messages in thread
From: Akira Hayakawa @ 2013-09-22  0:09 UTC (permalink / raw)
  To: snitzer
  Cc: gregkh, devel, linux-kernel, dm-devel, cesarb, joe, akpm, agk,
	m.chehab, ejt, ruby.wktk

Mike,

> We don't need to go through staging.  If the dm-writeboost target is
> designed well and provides a tangible benefit it doesn't need
> wide-spread users as justification for going in.  The users will come if
> it is implemented well.
OK.
The benefit of introducing writeboost will be documented.
1. READ often hit in page cache.
   That's what page cache is all about.
   READ cache only caches the rest that page cache couldn't cache.
2. Backing store in RAID mode crazily slow in WRITE,
   especially if it is RAID-5.
will be the points.
There is not a silver bullet as a cache software
but writeboost can fit in many situations I believe.


> Have you looked at how both dm-cache and dm-thinp handle this?
> Userspace takes care to write all zeroes to the start of the metadata
> device before the first use in the kernel.
Zeroing the first one sector is a sign of needing formatting
sounds nice to writeboost too.
It's simple and I like it.


> Could be the log structured nature of writeboost is very different.
> I'll review this closer tomorrow.
I should mention about the big design difference
between writeboost and dm-cache
to help you understand the nature of writeboost.

Writeboost doesn't have segregated metadata device like dm-cache does.
Data and metadata coexists in the same cache device.
That is what log-structured is.
Data and its relevant metadata are packed in a log segment
and written to cache device atomically
which makes writeboost reliable and fast.
So, 
> could be factored out.  I haven't yet looked close enough at that aspect
> of writeboost code to know if it could benefit from the existing
> bio-prison code or persistent-data library at all.  writeboost would
> obviously need a new space map type, etc.
what makes sense to dm-cache could not make sense to writeboost.
At a simple look, they don't fit to the design of writeboost.
But I will investigate these functionality further in later time.


> sounds like a step in the right direction.  Plus you can share the cache
> by layering multiple linear devices ontop of the dm-writeboost device.
They are theoretically different but it is actually a trade-off.
But it is not a big problem compared to fitting to device-mapper.


> Also managing dm-writeboost devices with lvm2 is a priority, so any
> interface similarities dm-writeboost has with dm-cache will be
> beneficial.
It sounds really good to me.
Huge benefit.


Akira

n 9/18/13 5:59 AM, Mike Snitzer wrote:
> On Tue, Sep 17 2013 at  8:43am -0400,
> Akira Hayakawa <ruby.wktk@gmail.com> wrote:
> 
>> Hi, Mike
>>
>> There are two designs in my mind
>> regarding the formatting cache.
>>
>> You said
>>>   administer the writeboost devices.  There is no need for this.  Just
>>>   have a normal DM target whose .ctr takes care of validation and
>>>   determines whether a device needs formatting, etc.  
>> makes me wonder how I format the cache device.
>>
>>
>> There are two choices for formatting cache and create a writeboost device
>> standing on the point of removing writeboost-mgr existing in the current design.
>> I will explain them from how the interface will look like.
>>
>> (1) dmsetup create myDevice ... "... $backing_path $cache_path"
>> which will returns error if the superblock of the given cache device
>> is invalid and needs formatting.
>> And then the user formats the cache device by some userland tool.
>>
>> (2) dmsetup create myDevice ... "... $backing_path $cache_path $do_format"
>> which also returns error if the superblock of the given cache device
>> is invalid and needs formatting when $do_format is 0.
>> And then user formats the cache device by setting $do_format to 1 and try again.
>>
>> There pros and cons about the design tradeoffs:
>> - (i)  (1) is simpler. do_format parameter in (2) doesn't seem to be sane.
>>        (1) is like the interfaces of filesystems where dmsetup create is like mounting a filesystem.
>> - (ii) (2) can implement everything in kernel. It can gather all the information
>>        about how the superblock in one place, kernel code.
>>
>> Excuse for the current design:
>> - The reason I design writeboost-mgr is almost regarding (ii) above.
>>   writeboost-mgr has a message "format_cache_device" and
>>   writeboost-format-cache userland command kicks the message to format cache.
>>
>> - writeboost-mgr has also a message "resume_cache"
>>   that validates and builds a in-memory structure according to the cache binding to given $cache_id
>>   and user later dmsetup create the writeboost device with the $cache_id.
>>   However, resuming the cache metadata should be done under .ctr like dm-cache does
>>   and should not relate LV to create and cache by external cache_id
>>   is what I realized by looking at the code of dm-cache which
>>   calls dm_cache_metadata_open() routines under .ctr .
> 
> Right, any in-core structures should be allocated in .ctr()
> 
>> writeboost-mgr is something like smell of over-engineering but
>> is useful for simplifying the design for above reasons.
>>
>>
>> Which do you think better?
> 
> Have you looked at how both dm-cache and dm-thinp handle this?
> Userspace takes care to write all zeroes to the start of the metadata
> device before the first use in the kernel.
> 
> In the kernel, see __superblock_all_zeroes(), the superblock on the
> metadata device is checked to see whether it is all 0s or not.  If it is
> all 0s then the kernel code knows it needs to format (writing the
> superblock, etc).
> 
> I see no reason why dm-writeboost couldn't use the same design.
> 
> Also, have you looked at forking dm-cache as a starting point for
> dm-writeboost?  It is an option, not yet clear if it'd help you as there
> is likely a fair amount of work picking through code that isn't
> relevant.  But it'd be nice to have the writeboost code follow the same
> basic design principles.
> 
> Like I mentioned before, especially if the log structured block code
> could be factored out.  I haven't yet looked close enough at that aspect
> of writeboost code to know if it could benefit from the existing
> bio-prison code or persistent-data library at all.  writeboost would
> obviously need a new space map type, etc.
> 
> Could be the log structured nature of writeboost is very different.
> I'll review this closer tomorrow.
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
  2013-09-22  0:09       ` Reworking dm-writeboost [was: Re: staging: Add dm-writeboost] Akira Hayakawa
@ 2013-09-24 12:20         ` Akira Hayakawa
  2013-09-25 17:37           ` Mike Snitzer
                             ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Akira Hayakawa @ 2013-09-24 12:20 UTC (permalink / raw)
  To: snitzer
  Cc: ruby.wktk, gregkh, devel, linux-kernel, dm-devel, cesarb, joe,
	akpm, agk, m.chehab, ejt, dan.carpenter

Hi, Mike

I am now working on redesigning and implementation
of dm-writeboost.

This is a progress report. 

Please run
git clone https://github.com/akiradeveloper/dm-writeboost.git 
to see full set of the code.

* 1. Current Status
writeboost in new design passed my test.
Documentations are ongoing.

* 2. Big Changes 
- Cache-sharing purged
- All Sysfs purged.
- All Userland tools in Python purged.
-- dmsetup is the only user interface now.
- The daemon in userland is ported to kernel.
- On-disk metadata are in little endian.
- 300 lines of codes shed in kernel
-- Python scripts were 500 LOC so 800 LOC in total.
-- It is now about 3.2k LOC all in kernel.
- Comments are added neatly.
- Reorder the codes so that it gets more readable.

* 3. Documentation in Draft
This is a current document that will be under Documentation/device-mapper

dm-writeboost
=============
writeboost target provides log-structured caching.
It batches random writes into a big sequential write to a cache device.

It is like dm-cache but the difference is
that writeboost focuses on handling bursty writes and lifetime of SSD cache device.

Auxiliary PDF documents and Quick-start scripts are available in
https://github.com/akiradeveloper/dm-writeboost

Design
======
There are foreground path and 6 background daemons.

Foreground
----------
It accepts bios and put writes to RAM buffer.
When the buffer is full, it creates a "flush job" and queues it.

Background
----------
* Flush Daemon
Pop a flush job from the queue and executes it.

* Deferring ACK for barrier writes
Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily.
Immediately handling these bios badly slows down writeboost.
It surveils the bios with these flags and forcefully flushes them
at worst case within `barrier_deadline_ms` period.

* Migration Daemon
It migrates, writes back cache data to backing store,
the data on the cache device in segment granurality.

If `allow_migrate` is true, it migrates without impending situation.
Being in impending situation is that there are no room in cache device
for writing further flush jobs.

Migration at a time is done batching `nr_max_batched_migration` segments at maximum.
Therefore, unlike existing I/O scheduler,
two dirty writes distant in time space can be merged.

* Migration Modulator
Migration while the backing store is heavily loaded
grows the device queue and thus makes the situation ever worse.
This daemon modulates the migration by switching `allow_migrate`.

* Superblock Recorder
Superblock record is a last sector of first 1MB region in cache device.
It contains what id of the segment lastly migrated.
This daemon periodically update the region every `update_record_interval` seconds.

* Cache Synchronizer
This daemon forcefully makes all the dirty writes persistent
every `sync_interval` seconds.
Since writeboost correctly implements the bio semantics
writing the dirties out forcefully out of the main path is needless.
However, some user want to be on the safe side by enabling this.

Target Interface
================
All the operations are via dmsetup command.

Constructor
-----------
writeboost <backing dev> <cache dev>

backing dev : slow device holding original data blocks.
cache dev   : fast device holding cached data and its metadata.

Note that cache device is re-formatted
if the first sector of the cache device is zeroed out.

Status
------
<#dirty caches> <#segments>
<id of the segment lastly migrated>
<id of the segment lastly flushed>
<id of the current segment>
<the position of the cursor>
<16 stat info (r/w) x (hit/miss) x (on buffer/not) x (fullsize/not)>
<# of kv pairs>
<kv pairs>

Messages
--------
You can tune up writeboost via message interface.

* barrier_deadline_ms (ms)
Default: 3
All the bios with barrier flags like REQ_FUA or REQ_FLUSH
are guaranteed to be acked within this deadline.

* allow_migrate (bool)
Default: 1
Set to 1 to start migration.

* enable_migration_modulator (bool) and
  migrate_threshold (%)
Default: 1
Set to 1 to run migration modulator.
Migration modulator surveils the load of backing store
and set the migration started when the load is
lower than the migrate_threshold.

* nr_max_batched_migration (int)
Default: 1
Number of segments to migrate simultaneously and atomically.
Set higher value to fully exploit the capacily of the backing store.

* sync_interval (sec)
Default: 60
All the dirty writes are guaranteed to be persistent by this interval.

* update_record_interval (sec)
Default: 60
The superblock record is updated every update_record_interval seconds.

Example
=======
dd if=/dev/zero of=${CACHE} bs=512 count=1 oflag=direct
sz=`blockdev --getsize ${BACKING}`
dmsetup create writeboost-vol --table "0 ${sz} writeboost ${BACKING} {CACHE}"

* 4. TODO
- rename struct arr
-- It is like flex_array but lighter by eliminating the resizableness.
   Maybe, bigarray is a next candidate but I don't have a judge on this.
   I want to make an agreement on this renaming issue before doing it.
- resume, preresume and postsuspend possibly have to be implemented.
-- But I have no idea at all.
-- Maybe, I should make a research on other target implementing these methods.
- dmsetup status is like that of dm-cache
-- Please look at the example in the reference below.
-- It is far less understandable. Moreover inflexible to changes. 
-- If I may not change the output format in the future 
   I think I should make an agreement on the format.
- Splitting the code is desireble.
-- Should I show you a plan of splitting immediately?
-- If so, I will start it immediately.
- Porting the current implementation to linux-next
-- I am working on my portable kernel with version switches.
-- I want to make an agreement on the basic design with maintainers
   before going to the next step.
-- WB* macros will be purged for sure.

* 5. References 
- Example of `dmsetup status`
-- the number 7 before the barrier_deadline_ms is a number of K-V pairs 
   but they are of fixed number in dm-writeboost unlike dm-cache.
   I am thinking of removing it.
   Even K such as barrier_deadline_ms and allow_migrate are also meaningless
   for the same reason.
# root@Hercules:~/dm-writeboost/testing/1# dmsetup status perflv
0 6291456 writeboost 0 3 669 669 670 0 21 6401 24 519 0 0 13 7051 1849 63278 29 11 0 0 6 7 barrier_deadline_ms 3 allow_migrate 1 enable_migration_modulator 1 migrate_threshold 70 nr_cur_batched_migration 1 sync_interval 3 update_record_interval 2


Thanks,
Akira

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
  2013-09-24 12:20         ` Akira Hayakawa
@ 2013-09-25 17:37           ` Mike Snitzer
  2013-09-26  1:42             ` Akira Hayakawa
  2013-09-26  1:47             ` Akira Hayakawa
  2013-09-25 23:03           ` Greg KH
  2013-09-26  3:43           ` Dave Chinner
  2 siblings, 2 replies; 25+ messages in thread
From: Mike Snitzer @ 2013-09-25 17:37 UTC (permalink / raw)
  To: Akira Hayakawa
  Cc: gregkh, devel, linux-kernel, dm-devel, cesarb, joe, akpm, agk,
	m.chehab, ejt, dan.carpenter

On Tue, Sep 24 2013 at  8:20am -0400,
Akira Hayakawa <ruby.wktk@gmail.com> wrote:

> Hi, Mike
> 
> I am now working on redesigning and implementation
> of dm-writeboost.
> 
> This is a progress report. 
> 
> Please run
> git clone https://github.com/akiradeveloper/dm-writeboost.git 
> to see full set of the code.

I likely won't be able to look closely at the code until Monday (9/30);
I have some higher priority reviews and issues to take care of this
week.

But I'm very encouraged by what you've shared below; looks like things
are moving in the right direction.  Great job.

> * 1. Current Status
> writeboost in new design passed my test.
> Documentations are ongoing.
> 
> * 2. Big Changes 
> - Cache-sharing purged
> - All Sysfs purged.
> - All Userland tools in Python purged.
> -- dmsetup is the only user interface now.
> - The daemon in userland is ported to kernel.
> - On-disk metadata are in little endian.
> - 300 lines of codes shed in kernel
> -- Python scripts were 500 LOC so 800 LOC in total.
> -- It is now about 3.2k LOC all in kernel.
> - Comments are added neatly.
> - Reorder the codes so that it gets more readable.
> 
> * 3. Documentation in Draft
> This is a current document that will be under Documentation/device-mapper
> 
> dm-writeboost
> =============
> writeboost target provides log-structured caching.
> It batches random writes into a big sequential write to a cache device.
> 
> It is like dm-cache but the difference is
> that writeboost focuses on handling bursty writes and lifetime of SSD cache device.
> 
> Auxiliary PDF documents and Quick-start scripts are available in
> https://github.com/akiradeveloper/dm-writeboost
> 
> Design
> ======
> There are foreground path and 6 background daemons.
> 
> Foreground
> ----------
> It accepts bios and put writes to RAM buffer.
> When the buffer is full, it creates a "flush job" and queues it.
> 
> Background
> ----------
> * Flush Daemon
> Pop a flush job from the queue and executes it.
> 
> * Deferring ACK for barrier writes
> Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily.
> Immediately handling these bios badly slows down writeboost.
> It surveils the bios with these flags and forcefully flushes them
> at worst case within `barrier_deadline_ms` period.

OK, but the thing is upper level consumers in the IO stack, like ext4,
expect that when the REQ_FLUSH completes that the device has in fact
flushed any transient state in memory.  So I'm not seeing how handling
these lazily is an option.  Though I do appreciate that dm-cache (and
dm-thin) do take similar approaches.  Would like to get Joe Thornber's
insight here.

> * Migration Daemon
> It migrates, writes back cache data to backing store,
> the data on the cache device in segment granurality.
> 
> If `allow_migrate` is true, it migrates without impending situation.
> Being in impending situation is that there are no room in cache device
> for writing further flush jobs.
> 
> Migration at a time is done batching `nr_max_batched_migration` segments at maximum.
> Therefore, unlike existing I/O scheduler,
> two dirty writes distant in time space can be merged.
> 
> * Migration Modulator
> Migration while the backing store is heavily loaded
> grows the device queue and thus makes the situation ever worse.
> This daemon modulates the migration by switching `allow_migrate`.
> 
> * Superblock Recorder
> Superblock record is a last sector of first 1MB region in cache device.
> It contains what id of the segment lastly migrated.
> This daemon periodically update the region every `update_record_interval` seconds.
> 
> * Cache Synchronizer
> This daemon forcefully makes all the dirty writes persistent
> every `sync_interval` seconds.
> Since writeboost correctly implements the bio semantics
> writing the dirties out forcefully out of the main path is needless.
> However, some user want to be on the safe side by enabling this.

These seem reasonable to me.  Will need to have a look at thread naming
to make sure the names reflect they are part of a dm-writeboost service.

> Target Interface
> ================
> All the operations are via dmsetup command.
> 
> Constructor
> -----------
> writeboost <backing dev> <cache dev>
> 
> backing dev : slow device holding original data blocks.
> cache dev   : fast device holding cached data and its metadata.

You don't allow user to specify the "segment size"?  I'd expect tuning
that could be important based on the underlying storage capabilities
(e.g. having the segment size match that of the SSD's erase block or
matching the backing device's full stripe width?).  SO something similar
to what we have in dm-cache's blocksize.

> Note that cache device is re-formatted
> if the first sector of the cache device is zeroed out.

I'll look at the code but it strikes me as odd that the first sector of
the cache device is checked yet the last sector of the first MB of the
cache is wher ethe superblock resides.  I'd think you'd want to have the
check on whether to format or not to be the same location as the
superblock?

> Status
> ------
> <#dirty caches> <#segments>
> <id of the segment lastly migrated>
> <id of the segment lastly flushed>
> <id of the current segment>
> <the position of the cursor>
> <16 stat info (r/w) x (hit/miss) x (on buffer/not) x (fullsize/not)>
> <# of kv pairs>
> <kv pairs>

So this "<16 stat info (r/w)", is that like /proc/diskstats ?  Are you
aware that dm-stats exists now and can be used instead of needing to
tracking these stats in dm-writeboost?
 
> Messages
> --------
> You can tune up writeboost via message interface.
> 
> * barrier_deadline_ms (ms)
> Default: 3
> All the bios with barrier flags like REQ_FUA or REQ_FLUSH
> are guaranteed to be acked within this deadline.
> 
> * allow_migrate (bool)
> Default: 1
> Set to 1 to start migration.
> 
> * enable_migration_modulator (bool) and
>   migrate_threshold (%)
> Default: 1
> Set to 1 to run migration modulator.
> Migration modulator surveils the load of backing store
> and set the migration started when the load is
> lower than the migrate_threshold.
> 
> * nr_max_batched_migration (int)
> Default: 1
> Number of segments to migrate simultaneously and atomically.
> Set higher value to fully exploit the capacily of the backing store.
> 
> * sync_interval (sec)
> Default: 60
> All the dirty writes are guaranteed to be persistent by this interval.
> 
> * update_record_interval (sec)
> Default: 60
> The superblock record is updated every update_record_interval seconds.

OK to the above.
 
> Example
> =======
> dd if=/dev/zero of=${CACHE} bs=512 count=1 oflag=direct
> sz=`blockdev --getsize ${BACKING}`
> dmsetup create writeboost-vol --table "0 ${sz} writeboost ${BACKING} {CACHE}"
> 
> * 4. TODO
> - rename struct arr
> -- It is like flex_array but lighter by eliminating the resizableness.
>    Maybe, bigarray is a next candidate but I don't have a judge on this.
>    I want to make an agreement on this renaming issue before doing it.

Whatever name you come up with, please add a "dm_" prefix.

> - resume, preresume and postsuspend possibly have to be implemented.
> -- But I have no idea at all.
> -- Maybe, I should make a research on other target implementing these methods.

Yes, these will be important to make sure state is synchronized at
proper places.  Big rule of thumb is you don't want to do any
memory allocations outside of the .ctr method.  Otherwise you can run
into theoretical deadlocks when DM devices are stacked.

But my review will focus on these aspects of dm-writeboost (safety
relative to suspend and resume) rather than the other dm-writeboost
specific implementation.  So we'll sort it out if something needs
fixing.

> - dmsetup status is like that of dm-cache
> -- Please look at the example in the reference below.
> -- It is far less understandable. Moreover inflexible to changes. 
> -- If I may not change the output format in the future 
>    I think I should make an agreement on the format.

Yes, that is one major drawback but generally speaking it is for upper
level tools to consume this info (e.g. lvm2).  So human readability
isn't of primary concern.

> - Splitting the code is desireble.
> -- Should I show you a plan of splitting immediately?
> -- If so, I will start it immediately.

Yes, please share your plan.  Anything that can simplify the code layout
is best done earlier to simplfy code review.

> - Porting the current implementation to linux-next
> -- I am working on my portable kernel with version switches.
> -- I want to make an agreement on the basic design with maintainers
>    before going to the next step.
> -- WB* macros will be purged for sure.

I'd prefer you focus on getting the code working on a stable baseline of
your choosing.  For instance you could build on the linux-dm.git
"for-linus" branch, see:
http://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git

But you're welcome to use the latest released final kernel instead
(currently v3.11).  Given you don't seem to be needing to modify DM core
it isn't a big deal which kernel you develop against (provided it is
pretty recent).  Whatever is easiest for you.

> * 5. References 
> - Example of `dmsetup status`
> -- the number 7 before the barrier_deadline_ms is a number of K-V pairs 
>    but they are of fixed number in dm-writeboost unlike dm-cache.
>    I am thinking of removing it.

But will this always be the case?  Couldn't it be that you add another
K-V pair sometime in the future?

>    Even K such as barrier_deadline_ms and allow_migrate are also meaningless
>    for the same reason.

I'm not following why you feel including the key name in the status is
meaningless.

> # root@Hercules:~/dm-writeboost/testing/1# dmsetup status perflv
> 0 6291456 writeboost 0 3 669 669 670 0 21 6401 24 519 0 0 13 7051 1849 63278 29 11 0 0 6 7 barrier_deadline_ms 3 allow_migrate 1 enable_migration_modulator 1 migrate_threshold 70 nr_cur_batched_migration 1 sync_interval 3 update_record_interval 2

Yeah, it certainly isn't easy for a human to zero in on what is
happening here.. but like I said above, that isn't a goal of dmsetup
status output ;)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
  2013-09-24 12:20         ` Akira Hayakawa
  2013-09-25 17:37           ` Mike Snitzer
@ 2013-09-25 23:03           ` Greg KH
  2013-09-26  3:43           ` Dave Chinner
  2 siblings, 0 replies; 25+ messages in thread
From: Greg KH @ 2013-09-25 23:03 UTC (permalink / raw)
  To: Akira Hayakawa
  Cc: snitzer, devel, linux-kernel, dm-devel, agk, joe, akpm,
	dan.carpenter, ejt, cesarb, m.chehab

On Tue, Sep 24, 2013 at 09:20:50PM +0900, Akira Hayakawa wrote:
> Hi, Mike
> 
> I am now working on redesigning and implementation
> of dm-writeboost.

Ok, I'm dropping your original patch, please resend when you have
something you want merged into drivers/staging/

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
  2013-09-25 17:37           ` Mike Snitzer
@ 2013-09-26  1:42             ` Akira Hayakawa
  2013-09-26  1:47             ` Akira Hayakawa
  1 sibling, 0 replies; 25+ messages in thread
From: Akira Hayakawa @ 2013-09-26  1:42 UTC (permalink / raw)
  To: snitzer
  Cc: gregkh, devel, linux-kernel, dm-devel, cesarb, joe, akpm, agk,
	m.chehab, ejt, dan.carpenter, ruby.wktk

Hi, Mike

I have made another progress yesterday:
Splitting the monolithic source code into
meaningful pieces is done.
It will follow in the next mail.

> Yes, please share your plan.  Anything that can simplify the code layout
> is best done earlier to simplfy code review.
Sorry, should have been done in earlier stage.

First, I reply to each of your comments.

> OK, but the thing is upper level consumers in the IO stack, like ext4,
> expect that when the REQ_FLUSH completes that the device has in fact
> flushed any transient state in memory.  So I'm not seeing how handling
> these lazily is an option.  Though I do appreciate that dm-cache (and
> dm-thin) do take similar approaches.  Would like to get Joe Thornber's
> insight here.
When the upper level consumers receives
the completion of bio with REQ_FLUSH sent
all the transient states are persistent.
writeboost do four steps to accomplish this:
1. queue the flush job with the current transient state (RAM buffer).
2. wait for the completion of the flush job to be written in cache device.
3. blkdev_issue_flush() to the cache device to make all the writes persistent.
4. bio_endio() to the said flagged bios.

If the implementation isn't wrong
It could be working as the consumers expect is what I believe.


> These seem reasonable to me.  Will need to have a look at thread naming
> to make sure the names reflect they are part of a dm-writeboost service.
I change former "Cache Synchronizer" to "Dirty Synchronizer"
but it sounds little bit odd still.
Naming is truly difficult.


> You don't allow user to specify the "segment size"?  I'd expect tuning
> that could be important based on the underlying storage capabilities
> (e.g. having the segment size match that of the SSD's erase block or
> matching the backing device's full stripe width?).  SO something similar
> to what we have in dm-cache's blocksize.
For the current implementation, No.
The segment size is hard-coded in the source code and
one has to re-compile the module to change the segment size.

But hard-coding the size has reasonable background
for performance and simplification.

Please look at the code fragment from .map method which does
(1) writeboost first sees hit/miss. Get metablock (mb).
(2) And then have to get the segment_header "logically" containing the metablock.

        mb = ht_lookup(cache, head, &key); // (1)
        if (mb) {
                seg = ((void *) mb) - (mb->idx % NR_CACHES_INSEG) * // (2)
                                      sizeof(struct metablock);
                atomic_inc(&seg->nr_inflight_ios);
        }

#define NR_CACHES_INSEG ((1 << (WB_SEGMENTSIZE_ORDER - 3)) - 1)

(3)
struct segment_header {
        struct metablock mb_array[NR_CACHES_INSEG];

In the current implementation
I place metablocks especially "physically" in the segment header (3)
so calculation of the segment header containing the metablock
is a just a simple address calculation which performs good.
Since writeboost focuses on the peak write performance
the light-weighted lookup is the lifeline.

If I re-design writeboost to accept segment size in .ctr
this technique will be impossible
since knowing NR_CACHES_INSEG before accepting it is impossible.

It is just a matter of tradeoff.

But probably,
having purged cache-sharing gave me some another chance of
fancy technique to do the same thing with reasonable overhead
and code complexity. I will try to think of it.
I know forcing re-compiling the kernel
to the ordinary users sounds harsh.


> I'll look at the code but it strikes me as odd that the first sector of
> the cache device is checked yet the last sector of the first MB of the
> cache is wher ethe superblock resides.  I'd think you'd want to have the
> check on whether to format or not to be the same location as the
> superblock?
The first sector of the first 1MB is called Superblock Header and
the last sector of the first 1MB is called Superblock Record.
The former contains information fixed at initialization and
the latter contains information updated runtime
by Superblock Recorder daemon.

The latter is also checked in initialization step.
The logic is in recover_cache().
If it contains `last_migrated_segment_id` updated,
the time for recover_cache() becomes short.


> So this "<16 stat info (r/w)", is that like /proc/diskstats ?  Are you
> aware that dm-stats exists now and can be used instead of needing to
> tracking these stats in dm-writeboost?
Sort of.
But the difference is that
these information is relevant to
how a bio went through the path in writeboost.
They are like "read hits", "read misses" ... in dm-cache status.
So I don't think I need to discard it.

I read through the document of statistics
https://lwn.net/Articles/566273/
and I understand the dm-stats only surveils the
external I/O statistics
but not the internal conditional branch in detail.


> Whatever name you come up with, please add a "dm_" prefix.
add dm_ prefix to only struct or
to including all filenames and function names?
If so, needs really big fixing.


> Yes, these will be important to make sure state is synchronized at
> proper places.  Big rule of thumb is you don't want to do any
> memory allocations outside of the .ctr method.  Otherwise you can run
> into theoretical deadlocks when DM devices are stacked.
> 
> But my review will focus on these aspects of dm-writeboost (safety
> relative to suspend and resume) rather than the other dm-writeboost
> specific implementation.  So we'll sort it out if something needs
> fixing.
I will be waiting for you reviewing.

For your information,
writeboost can
makes all the transient state to persistent and
stops all the daemon such as migration daemon
at postsuspend to "freeze" the logical device.
But I don't have an idea what state strictly
the logical device should be in after suspending.
I couldn't understand that by simple look at
how the other targets implement these methods.


> But you're welcome to use the latest released final kernel instead
> (currently v3.11).  Given you don't seem to be needing to modify DM core
> it isn't a big deal which kernel you develop against (provided it is
> pretty recent).  Whatever is easiest for you.
I am using v3.11.



> But will this always be the case?  Couldn't it be that you add another
> K-V pair sometime in the future?

> I'm not following why you feel including the key name in the status is
> meaningless.

I understand.
I forgot the possibility of adding another daemon that is tunable.
However, I don't see the reason not to
add "read-miss" key to the #read-miss in dm-cache status for example.
Only tunable parameters are in K-V pair format is the implicit design rule?

Akira

On 9/26/13 2:37 AM, Mike Snitzer wrote:
> On Tue, Sep 24 2013 at  8:20am -0400,
> Akira Hayakawa <ruby.wktk@gmail.com> wrote:
> 
>> Hi, Mike
>>
>> I am now working on redesigning and implementation
>> of dm-writeboost.
>>
>> This is a progress report. 
>>
>> Please run
>> git clone https://github.com/akiradeveloper/dm-writeboost.git 
>> to see full set of the code.
> 
> I likely won't be able to look closely at the code until Monday (9/30);
> I have some higher priority reviews and issues to take care of this
> week.
> 
> But I'm very encouraged by what you've shared below; looks like things
> are moving in the right direction.  Great job.
> 
>> * 1. Current Status
>> writeboost in new design passed my test.
>> Documentations are ongoing.
>>
>> * 2. Big Changes 
>> - Cache-sharing purged
>> - All Sysfs purged.
>> - All Userland tools in Python purged.
>> -- dmsetup is the only user interface now.
>> - The daemon in userland is ported to kernel.
>> - On-disk metadata are in little endian.
>> - 300 lines of codes shed in kernel
>> -- Python scripts were 500 LOC so 800 LOC in total.
>> -- It is now about 3.2k LOC all in kernel.
>> - Comments are added neatly.
>> - Reorder the codes so that it gets more readable.
>>
>> * 3. Documentation in Draft
>> This is a current document that will be under Documentation/device-mapper
>>
>> dm-writeboost
>> =============
>> writeboost target provides log-structured caching.
>> It batches random writes into a big sequential write to a cache device.
>>
>> It is like dm-cache but the difference is
>> that writeboost focuses on handling bursty writes and lifetime of SSD cache device.
>>
>> Auxiliary PDF documents and Quick-start scripts are available in
>> https://github.com/akiradeveloper/dm-writeboost
>>
>> Design
>> ======
>> There are foreground path and 6 background daemons.
>>
>> Foreground
>> ----------
>> It accepts bios and put writes to RAM buffer.
>> When the buffer is full, it creates a "flush job" and queues it.
>>
>> Background
>> ----------
>> * Flush Daemon
>> Pop a flush job from the queue and executes it.
>>
>> * Deferring ACK for barrier writes
>> Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily.
>> Immediately handling these bios badly slows down writeboost.
>> It surveils the bios with these flags and forcefully flushes them
>> at worst case within `barrier_deadline_ms` period.
> 
> OK, but the thing is upper level consumers in the IO stack, like ext4,
> expect that when the REQ_FLUSH completes that the device has in fact
> flushed any transient state in memory.  So I'm not seeing how handling
> these lazily is an option.  Though I do appreciate that dm-cache (and
> dm-thin) do take similar approaches.  Would like to get Joe Thornber's
> insight here.
> 
>> * Migration Daemon
>> It migrates, writes back cache data to backing store,
>> the data on the cache device in segment granurality.
>>
>> If `allow_migrate` is true, it migrates without impending situation.
>> Being in impending situation is that there are no room in cache device
>> for writing further flush jobs.
>>
>> Migration at a time is done batching `nr_max_batched_migration` segments at maximum.
>> Therefore, unlike existing I/O scheduler,
>> two dirty writes distant in time space can be merged.
>>
>> * Migration Modulator
>> Migration while the backing store is heavily loaded
>> grows the device queue and thus makes the situation ever worse.
>> This daemon modulates the migration by switching `allow_migrate`.
>>
>> * Superblock Recorder
>> Superblock record is a last sector of first 1MB region in cache device.
>> It contains what id of the segment lastly migrated.
>> This daemon periodically update the region every `update_record_interval` seconds.
>>
>> * Cache Synchronizer
>> This daemon forcefully makes all the dirty writes persistent
>> every `sync_interval` seconds.
>> Since writeboost correctly implements the bio semantics
>> writing the dirties out forcefully out of the main path is needless.
>> However, some user want to be on the safe side by enabling this.
> 
> These seem reasonable to me.  Will need to have a look at thread naming
> to make sure the names reflect they are part of a dm-writeboost service.
> 
>> Target Interface
>> ================
>> All the operations are via dmsetup command.
>>
>> Constructor
>> -----------
>> writeboost <backing dev> <cache dev>
>>
>> backing dev : slow device holding original data blocks.
>> cache dev   : fast device holding cached data and its metadata.
> 
> You don't allow user to specify the "segment size"?  I'd expect tuning
> that could be important based on the underlying storage capabilities
> (e.g. having the segment size match that of the SSD's erase block or
> matching the backing device's full stripe width?).  SO something similar
> to what we have in dm-cache's blocksize.
> 
>> Note that cache device is re-formatted
>> if the first sector of the cache device is zeroed out.
> 
> I'll look at the code but it strikes me as odd that the first sector of
> the cache device is checked yet the last sector of the first MB of the
> cache is wher ethe superblock resides.  I'd think you'd want to have the
> check on whether to format or not to be the same location as the
> superblock?
> 
>> Status
>> ------
>> <#dirty caches> <#segments>
>> <id of the segment lastly migrated>
>> <id of the segment lastly flushed>
>> <id of the current segment>
>> <the position of the cursor>
>> <16 stat info (r/w) x (hit/miss) x (on buffer/not) x (fullsize/not)>
>> <# of kv pairs>
>> <kv pairs>
> 
> So this "<16 stat info (r/w)", is that like /proc/diskstats ?  Are you
> aware that dm-stats exists now and can be used instead of needing to
> tracking these stats in dm-writeboost?
>  
>> Messages
>> --------
>> You can tune up writeboost via message interface.
>>
>> * barrier_deadline_ms (ms)
>> Default: 3
>> All the bios with barrier flags like REQ_FUA or REQ_FLUSH
>> are guaranteed to be acked within this deadline.
>>
>> * allow_migrate (bool)
>> Default: 1
>> Set to 1 to start migration.
>>
>> * enable_migration_modulator (bool) and
>>   migrate_threshold (%)
>> Default: 1
>> Set to 1 to run migration modulator.
>> Migration modulator surveils the load of backing store
>> and set the migration started when the load is
>> lower than the migrate_threshold.
>>
>> * nr_max_batched_migration (int)
>> Default: 1
>> Number of segments to migrate simultaneously and atomically.
>> Set higher value to fully exploit the capacily of the backing store.
>>
>> * sync_interval (sec)
>> Default: 60
>> All the dirty writes are guaranteed to be persistent by this interval.
>>
>> * update_record_interval (sec)
>> Default: 60
>> The superblock record is updated every update_record_interval seconds.
> 
> OK to the above.
>  
>> Example
>> =======
>> dd if=/dev/zero of=${CACHE} bs=512 count=1 oflag=direct
>> sz=`blockdev --getsize ${BACKING}`
>> dmsetup create writeboost-vol --table "0 ${sz} writeboost ${BACKING} {CACHE}"
>>
>> * 4. TODO
>> - rename struct arr
>> -- It is like flex_array but lighter by eliminating the resizableness.
>>    Maybe, bigarray is a next candidate but I don't have a judge on this.
>>    I want to make an agreement on this renaming issue before doing it.
> 
> Whatever name you come up with, please add a "dm_" prefix.
> 
>> - resume, preresume and postsuspend possibly have to be implemented.
>> -- But I have no idea at all.
>> -- Maybe, I should make a research on other target implementing these methods.
> 
> Yes, these will be important to make sure state is synchronized at
> proper places.  Big rule of thumb is you don't want to do any
> memory allocations outside of the .ctr method.  Otherwise you can run
> into theoretical deadlocks when DM devices are stacked.
> 
> But my review will focus on these aspects of dm-writeboost (safety
> relative to suspend and resume) rather than the other dm-writeboost
> specific implementation.  So we'll sort it out if something needs
> fixing.
> 
>> - dmsetup status is like that of dm-cache
>> -- Please look at the example in the reference below.
>> -- It is far less understandable. Moreover inflexible to changes. 
>> -- If I may not change the output format in the future 
>>    I think I should make an agreement on the format.
> 
> Yes, that is one major drawback but generally speaking it is for upper
> level tools to consume this info (e.g. lvm2).  So human readability
> isn't of primary concern.
> 
>> - Splitting the code is desireble.
>> -- Should I show you a plan of splitting immediately?
>> -- If so, I will start it immediately.
> 
> Yes, please share your plan.  Anything that can simplify the code layout
> is best done earlier to simplfy code review.
> 
>> - Porting the current implementation to linux-next
>> -- I am working on my portable kernel with version switches.
>> -- I want to make an agreement on the basic design with maintainers
>>    before going to the next step.
>> -- WB* macros will be purged for sure.
> 
> I'd prefer you focus on getting the code working on a stable baseline of
> your choosing.  For instance you could build on the linux-dm.git
> "for-linus" branch, see:
> http://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git
> 
> But you're welcome to use the latest released final kernel instead
> (currently v3.11).  Given you don't seem to be needing to modify DM core
> it isn't a big deal which kernel you develop against (provided it is
> pretty recent).  Whatever is easiest for you.
> 
>> * 5. References 
>> - Example of `dmsetup status`
>> -- the number 7 before the barrier_deadline_ms is a number of K-V pairs 
>>    but they are of fixed number in dm-writeboost unlike dm-cache.
>>    I am thinking of removing it.
> 
> But will this always be the case?  Couldn't it be that you add another
> K-V pair sometime in the future?
> 
>>    Even K such as barrier_deadline_ms and allow_migrate are also meaningless
>>    for the same reason.
> 
> I'm not following why you feel including the key name in the status is
> meaningless.
> 
>> # root@Hercules:~/dm-writeboost/testing/1# dmsetup status perflv
>> 0 6291456 writeboost 0 3 669 669 670 0 21 6401 24 519 0 0 13 7051 1849 63278 29 11 0 0 6 7 barrier_deadline_ms 3 allow_migrate 1 enable_migration_modulator 1 migrate_threshold 70 nr_cur_batched_migration 1 sync_interval 3 update_record_interval 2
> 
> Yeah, it certainly isn't easy for a human to zero in on what is
> happening here.. but like I said above, that isn't a goal of dmsetup
> status output ;)
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
  2013-09-25 17:37           ` Mike Snitzer
  2013-09-26  1:42             ` Akira Hayakawa
@ 2013-09-26  1:47             ` Akira Hayakawa
  2013-09-27 18:35               ` Mike Snitzer
  1 sibling, 1 reply; 25+ messages in thread
From: Akira Hayakawa @ 2013-09-26  1:47 UTC (permalink / raw)
  To: snitzer
  Cc: gregkh, devel, linux-kernel, dm-devel, cesarb, joe, akpm, agk,
	m.chehab, ejt, dan.carpenter, ruby.wktk

Hi, Mike

The monolithic source code (3.2k)
is nicely splitted into almost 20 *.c files
according to the functionality and
data strucutures in OOP style.

The aim of this posting
is to share how the splitting looks like.

I believe that
at least reading the *.h files
can convince you the splitting is clear.

The code is now tainted with
almost 20 version switch macros
and WB* debug macros
but I will clean them up
for sending patch.

Again,
the latest code can be cloned by
git clone https://github.com/akiradeveloper/dm-writeboost.git

I will make few updates to the source codes on this weekend
so please track it to follow the latest version.
Below is only the snapshot.

Akira

---------- Summary ----------
33 Makefile
10 bigarray.h
19 cache-alloc.h
10 defer-barrier.h
8 dirty-sync.h
8 flush-daemon.h
10 format-cache.h
24 handle-io.h
16 hashtable.h
18 migrate-daemon.h
7 migrate-modulator.h
12 queue-flush-job.h
8 rambuf.h
13 recover.h
18 segment.h
8 superblock-recorder.h
9 target.h
30 util.h
384 writeboost.h
99 bigarray.c
192 cache-alloc.c
36 defer-barrier.c
33 dirty-sync.c
85 flush-daemon.c
234 format-cache.c
553 handle-io.c
109 hashtable.c
345 migrate-daemon.c
41 migrate-modulator.c
169 queue-flush-job.c
52 rambuf.c
308 recover.c
118 segment.c
61 superblock-recorder.c
376 target.c
126 util.c

---------- Makefile ----------
KERNEL_TREE := /lib/modules/$(shell uname -r)/build
# KERNEL_TREE := $(HOME)/linux-$(KERN_VERSION)

PWD := $(shell pwd)

# EXTRA_CFLAGS += -O0 -DCONFIG_DM_DEBUG -fno-inline #-Wall
# EXTRA_CFLAGS += -O2 -UCONFIG_DM_DEBUG

obj-m := dm-writeboost.o
dm-writeboost-objs := \
	target.o \
	handle-io.o \
	queue-flush-job.o \
	flush-daemon.o \
	migrate-daemon.o \
	migrate-modulator.o \
	defer-barrier.o \
	superblock-recorder.o \
	dirty-sync.o \
	bigarray.o \
	segment.o \
	hashtable.o \
	cache-alloc.o \
	format-cache.o \
	recover.o \
	rambuf.o \
	util.o

all:
	$(MAKE) -C $(KERNEL_TREE) M=$(PWD) modules

clean:
	$(MAKE) -C $(KERNEL_TREE) M=$(PWD) clean

---------- bigarray.h ----------
#ifndef WRITEBOOST_BIGARRAY_H
#define WRITEBOOST_BIGARRAY_H

#include "writeboost.h"

struct bigarray;
struct bigarray *make_bigarray(size_t elemsize, size_t nr_elems);
void kill_bigarray(struct bigarray *);
void *bigarray_at(struct bigarray *, size_t i);
#endif

---------- cache-alloc.h ----------
#ifndef WRITEBOOST_CACHE_ALLOC_H
#define WRITEBOOST_CACHE_ALLOC_H

#include "writeboost.h"
#include "segment.h"
#include "flush-daemon.h"
#include "migrate-daemon.h"
#include "migrate-modulator.h"
#include "rambuf.h"
#include "hashtable.h"
#include "superblock-recorder.h"
#include "dirty-sync.h"
#include "recover.h"
#include "defer-barrier.h"
#include "handle-io.h"

int __must_check resume_cache(struct wb_cache *, struct dm_dev *);
void free_cache(struct wb_cache *);
#endif

---------- defer-barrier.h ----------
#ifndef WRITEBOOST_DEFER_BARRIER_H
#define WRITEBOOST_DEFER_BARRIER_H

#include "writeboost.h"
#include "queue-flush-job.h"

void queue_barrier_io(struct wb_cache *, struct bio *);
void flush_barrier_ios(struct work_struct *);
void barrier_deadline_proc(unsigned long data);
#endif

---------- dirty-sync.h ----------
#ifndef WRITEBOOST_DIRTY_SYNC_H
#define WRITEBOOST_DIRTY_SYNC_H

#include "writeboost.h"
#include "queue-flush-job.h"

void sync_proc(struct work_struct *);
#endif

---------- flush-daemon.h ----------
#ifndef WRITEBOOST_FLUSH_DAEMON_H
#define WRITEBOOST_FLUSH_DAEMON_H

#include "writeboost.h"
#include "util.h"

void flush_proc(struct work_struct *);
#endif

---------- format-cache.h ----------
#ifndef WRITEBOOST_FORMAT_CACHE_H
#define WRITEBOOST_FORMAT_CACHE_H

#include "writeboost.h"
#include "util.h"
#include "segment.h"

int __must_check audit_cache_device(struct dm_dev *, bool *cache_valid);
int __must_check format_cache_device(struct dm_dev *);
#endif

---------- handle-io.h ----------
#ifndef WRITEBOOST_HANDLE_IO_H
#define WRITEBOOST_HANDLE_IO_H

#include "writeboost.h"
#include "bigarray.h"
#include "util.h"
#include "defer-barrier.h"
#include "hashtable.h"
#include "segment.h"
#include "queue-flush-job.h"

int writeboost_map(struct dm_target *, struct bio *
#if LINUX_VERSION_CODE < PER_BIO_VERSION
		 , union map_info *
#endif
		  );
int writeboost_end_io(struct dm_target *, struct bio *, int error
#if LINUX_VERSION_CODE < PER_BIO_VERSION
		    , union map_info *
#endif
		     );
void inc_nr_dirty_caches(struct wb_device *);
void clear_stat(struct wb_cache *);
#endif

---------- hashtable.h ----------
#ifndef WRITEBOOST_HASHTABLE_H
#define WRITEBOOST_HASHTABLE_H

#include "writeboost.h"
#include "segment.h"

int __must_check ht_empty_init(struct wb_cache *);
cache_nr ht_hash(struct wb_cache *, struct lookup_key *);
struct metablock *ht_lookup(struct wb_cache *,
			    struct ht_head *, struct lookup_key *);
void ht_register(struct wb_cache *, struct ht_head *,
		 struct lookup_key *, struct metablock *);
void ht_del(struct wb_cache *, struct metablock *);
void discard_caches_inseg(struct wb_cache *,
			  struct segment_header *);
#endif

---------- migrate-daemon.h ----------
#ifndef WRITEBOOST_MIGRATE_DAEMON_H
#define WRITEBOOST_MIGRATE_DAEMON_H

#include "writeboost.h"
#include "util.h"
#include "segment.h"

u8 atomic_read_mb_dirtiness(struct segment_header *,
			    struct metablock *);

void cleanup_mb_if_dirty(struct wb_cache *,
			 struct segment_header *,
			 struct metablock *);

void migrate_proc(struct work_struct *);

void wait_for_migration(struct wb_cache *, size_t id);
#endif

---------- migrate-modulator.h ----------
#ifndef WRITEBOOST_MIGRATE_MODULATOR_H
#define WRITEBOOST_MIGRATE_MODULATOR_H

#include "writeboost.h"

void modulator_proc(struct work_struct *);
#endif

---------- queue-flush-job.h ----------
#ifndef WRITEBOOST_QUEUE_FLUSH_JOB
#define WRITEBOOST_QUEUE_FLUSH_JOB

#include "writeboost.h"
#include "segment.h"
#include "hashtable.h"
#include "util.h"
#include "migrate-daemon.h"

void queue_current_buffer(struct wb_cache *);
void flush_current_buffer(struct wb_cache *);
#endif

---------- rambuf.h ----------
#ifndef WRITEBOOST_RAMBUF_H
#define WRITEBOOST_RAMBUF_H

#include "writeboost.h"

int __must_check init_rambuf_pool(struct wb_cache *);
void free_rambuf_pool(struct wb_cache *);
#endif

---------- recover.h ----------
#ifndef WRITEBOOST_RECOVER_H
#define WRITEBOOST_RECOVER_H

#include "writeboost.h"
#include "util.h"
#include "segment.h"
#include "bigarray.h"
#include "hashtable.h"
#include "migrate-daemon.h"
#include "handle-io.h"

int __must_check recover_cache(struct wb_cache *);
#endif

---------- segment.h ----------
#ifndef WRITEBOOST_SEGMENT_H
#define WRITEBOOST_SEGMENT_H

#include "writeboost.h"
#include "segment.h"
#include "bigarray.h"
#include "util.h"

int __must_check init_segment_header_array(struct wb_cache *);
u64 calc_nr_segments(struct dm_dev *);
struct segment_header *get_segment_header_by_id(struct wb_cache *,
					        size_t segment_id);
sector_t calc_segment_header_start(size_t segment_idx);
sector_t calc_mb_start_sector(struct segment_header *, cache_nr mb_idx);
u32 calc_segment_lap(struct wb_cache *, size_t segment_id);
struct metablock *mb_at(struct wb_cache *, cache_nr idx);
bool is_on_buffer(struct wb_cache *, cache_nr mb_idx);
#endif

---------- superblock-recorder.h ----------
#ifndef WRITEBOOST_SUPERBLOCK_RECORDER_H
#define WRITEBOOST_SUPERBLOCK_RECORDER_H

#include "writeboost.h"
#include "util.h"

void recorder_proc(struct work_struct *);
#endif

---------- target.h ----------
#ifndef WRITEBOOST_TARGET_H
#define WRITEBOOST_TARGET_H

#include "writeboost.h"
#include "format-cache.h"
#include "cache-alloc.h"
#include "handle-io.h"
#include "util.h"
#endif

---------- util.h ----------
#ifndef WRITEBOOST_UTIL_H
#define WRITEBOOST_UTIL_H

#include "writeboost.h"

extern struct workqueue_struct *safe_io_wq;
extern struct dm_io_client *wb_io_client;

void *do_kmalloc_retry(size_t size, gfp_t flags, int lineno);
#define kmalloc_retry(size, flags) \
	do_kmalloc_retry((size), (flags), __LINE__)

int dm_safe_io_internal(
		struct dm_io_request *,
		unsigned num_regions, struct dm_io_region *,
		unsigned long *err_bits, bool thread, int lineno);
#define dm_safe_io(io_req, num_regions, regions, err_bits, thread) \
	dm_safe_io_internal((io_req), (num_regions), (regions), \
			    (err_bits), (thread), __LINE__)

void dm_safe_io_retry_internal(
		struct dm_io_request *,
		unsigned num_regions, struct dm_io_region *,
		bool thread, int lineno);
#define dm_safe_io_retry(io_req, num_regions, regions, thread) \
	dm_safe_io_retry_internal((io_req), (num_regions), (regions), \
				  (thread), __LINE__)

sector_t dm_devsize(struct dm_dev *);
#endif

---------- writeboost.h ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

#ifndef DM_WRITEBOOST_H
#define DM_WRITEBOOST_H

#define DM_MSG_PREFIX "writeboost"

#include <linux/module.h>
#include <linux/version.h>
#include <linux/list.h>
#include <linux/slab.h>
#include <linux/mutex.h>
#include <linux/sched.h>
#include <linux/timer.h>
#include <linux/device-mapper.h>
#include <linux/dm-io.h>

#define WBERR(f, args...) \
	DMERR("err@%d " f, __LINE__, ## args)
#define WBWARN(f, args...) \
	DMWARN("warn@%d " f, __LINE__, ## args)
#define WBINFO(f, args...) \
	DMINFO("info@%d " f, __LINE__, ## args)


/*
 * (1 << x) sector.
 * 4 <= x <= 11
 * dm-writeboost supports segment size up to 1MB.
 *
 * All the comments are if
 * the segment size is the maximum 1MB.
 */
#define WB_SEGMENTSIZE_ORDER 11

/*
 * By default,
 * we allocate 64 * 1MB RAM buffers statically.
 */
#define NR_RAMBUF_POOL 64

/*
 * The first 4KB (1<<3 sectors) in segment
 * is for metadata.
 */
#define NR_CACHES_INSEG ((1 << (WB_SEGMENTSIZE_ORDER - 3)) - 1)

/*
 * The Detail of the Disk Format
 *
 * Whole:
 * Superblock(1MB) Segment(1MB) Segment(1MB) ...
 * We reserve the first segment (1MB) as the superblock.
 *
 * Superblock(1MB):
 * head <----                               ----> tail
 * superblock header(512B) ... superblock record(512B)
 *
 * Segment(1MB):
 * segment_header_device(4KB) metablock_device(4KB) * NR_CACHES_INSEG
 */

/*
 * Superblock Header
 * First one sector of the super block region.
 * The value is fixed after formatted.
 */

 /*
  * Magic Number
  * "WBst"
  */
#define WRITEBOOST_MAGIC 0x57427374
struct superblock_header_device {
	__le32 magic;
} __packed;

/*
 * Superblock Record (Mutable)
 * Last one sector of the superblock region.
 * Record the current cache status in need.
 */
struct superblock_record_device {
	__le64 last_migrated_segment_id;
} __packed;

/*
 * Cache line index.
 *
 * dm-writeboost can supoort a cache device
 * with size less than 4KB * (1 << 32)
 * that is 16TB.
 */
typedef u32 cache_nr;

/*
 * Metadata of a 4KB cache line
 *
 * Dirtiness is defined for each sector
 * in this cache line.
 */
struct metablock {
	sector_t sector; /* key */

	cache_nr idx; /* Const */

	struct hlist_node ht_list;

	/*
	 * 8 bit flag for dirtiness
	 * for each sector in cache line.
	 *
	 * Current implementation
	 * only recovers dirty caches.
	 * Recovering clean caches complicates the code
	 * but couldn't be effective
	 * since only few of the caches are clean.
	 */
	u8 dirty_bits;
};

/*
 * On-disk metablock
 */
struct metablock_device {
	__le64 sector;

	u8 dirty_bits;

	__le32 lap;
} __packed;

#define SZ_MAX (~(size_t)0)
struct segment_header {
	struct metablock mb_array[NR_CACHES_INSEG];

	/*
	 * ID uniformly increases.
	 * ID 0 is used to tell that the segment is invalid
	 * and valid id >= 1.
	 */
	u64 global_id;

	/*
	 * Segment can be flushed half-done.
	 * length is the number of
	 * metablocks that must be counted in
	 * in resuming.
	 */
	u8 length;

	cache_nr start_idx; /* Const */
	sector_t start_sector; /* Const */

	struct list_head migrate_list;

	/*
	 * This segment can not be migrated
	 * to backin store
	 * until flushed.
	 * Flushed segment is in cache device.
	 */
	struct completion flush_done;

	/*
	 * This segment can not be overwritten
	 * until migrated.
	 */
	struct completion migrate_done;

	spinlock_t lock;

	atomic_t nr_inflight_ios;
};

/*
 * (Locking)
 * Locking metablocks by their granularity
 * needs too much memory space for lock structures.
 * We only locks a metablock by locking the parent segment
 * that includes the metablock.
 */
#define lockseg(seg, flags) spin_lock_irqsave(&(seg)->lock, flags)
#define unlockseg(seg, flags) spin_unlock_irqrestore(&(seg)->lock, flags)

/*
 * On-disk segment header.
 *
 * Must be at most 4KB large.
 */
struct segment_header_device {
	/* - FROM - At most512 byte for atomicity. --- */
	__le64 global_id;
	/*
	 * How many cache lines in this segments
	 * should be counted in resuming.
	 */
	u8 length;
	/*
	 * On what lap in rorating on cache device
	 * used to find the head and tail in the
	 * segments in cache device.
	 */
	__le32 lap;
	/* - TO -------------------------------------- */
	/* This array must locate at the tail */
	struct metablock_device mbarr[NR_CACHES_INSEG];
} __packed;

struct rambuffer {
	void *data;
	struct completion done;
};

enum STATFLAG {
	STAT_WRITE = 0,
	STAT_HIT,
	STAT_ON_BUFFER,
	STAT_FULLSIZE,
};
#define STATLEN (1 << 4)

struct lookup_key {
	sector_t sector;
};

struct ht_head {
	struct hlist_head ht_list;
};

struct wb_device;
struct wb_cache {
	struct wb_device *wb;

	struct dm_dev *device;
	struct mutex io_lock;
	cache_nr nr_caches; /* Const */
	u64 nr_segments; /* Const */
	struct bigarray *segment_header_array;

	/*
	 * Chained hashtable
	 *
	 * Writeboost uses chained hashtable
	 * to cache lookup.
	 * Cache discarding often happedns
	 * This structure fits our needs.
	 */
	struct bigarray *htable;
	size_t htsize;
	struct ht_head *null_head;

	cache_nr cursor; /* Index that has been written the most lately */
	struct segment_header *current_seg;
	struct rambuffer *current_rambuf;
	struct rambuffer *rambuf_pool;

	u64 last_migrated_segment_id;
	u64 last_flushed_segment_id;
	u64 reserving_segment_id;

	/*
	 * Flush daemon
	 *
	 * Writeboost first queue the segment to flush
	 * and flush daemon asynchronously
	 * flush them to the cache device.
	 */
	struct work_struct flush_work;
	struct workqueue_struct *flush_wq;
	spinlock_t flush_queue_lock;
	struct list_head flush_queue;
	wait_queue_head_t flush_wait_queue;

	/*
	 * Deferred ACK for barriers.
	 */
	struct work_struct barrier_deadline_work;
	struct timer_list barrier_deadline_timer;
	struct bio_list barrier_ios;
	unsigned long barrier_deadline_ms; /* param */

	/*
	 * Migration daemon
	 *
	 * Migartion also works in background.
	 *
	 * If allow_migrate is true,
	 * migrate daemon goes into migration
	 * if they are segments to migrate.
	 */
	struct work_struct migrate_work;
	struct workqueue_struct *migrate_wq;
	bool allow_migrate; /* param */

	/*
	 * Batched Migration
	 *
	 * Migration is done atomically
	 * with number of segments batched.
	 */
	wait_queue_head_t migrate_wait_queue;
	atomic_t migrate_fail_count;
	atomic_t migrate_io_count;
	struct list_head migrate_list;
	u8 *dirtiness_snapshot;
	void *migrate_buffer;
	size_t nr_cur_batched_migration;
	size_t nr_max_batched_migration; /* param */

	/*
	 * Migration modulator
	 *
	 * This daemon turns on and off
	 * the migration
	 * according to the load of backing store.
	 */
	struct work_struct modulator_work;
	bool enable_migration_modulator; /* param */

	/*
	 * Superblock Recorder
	 *
	 * Update the superblock record
	 * periodically.
	 */
	struct work_struct recorder_work;
	unsigned long update_record_interval; /* param */

	/*
	 * Cache Synchronizer
	 *
	 * Sync the dirty writes
	 * periodically.
	 */
	struct work_struct sync_work;
	unsigned long sync_interval; /* param */

	/*
	 * on_terminate is true
	 * to notify all the background daemons to
	 * stop their operations.
	 */
	bool on_terminate;

	atomic64_t stat[STATLEN];
};

struct wb_device {
	struct dm_target *ti;

	struct dm_dev *device;

	struct wb_cache *cache;

	u8 migrate_threshold;

	atomic64_t nr_dirty_caches;
};

struct flush_job {
	struct list_head flush_queue;
	struct segment_header *seg;
	/*
	 * The data to flush to cache device.
	 */
	struct rambuffer *rambuf;
	/*
	 * List of bios with barrier flags.
	 */
	struct bio_list barrier_ios;
};

#define PER_BIO_VERSION KERNEL_VERSION(3, 8, 0)
#if LINUX_VERSION_CODE >= PER_BIO_VERSION
struct per_bio_data {
	void *ptr;
};
#endif
#endif

---------- bigarray.c ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

/*
 * A array like structure
 * that can contain million of elements.
 * The aim of this class is the same as
 * flex_array.
 * The reason we don't use flex_array is
 * that the class trades the performance
 * to get the resizability.
 * struct arr is fast and light-weighted.
 */

#include "bigarray.h"

struct part {
	void *memory;
};

struct bigarray {
	struct part *parts;
	size_t nr_elems;
	size_t elemsize;
};

#define ALLOC_SIZE (1 << 16)
static size_t nr_elems_in_part(struct bigarray *arr)
{
	return ALLOC_SIZE / arr->elemsize;
};

static size_t nr_parts(struct bigarray *arr)
{
	return dm_div_up(arr->nr_elems, nr_elems_in_part(arr));
}

struct bigarray *make_bigarray(size_t elemsize, size_t nr_elems)
{
	size_t i, j;
	struct part *part;

	struct bigarray *arr = kmalloc(sizeof(*arr), GFP_KERNEL);
	if (!arr) {
		WBERR();
		return NULL;
	}

	arr->elemsize = elemsize;
	arr->nr_elems = nr_elems;
	arr->parts = kmalloc(sizeof(struct part) * nr_parts(arr), GFP_KERNEL);
	if (!arr->parts) {
		WBERR();
		goto bad_alloc_parts;
	}

	for (i = 0; i < nr_parts(arr); i++) {
		part = arr->parts + i;
		part->memory = kmalloc(ALLOC_SIZE, GFP_KERNEL);
		if (!part->memory) {
			WBERR();
			for (j = 0; j < i; j++) {
				part = arr->parts + j;
				kfree(part->memory);
			}
			goto bad_alloc_parts_memory;
		}
	}
	return arr;

bad_alloc_parts_memory:
	kfree(arr->parts);
bad_alloc_parts:
	kfree(arr);
	return NULL;
}

void kill_bigarray(struct bigarray *arr)
{
	size_t i;
	for (i = 0; i < nr_parts(arr); i++) {
		struct part *part = arr->parts + i;
		kfree(part->memory);
	}
	kfree(arr->parts);
	kfree(arr);
}

void *bigarray_at(struct bigarray *arr, size_t i)
{
	size_t n = nr_elems_in_part(arr);
	size_t j = i / n;
	size_t k = i % n;
	struct part *part = arr->parts + j;
	return part->memory + (arr->elemsize * k);
}

---------- cache-alloc.c ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

/*
 * Cache resume/free operations are provided.
 * Resuming a cache is to construct in-core
 * metadata structures from the metadata
 * region in the cache device.
 */

#include "cache-alloc.h"

int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev)
{
	int r = 0;

	cache->device = dev;
	cache->nr_segments = calc_nr_segments(cache->device);
	cache->nr_caches = cache->nr_segments * NR_CACHES_INSEG;
	cache->on_terminate = false;
	cache->allow_migrate = true;
	cache->reserving_segment_id = 0;
	mutex_init(&cache->io_lock);

	cache->enable_migration_modulator = true;
	cache->update_record_interval = 60;
	cache->sync_interval = 60;

	r = init_rambuf_pool(cache);
	if (r) {
		WBERR();
		goto bad_init_rambuf_pool;
	}

	/*
	 * Select arbitrary one as the initial rambuffer.
	 */
	cache->current_rambuf = cache->rambuf_pool + 0;

	r = init_segment_header_array(cache);
	if (r) {
		WBERR();
		goto bad_alloc_segment_header_array;
	}

	r = ht_empty_init(cache);
	if (r) {
		WBERR();
		goto bad_alloc_ht;
	}

	/*
	 * All in-core structures are allocated and
	 * initialized.
	 * Next, read metadata from the cache device.
	 */

	r = recover_cache(cache);
	if (r) {
		WBERR();
		goto bad_recover;
	}


	/* Data structures for Migration */
	cache->migrate_buffer = vmalloc(NR_CACHES_INSEG << 12);
	if (!cache->migrate_buffer) {
		WBERR();
		goto bad_alloc_migrate_buffer;
	}

	cache->dirtiness_snapshot = kmalloc(
			NR_CACHES_INSEG,
			GFP_KERNEL);
	if (!cache->dirtiness_snapshot) {
		WBERR();
		goto bad_alloc_dirtiness_snapshot;
	}

	cache->migrate_wq = create_singlethread_workqueue("migratewq");
	if (!cache->migrate_wq) {
		WBERR();
		goto bad_migratewq;
	}

	cache->flush_wq = create_singlethread_workqueue("flushwq");
	if (!cache->flush_wq) {
		WBERR();
		goto bad_flushwq;
	}


	/* Migration Daemon */
	INIT_WORK(&cache->migrate_work, migrate_proc);
	init_waitqueue_head(&cache->migrate_wait_queue);
	INIT_LIST_HEAD(&cache->migrate_list);
	atomic_set(&cache->migrate_fail_count, 0);
	atomic_set(&cache->migrate_io_count, 0);
	cache->nr_max_batched_migration = 1;
	cache->nr_cur_batched_migration = 1;
	queue_work(cache->migrate_wq, &cache->migrate_work);


	/* Deferred ACK for barrier writes */
	setup_timer(&cache->barrier_deadline_timer,
		    barrier_deadline_proc, (unsigned long) cache);
	bio_list_init(&cache->barrier_ios);
	/*
	 * Deadline is 3 ms by default.
	 * 2.5 us to process on bio
	 * and 3 ms is enough long to process 255 bios.
	 * If the buffer doesn't get full within 3 ms,
	 * we can doubt write starves
	 * by waiting formerly submitted barrier to be complete.
	 */
	cache->barrier_deadline_ms = 3;
	INIT_WORK(&cache->barrier_deadline_work, flush_barrier_ios);


	/* Flush Daemon */
	INIT_WORK(&cache->flush_work, flush_proc);
	spin_lock_init(&cache->flush_queue_lock);
	INIT_LIST_HEAD(&cache->flush_queue);
	init_waitqueue_head(&cache->flush_wait_queue);
	queue_work(cache->flush_wq, &cache->flush_work);


	/* Migartion Modulator */
	INIT_WORK(&cache->modulator_work, modulator_proc);
	schedule_work(&cache->modulator_work);


	/* Superblock Recorder */
	INIT_WORK(&cache->recorder_work, recorder_proc);
	schedule_work(&cache->recorder_work);


	/* Dirty Synchronizer */
	INIT_WORK(&cache->sync_work, sync_proc);
	schedule_work(&cache->sync_work);


	clear_stat(cache);

	return 0;

bad_flushwq:
	destroy_workqueue(cache->migrate_wq);
bad_migratewq:
	kfree(cache->dirtiness_snapshot);
bad_alloc_dirtiness_snapshot:
	vfree(cache->migrate_buffer);
bad_alloc_migrate_buffer:
bad_recover:
	kill_bigarray(cache->htable);
bad_alloc_ht:
	kill_bigarray(cache->segment_header_array);
bad_alloc_segment_header_array:
	free_rambuf_pool(cache);
bad_init_rambuf_pool:
	kfree(cache);
	return r;
}

void free_cache(struct wb_cache *cache)
{
	cache->on_terminate = true;

	/* Kill in-kernel daemons */
	cancel_work_sync(&cache->sync_work);
	cancel_work_sync(&cache->recorder_work);
	cancel_work_sync(&cache->modulator_work);

	cancel_work_sync(&cache->flush_work);
	destroy_workqueue(cache->flush_wq);

	cancel_work_sync(&cache->barrier_deadline_work);

	cancel_work_sync(&cache->migrate_work);
	destroy_workqueue(cache->migrate_wq);
	kfree(cache->dirtiness_snapshot);
	vfree(cache->migrate_buffer);

	/* Destroy in-core structures */
	kill_bigarray(cache->htable);
	kill_bigarray(cache->segment_header_array);

	free_rambuf_pool(cache);
}

---------- defer-barrier.c ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

#include "defer-barrier.h"

void queue_barrier_io(struct wb_cache *cache, struct bio *bio)
{
	mutex_lock(&cache->io_lock);
	bio_list_add(&cache->barrier_ios, bio);
	mutex_unlock(&cache->io_lock);

	if (!timer_pending(&cache->barrier_deadline_timer))
		mod_timer(&cache->barrier_deadline_timer,
			  msecs_to_jiffies(cache->barrier_deadline_ms));
}

void barrier_deadline_proc(unsigned long data)
{
	struct wb_cache *cache = (struct wb_cache *) data;
	schedule_work(&cache->barrier_deadline_work);
}

void flush_barrier_ios(struct work_struct *work)
{
	struct wb_cache *cache =
		container_of(work, struct wb_cache,
			     barrier_deadline_work);

	if (bio_list_empty(&cache->barrier_ios))
		return;

	flush_current_buffer(cache);
}

---------- dirty-sync.c ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

#include "dirty-sync.h"

void sync_proc(struct work_struct *work)
{
	struct wb_cache *cache =
		container_of(work, struct wb_cache, sync_work);
	unsigned long intvl;

	while (true) {
		if (cache->on_terminate)
			return;

		/* sec -> ms */
		intvl = cache->sync_interval * 1000;

		if (!intvl) {
			schedule_timeout_interruptible(msecs_to_jiffies(1000));
			continue;
		}

		WBINFO();
		flush_current_buffer(cache);
		blkdev_issue_flush(cache->device->bdev, GFP_NOIO, NULL);

		schedule_timeout_interruptible(msecs_to_jiffies(intvl));
	}
}

---------- flush-daemon.c ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

#include "flush-daemon.h"

void flush_proc(struct work_struct *work)
{
	unsigned long flags;

	struct wb_cache *cache =
		container_of(work, struct wb_cache, flush_work);

	while (true) {
		struct flush_job *job;
		struct segment_header *seg;
		struct dm_io_request io_req;
		struct dm_io_region region;

		WBINFO();

		spin_lock_irqsave(&cache->flush_queue_lock, flags);
		while (list_empty(&cache->flush_queue)) {
			spin_unlock_irqrestore(&cache->flush_queue_lock, flags);
			wait_event_interruptible_timeout(
				cache->flush_wait_queue,
				(!list_empty(&cache->flush_queue)),
				msecs_to_jiffies(100));
			spin_lock_irqsave(&cache->flush_queue_lock, flags);

			if (cache->on_terminate)
				return;
		}

		/*
		 * Pop a fluch_context from a list
		 * and flush it.
		 */
		job = list_first_entry(
			&cache->flush_queue, struct flush_job, flush_queue);
		list_del(&job->flush_queue);
		spin_unlock_irqrestore(&cache->flush_queue_lock, flags);

		seg = job->seg;

		io_req = (struct dm_io_request) {
			.client = wb_io_client,
			.bi_rw = WRITE,
			.notify.fn = NULL,
			.mem.type = DM_IO_KMEM,
			.mem.ptr.addr = job->rambuf->data,
		};

		region = (struct dm_io_region) {
			.bdev = cache->device->bdev,
			.sector = seg->start_sector,
			.count = (seg->length + 1) << 3,
		};

		dm_safe_io_retry(&io_req, 1, &region, false);

		cache->last_flushed_segment_id = seg->global_id;

		complete_all(&seg->flush_done);

		complete_all(&job->rambuf->done);

		/*
		 * Deferred ACK
		 */
		if (!bio_list_empty(&job->barrier_ios)) {
			struct bio *bio;
			blkdev_issue_flush(cache->device->bdev, GFP_NOIO, NULL);
			while ((bio = bio_list_pop(&job->barrier_ios)))
				bio_endio(bio, 0);

			mod_timer(&cache->barrier_deadline_timer,
				  msecs_to_jiffies(cache->barrier_deadline_ms));
		}

		kfree(job);
	}
}

---------- format-cache.c ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

#include "format-cache.h"

static int read_superblock_header(struct superblock_header_device *sup,
				  struct dm_dev *dev)
{
	int r = 0;
	struct dm_io_request io_req_sup;
	struct dm_io_region region_sup;

	void *buf = kmalloc(1 << SECTOR_SHIFT, GFP_KERNEL);
	if (!buf) {
		WBERR();
		return -ENOMEM;
	}

	io_req_sup = (struct dm_io_request) {
		.client = wb_io_client,
		.bi_rw = READ,
		.notify.fn = NULL,
		.mem.type = DM_IO_KMEM,
		.mem.ptr.addr = buf,
	};
	region_sup = (struct dm_io_region) {
		.bdev = dev->bdev,
		.sector = 0,
		.count = 1,
	};
	r = dm_safe_io(&io_req_sup, 1, &region_sup, NULL, false);

	kfree(buf);

	if (r) {
		WBERR();
		return r;
	}

	memcpy(sup, buf, sizeof(*sup));

	return 0;
}

static int audit_superblock_header(struct superblock_header_device *sup)
{
	u32 magic = le32_to_cpu(sup->magic);

	if (magic != WRITEBOOST_MAGIC) {
		WBERR();
		return -EINVAL;
	}

	return 0;
}

/*
 * Check if the cache device is already formatted.
 * Returns 0 iff this routine runs without failure.
 * cache_valid is stored true iff the cache device
 * is formatted and needs not to be re-fomatted.
 */
int __must_check audit_cache_device(struct dm_dev *dev,
				    bool *cache_valid)
{
	int r = 0;
	struct superblock_header_device sup;
	r = read_superblock_header(&sup, dev);
	if (r)
		return r;

	*cache_valid = audit_superblock_header(&sup) ? false : true;
	return r;
}

static int format_superblock_header(struct dm_dev *dev)
{
	int r = 0;
	struct dm_io_request io_req_sup;
	struct dm_io_region region_sup;

	struct superblock_header_device sup = {
		.magic = cpu_to_le32(WRITEBOOST_MAGIC),
	};

	void *buf = kzalloc(1 << SECTOR_SHIFT, GFP_KERNEL);
	if (!buf) {
		WBERR();
		return -ENOMEM;
	}

	memcpy(buf, &sup, sizeof(sup));

	io_req_sup = (struct dm_io_request) {
		.client = wb_io_client,
		.bi_rw = WRITE_FUA,
		.notify.fn = NULL,
		.mem.type = DM_IO_KMEM,
		.mem.ptr.addr = buf,
	};
	region_sup = (struct dm_io_region) {
		.bdev = dev->bdev,
		.sector = 0,
		.count = 1,
	};
	r = dm_safe_io(&io_req_sup, 1, &region_sup, NULL, false);
	kfree(buf);

	if (r) {
		WBERR();
		return r;
	}

	return 0;
}

struct format_segmd_context {
	int err;
	atomic64_t count;
};

static void format_segmd_endio(unsigned long error, void *__context)
{
	struct format_segmd_context *context = __context;
	if (error)
		context->err = 1;
	atomic64_dec(&context->count);
}

/*
 * Format superblock header and
 * all the metadata regions over the cache device.
 */
int __must_check format_cache_device(struct dm_dev *dev)
{
	u64 i, nr_segments = calc_nr_segments(dev);
	struct format_segmd_context context;
	struct dm_io_request io_req_sup;
	struct dm_io_region region_sup;
	void *buf;

	int r = 0;

	/*
	 * Zeroing the full superblock
	 */
	buf = kzalloc(1 << 20, GFP_KERNEL);
	if (!buf) {
		WBERR();
		return -ENOMEM;
	}

	io_req_sup = (struct dm_io_request) {
		.client = wb_io_client,
		.bi_rw = WRITE_FUA,
		.notify.fn = NULL,
		.mem.type = DM_IO_KMEM,
		.mem.ptr.addr = buf,
	};
	region_sup = (struct dm_io_region) {
		.bdev = dev->bdev,
		.sector = 0,
		.count = (1 << 11),
	};
	r = dm_safe_io(&io_req_sup, 1, &region_sup, NULL, false);
	kfree(buf);

	if (r) {
		WBERR();
		return r;
	}

	format_superblock_header(dev);

	/* Format the metadata regions */

	/*
	 * Count the number of segments
	 */
	atomic64_set(&context.count, nr_segments);
	context.err = 0;

	buf = kzalloc(1 << 12, GFP_KERNEL);
	if (!buf) {
		WBERR();
		return -ENOMEM;
	}

	/*
	 * Submit all the writes asynchronously.
	 */
	for (i = 0; i < nr_segments; i++) {
		struct dm_io_request io_req_seg = {
			.client = wb_io_client,
			.bi_rw = WRITE,
			.notify.fn = format_segmd_endio,
			.notify.context = &context,
			.mem.type = DM_IO_KMEM,
			.mem.ptr.addr = buf,
		};
		struct dm_io_region region_seg = {
			.bdev = dev->bdev,
			.sector = calc_segment_header_start(i),
			.count = (1 << 3),
		};
		r = dm_safe_io(&io_req_seg, 1, &region_seg, NULL, false);
		if (r) {
			WBERR();
			break;
		}
	}
	kfree(buf);

	if (r) {
		WBERR();
		return r;
	}

	/*
	 * Wait for all the writes complete.
	 */
	while (atomic64_read(&context.count))
		schedule_timeout_interruptible(msecs_to_jiffies(100));

	if (context.err) {
		WBERR();
		return -EIO;
	}

	return blkdev_issue_flush(dev->bdev, GFP_KERNEL, NULL);
}

---------- handle-io.c ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

#include "handle-io.h"

void inc_nr_dirty_caches(struct wb_device *wb)
{
	BUG_ON(!wb);
	atomic64_inc(&wb->nr_dirty_caches);
}

static void dec_nr_dirty_caches(struct wb_device *wb)
{
	BUG_ON(!wb);
	atomic64_dec(&wb->nr_dirty_caches);
}

void cleanup_mb_if_dirty(struct wb_cache *cache,
			 struct segment_header *seg,
			 struct metablock *mb)
{
	unsigned long flags;

	bool b = false;
	lockseg(seg, flags);
	if (mb->dirty_bits) {
		mb->dirty_bits = 0;
		b = true;
	}
	unlockseg(seg, flags);

	if (b)
		dec_nr_dirty_caches(cache->wb);
}

u8 atomic_read_mb_dirtiness(struct segment_header *seg,
			    struct metablock *mb)
{
	unsigned long flags;
	u8 r;

	lockseg(seg, flags);
	r = mb->dirty_bits;
	unlockseg(seg, flags);

	return r;
}

static void inc_stat(struct wb_cache *cache,
		     int rw, bool found, bool on_buffer, bool fullsize)
{
	atomic64_t *v;

	int i = 0;
	if (rw)
		i |= (1 << STAT_WRITE);
	if (found)
		i |= (1 << STAT_HIT);
	if (on_buffer)
		i |= (1 << STAT_ON_BUFFER);
	if (fullsize)
		i |= (1 << STAT_FULLSIZE);

	v = &cache->stat[i];
	atomic64_inc(v);
}

void clear_stat(struct wb_cache *cache)
{
	int i;
	for (i = 0; i < STATLEN; i++) {
		atomic64_t *v = &cache->stat[i];
		atomic64_set(v, 0);
	}
}

/*
 * Migrate a data on the cache device
 */
static void migrate_mb(struct wb_cache *cache, struct segment_header *seg,
		       struct metablock *mb, u8 dirty_bits, bool thread)
{
	struct wb_device *wb = cache->wb;

	if (!dirty_bits)
		return;

	if (dirty_bits == 255) {
		void *buf = kmalloc_retry(1 << 12, GFP_NOIO);
		struct dm_io_request io_req_r, io_req_w;
		struct dm_io_region region_r, region_w;

		io_req_r = (struct dm_io_request) {
			.client = wb_io_client,
			.bi_rw = READ,
			.notify.fn = NULL,
			.mem.type = DM_IO_KMEM,
			.mem.ptr.addr = buf,
		};
		region_r = (struct dm_io_region) {
			.bdev = cache->device->bdev,
			.sector = calc_mb_start_sector(seg, mb->idx),
			.count = (1 << 3),
		};

		dm_safe_io_retry(&io_req_r, 1, &region_r, thread);

		io_req_w = (struct dm_io_request) {
			.client = wb_io_client,
			.bi_rw = WRITE_FUA,
			.notify.fn = NULL,
			.mem.type = DM_IO_KMEM,
			.mem.ptr.addr = buf,
		};
		region_w = (struct dm_io_region) {
			.bdev = wb->device->bdev,
			.sector = mb->sector,
			.count = (1 << 3),
		};
		dm_safe_io_retry(&io_req_w, 1, &region_w, thread);

		kfree(buf);
	} else {
		void *buf = kmalloc_retry(1 << SECTOR_SHIFT, GFP_NOIO);
		size_t i;
		for (i = 0; i < 8; i++) {
			bool bit_on = dirty_bits & (1 << i);
			struct dm_io_request io_req_r, io_req_w;
			struct dm_io_region region_r, region_w;
			sector_t src;

			if (!bit_on)
				continue;

			io_req_r = (struct dm_io_request) {
				.client = wb_io_client,
				.bi_rw = READ,
				.notify.fn = NULL,
				.mem.type = DM_IO_KMEM,
				.mem.ptr.addr = buf,
			};
			/* A tmp variable just to avoid 80 cols rule */
			src = calc_mb_start_sector(seg, mb->idx) + i;
			region_r = (struct dm_io_region) {
				.bdev = cache->device->bdev,
				.sector = src,
				.count = 1,
			};
			dm_safe_io_retry(&io_req_r, 1, &region_r, thread);

			io_req_w = (struct dm_io_request) {
				.client = wb_io_client,
				.bi_rw = WRITE,
				.notify.fn = NULL,
				.mem.type = DM_IO_KMEM,
				.mem.ptr.addr = buf,
			};
			region_w = (struct dm_io_region) {
				.bdev = wb->device->bdev,
				.sector = mb->sector + 1 * i,
				.count = 1,
			};
			dm_safe_io_retry(&io_req_w, 1, &region_w, thread);
		}
		kfree(buf);
	}
}

/*
 * Migrate the cache on the RAM buffer.
 * Calling this function is really rare.
 */
static void migrate_buffered_mb(struct wb_cache *cache,
				struct metablock *mb, u8 dirty_bits)
{
	struct wb_device *wb = cache->wb;

	u8 i, k = 1 + (mb->idx % NR_CACHES_INSEG);
	sector_t offset = (k << 3);

	void *buf = kmalloc_retry(1 << SECTOR_SHIFT, GFP_NOIO);
	for (i = 0; i < 8; i++) {
		struct dm_io_request io_req;
		struct dm_io_region region;
		void *src;
		sector_t dest;

		bool bit_on = dirty_bits & (1 << i);
		if (!bit_on)
			continue;

		src = cache->current_rambuf->data +
		      ((offset + i) << SECTOR_SHIFT);
		memcpy(buf, src, 1 << SECTOR_SHIFT);

		io_req = (struct dm_io_request) {
			.client = wb_io_client,
			.bi_rw = WRITE_FUA,
			.notify.fn = NULL,
			.mem.type = DM_IO_KMEM,
			.mem.ptr.addr = buf,
		};

		dest = mb->sector + 1 * i;
		region = (struct dm_io_region) {
			.bdev = wb->device->bdev,
			.sector = dest,
			.count = 1,
		};

		dm_safe_io_retry(&io_req, 1, &region, true);
	}
	kfree(buf);
}

static void bio_remap(struct bio *bio, struct dm_dev *dev, sector_t sector)
{
	bio->bi_bdev = dev->bdev;
	bio->bi_sector = sector;
}

static sector_t calc_cache_alignment(struct wb_cache *cache,
				     sector_t bio_sector)
{
	return (bio_sector / (1 << 3)) * (1 << 3);
}

int writeboost_map(struct dm_target *ti, struct bio *bio
#if LINUX_VERSION_CODE < PER_BIO_VERSION
		 , union map_info *map_context
#endif
		  )
{
	unsigned long flags;
	struct segment_header *uninitialized_var(seg);
	struct metablock *mb, *new_mb;
#if LINUX_VERSION_CODE >= PER_BIO_VERSION
	struct per_bio_data *map_context;
#endif
	sector_t bio_count, bio_offset, s;
	bool bio_fullsize, found, on_buffer,
	     refresh_segment, b;
	int rw;
	struct lookup_key key;
	struct ht_head *head;
	cache_nr update_mb_idx, idx_inseg, k;
	size_t start;
	void *data;

	struct wb_device *wb = ti->private;
	struct wb_cache *cache = wb->cache;
	struct dm_dev *orig = wb->device;

#if LINUX_VERSION_CODE >= PER_BIO_VERSION
	map_context = dm_per_bio_data(bio, ti->per_bio_data_size);
#endif
	map_context->ptr = NULL;

	/*
	 * We only discard only the backing store because
	 * blocks on cache device are unlikely to be discarded.
	 *
	 * Discarding blocks is likely to be operated
	 * long after writing;
	 * the block is likely to be migrated before.
	 * Moreover,
	 * we discard the segment at the end of migration
	 * and that's enough for discarding blocks.
	 */
	if (bio->bi_rw & REQ_DISCARD) {
		bio_remap(bio, orig, bio->bi_sector);
		return DM_MAPIO_REMAPPED;
	}

	/*
	 * defered ACK for barrier writes
	 *
	 * bio with REQ_FLUSH is guaranteed
	 * to have no data.
	 * So, simply queue it and return.
	 */
	if (bio->bi_rw & REQ_FLUSH) {
		BUG_ON(bio->bi_size);
		queue_barrier_io(cache, bio);
		return DM_MAPIO_SUBMITTED;
	}

	bio_count = bio->bi_size >> SECTOR_SHIFT;
	bio_fullsize = (bio_count == (1 << 3));
	bio_offset = bio->bi_sector % (1 << 3);

	rw = bio_data_dir(bio);

	key = (struct lookup_key) {
		.sector = calc_cache_alignment(cache, bio->bi_sector),
	};

	k = ht_hash(cache, &key);
	head = bigarray_at(cache->htable, k);

	/*
	 * (Locking)
	 * Why mutex?
	 *
	 * The reason we use mutex instead of rw_semaphore
	 * that can allow truely concurrent read access
	 * is that mutex is even lighter than rw_semaphore.
	 * Since dm-writebuffer is a real performance centric software
	 * the overhead of rw_semaphore is crucial.
	 * All in all,
	 * since exclusive region in read path is enough small
	 * and cheap, using rw_semaphore and let the reads
	 * execute concurrently won't improve the performance
	 * as much as one expects.
	 */
	mutex_lock(&cache->io_lock);
	mb = ht_lookup(cache, head, &key);
	if (mb) {
		seg = ((void *) mb) - (mb->idx % NR_CACHES_INSEG) *
				      sizeof(struct metablock);
		atomic_inc(&seg->nr_inflight_ios);
	}

	found = (mb != NULL);
	on_buffer = false;
	if (found)
		on_buffer = is_on_buffer(cache, mb->idx);

	inc_stat(cache, rw, found, on_buffer, bio_fullsize);

	if (!rw) {
		u8 dirty_bits;

		mutex_unlock(&cache->io_lock);

		if (!found) {
			bio_remap(bio, orig, bio->bi_sector);
			return DM_MAPIO_REMAPPED;
		}

		dirty_bits = atomic_read_mb_dirtiness(seg, mb);

		if (unlikely(on_buffer)) {

			if (dirty_bits)
				migrate_buffered_mb(cache, mb, dirty_bits);

			/*
			 * Cache class
			 * Live and Stable
			 *
			 * Live:
			 * The cache is on the RAM buffer.
			 *
			 * Stable:
			 * The cache is not on the RAM buffer
			 * but at least queued in flush_queue.
			 */

			/*
			 * (Locking)
			 * Dirtiness of a live cache
			 *
			 * We can assume dirtiness of a cache only increase
			 * when it is on the buffer, we call this cache is live.
			 * This eases the locking because
			 * we don't worry the dirtiness of
			 * a live cache fluctuates.
			 */

			atomic_dec(&seg->nr_inflight_ios);
			bio_remap(bio, orig, bio->bi_sector);
			return DM_MAPIO_REMAPPED;
		}

		wait_for_completion(&seg->flush_done);
		if (likely(dirty_bits == 255)) {
			bio_remap(bio,
				  cache->device,
				  calc_mb_start_sector(seg, mb->idx)
				  + bio_offset);
			map_context->ptr = seg;
		} else {

			/*
			 * (Locking)
			 * Dirtiness of a stable cache
			 *
			 * Unlike the live caches that don't
			 * fluctuate the dirtiness,
			 * stable caches which are not on the buffer
			 * but on the cache device
			 * may decrease the dirtiness by other processes
			 * than the migrate daemon.
			 * This works fine
			 * because migrating the same cache twice
			 * doesn't craze the cache concistency.
			 */

			migrate_mb(cache, seg, mb, dirty_bits, true);
			cleanup_mb_if_dirty(cache, seg, mb);

			atomic_dec(&seg->nr_inflight_ios);
			bio_remap(bio, orig, bio->bi_sector);
		}
		return DM_MAPIO_REMAPPED;
	}

	if (found) {

		if (unlikely(on_buffer)) {
			mutex_unlock(&cache->io_lock);

			update_mb_idx = mb->idx;
			goto write_on_buffer;
		} else {
			u8 dirty_bits = atomic_read_mb_dirtiness(seg, mb);

			/*
			 * First clean up the previous cache
			 * and migrate the cache if needed.
			 */
			bool needs_cleanup_prev_cache =
				!bio_fullsize || !(dirty_bits == 255);

			if (unlikely(needs_cleanup_prev_cache)) {
				wait_for_completion(&seg->flush_done);
				migrate_mb(cache, seg, mb, dirty_bits, true);
			}

			/*
			 * Fullsize dirty cache
			 * can be discarded without migration.
			 */
			cleanup_mb_if_dirty(cache, seg, mb);

			ht_del(cache, mb);

			atomic_dec(&seg->nr_inflight_ios);
			goto write_not_found;
		}
	}

write_not_found:
	;

	/*
	 * If cache->cursor is 254, 509, ...
	 * that is the last cache line in the segment.
	 * We must flush the current segment and
	 * get the new one.
	 */
	refresh_segment = !((cache->cursor + 1) % NR_CACHES_INSEG);

	if (refresh_segment)
		queue_current_buffer(cache);

	cache->cursor = (cache->cursor + 1) % cache->nr_caches;

	/*
	 * update_mb_idx is the cache line index to update.
	 */
	update_mb_idx = cache->cursor;

	seg = cache->current_seg;
	atomic_inc(&seg->nr_inflight_ios);

	new_mb = seg->mb_array + (update_mb_idx % NR_CACHES_INSEG);
	new_mb->dirty_bits = 0;
	ht_register(cache, head, &key, new_mb);
	mutex_unlock(&cache->io_lock);

	mb = new_mb;

write_on_buffer:
	;
	idx_inseg = update_mb_idx % NR_CACHES_INSEG;
	s = (idx_inseg + 1) << 3;

	b = false;
	lockseg(seg, flags);
	if (!mb->dirty_bits) {
		seg->length++;
		BUG_ON(seg->length >  NR_CACHES_INSEG);
		b = true;
	}

	if (likely(bio_fullsize)) {
		mb->dirty_bits = 255;
	} else {
		u8 i;
		u8 acc_bits = 0;
		s += bio_offset;
		for (i = bio_offset; i < (bio_offset+bio_count); i++)
			acc_bits += (1 << i);

		mb->dirty_bits |= acc_bits;
	}

	BUG_ON(!mb->dirty_bits);

	unlockseg(seg, flags);

	if (b)
		inc_nr_dirty_caches(wb);

	start = s << SECTOR_SHIFT;
	data = bio_data(bio);

	memcpy(cache->current_rambuf->data + start, data, bio->bi_size);
	atomic_dec(&seg->nr_inflight_ios);

	/*
	 * deferred ACK for barrier writes
	 *
	 * bio with REQ_FUA flag has data.
	 * So, we run through the path for the
	 * ordinary bio. And the data is
	 * now stored in the RAM buffer.
	 * After that, queue it and return
	 * to defer completion.
	 */
	if (bio->bi_rw & REQ_FUA) {
		queue_barrier_io(cache, bio);
		return DM_MAPIO_SUBMITTED;
	}

	bio_endio(bio, 0);
	return DM_MAPIO_SUBMITTED;
}

int writeboost_end_io(struct dm_target *ti, struct bio *bio, int error
#if LINUX_VERSION_CODE < PER_BIO_VERSION
		    , union map_info *map_context
#endif
		     )
{
	struct segment_header *seg;
#if LINUX_VERSION_CODE >= PER_BIO_VERSION
	struct per_bio_data *map_context =
		dm_per_bio_data(bio, ti->per_bio_data_size);
#endif
	if (!map_context->ptr)
		return 0;

	seg = map_context->ptr;
	atomic_dec(&seg->nr_inflight_ios);

	return 0;
}

---------- hashtable.c ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

#include "hashtable.h"

/*
 * Initialize the Hash Table.
 */
int __must_check ht_empty_init(struct wb_cache *cache)
{
	cache_nr idx;
	size_t i;
	size_t nr_heads;
	struct bigarray *arr;

	cache->htsize = cache->nr_caches;
	nr_heads = cache->htsize + 1;
	arr = make_bigarray(sizeof(struct ht_head), nr_heads);
	if (!arr) {
		WBERR();
		return -ENOMEM;
	}

	cache->htable = arr;

	for (i = 0; i < nr_heads; i++) {
		struct ht_head *hd = bigarray_at(arr, i);
		INIT_HLIST_HEAD(&hd->ht_list);
	}

	/*
	 * Our hashtable has one special bucket called null head.
	 * Orphan metablocks are linked to the null head.
	 */
	cache->null_head = bigarray_at(cache->htable, cache->htsize);

	for (idx = 0; idx < cache->nr_caches; idx++) {
		struct metablock *mb = mb_at(cache, idx);
		hlist_add_head(&mb->ht_list, &cache->null_head->ht_list);
	}

	return 0;
}

cache_nr ht_hash(struct wb_cache *cache, struct lookup_key *key)
{
	return key->sector % cache->htsize;
}

static bool mb_hit(struct metablock *mb, struct lookup_key *key)
{
	return mb->sector == key->sector;
}

void ht_del(struct wb_cache *cache, struct metablock *mb)
{
	struct ht_head *null_head;

	hlist_del(&mb->ht_list);

	null_head = cache->null_head;
	hlist_add_head(&mb->ht_list, &null_head->ht_list);
}

void ht_register(struct wb_cache *cache, struct ht_head *head,
		 struct lookup_key *key, struct metablock *mb)
{
	hlist_del(&mb->ht_list);
	hlist_add_head(&mb->ht_list, &head->ht_list);

	mb->sector = key->sector;
};

struct metablock *ht_lookup(struct wb_cache *cache,
			    struct ht_head *head,
			    struct lookup_key *key)
{
	struct metablock *mb, *found = NULL;

#if LINUX_VERSION_CODE >= KERNEL_VERSION(3, 9, 0)
	hlist_for_each_entry(mb, &head->ht_list, ht_list)
#else
	struct hlist_node *pos;
	hlist_for_each_entry(mb, pos, &head->ht_list, ht_list)
#endif
	{
		if (mb_hit(mb, key)) {
			found = mb;
			break;
		}
	}
	return found;
}

/*
 * Discard all the metablock in a segment.
 */
void discard_caches_inseg(struct wb_cache *cache,
			  struct segment_header *seg)
{
	u8 i;
	for (i = 0; i < NR_CACHES_INSEG; i++) {
		struct metablock *mb = seg->mb_array + i;
		ht_del(cache, mb);
	}
}

---------- migrate-daemon.c ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

#include "migrate-daemon.h"

static void migrate_endio(unsigned long error, void *context)
{
	struct wb_cache *cache = context;

	if (error)
		atomic_inc(&cache->migrate_fail_count);

	if (atomic_dec_and_test(&cache->migrate_io_count))
		wake_up_interruptible(&cache->migrate_wait_queue);
}

/*
 * Submit the segment data at position k
 * in migrate buffer.
 * Batched migration first gather all the segments
 * to migrate into a migrate buffer.
 * So, there are a number of segment data
 * in the buffer.
 * This function submits the one in position k.
 */
static void submit_migrate_io(struct wb_cache *cache,
			      struct segment_header *seg, size_t k)
{
	u8 i, j;
	size_t a = NR_CACHES_INSEG * k;
	void *p = cache->migrate_buffer + (NR_CACHES_INSEG << 12) * k;

	for (i = 0; i < seg->length; i++) {
		struct metablock *mb = seg->mb_array + i;

		struct wb_device *wb = cache->wb;
		u8 dirty_bits = *(cache->dirtiness_snapshot + (a + i));

		unsigned long offset;
		void *base, *addr;

		struct dm_io_request io_req_w;
		struct dm_io_region region_w;

		if (!dirty_bits)
			continue;

		offset = i << 12;
		base = p + offset;

		if (dirty_bits == 255) {
			addr = base;
			io_req_w = (struct dm_io_request) {
				.client = wb_io_client,
				.bi_rw = WRITE,
				.notify.fn = migrate_endio,
				.notify.context = cache,
				.mem.type = DM_IO_VMA,
				.mem.ptr.vma = addr,
			};
			region_w = (struct dm_io_region) {
				.bdev = wb->device->bdev,
				.sector = mb->sector,
				.count = (1 << 3),
			};
			dm_safe_io_retry(&io_req_w, 1, &region_w, false);
		} else {
			for (j = 0; j < 8; j++) {
				bool b = dirty_bits & (1 << j);
				if (!b)
					continue;

				addr = base + (j << SECTOR_SHIFT);
				io_req_w = (struct dm_io_request) {
					.client = wb_io_client,
					.bi_rw = WRITE,
					.notify.fn = migrate_endio,
					.notify.context = cache,
					.mem.type = DM_IO_VMA,
					.mem.ptr.vma = addr,
				};
				region_w = (struct dm_io_region) {
					.bdev = wb->device->bdev,
					.sector = mb->sector + j,
					.count = 1,
				};
				dm_safe_io_retry(
					&io_req_w, 1, &region_w, false);
			}
		}
	}
}

static void memorize_dirty_state(struct wb_cache *cache,
				 struct segment_header *seg, size_t k,
				 size_t *migrate_io_count)
{
	u8 i, j;
	size_t a = NR_CACHES_INSEG * k;
	void *p = cache->migrate_buffer + (NR_CACHES_INSEG << 12) * k;
	struct metablock *mb;

	struct dm_io_request io_req_r = {
		.client = wb_io_client,
		.bi_rw = READ,
		.notify.fn = NULL,
		.mem.type = DM_IO_VMA,
		.mem.ptr.vma = p,
	};
	struct dm_io_region region_r = {
		.bdev = cache->device->bdev,
		.sector = seg->start_sector + (1 << 3),
		.count = seg->length << 3,
	};
	dm_safe_io_retry(&io_req_r, 1, &region_r, false);

	/*
	 * We take snapshot of the dirtiness in the segments.
	 * The snapshot segments
	 * are dirtier than themselves of any future moment
	 * and we will migrate the possible dirtiest
	 * state of the segments
	 * which won't lose any dirty data that was acknowledged.
	 */
	for (i = 0; i < seg->length; i++) {
		mb = seg->mb_array + i;
		*(cache->dirtiness_snapshot + (a + i)) =
			atomic_read_mb_dirtiness(seg, mb);
	}

	for (i = 0; i < seg->length; i++) {
		u8 dirty_bits;

		mb = seg->mb_array + i;

		dirty_bits = *(cache->dirtiness_snapshot + (a + i));

		if (!dirty_bits)
			continue;

		if (dirty_bits == 255) {
			(*migrate_io_count)++;
		} else {
			for (j = 0; j < 8; j++) {
				if (dirty_bits & (1 << j))
					(*migrate_io_count)++;
			}
		}
	}
}

static void cleanup_segment(struct wb_cache *cache, struct segment_header *seg)
{
	u8 i;
	for (i = 0; i < seg->length; i++) {
		struct metablock *mb = seg->mb_array + i;
		cleanup_mb_if_dirty(cache, seg, mb);
	}
}

static void migrate_linked_segments(struct wb_cache *cache)
{
	struct segment_header *seg;
	size_t k, migrate_io_count = 0;

	/*
	 * Memorize the dirty state to migrate before going in.
	 * - How many migration writes should be submitted atomically,
	 * - Which cache lines are dirty to migarate
	 * - etc.
	 */
	k = 0;
	list_for_each_entry(seg, &cache->migrate_list, migrate_list) {
		memorize_dirty_state(cache, seg, k, &migrate_io_count);
		k++;
	}

migrate_write:
	atomic_set(&cache->migrate_io_count, migrate_io_count);
	atomic_set(&cache->migrate_fail_count, 0);

	k = 0;
	list_for_each_entry(seg, &cache->migrate_list, migrate_list) {
		submit_migrate_io(cache, seg, k);
		k++;
	}

	wait_event_interruptible(cache->migrate_wait_queue,
				 atomic_read(&cache->migrate_io_count) == 0);

	if (atomic_read(&cache->migrate_fail_count)) {
		WBWARN("%u writebacks failed. retry.",
		       atomic_read(&cache->migrate_fail_count));
		goto migrate_write;
	}

	BUG_ON(atomic_read(&cache->migrate_io_count));

	list_for_each_entry(seg, &cache->migrate_list, migrate_list) {
		cleanup_segment(cache, seg);
	}

	/*
	 * The segment may have a block
	 * that returns ACK for persistent write
	 * on the cache device.
	 * Migrating them in non-persistent way
	 * is betrayal to the client
	 * who received the ACK and
	 * expects the data is persistent.
	 * Since it is difficult to know
	 * whether a cache in a segment
	 * is of that status
	 * we are on the safe side
	 * on this issue by always
	 * migrating those data persistently.
	 */
	blkdev_issue_flush(cache->wb->device->bdev, GFP_NOIO, NULL);

	/*
	 * Discarding the migrated regions
	 * can avoid unnecessary wear amplifier in the future.
	 *
	 * But note that we should not discard
	 * the metablock region because
	 * whether or not to ensure
	 * the discarded block returns certain value
	 * is depends on venders
	 * and unexpected metablock data
	 * will craze the cache.
	 */
	list_for_each_entry(seg, &cache->migrate_list, migrate_list) {
		blkdev_issue_discard(cache->device->bdev,
				     seg->start_sector + (1 << 3),
				     seg->length << 3,
				     GFP_NOIO, 0);
	}
}

void migrate_proc(struct work_struct *work)
{
	struct wb_cache *cache =
		container_of(work, struct wb_cache, migrate_work);

	while (true) {
		bool allow_migrate;
		size_t i, nr_mig_candidates, nr_mig;
		struct segment_header *seg, *tmp;

		WBINFO();

		if (cache->on_terminate)
			return;

		/*
		 * If reserving_id > 0
		 * Migration should be immediate.
		 */
		allow_migrate = cache->reserving_segment_id ||
				cache->allow_migrate;

		if (!allow_migrate) {
			schedule_timeout_interruptible(msecs_to_jiffies(1000));
			continue;
		}

		nr_mig_candidates = cache->last_flushed_segment_id -
				    cache->last_migrated_segment_id;

		if (!nr_mig_candidates) {
			schedule_timeout_interruptible(msecs_to_jiffies(1000));
			continue;
		}

		if (cache->nr_cur_batched_migration !=
		    cache->nr_max_batched_migration){
			vfree(cache->migrate_buffer);
			kfree(cache->dirtiness_snapshot);
			cache->nr_cur_batched_migration =
				cache->nr_max_batched_migration;
			cache->migrate_buffer =
				vmalloc(cache->nr_cur_batched_migration *
					(NR_CACHES_INSEG << 12));
			cache->dirtiness_snapshot =
				kmalloc_retry(cache->nr_cur_batched_migration *
					      NR_CACHES_INSEG,
					      GFP_NOIO);

			BUG_ON(!cache->migrate_buffer);
			BUG_ON(!cache->dirtiness_snapshot);
		}

		/*
		 * Batched Migration:
		 * We will migrate at most nr_max_batched_migration
		 * segments at a time.
		 */
		nr_mig = min(nr_mig_candidates,
			     cache->nr_cur_batched_migration);

		/*
		 * Add segments to migrate atomically.
		 */
		for (i = 1; i <= nr_mig; i++) {
			seg = get_segment_header_by_id(
					cache,
					cache->last_migrated_segment_id + i);
			list_add_tail(&seg->migrate_list, &cache->migrate_list);
		}

		migrate_linked_segments(cache);

		/*
		 * (Locking)
		 * Only line of code changes
		 * last_migrate_segment_id during runtime.
		 */
		cache->last_migrated_segment_id += nr_mig;

		list_for_each_entry_safe(seg, tmp,
					 &cache->migrate_list,
					 migrate_list) {
			complete_all(&seg->migrate_done);
			list_del(&seg->migrate_list);
		}
	}
}

void wait_for_migration(struct wb_cache *cache, size_t id)
{
	struct segment_header *seg = get_segment_header_by_id(cache, id);

	/*
	 * Set reserving_segment_id to non zero
	 * to force the migartion daemon
	 * to complete migarate of this segment
	 * immediately.
	 */
	cache->reserving_segment_id = id;
	wait_for_completion(&seg->migrate_done);
	cache->reserving_segment_id = 0;
}

---------- migrate-modulator.c ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

#include "migrate-modulator.h"

void modulator_proc(struct work_struct *work)
{
	struct wb_cache *cache =
		container_of(work, struct wb_cache, modulator_work);
	struct wb_device *wb = cache->wb;

	struct hd_struct *hd = wb->device->bdev->bd_part;
	unsigned long old = 0, new, util;
	unsigned long intvl = 1000;

	while (true) {
		if (cache->on_terminate)
			return;

		new = jiffies_to_msecs(part_stat_read(hd, io_ticks));

		if (!cache->enable_migration_modulator)
			goto modulator_update;

		util = (100 * (new - old)) / 1000;

		WBINFO("%u", (unsigned) util);
		if (util < wb->migrate_threshold)
			cache->allow_migrate = true;
		else
			cache->allow_migrate = false;

modulator_update:
		old = new;

		schedule_timeout_interruptible(msecs_to_jiffies(intvl));
	}
}

---------- queue-flush-job.c ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

#include "queue-flush-job.h"

static u8 count_dirty_caches_remained(struct segment_header *seg)
{
	u8 i, count = 0;

	struct metablock *mb;
	for (i = 0; i < seg->length; i++) {
		mb = seg->mb_array + i;
		if (mb->dirty_bits)
			count++;
	}
	return count;
}

/*
 * Make a metadata in segment data to flush.
 * @dest The metadata part of the segment to flush
 */
static void prepare_segment_header_device(struct segment_header_device *dest,
					  struct wb_cache *cache,
					  struct segment_header *src)
{
	cache_nr i;
	u8 left, right;

	dest->global_id = cpu_to_le64(src->global_id);
	dest->length = src->length;
	dest->lap = cpu_to_le32(calc_segment_lap(cache, src->global_id));

	left = src->length - 1;
	right = (cache->cursor) % NR_CACHES_INSEG;
	BUG_ON(left != right);

	for (i = 0; i < src->length; i++) {
		struct metablock *mb = src->mb_array + i;
		struct metablock_device *mbdev = &dest->mbarr[i];
		mbdev->sector = cpu_to_le64(mb->sector);
		mbdev->dirty_bits = mb->dirty_bits;
		mbdev->lap = cpu_to_le32(dest->lap);
	}
}

static void prepare_meta_rambuffer(void *rambuffer,
				   struct wb_cache *cache,
				   struct segment_header *seg)
{
	prepare_segment_header_device(rambuffer, cache, seg);
}

/*
 * Queue the current segment into the queue
 * and prepare a new segment.
 */
static void queue_flushing(struct wb_cache *cache)
{
	unsigned long flags;
	struct segment_header *current_seg = cache->current_seg, *new_seg;
	struct flush_job *job;
	bool empty;
	struct rambuffer *next_rambuf;
	size_t n1 = 0, n2 = 0;
	u64 next_id;

	while (atomic_read(&current_seg->nr_inflight_ios)) {
		n1++;
		if (n1 == 100)
			WBWARN();
		schedule_timeout_interruptible(msecs_to_jiffies(1));
	}

	prepare_meta_rambuffer(cache->current_rambuf->data, cache,
			       cache->current_seg);

	INIT_COMPLETION(current_seg->migrate_done);
	INIT_COMPLETION(current_seg->flush_done);

	job = kmalloc_retry(sizeof(*job), GFP_NOIO);
	INIT_LIST_HEAD(&job->flush_queue);
	job->seg = current_seg;
	job->rambuf = cache->current_rambuf;

	bio_list_init(&job->barrier_ios);
	bio_list_merge(&job->barrier_ios, &cache->barrier_ios);
	bio_list_init(&cache->barrier_ios);

	spin_lock_irqsave(&cache->flush_queue_lock, flags);
	empty = list_empty(&cache->flush_queue);
	list_add_tail(&job->flush_queue, &cache->flush_queue);
	spin_unlock_irqrestore(&cache->flush_queue_lock, flags);
	if (empty)
		wake_up_interruptible(&cache->flush_wait_queue);

	next_id = current_seg->global_id + 1;
	new_seg = get_segment_header_by_id(cache, next_id);
	new_seg->global_id = next_id;

	while (atomic_read(&new_seg->nr_inflight_ios)) {
		n2++;
		if (n2 == 100)
			WBWARN();
		schedule_timeout_interruptible(msecs_to_jiffies(1));
	}

	BUG_ON(count_dirty_caches_remained(new_seg));

	discard_caches_inseg(cache, new_seg);

	/*
	 * Set the cursor to the last of the flushed segment.
	 */
	cache->cursor = current_seg->start_idx + (NR_CACHES_INSEG - 1);
	new_seg->length = 0;

	next_rambuf = cache->rambuf_pool + (next_id % NR_RAMBUF_POOL);
	wait_for_completion(&next_rambuf->done);
	INIT_COMPLETION(next_rambuf->done);

	cache->current_rambuf = next_rambuf;

	cache->current_seg = new_seg;
}

void queue_current_buffer(struct wb_cache *cache)
{
	/*
	 * Before we get the next segment
	 * we must wait until the segment is all clean.
	 * A clean segment doesn't have
	 * log to flush and dirties to migrate.
	 */
	u64 next_id = cache->current_seg->global_id + 1;

	struct segment_header *next_seg =
		get_segment_header_by_id(cache, next_id);

	wait_for_completion(&next_seg->flush_done);

	wait_for_migration(cache, next_id);

	queue_flushing(cache);
}

/*
 * flush all the dirty data at a moment
 * but NOT persistently.
 * Clean up the writes before termination
 * is an example of the usecase.
 */
void flush_current_buffer(struct wb_cache *cache)
{
	struct segment_header *old_seg;

	mutex_lock(&cache->io_lock);
	old_seg = cache->current_seg;

	queue_current_buffer(cache);
	cache->cursor = (cache->cursor + 1) % cache->nr_caches;
	cache->current_seg->length = 1;
	mutex_unlock(&cache->io_lock);

	wait_for_completion(&old_seg->flush_done);
}

---------- rambuf.c ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

#include "rambuf.h"

int __must_check init_rambuf_pool(struct wb_cache *cache)
{
	size_t i, j;
	struct rambuffer *rambuf;

	cache->rambuf_pool = kmalloc(sizeof(struct rambuffer) * NR_RAMBUF_POOL,
				     GFP_KERNEL);
	if (!cache->rambuf_pool) {
		WBERR();
		return -ENOMEM;
	}

	for (i = 0; i < NR_RAMBUF_POOL; i++) {
		rambuf = cache->rambuf_pool + i;
		init_completion(&rambuf->done);
		complete_all(&rambuf->done);

		rambuf->data = kmalloc(
			1 << (WB_SEGMENTSIZE_ORDER + SECTOR_SHIFT),
			GFP_KERNEL);
		if (!rambuf->data) {
			WBERR();
			for (j = 0; j < i; j++) {
				rambuf = cache->rambuf_pool + j;
				kfree(rambuf->data);
			}
			kfree(cache->rambuf_pool);
			return -ENOMEM;
		}
	}

	return 0;
}

void free_rambuf_pool(struct wb_cache *cache)
{
	struct rambuffer *rambuf;
	size_t i;
	for (i = 0; i < NR_RAMBUF_POOL; i++) {
		rambuf = cache->rambuf_pool + i;
		kfree(rambuf->data);
	}
	kfree(cache->rambuf_pool);
}

---------- recover.c ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

#include "recover.h"

static int __must_check
read_superblock_record(struct superblock_record_device *record,
		       struct wb_cache *cache)
{
	int r = 0;
	struct dm_io_request io_req;
	struct dm_io_region region;

	void *buf = kmalloc(1 << SECTOR_SHIFT, GFP_KERNEL);
	if (!buf) {
		WBERR();
		return -ENOMEM;
	}

	io_req = (struct dm_io_request) {
		.client = wb_io_client,
		.bi_rw = READ,
		.notify.fn = NULL,
		.mem.type = DM_IO_KMEM,
		.mem.ptr.addr = buf,
	};
	region = (struct dm_io_region) {
		.bdev = cache->device->bdev,
		.sector = (1 << 11) - 1,
		.count = 1,
	};
	r = dm_safe_io(&io_req, 1, &region, NULL, true);

	kfree(buf);

	if (r) {
		WBERR();
		return r;
	}

	memcpy(record, buf, sizeof(*record));

	return r;
}

static int __must_check
read_segment_header_device(struct segment_header_device *dest,
			   struct wb_cache *cache, size_t segment_idx)
{
	int r = 0;
	struct dm_io_request io_req;
	struct dm_io_region region;
	void *buf = kmalloc(1 << 12, GFP_KERNEL);
	if (!buf) {
		WBERR();
		return -ENOMEM;
	}

	io_req = (struct dm_io_request) {
		.client = wb_io_client,
		.bi_rw = READ,
		.notify.fn = NULL,
		.mem.type = DM_IO_KMEM,
		.mem.ptr.addr = buf,
	};
	region = (struct dm_io_region) {
		.bdev = cache->device->bdev,
		.sector = calc_segment_header_start(segment_idx),
		.count = (1 << 3),
	};
	r = dm_safe_io(&io_req, 1, &region, NULL, false);

	kfree(buf);

	if (r) {
		WBERR();
		return r;
	}

	memcpy(dest, buf, sizeof(*dest));

	return r;
}

/*
 * Read the on-disk metadata of the segment
 * and update the in-core cache metadata structure
 * like Hash Table.
 */
static void update_by_segment_header_device(struct wb_cache *cache,
					    struct segment_header_device *src)
{
	cache_nr i;
	struct segment_header *seg =
		get_segment_header_by_id(cache, src->global_id);
	seg->length = src->length;

	INIT_COMPLETION(seg->migrate_done);

	for (i = 0 ; i < src->length; i++) {
		cache_nr k;
		struct lookup_key key;
		struct ht_head *head;
		struct metablock *found, *mb = seg->mb_array + i;
		struct metablock_device *mbdev = &src->mbarr[i];

		if (!mbdev->dirty_bits)
			continue;

		mb->sector = le64_to_cpu(mbdev->sector);
		mb->dirty_bits = mbdev->dirty_bits;

		inc_nr_dirty_caches(cache->wb);

		key = (struct lookup_key) {
			.sector = mb->sector,
		};

		k = ht_hash(cache, &key);
		head = bigarray_at(cache->htable, k);

		found = ht_lookup(cache, head, &key);
		if (found)
			ht_del(cache, found);
		ht_register(cache, head, &key, mb);
	}
}

/*
 * If only if the lap attributes
 * are the same between header and all the metablock,
 * the segment is judged to be flushed correctly
 * and then merge into the runtime structure.
 * Otherwise, ignored.
 */
static bool checkup_atomicity(struct segment_header_device *header)
{
	u8 i;
	u32 a = le32_to_cpu(header->lap), b;
	for (i = 0; i < header->length; i++) {
		struct metablock_device *o;
		o = header->mbarr + i;
		b = le32_to_cpu(o->lap);
		if (a != b)
			return false;
	}
	return true;
}

int __must_check recover_cache(struct wb_cache *cache)
{
	int r = 0;
	struct segment_header_device *header;
	struct segment_header *seg;
	u64 i, j,
	    max_id, oldest_id, last_flushed_id, init_segment_id,
	    oldest_idx, nr_segments = cache->nr_segments,
	    header_id, record_id;

	struct superblock_record_device uninitialized_var(record);
	r = read_superblock_record(&record, cache);
	if (r) {
		WBERR();
		return r;
	}
	WBINFO("%llu", record.last_migrated_segment_id);
	record_id = le64_to_cpu(record.last_migrated_segment_id);
	WBINFO("%llu", record_id);

	header = kmalloc(sizeof(*header), GFP_KERNEL);
	if (!header) {
		WBERR();
		return -ENOMEM;
	}

	/*
	 * Finding the oldest, non-zero id and its index.
	 */

	max_id = SZ_MAX;
	oldest_id = max_id;
	oldest_idx = 0;
	for (i = 0; i < nr_segments; i++) {
		r = read_segment_header_device(header, cache, i);
		if (r) {
			WBERR();
			kfree(header);
			return r;
		}
		header_id = le64_to_cpu(header->global_id);

		if (header_id < 1)
			continue;

		if (header_id < oldest_id) {
			oldest_idx = i;
			oldest_id = header_id;
		}
	}

	last_flushed_id = 0;

	/*
	 * This is an invariant.
	 * We always start from the segment
	 * that is right after the last_flush_id.
	 */
	init_segment_id = last_flushed_id + 1;

	/*
	 * If no segment was flushed
	 * then there is nothing to recover.
	 */
	if (oldest_id == max_id)
		goto setup_init_segment;

	/*
	 * What we have to do in the next loop is to
	 * revive the segments that are
	 * flushed but yet not migrated.
	 */

	/*
	 * Example:
	 * There are only 5 segments.
	 * The segments we will consider are of id k+2 and k+3
	 * because they are dirty but not migrated.
	 *
	 * id: [     k+3    ][  k+4   ][   k    ][     k+1     ][  K+2  ]
	 *      last_flushed  init_seg  migrated  last_migrated  flushed
	 */
	for (i = oldest_idx; i < (nr_segments + oldest_idx); i++) {
		j = i % nr_segments;
		r = read_segment_header_device(header, cache, j);
		if (r) {
			WBERR();
			kfree(header);
			return r;
		}
		header_id = le64_to_cpu(header->global_id);

		/*
		 * Valid global_id > 0.
		 * We encounter header with global_id = 0 and
		 * we can consider
		 * this and the followings are all invalid.
		 */
		if (header_id <= last_flushed_id)
			break;

		if (!checkup_atomicity(header)) {
			WBWARN("header atomicity broken id %llu",
			       header_id);
			break;
		}

		/*
		 * Now the header is proven valid.
		 */

		last_flushed_id = header_id;
		init_segment_id = last_flushed_id + 1;

		/*
		 * If the data is already on the backing store,
		 * we ignore the segment.
		 */
		if (header_id <= record_id)
			continue;

		update_by_segment_header_device(cache, header);
	}

setup_init_segment:
	kfree(header);

	seg = get_segment_header_by_id(cache, init_segment_id);
	seg->global_id = init_segment_id;
	atomic_set(&seg->nr_inflight_ios, 0);

	cache->last_flushed_segment_id = seg->global_id - 1;

	cache->last_migrated_segment_id =
		cache->last_flushed_segment_id > cache->nr_segments ?
		cache->last_flushed_segment_id - cache->nr_segments : 0;

	if (record_id > cache->last_migrated_segment_id)
		cache->last_migrated_segment_id = record_id;

	WBINFO("%llu", cache->last_migrated_segment_id);
	wait_for_migration(cache, seg->global_id);

	discard_caches_inseg(cache, seg);

	/*
	 * cursor is set to the first element of the segment.
	 * This means that we will not use the element.
	 */
	cache->cursor = seg->start_idx;
	seg->length = 1;

	cache->current_seg = seg;

	return 0;
}

---------- segment.c ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

#include "segment.h"

/*
 * Get the in-core metablock of the given index.
 */
struct metablock *mb_at(struct wb_cache *cache, cache_nr idx)
{
	u64 seg_idx = idx / NR_CACHES_INSEG;
	struct segment_header *seg =
		bigarray_at(cache->segment_header_array, seg_idx);
	cache_nr idx_inseg = idx % NR_CACHES_INSEG;
	return seg->mb_array + idx_inseg;
}

static void mb_array_empty_init(struct wb_cache *cache)
{
	size_t i;
	for (i = 0; i < cache->nr_caches; i++) {
		struct metablock *mb = mb_at(cache, i);
		INIT_HLIST_NODE(&mb->ht_list);

		mb->idx = i;
		mb->dirty_bits = 0;
	}
}

int __must_check init_segment_header_array(struct wb_cache *cache)
{
	u64 segment_idx, nr_segments = cache->nr_segments;
	cache->segment_header_array =
		make_bigarray(sizeof(struct segment_header), nr_segments);
	if (!cache->segment_header_array) {
		WBERR();
		return -ENOMEM;
	}

	for (segment_idx = 0; segment_idx < nr_segments; segment_idx++) {
		struct segment_header *seg =
			bigarray_at(cache->segment_header_array, segment_idx);
		seg->start_idx = NR_CACHES_INSEG * segment_idx;
		seg->start_sector =
			((segment_idx % nr_segments) + 1) *
			(1 << WB_SEGMENTSIZE_ORDER);

		seg->length = 0;

		atomic_set(&seg->nr_inflight_ios, 0);

		spin_lock_init(&seg->lock);

		INIT_LIST_HEAD(&seg->migrate_list);

		init_completion(&seg->flush_done);
		complete_all(&seg->flush_done);

		init_completion(&seg->migrate_done);
		complete_all(&seg->migrate_done);
	}

	mb_array_empty_init(cache);

	return 0;
}

/*
 * Get the segment from the segment id.
 * The Index of the segment is calculated from the segment id.
 */
struct segment_header *get_segment_header_by_id(struct wb_cache *cache,
						       size_t segment_id)
{
	struct segment_header *r =
		bigarray_at(cache->segment_header_array,
		       (segment_id - 1) % cache->nr_segments);
	return r;
}

u32 calc_segment_lap(struct wb_cache *cache, size_t segment_id)
{
	u32 a = (segment_id - 1) / cache->nr_segments;
	return a + 1;
};

sector_t calc_mb_start_sector(struct segment_header *seg,
				     cache_nr mb_idx)
{
	size_t k = 1 + (mb_idx % NR_CACHES_INSEG);
	return seg->start_sector + (k << 3);
}

sector_t calc_segment_header_start(size_t segment_idx)
{
	return (1 << WB_SEGMENTSIZE_ORDER) * (segment_idx + 1);
}

u64 calc_nr_segments(struct dm_dev *dev)
{
	sector_t devsize = dm_devsize(dev);
	return devsize / (1 << WB_SEGMENTSIZE_ORDER) - 1;
}

bool is_on_buffer(struct wb_cache *cache, cache_nr mb_idx)
{
	cache_nr start = cache->current_seg->start_idx;
	if (mb_idx < start)
		return false;

	if (mb_idx >= (start + NR_CACHES_INSEG))
		return false;

	return true;
}

---------- superblock-recorder.c ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

#include "superblock-recorder.h"

static void update_superblock_record(struct wb_cache *cache)
{
	struct superblock_record_device o;
	void *buf;
	struct dm_io_request io_req;
	struct dm_io_region region;

	o.last_migrated_segment_id =
		cpu_to_le64(cache->last_migrated_segment_id);

	buf = kmalloc_retry(1 << SECTOR_SHIFT, GFP_NOIO | __GFP_ZERO);
	memcpy(buf, &o, sizeof(o));

	io_req = (struct dm_io_request) {
		.client = wb_io_client,
		.bi_rw = WRITE_FUA,
		.notify.fn = NULL,
		.mem.type = DM_IO_KMEM,
		.mem.ptr.addr = buf,
	};
	region = (struct dm_io_region) {
		.bdev = cache->device->bdev,
		.sector = (1 << 11) - 1,
		.count = 1,
	};
	dm_safe_io_retry(&io_req, 1, &region, true);
	kfree(buf);
}

void recorder_proc(struct work_struct *work)
{
	struct wb_cache *cache =
		container_of(work, struct wb_cache, recorder_work);
	unsigned long intvl;

	while (true) {
		if (cache->on_terminate)
			return;

		/* sec -> ms */
		intvl = cache->update_record_interval * 1000;

		if (!intvl) {
			schedule_timeout_interruptible(msecs_to_jiffies(1000));
			continue;
		}

		WBINFO();
		update_superblock_record(cache);

		schedule_timeout_interruptible(msecs_to_jiffies(intvl));
	}
}

---------- target.c ----------
/*
 * writeboost
 * Log-structured Caching for Linux
 *
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

#include "target.h"

/*
 * <backing dev> <cache dev>
 */
static int writeboost_ctr(struct dm_target *ti, unsigned int argc, char **argv)
{
	int r = 0;
	bool cache_valid;
	struct wb_device *wb;
	struct wb_cache *cache;
	struct dm_dev *origdev, *cachedev;

#if LINUX_VERSION_CODE >= KERNEL_VERSION(3, 6, 0)
	r = dm_set_target_max_io_len(ti, (1 << 3));
	if (r) {
		WBERR();
		return r;
	}
#else
	ti->split_io = (1 << 3);
#endif

	wb = kzalloc(sizeof(*wb), GFP_KERNEL);
	if (!wb) {
		WBERR();
		return -ENOMEM;
	}

	/*
	 * EMC's textbook on storage system says
	 * storage should keep its disk util less
	 * than 70%.
	 */
	wb->migrate_threshold = 70;

	atomic64_set(&wb->nr_dirty_caches, 0);

	r = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
			  &origdev);
	if (r) {
		WBERR("%d", r);
		goto bad_get_device_orig;
	}
	wb->device = origdev;

	wb->cache = NULL;

	if (dm_get_device(ti, argv[1], dm_table_get_mode(ti->table),
			  &cachedev)) {
		WBERR();
		goto bad_get_device_cache;
	}

	r = audit_cache_device(cachedev, &cache_valid);
	if (r) {
		WBERR("%d", r);
		/*
		 * If something happens in auditing the cache
		 * such as read io error either go formatting
		 * or resume it trusting the cache is valid
		 * are dangerous. So we quit.
		 */
		goto bad_audit_cache;
	}

	if (!cache_valid) {
		r = format_cache_device(cachedev);
		if (r) {
			WBERR("%d", r);
			goto bad_format_cache;
		}
	}

	cache = kzalloc(sizeof(*cache), GFP_KERNEL);
	if (!cache) {
		WBERR();
		goto bad_alloc_cache;
	}

	wb->cache = cache;
	wb->cache->wb = wb;

	r = resume_cache(cache, cachedev);
	if (r) {
		WBERR("%d", r);
		goto bad_resume_cache;
	}

	wb->ti = ti;
	ti->private = wb;

#if LINUX_VERSION_CODE >= PER_BIO_VERSION
	ti->per_bio_data_size = sizeof(struct per_bio_data);
#endif

#if LINUX_VERSION_CODE >= KERNEL_VERSION(3, 9, 0)
	ti->num_flush_bios = 1;
	ti->num_discard_bios = 1;
#else
	ti->num_flush_requests = 1;
	ti->num_discard_requests = 1;
#endif

	ti->discard_zeroes_data_unsupported = true;

	return 0;

bad_resume_cache:
	kfree(cache);
bad_alloc_cache:
bad_format_cache:
bad_audit_cache:
	dm_put_device(ti, cachedev);
bad_get_device_cache:
	dm_put_device(ti, origdev);
bad_get_device_orig:
	kfree(wb);
	return r;
}

static void writeboost_dtr(struct dm_target *ti)
{
	struct wb_device *wb = ti->private;
	struct wb_cache *cache = wb->cache;

	/*
	 * Synchronize all the dirty writes
	 * before termination.
	 */
	cache->sync_interval = 1;

	free_cache(cache);
	kfree(cache);

	dm_put_device(wb->ti, cache->device);
	dm_put_device(ti, wb->device);

	ti->private = NULL;
	kfree(wb);
}

static int writeboost_message(struct dm_target *ti, unsigned argc, char **argv)
{
	struct wb_device *wb = ti->private;
	struct wb_cache *cache = wb->cache;

	char *cmd = argv[0];
	unsigned long tmp;

	if (!strcasecmp(cmd, "clear_stat")) {
		struct wb_cache *cache = wb->cache;
		clear_stat(cache);
		return 0;
	}

	if (kstrtoul(argv[1], 10, &tmp))
		return -EINVAL;

	if (!strcasecmp(cmd, "allow_migrate")) {
		if (tmp > 1)
			return -EINVAL;
		cache->allow_migrate = tmp;
		return 0;
	}

	if (!strcasecmp(cmd, "enable_migration_modulator")) {
		if (tmp > 1)
			return -EINVAL;
		cache->enable_migration_modulator = tmp;
		return 0;
	}

	if (!strcasecmp(cmd, "barrier_deadline_ms")) {
		if (tmp < 1)
			return -EINVAL;
		cache->barrier_deadline_ms = tmp;
		return 0;
	}

	if (!strcasecmp(cmd, "nr_max_batched_migration")) {
		if (tmp < 1)
			return -EINVAL;
		cache->nr_max_batched_migration = tmp;
		return 0;
	}

	if (!strcasecmp(cmd, "migrate_threshold")) {
		wb->migrate_threshold = tmp;
		return 0;
	}

	if (!strcasecmp(cmd, "update_record_interval")) {
		cache->update_record_interval = tmp;
		return 0;
	}

	if (!strcasecmp(cmd, "sync_interval")) {
		cache->sync_interval = tmp;
		return 0;
	}

	return -EINVAL;
}

static int writeboost_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
			    struct bio_vec *biovec, int max_size)
{
	struct wb_device *wb = ti->private;
	struct dm_dev *device = wb->device;
	struct request_queue *q = bdev_get_queue(device->bdev);

	if (!q->merge_bvec_fn)
		return max_size;

	bvm->bi_bdev = device->bdev;
	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
}

static int writeboost_iterate_devices(struct dm_target *ti,
				      iterate_devices_callout_fn fn, void *data)
{
	struct wb_device *wb = ti->private;
	struct dm_dev *orig = wb->device;
	sector_t start = 0;
	sector_t len = dm_devsize(orig);
	return fn(ti, orig, start, len, data);
}

static void writeboost_io_hints(struct dm_target *ti,
				struct queue_limits *limits)
{
	blk_limits_io_min(limits, 512);
	blk_limits_io_opt(limits, 4096);
}

static
#if LINUX_VERSION_CODE >= KERNEL_VERSION(3, 8, 0)
void
#else
int
#endif
writeboost_status(
		struct dm_target *ti, status_type_t type,
#if LINUX_VERSION_CODE >= KERNEL_VERSION(3, 6, 0)
		unsigned flags,
#endif
		char *result,
		unsigned maxlen)
{
	unsigned int sz = 0;
	struct wb_device *wb = ti->private;
	struct wb_cache *cache = wb->cache;
	size_t i;

	switch (type) {
	case STATUSTYPE_INFO:
		DMEMIT("%llu %llu %llu %llu %llu %u ",
		       (long long unsigned int)
		       atomic64_read(&wb->nr_dirty_caches),
		       (long long unsigned int) cache->nr_segments,
		       (long long unsigned int) cache->last_migrated_segment_id,
		       (long long unsigned int) cache->last_flushed_segment_id,
		       (long long unsigned int) cache->current_seg->global_id,
		       (unsigned int) cache->cursor);

		for (i = 0; i < STATLEN; i++) {
			atomic64_t *v;
			if (i == (STATLEN-1))
				break;

			v = &cache->stat[i];
			DMEMIT("%lu ", atomic64_read(v));
		}

		DMEMIT("%d ", 7);
		DMEMIT("barrier_deadline_ms %lu ",
		       cache->barrier_deadline_ms);
		DMEMIT("allow_migrate %d ",
		       cache->allow_migrate ? 1 : 0);
		DMEMIT("enable_migration_modulator %d ",
		       cache->enable_migration_modulator ? 1 : 0);
		DMEMIT("migrate_threshold %d ", wb->migrate_threshold);
		DMEMIT("nr_cur_batched_migration %lu ",
		       cache->nr_cur_batched_migration);
		DMEMIT("sync_interval %lu ",
		       cache->sync_interval);
		DMEMIT("update_record_interval %lu",
		       cache->update_record_interval);
		break;

	case STATUSTYPE_TABLE:
		DMEMIT("%s %s", wb->device->name, wb->cache->device->name);
		break;
	}
#if LINUX_VERSION_CODE < KERNEL_VERSION(3, 8, 0)
	return 0;
#endif
}

static struct target_type writeboost_target = {
	.name = "writeboost",
	.version = {0, 1, 0},
	.module = THIS_MODULE,
	.map = writeboost_map,
	.ctr = writeboost_ctr,
	.dtr = writeboost_dtr,
	.end_io = writeboost_end_io,
	.merge = writeboost_merge,
	.message = writeboost_message,
	.status = writeboost_status,
	.io_hints = writeboost_io_hints,
	.iterate_devices = writeboost_iterate_devices,
};

struct dm_io_client *wb_io_client;
struct workqueue_struct *safe_io_wq;
static int __init writeboost_module_init(void)
{
	int r = 0;

	r = dm_register_target(&writeboost_target);
	if (r < 0) {
		WBERR("%d", r);
		return r;
	}

	r = -ENOMEM;

	safe_io_wq = alloc_workqueue("safeiowq",
				     WQ_NON_REENTRANT | WQ_MEM_RECLAIM, 0);
	if (!safe_io_wq) {
		WBERR();
		goto bad_wq;
	}

	wb_io_client = dm_io_client_create();
	if (IS_ERR(wb_io_client)) {
		WBERR();
		r = PTR_ERR(wb_io_client);
		goto bad_io_client;
	}

	return 0;

bad_io_client:
	destroy_workqueue(safe_io_wq);
bad_wq:
	dm_unregister_target(&writeboost_target);

	return r;
}

static void __exit writeboost_module_exit(void)
{
	dm_io_client_destroy(wb_io_client);
	destroy_workqueue(safe_io_wq);

	dm_unregister_target(&writeboost_target);
}

module_init(writeboost_module_init);
module_exit(writeboost_module_exit);

MODULE_AUTHOR("Akira Hayakawa <ruby.wktk@gmail.com>");
MODULE_DESCRIPTION(DM_NAME " writeboost target");
MODULE_LICENSE("GPL");

---------- util.c ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

#include "util.h"

void *do_kmalloc_retry(size_t size, gfp_t flags, int lineno)
{
	size_t count = 0;
	void *p;

retry_alloc:
	p = kmalloc(size, flags);
	if (!p) {
		count++;
		WBWARN("L%d size:%lu, count:%lu",
		       lineno, size, count);
		schedule_timeout_interruptible(msecs_to_jiffies(1));
		goto retry_alloc;
	}
	return p;
}

struct safe_io {
	struct work_struct work;
	int err;
	unsigned long err_bits;
	struct dm_io_request *io_req;
	unsigned num_regions;
	struct dm_io_region *regions;
};

static void safe_io_proc(struct work_struct *work)
{
	struct safe_io *io = container_of(work, struct safe_io, work);
	io->err_bits = 0;
	io->err = dm_io(io->io_req, io->num_regions, io->regions,
			&io->err_bits);
}

/*
 * dm_io wrapper.
 * @thread run this operation in other thread to avoid deadlock.
 */
int dm_safe_io_internal(
		struct dm_io_request *io_req,
		unsigned num_regions, struct dm_io_region *regions,
		unsigned long *err_bits, bool thread, int lineno)
{
	int err;
	dev_t dev;

	if (thread) {
		struct safe_io io = {
			.io_req = io_req,
			.regions = regions,
			.num_regions = num_regions,
		};

		INIT_WORK_ONSTACK(&io.work, safe_io_proc);

		queue_work(safe_io_wq, &io.work);
		flush_work(&io.work);

		err = io.err;
		if (err_bits)
			*err_bits = io.err_bits;
	} else {
		err = dm_io(io_req, num_regions, regions, err_bits);
	}

	dev = regions->bdev->bd_dev;

	/* dm_io routines permits NULL for err_bits pointer. */
	if (err || (err_bits && *err_bits)) {
		unsigned long eb;
		if (!err_bits)
			eb = (~(unsigned long)0);
		else
			eb = *err_bits;
		WBERR("L%d err(%d, %lu), rw(%d), sector(%lu), dev(%u:%u)",
		      lineno, err, eb,
		      io_req->bi_rw, regions->sector,
		      MAJOR(dev), MINOR(dev));
	}

	return err;
}

void dm_safe_io_retry_internal(
		struct dm_io_request *io_req,
		unsigned num_regions, struct dm_io_region *regions,
		bool thread, int lineno)
{
	int err, count = 0;
	unsigned long err_bits;
	dev_t dev;

retry_io:
	err_bits = 0;
	err = dm_safe_io_internal(io_req, num_regions, regions, &err_bits,
				  thread, lineno);

	dev = regions->bdev->bd_dev;
	if (err || err_bits) {
		count++;
		WBWARN("L%d count(%d)", lineno, count);

		schedule_timeout_interruptible(msecs_to_jiffies(1000));
		goto retry_io;
	}

	if (count) {
		WBWARN("L%d rw(%d), sector(%lu), dev(%u:%u)",
		       lineno,
		       io_req->bi_rw, regions->sector,
		       MAJOR(dev), MINOR(dev));
	}
}

sector_t dm_devsize(struct dm_dev *dev)
{
	return i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT;
}


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
  2013-09-24 12:20         ` Akira Hayakawa
  2013-09-25 17:37           ` Mike Snitzer
  2013-09-25 23:03           ` Greg KH
@ 2013-09-26  3:43           ` Dave Chinner
  2013-10-01  8:26             ` Joe Thornber
  2 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2013-09-26  3:43 UTC (permalink / raw)
  To: Akira Hayakawa
  Cc: snitzer, gregkh, devel, linux-kernel, dm-devel, cesarb, joe,
	akpm, agk, m.chehab, ejt, dan.carpenter

On Tue, Sep 24, 2013 at 09:20:50PM +0900, Akira Hayakawa wrote:
> * Deferring ACK for barrier writes
> Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily.
> Immediately handling these bios badly slows down writeboost.
> It surveils the bios with these flags and forcefully flushes them
> at worst case within `barrier_deadline_ms` period.

That rings alarm bells.

If the filesystem is using REQ_FUA/REQ_FLUSH for ordering reasons,
delaying them to allow other IOs to be submitted and dispatched may
very well violate the IO ordering constraints the filesystem is
trying to acheive.

Alternatively, delaying them will stall the filesystem because it's
waiting for said REQ_FUA IO to complete. For example, journal writes
in XFS are extremely IO latency sensitive in workloads that have a
signifincant number of ordering constraints (e.g. O_SYNC writes,
fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the
filesystem for the majority of that barrier_deadline_ms.

i.e. this says to me that the best performance you can get from such
workloas is one synchronous operation per process per
barrier_deadline_ms, even when the storage and filesystem might be
capable of executing hundreds of synchronous operations per
barrier_deadline_ms..

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
  2013-09-26  1:47             ` Akira Hayakawa
@ 2013-09-27 18:35               ` Mike Snitzer
  2013-09-28 11:29                 ` Akira Hayakawa
  0 siblings, 1 reply; 25+ messages in thread
From: Mike Snitzer @ 2013-09-27 18:35 UTC (permalink / raw)
  To: Akira Hayakawa
  Cc: gregkh, devel, linux-kernel, dm-devel, cesarb, joe, akpm, agk,
	m.chehab, ejt, dan.carpenter

On Wed, Sep 25 2013 at  9:47pm -0400,
Akira Hayakawa <ruby.wktk@gmail.com> wrote:

> Hi, Mike
> 
> The monolithic source code (3.2k)
> is nicely splitted into almost 20 *.c files
> according to the functionality and
> data strucutures in OOP style.
> 
> The aim of this posting
> is to share how the splitting looks like.
> 
> I believe that
> at least reading the *.h files
> can convince you the splitting is clear.
> 
> The code is now tainted with
> almost 20 version switch macros
> and WB* debug macros
> but I will clean them up
> for sending patch.
> 
> Again,
> the latest code can be cloned by
> git clone https://github.com/akiradeveloper/dm-writeboost.git
> 
> I will make few updates to the source codes on this weekend
> so please track it to follow the latest version.
> Below is only the snapshot.
> 
> Akira
> 
> ---------- Summary ----------
> 33 Makefile
> 10 bigarray.h
> 19 cache-alloc.h
> 10 defer-barrier.h
> 8 dirty-sync.h
> 8 flush-daemon.h
> 10 format-cache.h
> 24 handle-io.h
> 16 hashtable.h
> 18 migrate-daemon.h
> 7 migrate-modulator.h
> 12 queue-flush-job.h
> 8 rambuf.h
> 13 recover.h
> 18 segment.h
> 8 superblock-recorder.h
> 9 target.h
> 30 util.h
> 384 writeboost.h
> 99 bigarray.c
> 192 cache-alloc.c
> 36 defer-barrier.c
> 33 dirty-sync.c
> 85 flush-daemon.c
> 234 format-cache.c
> 553 handle-io.c
> 109 hashtable.c
> 345 migrate-daemon.c
> 41 migrate-modulator.c
> 169 queue-flush-job.c
> 52 rambuf.c
> 308 recover.c
> 118 segment.c
> 61 superblock-recorder.c
> 376 target.c
> 126 util.c

Unfortunately I think you went too far with all these different small
files, I was hoping to see 2 or 3 .c files and a couple .h files.

Maybe fold all the daemon code into a 1 .c and 1 .h ?

The core of the writeboost target in dm-writeboost-target.c ?

And fold all the other data structures into a 1 .c and 1 .h ?

When folding these files together feel free to use dividers in the code
like dm-thin.c and dm-cache-target.c do, e.g.:

/*-----------------------------------------------------------------*/

Mike

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
  2013-09-27 18:35               ` Mike Snitzer
@ 2013-09-28 11:29                 ` Akira Hayakawa
  0 siblings, 0 replies; 25+ messages in thread
From: Akira Hayakawa @ 2013-09-28 11:29 UTC (permalink / raw)
  To: snitzer
  Cc: gregkh, devel, linux-kernel, dm-devel, cesarb, joe, akpm, agk,
	m.chehab, ejt, dan.carpenter, ruby.wktk

Hi,

Two major progress:
1) .ctr accepts segment size so .ctr now accepts 3 arguments: <backing dev> <cache dev> <segment size order>.
2) fold the small files splitted that I suggested in the previous progress report.

For 1) 
I use zero length array to dynamically accept the segment size.
writeboost had the parameter embedded previously and one must re-compile the code
to change the parameter which badly loses usability was the problem.

For 2)
> Unfortunately I think you went too far with all these different small
> files, I was hoping to see 2 or 3 .c files and a couple .h files.
> 
> Maybe fold all the daemon code into a 1 .c and 1 .h ?
> 
> The core of the writeboost target in dm-writeboost-target.c ?
> 
> And fold all the other data structures into a 1 .c and 1 .h ?
> 
> When folding these files together feel free to use dividers in the code
> like dm-thin.c and dm-cache-target.c do, e.g.:
> 
> /*-----------------------------------------------------------------*/
As Mike pointed out splitting into almost 20 files went too far.
I aggregated these files into 3 .c files and 3 .h files in total which are shown below.

---------- Summary ----------
39 dm-writeboost-daemon.h
46 dm-writeboost-metadata.h
413 dm-writeboost.h
577 dm-writeboost-daemon.c
1129 dm-writeboost-metadata.c
1212 dm-writeboost-target.c
81 dm-writeboost.mod.c

The responsibilities of each .c file
is the policy of this splitting.

a) dm-writeboost-metadata.c
This file knows how the metadata is laid out on cache device.
It can audit/format the cache device metadata
and resume/free the in-core cache metadata from that on the cache device.
Also provides accessor to the in-core metadata resumed.

b) dm-writeboost-target.c
This file contains all the methods to define target type.
In terms of I/O processing, this files only defines
from when bio is accepted to when flush job is queued
which is described as "foreground processing" in the document.
What happens after the job is queued is defined in -daemon.c file.

c) dm-writeboost-daemon.c
This file contains all the daemons as Mike suggested.
Maybe, superblock_recorder should be in the -metadata.c file
but I chose to put it on this file since for unity.

Thanks,
Akira


followed by the current .h files.

---------- dm-writeboost-daemon.h ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

#ifndef DM_WRITEBOOST_DAEMON_H
#define DM_WRITEBOOST_DAEMON_H

/*----------------------------------------------------------------*/

void flush_proc(struct work_struct *);

/*----------------------------------------------------------------*/

void queue_barrier_io(struct wb_cache *, struct bio *);
void barrier_deadline_proc(unsigned long data);
void flush_barrier_ios(struct work_struct *);

/*----------------------------------------------------------------*/

void migrate_proc(struct work_struct *);
void wait_for_migration(struct wb_cache *, u64 id);

/*----------------------------------------------------------------*/

void modulator_proc(struct work_struct *);

/*----------------------------------------------------------------*/

void sync_proc(struct work_struct *);

/*----------------------------------------------------------------*/

void recorder_proc(struct work_struct *);

/*----------------------------------------------------------------*/

#endif

---------- dm-writeboost-metadata.h ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

#ifndef DM_WRITEBOOST_METADATA_H
#define DM_WRITEBOOST_METADATA_H

/*----------------------------------------------------------------*/

struct segment_header *get_segment_header_by_id(struct wb_cache *, u64 segment_id);
sector_t calc_mb_start_sector(struct wb_cache *, struct segment_header *, cache_nr mb_idx);
bool is_on_buffer(struct wb_cache *, cache_nr mb_idx);

/*----------------------------------------------------------------*/

struct ht_head *ht_get_head(struct wb_cache *, struct lookup_key *);
struct metablock *ht_lookup(struct wb_cache *,
			    struct ht_head *, struct lookup_key *);
void ht_register(struct wb_cache *, struct ht_head *,
		 struct lookup_key *, struct metablock *);
void ht_del(struct wb_cache *, struct metablock *);
void discard_caches_inseg(struct wb_cache *,
			  struct segment_header *);

/*----------------------------------------------------------------*/

int __must_check audit_cache_device(struct dm_dev *, struct wb_cache *,
				    bool *need_format, bool *allow_format);
int __must_check format_cache_device(struct dm_dev *, struct wb_cache *);

/*----------------------------------------------------------------*/

void prepare_segment_header_device(struct segment_header_device *dest,
				   struct wb_cache *,
				   struct segment_header *src);

/*----------------------------------------------------------------*/

int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev);
void free_cache(struct wb_cache *cache);

/*----------------------------------------------------------------*/

#endif

---------- dm-writeboost.h ----------
/*
 * Copyright (C) 2012-2013 Akira Hayakawa <ruby.wktk@gmail.com>
 *
 * This file is released under the GPL.
 */

#ifndef DM_WRITEBOOST_H
#define DM_WRITEBOOST_H

/*----------------------------------------------------------------*/

#define DM_MSG_PREFIX "writeboost"

#include <linux/module.h>
#include <linux/version.h>
#include <linux/list.h>
#include <linux/slab.h>
#include <linux/mutex.h>
#include <linux/sched.h>
#include <linux/timer.h>
#include <linux/device-mapper.h>
#include <linux/dm-io.h>

#define WBERR(f, args...) \
	DMERR("err@%d " f, __LINE__, ## args)
#define WBWARN(f, args...) \
	DMWARN("warn@%d " f, __LINE__, ## args)
#define WBINFO(f, args...) \
	DMINFO("info@%d " f, __LINE__, ## args)

/*
 * The amount of RAM buffer pool to pre-allocated.
 */
#define RAMBUF_POOL_ALLOCATED 64 /* MB */

/*
 * The Detail of the Disk Format
 *
 * Whole:
 * Superblock (1MB) + Segment + Segment ...
 * We reserve the first 1MB as the superblock.
 *
 * Superblock:
 * head <----                                     ----> tail
 * superblock header (512B) + ... + superblock record (512B)
 *
 * Segment:
 * segment_header_device +
 * metablock_device * nr_caches_inseg +
 * (aligned first 4KB region)
 * data[0] (4KB) + data{1] + ... + data{nr_cache_inseg - 1]
 */

/*
 * Superblock Header
 * First one sector of the super block region.
 * The value is fixed after formatted.
 */

 /*
  * Magic Number
  * "WBst"
  */
#define WRITEBOOST_MAGIC 0x57427374
struct superblock_header_device {
	__le32 magic;
	u8 segment_size_order;
} __packed;

/*
 * Superblock Record (Mutable)
 * Last one sector of the superblock region.
 * Record the current cache status in need.
 */
struct superblock_record_device {
	__le64 last_migrated_segment_id;
} __packed;

/*
 * Cache line index.
 *
 * dm-writeboost can supoort a cache device
 * with size less than 4KB * (1 << 32)
 * that is 16TB.
 */
typedef u32 cache_nr;

/*
 * Metadata of a 4KB cache line
 *
 * Dirtiness is defined for each sector
 * in this cache line.
 */
struct metablock {
	sector_t sector; /* key */

	cache_nr idx; /* Const */

	struct hlist_node ht_list;

	/*
	 * 8 bit flag for dirtiness
	 * for each sector in cache line.
	 *
	 * Current implementation
	 * only recovers dirty caches.
	 * Recovering clean caches complicates the code
	 * but couldn't be effective
	 * since only few of the caches are clean.
	 */
	u8 dirty_bits;
};

/*
 * On-disk metablock
 */
struct metablock_device {
	__le64 sector;

	u8 dirty_bits;

	__le32 lap;
} __packed;

#define SZ_MAX (~(size_t)0)
struct segment_header {
	/*
	 * ID uniformly increases.
	 * ID 0 is used to tell that the segment is invalid
	 * and valid id >= 1.
	 */
	u64 global_id;

	/*
	 * Segment can be flushed half-done.
	 * length is the number of
	 * metablocks that must be counted in
	 * in resuming.
	 */
	u8 length;

	cache_nr start_idx; /* Const */
	sector_t start_sector; /* Const */

	struct list_head migrate_list;

	/*
	 * This segment can not be migrated
	 * to backin store
	 * until flushed.
	 * Flushed segment is in cache device.
	 */
	struct completion flush_done;

	/*
	 * This segment can not be overwritten
	 * until migrated.
	 */
	struct completion migrate_done;

	spinlock_t lock;

	atomic_t nr_inflight_ios;

	struct metablock mb_array[0];
};

/*
 * (Locking)
 * Locking metablocks by their granularity
 * needs too much memory space for lock structures.
 * We only locks a metablock by locking the parent segment
 * that includes the metablock.
 */
#define lockseg(seg, flags) spin_lock_irqsave(&(seg)->lock, flags)
#define unlockseg(seg, flags) spin_unlock_irqrestore(&(seg)->lock, flags)

/*
 * On-disk segment header.
 *
 * Must be at most 4KB large.
 */
struct segment_header_device {
	/* - FROM - At most512 byte for atomicity. --- */
	__le64 global_id;
	/*
	 * How many cache lines in this segments
	 * should be counted in resuming.
	 */
	u8 length;
	/*
	 * On what lap in rorating on cache device
	 * used to find the head and tail in the
	 * segments in cache device.
	 */
	__le32 lap;
	/* - TO -------------------------------------- */
	/* This array must locate at the tail */
	struct metablock_device mbarr[0];
} __packed;

struct rambuffer {
	void *data;
	struct completion done;
};

enum STATFLAG {
	STAT_WRITE = 0,
	STAT_HIT,
	STAT_ON_BUFFER,
	STAT_FULLSIZE,
};
#define STATLEN (1 << 4)

struct lookup_key {
	sector_t sector;
};

struct ht_head {
	struct hlist_head ht_list;
};

struct wb_device;
struct wb_cache {
	struct wb_device *wb;

	struct dm_dev *device;
	struct mutex io_lock;
	cache_nr nr_caches; /* Const */
	u64 nr_segments; /* Const */
	u8 segment_size_order; /* Const */
	u8 nr_caches_inseg; /* Const */
	struct bigarray *segment_header_array;

	/*
	 * Chained hashtable
	 *
	 * Writeboost uses chained hashtable
	 * to cache lookup.
	 * Cache discarding often happedns
	 * This structure fits our needs.
	 */
	struct bigarray *htable;
	size_t htsize;
	struct ht_head *null_head;

	cache_nr cursor; /* Index that has been written the most lately */
	struct segment_header *current_seg;

	struct rambuffer *current_rambuf;
	size_t nr_rambuf_pool; /* Const */
	struct rambuffer *rambuf_pool;

	u64 last_migrated_segment_id;
	u64 last_flushed_segment_id;
	u64 reserving_segment_id;

	/*
	 * Flush daemon
	 *
	 * Writeboost first queue the segment to flush
	 * and flush daemon asynchronously
	 * flush them to the cache device.
	 */
	struct work_struct flush_work;
	struct workqueue_struct *flush_wq;
	spinlock_t flush_queue_lock;
	struct list_head flush_queue;
	wait_queue_head_t flush_wait_queue;

	/*
	 * Deferred ACK for barriers.
	 */
	struct work_struct barrier_deadline_work;
	struct timer_list barrier_deadline_timer;
	struct bio_list barrier_ios;
	unsigned long barrier_deadline_ms; /* param */

	/*
	 * Migration daemon
	 *
	 * Migartion also works in background.
	 *
	 * If allow_migrate is true,
	 * migrate daemon goes into migration
	 * if they are segments to migrate.
	 */
	struct work_struct migrate_work;
	struct workqueue_struct *migrate_wq;
	bool allow_migrate; /* param */

	/*
	 * Batched Migration
	 *
	 * Migration is done atomically
	 * with number of segments batched.
	 */
	wait_queue_head_t migrate_wait_queue;
	atomic_t migrate_fail_count;
	atomic_t migrate_io_count;
	struct list_head migrate_list;
	u8 *dirtiness_snapshot;
	void *migrate_buffer;
	size_t nr_cur_batched_migration;
	size_t nr_max_batched_migration; /* param */

	/*
	 * Migration modulator
	 *
	 * This daemon turns on and off
	 * the migration
	 * according to the load of backing store.
	 */
	struct work_struct modulator_work;
	bool enable_migration_modulator; /* param */

	/*
	 * Superblock Recorder
	 *
	 * Update the superblock record
	 * periodically.
	 */
	struct work_struct recorder_work;
	unsigned long update_record_interval; /* param */

	/*
	 * Cache Synchronizer
	 *
	 * Sync the dirty writes
	 * periodically.
	 */
	struct work_struct sync_work;
	unsigned long sync_interval; /* param */

	/*
	 * on_terminate is true
	 * to notify all the background daemons to
	 * stop their operations.
	 */
	bool on_terminate;

	atomic64_t stat[STATLEN];
};

struct wb_device {
	struct dm_target *ti;

	struct dm_dev *device;

	struct wb_cache *cache;

	u8 migrate_threshold;

	atomic64_t nr_dirty_caches;
};

struct flush_job {
	struct list_head flush_queue;
	struct segment_header *seg;
	/*
	 * The data to flush to cache device.
	 */
	struct rambuffer *rambuf;
	/*
	 * List of bios with barrier flags.
	 */
	struct bio_list barrier_ios;
};

#define PER_BIO_VERSION KERNEL_VERSION(3, 8, 0)
#if LINUX_VERSION_CODE >= PER_BIO_VERSION
struct per_bio_data {
	void *ptr;
};
#endif
#endif

/*----------------------------------------------------------------*/

void flush_current_buffer(struct wb_cache *);
void inc_nr_dirty_caches(struct wb_device *);
void cleanup_mb_if_dirty(struct wb_cache *,
			 struct segment_header *,
			 struct metablock *);
u8 atomic_read_mb_dirtiness(struct segment_header *,
			    struct metablock *);

/*----------------------------------------------------------------*/

extern struct workqueue_struct *safe_io_wq;
extern struct dm_io_client *wb_io_client;

void *do_kmalloc_retry(size_t size, gfp_t flags, int lineno);
#define kmalloc_retry(size, flags) \
	do_kmalloc_retry((size), (flags), __LINE__)

int dm_safe_io_internal(
		struct dm_io_request *,
		unsigned num_regions, struct dm_io_region *,
		unsigned long *err_bits, bool thread, int lineno);
#define dm_safe_io(io_req, num_regions, regions, err_bits, thread) \
	dm_safe_io_internal((io_req), (num_regions), (regions), \
			    (err_bits), (thread), __LINE__)

void dm_safe_io_retry_internal(
		struct dm_io_request *,
		unsigned num_regions, struct dm_io_region *,
		bool thread, int lineno);
#define dm_safe_io_retry(io_req, num_regions, regions, thread) \
	dm_safe_io_retry_internal((io_req), (num_regions), (regions), \
				  (thread), __LINE__)

sector_t dm_devsize(struct dm_dev *);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
  2013-09-26  3:43           ` Dave Chinner
@ 2013-10-01  8:26             ` Joe Thornber
  2013-10-03  0:01               ` [dm-devel] " Mikulas Patocka
  0 siblings, 1 reply; 25+ messages in thread
From: Joe Thornber @ 2013-10-01  8:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Akira Hayakawa, snitzer, gregkh, devel, linux-kernel, dm-devel,
	cesarb, joe, akpm, agk, m.chehab, ejt, dan.carpenter

On Thu, Sep 26, 2013 at 01:43:25PM +1000, Dave Chinner wrote:
> On Tue, Sep 24, 2013 at 09:20:50PM +0900, Akira Hayakawa wrote:
> > * Deferring ACK for barrier writes
> > Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily.
> > Immediately handling these bios badly slows down writeboost.
> > It surveils the bios with these flags and forcefully flushes them
> > at worst case within `barrier_deadline_ms` period.
> 
> That rings alarm bells.
> 
> If the filesystem is using REQ_FUA/REQ_FLUSH for ordering reasons,
> delaying them to allow other IOs to be submitted and dispatched may
> very well violate the IO ordering constraints the filesystem is
> trying to acheive.

If the fs is using REQ_FUA for ordering they need to wait for
completion of that bio before issuing any subsequent bio that needs to
be strictly ordered.  So I don't think there is any issue here.

> Alternatively, delaying them will stall the filesystem because it's
> waiting for said REQ_FUA IO to complete. For example, journal writes
> in XFS are extremely IO latency sensitive in workloads that have a
> signifincant number of ordering constraints (e.g. O_SYNC writes,
> fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the
> filesystem for the majority of that barrier_deadline_ms.

Yes, this is a valid concern, but I assume Akira has benchmarked.
With dm-thin, I delay the REQ_FUA/REQ_FLUSH for a tiny amount, just to
see if there are any other FUA requests on my queue that can be
aggregated into a single flush.  I agree with you that the target
should never delay waiting for new io; that's asking for trouble.

- Joe

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
  2013-10-01  8:26             ` Joe Thornber
@ 2013-10-03  0:01               ` Mikulas Patocka
  2013-10-04  2:04                 ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Mikulas Patocka @ 2013-10-03  0:01 UTC (permalink / raw)
  To: Joe Thornber
  Cc: Dave Chinner, devel, snitzer, gregkh, linux-kernel,
	Akira Hayakawa, dm-devel, agk, joe, akpm, dan.carpenter, ejt,
	cesarb, m.chehab



On Tue, 1 Oct 2013, Joe Thornber wrote:

> > Alternatively, delaying them will stall the filesystem because it's
> > waiting for said REQ_FUA IO to complete. For example, journal writes
> > in XFS are extremely IO latency sensitive in workloads that have a
> > signifincant number of ordering constraints (e.g. O_SYNC writes,
> > fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the
> > filesystem for the majority of that barrier_deadline_ms.
> 
> Yes, this is a valid concern, but I assume Akira has benchmarked.
> With dm-thin, I delay the REQ_FUA/REQ_FLUSH for a tiny amount, just to
> see if there are any other FUA requests on my queue that can be
> aggregated into a single flush.  I agree with you that the target
> should never delay waiting for new io; that's asking for trouble.
> 
> - Joe

You could send the first REQ_FUA/REQ_FLUSH request directly to the disk 
and aggregate all the requests that were received while you processed the 
initial request. This way, you can do request batching without introducing 
artifical delays.

Mikulas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
  2013-10-03  0:01               ` [dm-devel] " Mikulas Patocka
@ 2013-10-04  2:04                 ` Dave Chinner
  2013-10-05  7:51                   ` Akira Hayakawa
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2013-10-04  2:04 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Joe Thornber, devel, snitzer, gregkh, linux-kernel,
	Akira Hayakawa, dm-devel, agk, joe, akpm, dan.carpenter, ejt,
	cesarb, m.chehab

On Wed, Oct 02, 2013 at 08:01:45PM -0400, Mikulas Patocka wrote:
> 
> 
> On Tue, 1 Oct 2013, Joe Thornber wrote:
> 
> > > Alternatively, delaying them will stall the filesystem because it's
> > > waiting for said REQ_FUA IO to complete. For example, journal writes
> > > in XFS are extremely IO latency sensitive in workloads that have a
> > > signifincant number of ordering constraints (e.g. O_SYNC writes,
> > > fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the
> > > filesystem for the majority of that barrier_deadline_ms.
> > 
> > Yes, this is a valid concern, but I assume Akira has benchmarked.
> > With dm-thin, I delay the REQ_FUA/REQ_FLUSH for a tiny amount, just to
> > see if there are any other FUA requests on my queue that can be
> > aggregated into a single flush.  I agree with you that the target
> > should never delay waiting for new io; that's asking for trouble.
> > 
> > - Joe
> 
> You could send the first REQ_FUA/REQ_FLUSH request directly to the disk 
> and aggregate all the requests that were received while you processed the 
> initial request. This way, you can do request batching without introducing 
> artifical delays.

Yes, that's what XFS does with it's log when lots of fsync requests
come in. i.e. the first is dispatched immmediately, and the others
are gathered into the next log buffer until it is either full or the
original REQ_FUA log write completes.

That's where arbitrary delays in the storage stack below XFS cause
problems - if the first FUA log write is delayed, the next log
buffer will get filled, issued and delayed, and when we run out of
log buffers (there are 8 maximum) the entire log subsystem will
stall, stopping *all* log commit operations until log buffer
IOs complete and become free again. i.e. it can stall modifications
across the entire filesystem while we wait for batch timeouts to
expire and issue and complete FUA requests.

IMNSHO, REQ_FUA/REQ_FLUSH optimisations should be done at the
point where they are issued - any attempt to further optimise them
by adding delays down in the stack to aggregate FUA operations will
only increase latency of the operations that the issuer want to have
complete as fast as possible....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
  2013-10-04  2:04                 ` Dave Chinner
@ 2013-10-05  7:51                   ` Akira Hayakawa
  2013-10-07 23:43                     ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Akira Hayakawa @ 2013-10-05  7:51 UTC (permalink / raw)
  To: david
  Cc: mpatocka, thornber, devel, snitzer, gregkh, linux-kernel,
	dm-devel, agk, joe, akpm, dan.carpenter, ejt, cesarb, m.chehab,
	ruby.wktk

Dave,

> That's where arbitrary delays in the storage stack below XFS cause
> problems - if the first FUA log write is delayed, the next log
> buffer will get filled, issued and delayed, and when we run out of
> log buffers (there are 8 maximum) the entire log subsystem will
> stall, stopping *all* log commit operations until log buffer
> IOs complete and become free again. i.e. it can stall modifications
> across the entire filesystem while we wait for batch timeouts to
> expire and issue and complete FUA requests.
To me, this sounds like design failure in XFS log subsystem.
Or just the limitation of metadata journal.

> IMNSHO, REQ_FUA/REQ_FLUSH optimisations should be done at the
> point where they are issued - any attempt to further optimise them
> by adding delays down in the stack to aggregate FUA operations will
> only increase latency of the operations that the issuer want to have
> complete as fast as possible....
That lower layer stack attempts to optimize further
can benefit any filesystems.
So, your opinion is not always correct although
it is always correct in error handling or memory management.

I have proposed future plan of using persistent memory.
I believe with this leap forward
filesystems are free from doing such optimization
relevant to write barriers. For more detail, please see my post.
https://lkml.org/lkml/2013/10/4/186

However,
I think I should leave option to disable the optimization
in case the upper layer doesn't like it.
Maybe, writeboost should disable deferring barriers
if barrier_deadline_ms parameter is especially 0.
Linux kernel's layered architecture is obviously not always perfect
so there are similar cases in other boundaries
such as O_DIRECT to bypass the page cache.

Maybe, dm-thin and dm-cache should add such switch.

Akira

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
  2013-10-05  7:51                   ` Akira Hayakawa
@ 2013-10-07 23:43                     ` Dave Chinner
  2013-10-08  9:41                       ` Christoph Hellwig
  2013-10-08 10:57                       ` [dm-devel] " Akira Hayakawa
  0 siblings, 2 replies; 25+ messages in thread
From: Dave Chinner @ 2013-10-07 23:43 UTC (permalink / raw)
  To: Akira Hayakawa
  Cc: mpatocka, thornber, devel, snitzer, gregkh, linux-kernel,
	dm-devel, agk, joe, akpm, dan.carpenter, ejt, cesarb, m.chehab

On Sat, Oct 05, 2013 at 04:51:16PM +0900, Akira Hayakawa wrote:
> Dave,
> 
> > That's where arbitrary delays in the storage stack below XFS cause
> > problems - if the first FUA log write is delayed, the next log
> > buffer will get filled, issued and delayed, and when we run out of
> > log buffers (there are 8 maximum) the entire log subsystem will
> > stall, stopping *all* log commit operations until log buffer
> > IOs complete and become free again. i.e. it can stall modifications
> > across the entire filesystem while we wait for batch timeouts to
> > expire and issue and complete FUA requests.
> To me, this sounds like design failure in XFS log subsystem.

If you say so. As it is, XFS is the best of all the linux
filesystems when it comes to performance under a heavy fsync
workload, so if you consider it broken by design then you've got a
horror show waiting for you on any other filesystem...

> Or just the limitation of metadata journal.

It's a recovery limitation - the more uncompleted log buffers we
have outstanding, the more space in the log will be considered
unrecoverable during a crash...

> > IMNSHO, REQ_FUA/REQ_FLUSH optimisations should be done at the
> > point where they are issued - any attempt to further optimise them
> > by adding delays down in the stack to aggregate FUA operations will
> > only increase latency of the operations that the issuer want to have
> > complete as fast as possible....
> That lower layer stack attempts to optimize further
> can benefit any filesystems.
> So, your opinion is not always correct although
> it is always correct in error handling or memory management.
> 
> I have proposed future plan of using persistent memory.
> I believe with this leap forward
> filesystems are free from doing such optimization
> relevant to write barriers. For more detail, please see my post.
> https://lkml.org/lkml/2013/10/4/186

Sure, we already do that in the storage stack to minimise the impact
of FUA operations - it's called a non-volatile write cache, and most
RAID controllers have them. They rely on immediate dispatch of FUA
operations to get them into the write cache as quickly as possible
(i.e. what filesystems do right now), and that is something your
proposed behaviour will prevent.

i.e. there's no point justifying a behaviour with "we could do this
in future so lets ignore the impact on current users"...

> However,
> I think I should leave option to disable the optimization
> in case the upper layer doesn't like it.
> Maybe, writeboost should disable deferring barriers
> if barrier_deadline_ms parameter is especially 0.
> Linux kernel's layered architecture is obviously not always perfect
> so there are similar cases in other boundaries
> such as O_DIRECT to bypass the page cache.

Right - but you can't detect O_DIRECT at the dm layer. IOWs, you're
relying on the user tweaking the corect knobs for their workload.

e.g. what happens if a user has a mixed workload - one where
performance benefits are only seen by delaying FUA, and another that
is seriously slowed down by delaying FUA requests?  This is where
knobs are problematic....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
  2013-10-07 23:43                     ` Dave Chinner
@ 2013-10-08  9:41                       ` Christoph Hellwig
  2013-10-08 10:37                         ` Akira Hayakawa
  2013-10-08 10:57                       ` [dm-devel] " Akira Hayakawa
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2013-10-08  9:41 UTC (permalink / raw)
  To: device-mapper development
  Cc: Akira Hayakawa, devel, thornber, snitzer, gregkh, linux-kernel,
	mpatocka, dan.carpenter, joe, akpm, m.chehab, ejt, agk, cesarb

On Tue, Oct 08, 2013 at 10:43:07AM +1100, Dave Chinner wrote:
> > Maybe, writeboost should disable deferring barriers
> > if barrier_deadline_ms parameter is especially 0.
> > Linux kernel's layered architecture is obviously not always perfect
> > so there are similar cases in other boundaries
> > such as O_DIRECT to bypass the page cache.
> 
> Right - but you can't detect O_DIRECT at the dm layer. IOWs, you're
> relying on the user tweaking the corect knobs for their workload.

You can detect O_DIRECT writes by second guession a special combination
of REQ_ flags only used there, as cfg tries to treat it special:

#define WRITE_SYNC              (WRITE | REQ_SYNC | REQ_NOIDLE)
#define WRITE_ODIRECT           (WRITE | REQ_SYNC)

the lack of REQ_NOIDLE when REQ_SYNC is set gives it away.  Not related
to the FLUSH or FUA flags in any way, though.

Akira, can you explain the workloads where your delay of FLUSH or FUA
requests helps you in any way?  I very much agree with Dave's reasoning,
but if you found workloads where your hack helps we should make sure we
fix them at the place where they are issued.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
  2013-10-08  9:41                       ` Christoph Hellwig
@ 2013-10-08 10:37                         ` Akira Hayakawa
       [not found]                           ` <20131008152924.GA3644@redhat.com>
  0 siblings, 1 reply; 25+ messages in thread
From: Akira Hayakawa @ 2013-10-08 10:37 UTC (permalink / raw)
  To: hch
  Cc: dm-devel, devel, thornber, snitzer, gregkh, linux-kernel,
	mpatocka, dan.carpenter, joe, akpm, m.chehab, ejt, agk, cesarb,
	ruby.wktk

Christoph,

> You can detect O_DIRECT writes by second guession a special combination
> of REQ_ flags only used there, as cfg tries to treat it special:
> 
> #define WRITE_SYNC              (WRITE | REQ_SYNC | REQ_NOIDLE)
> #define WRITE_ODIRECT           (WRITE | REQ_SYNC)
> 
> the lack of REQ_NOIDLE when REQ_SYNC is set gives it away.  Not related
> to the FLUSH or FUA flags in any way, though.
Thanks.
But, our problem is to detect the bio may or may not be deferred.
The flag REQ_NOIDLE is the one?

> Akira, can you explain the workloads where your delay of FLUSH or FUA
> requests helps you in any way?  I very much agree with Dave's reasoning,
> but if you found workloads where your hack helps we should make sure we
> fix them at the place where they are issued.
One of the examples is a fileserver accessed by multiple users.
A barrier is submitted when a user closes a file for example.

As I said in my previous post
https://lkml.org/lkml/2013/10/4/186
writeboost has RAM buffer and we want one to be
fulfilled with writes and then flushed to the cache device
that takes all the barriers away with the completion.
In that case we pay the minimum penalty for the barriers.
Interestingly, writeboost is happy with a lot of writes.

By deferring these barriers (FLUSH and FUA)
multiple barriers are likely to be merged on a RAM buffer
and then processed by replacing with only one FLUSH.

Merging the barriers and replacing it with a single FLUSH
by accepting a lot of writes
is the reason for deferring barriers in writeboost.
If you want to know further I recommend you to
look at the source code to see
how queue_barrier_io() is used and
how the barriers are kidnapped in queue_flushing().

Akira

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
  2013-10-07 23:43                     ` Dave Chinner
  2013-10-08  9:41                       ` Christoph Hellwig
@ 2013-10-08 10:57                       ` Akira Hayakawa
  1 sibling, 0 replies; 25+ messages in thread
From: Akira Hayakawa @ 2013-10-08 10:57 UTC (permalink / raw)
  To: david
  Cc: mpatocka, thornber, devel, snitzer, gregkh, linux-kernel,
	dm-devel, agk, joe, akpm, dan.carpenter, ejt, cesarb, m.chehab,
	ruby.wktk

Dave,

> i.e. there's no point justifying a behaviour with "we could do this
> in future so lets ignore the impact on current users"...
Sure, I am happy if we find a solution that
is good for both of us or filesystem and block in other word.

> e.g. what happens if a user has a mixed workload - one where
> performance benefits are only seen by delaying FUA, and another that
> is seriously slowed down by delaying FUA requests?  This is where
> knobs are problematic....
You are right.
But there is no perfect solution to satisfy all.
Dealing with each requirement will only complicate the code.
Stepping away from the user and
focusing on filesystem-block boundary
>> Maybe, writeboost should disable deferring barriers
>> if barrier_deadline_ms parameter is especially 0.
adding the switch for the mounted filesystem to decides on/off
is a simple but effective solution I believe.

Deciding per bio basis instead of per device could be an another solution.
I am happy if I can check the bio if it "may or may not defer the barrier". 

Akira

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
       [not found]                           ` <20131008152924.GA3644@redhat.com>
@ 2013-10-09  1:07                             ` Akira Hayakawa
  0 siblings, 0 replies; 25+ messages in thread
From: Akira Hayakawa @ 2013-10-09  1:07 UTC (permalink / raw)
  To: snitzer
  Cc: hch, dm-devel, devel, thornber, gregkh, linux-kernel, mpatocka,
	dan.carpenter, joe, akpm, m.chehab, ejt, agk, cesarb, ruby.wktk

Mike,

I am happy to see that
guys from filesystem to the block subsystem
have been discussing how to handle barriers in each layer
almost independently.

>> Merging the barriers and replacing it with a single FLUSH
>> by accepting a lot of writes
>> is the reason for deferring barriers in writeboost.
>> If you want to know further I recommend you to
>> look at the source code to see
>> how queue_barrier_io() is used and
>> how the barriers are kidnapped in queue_flushing().
> 
> AFAICT, this is an unfortunate hack resulting from dm-writeboost being a
> bio-based DM target.  The block layer already has support for FLUSH
> merging, see commit ae1b1539622fb4 ("block: reimplement FLUSH/FUA to
> support merge")

I have read the comments on this patch.
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae

My understanding is that
REQ_FUA and REQ_FLUSH are decomposed to more primitive flags
in accordance with the property of the device.
{PRE|POST}FLUSH request are queued in flush_queue[one of the two]
(which is often called "pending" queue) and
calls blk_kick_flush that defers flushing and later
if few conditions are satisfied it actually inserts "a single" flush request
no matter how many flush requests are in the pending queue
(just judged by !list_empty(pending)).

If my understanding is correct,
we are deferring flush across three layers.

Let me summarize.
- For filesystem, Dave said that metadata journaling defers
  barriers.
- For device-mapper, writeboost, dm-cache and dm-thin defers
  barriers.
- For block, it defers barriers and results it to
  merging several requests into one after all.

I think writeboost can not discard this deferring hack because
deferring the barriers is usually very effective to
make it likely to fulfill the RAM buffer which
makes the write throughput higher and decrease the CPU usage.
However, for particular case such as what Dave pointed out,
this hack is just a disturbance.
Even for writeboost, the hack in the patch
is just a disturbance too unfortunately.
Upper layer dislikes the lower layers hidden optimization is
just a limitation of the layered architecture of Linux kernel.

I think these three layers are thinking almost the same thing
is that these hacks are all good and each layer
preparing a switch to turn on/off the optimization
is what we have to do for compromise.

All the problems originates from the fact that
we have volatile cache and persistent memory can
take these problems away.

With persistent memory provided
writeboost can switch off the deferring barriers.
However,
I think all the servers are equipped with
persistent memory is the future tale.
So, my idea is to maintain both modes
for RAM buffer type (volatile, non-volatile)
and in case of the former type
deferring hack is a good compromise.

Akira

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2013-10-09  1:07 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-01 11:10 [PATCH] staging: Add dm-writeboost Akira Hayakawa
2013-09-16 21:53 ` Mike Snitzer
2013-09-16 22:49   ` Dan Carpenter
2013-09-17 12:41   ` Akira Hayakawa
2013-09-17 20:18     ` Mike Snitzer
2013-09-17 12:43   ` Akira Hayakawa
2013-09-17 20:59     ` Mike Snitzer
2013-09-22  0:09       ` Reworking dm-writeboost [was: Re: staging: Add dm-writeboost] Akira Hayakawa
2013-09-24 12:20         ` Akira Hayakawa
2013-09-25 17:37           ` Mike Snitzer
2013-09-26  1:42             ` Akira Hayakawa
2013-09-26  1:47             ` Akira Hayakawa
2013-09-27 18:35               ` Mike Snitzer
2013-09-28 11:29                 ` Akira Hayakawa
2013-09-25 23:03           ` Greg KH
2013-09-26  3:43           ` Dave Chinner
2013-10-01  8:26             ` Joe Thornber
2013-10-03  0:01               ` [dm-devel] " Mikulas Patocka
2013-10-04  2:04                 ` Dave Chinner
2013-10-05  7:51                   ` Akira Hayakawa
2013-10-07 23:43                     ` Dave Chinner
2013-10-08  9:41                       ` Christoph Hellwig
2013-10-08 10:37                         ` Akira Hayakawa
     [not found]                           ` <20131008152924.GA3644@redhat.com>
2013-10-09  1:07                             ` Akira Hayakawa
2013-10-08 10:57                       ` [dm-devel] " Akira Hayakawa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).