linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch v3]DM: dm-insitu-comp: a compressed DM target for SSD
@ 2014-02-18 10:13 Shaohua Li
  2014-03-07  7:57 ` Shaohua Li
  0 siblings, 1 reply; 11+ messages in thread
From: Shaohua Li @ 2014-02-18 10:13 UTC (permalink / raw)
  To: linux-kernel, dm-devel; +Cc: agk, snitzer, axboe


This is a simple DM target supporting compression for SSD only. Under layer SSD
must support 512B sector size, the target only supports 4k sector size.

Disk layout:
|super|...meta...|..data...|

Store unit is 4k (a block). Super is 1 block, which stores meta and data size
and compression algorithm. Meta is a bitmap. For each data block, there are 5
bits meta.

Data:
Data of a block is compressed. Compressed data is round up to 512B, which is
the payload. In disk, payload is stored at the begining of logical sector of
the block. Let's look at an example. Say we store data to block A, which is in
sector B(A*8), its orginal size is 4k, compressed size is 1500. Compressed data
(CD) will use 3 sectors (512B). The 3 sectors are the payload. Payload will be
stored at sector B.

---------------------------------------------------
... | CD1 | CD2 | CD3 |   |   |   |   |    | ...
---------------------------------------------------
    ^B    ^B+1  ^B+2                  ^B+7 ^B+8

For this block, we will not use sector B+3 to B+7 (a hole). We use 4 meta bits
to present payload size. The compressed size (1500) isn't stored in meta
directly. Instead, we store it at the last 32bits of payload. In this example,
we store it at the end of sector B+2. If compressed size + sizeof(32bits)
crosses a sector, payload size will increase one sector. If payload uses 8
sectors, we store uncompressed data directly.

If IO size is bigger than one block, we can store the data as an extent. Data
of the whole extent will compressed and stored in the similar way like above.
The first block of the extent is the head, all others are the tail. If extent
is 1 block, the block is head. We have 1 bit of meta to present if a block is
head or tail. If 4 meta bits of head block can't store extent payload size, we
will borrow tail block meta bits to store payload size. Max allowd extent size
is 128k, so we don't compress/decompress too big size data.

Meta:
Modifying data will modify meta too. Meta will be written(flush) to disk
depending on meta write policy. We support writeback and writethrough mode. In
writeback mode, meta will be written to disk in an interval or a FLUSH request.
In writethrough mode, data and meta data will be written to disk together.

Advantages:
1. simple. Since we store compressed data in-place, we don't need complicated
disk data management.
2. efficient. For each 4k, we only need 5 bits meta. 1T data will use less than
200M meta, so we can load all meta into memory. And actual compression size is
in payload. So if IO doesn't need RMW and we use writeback meta flush, we don't
need extra IO for meta.

Disadvantages:
1. hole. Since we store compressed data in-place, there are a lot of holes (in
above example, B+3 - B+7) Hole can impact IO, because we can't do IO merge.
2. 1:1 size. Compression doesn't change disk size. If disk is 1T, we can only store
1T data even we do compression.

But this is for SSD only. Generally SSD firmware has a FTL layer to map disk
sectors to flash nand. High end SSD firmware has filesystem-like FTL.
1. hole. Disk has a lot of holes, but SSD FTL can still store data continuous
in nand. Even we can't do IO merge in OS layer, SSD firmware can do it.
2. 1:1 size. On one side, we write compressed data to SSD, which means less
data is written to SSD. This will be very helpful to improve SSD garbage
collection, and so write speed and life cycle. So even this is a problem, the
target is still helpful. On the other side, advanced SSD FTL can easily do thin
provision. For example, if nand is 1T and we let SSD report it as 2T, and use
the SSD as compressed target. In such SSD, we don't have the 1:1 size issue.

So if SSD FTL can map non-continuous disk sectors to continuous nand and
support thin provision, the compressed target will work very well.

V2->V3:
Updated with new bio iter API

V1->V2:
1. Change name to insitu_comp, cleanup code, add comments and doc
2. Improve performance (extent locking, dedicated workqueue)

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 Documentation/device-mapper/insitu-comp.txt |   50 
 drivers/md/Kconfig                          |    6 
 drivers/md/Makefile                         |    1 
 drivers/md/dm-insitu-comp.c                 | 1480 ++++++++++++++++++++++++++++
 drivers/md/dm-insitu-comp.h                 |  158 ++
 5 files changed, 1695 insertions(+)

Index: linux/drivers/md/Kconfig
===================================================================
--- linux.orig/drivers/md/Kconfig	2014-02-17 17:34:45.431464714 +0800
+++ linux/drivers/md/Kconfig	2014-02-17 17:34:45.423464815 +0800
@@ -295,6 +295,12 @@ config DM_CACHE_CLEANER
          A simple cache policy that writes back all data to the
          origin.  Used when decommissioning a dm-cache.
 
+config DM_INSITU_COMPRESSION
+       tristate "Insitu compression target"
+       depends on BLK_DEV_DM
+       ---help---
+         Allow volume managers to insitu compress data for SSD.
+
 config DM_MIRROR
        tristate "Mirror target"
        depends on BLK_DEV_DM
Index: linux/drivers/md/Makefile
===================================================================
--- linux.orig/drivers/md/Makefile	2014-02-17 17:34:45.431464714 +0800
+++ linux/drivers/md/Makefile	2014-02-17 17:34:45.423464815 +0800
@@ -53,6 +53,7 @@ obj-$(CONFIG_DM_VERITY)		+= dm-verity.o
 obj-$(CONFIG_DM_CACHE)		+= dm-cache.o
 obj-$(CONFIG_DM_CACHE_MQ)	+= dm-cache-mq.o
 obj-$(CONFIG_DM_CACHE_CLEANER)	+= dm-cache-cleaner.o
+obj-$(CONFIG_DM_INSITU_COMPRESSION)		+= dm-insitu-comp.o
 
 ifeq ($(CONFIG_DM_UEVENT),y)
 dm-mod-objs			+= dm-uevent.o
Index: linux/drivers/md/dm-insitu-comp.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/drivers/md/dm-insitu-comp.c	2014-02-17 20:16:38.093360018 +0800
@@ -0,0 +1,1480 @@
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/slab.h>
+#include <linux/device-mapper.h>
+#include <linux/dm-io.h>
+#include <linux/crypto.h>
+#include <linux/lzo.h>
+#include <linux/kthread.h>
+#include <linux/page-flags.h>
+#include <linux/completion.h>
+#include "dm-insitu-comp.h"
+
+#define DM_MSG_PREFIX "dm_insitu_comp"
+
+static struct insitu_comp_compressor_data compressors[] = {
+	[INSITU_COMP_ALG_LZO] = {
+		.name = "lzo",
+		.comp_len = lzo_comp_len,
+	},
+	[INSITU_COMP_ALG_ZLIB] = {
+		.name = "deflate",
+	},
+};
+static int default_compressor;
+
+static struct kmem_cache *insitu_comp_io_range_cachep;
+static struct kmem_cache *insitu_comp_meta_io_cachep;
+
+static struct insitu_comp_io_worker insitu_comp_io_workers[NR_CPUS];
+static struct workqueue_struct *insitu_comp_wq;
+
+/* each block has 5 bits metadata */
+static u8 insitu_comp_get_meta(struct insitu_comp_info *info, u64 block_index)
+{
+	u64 first_bit = block_index * INSITU_COMP_META_BITS;
+	int bits, offset;
+	u8 data, ret = 0;
+
+	offset = first_bit & 7;
+	bits = min_t(u8, INSITU_COMP_META_BITS, 8 - offset);
+
+	data = info->meta_bitmap[first_bit >> 3];
+	ret = (data >> offset) & ((1 << bits) - 1);
+
+	if (bits < INSITU_COMP_META_BITS) {
+		data = info->meta_bitmap[(first_bit >> 3) + 1];
+		bits = INSITU_COMP_META_BITS - bits;
+		ret |= (data & ((1 << bits) - 1)) <<
+			(INSITU_COMP_META_BITS - bits);
+	}
+	return ret;
+}
+
+static void insitu_comp_set_meta(struct insitu_comp_info *info,
+	u64 block_index, u8 meta, bool dirty_meta)
+{
+	u64 first_bit = block_index * INSITU_COMP_META_BITS;
+	int bits, offset;
+	u8 data;
+	struct page *page;
+
+	offset = first_bit & 7;
+	bits = min_t(u8, INSITU_COMP_META_BITS, 8 - offset);
+
+	data = info->meta_bitmap[first_bit >> 3];
+	data &= ~(((1 << bits) - 1) << offset);
+	data |= (meta & ((1 << bits) - 1)) << offset;
+	info->meta_bitmap[first_bit >> 3] = data;
+
+	/*
+	 * For writethrough, we write metadata directly. For writeback, if
+	 * request is FUA, we do this too; otherwise we just dirty the page,
+	 * which will be flush out in an interval
+	 */
+	if (info->write_mode == INSITU_COMP_WRITE_BACK) {
+		page = vmalloc_to_page(&info->meta_bitmap[first_bit >> 3]);
+		if (dirty_meta)
+			SetPageDirty(page);
+		else
+			ClearPageDirty(page);
+	}
+
+	if (bits < INSITU_COMP_META_BITS) {
+		meta >>= bits;
+		data = info->meta_bitmap[(first_bit >> 3) + 1];
+		bits = INSITU_COMP_META_BITS - bits;
+		data = (data >> bits) << bits;
+		data |= meta & ((1 << bits) - 1);
+		info->meta_bitmap[(first_bit >> 3) + 1] = data;
+
+		if (info->write_mode == INSITU_COMP_WRITE_BACK) {
+			page = vmalloc_to_page(&info->meta_bitmap[
+						(first_bit >> 3) + 1]);
+			if (dirty_meta)
+				SetPageDirty(page);
+			else
+				ClearPageDirty(page);
+		}
+	}
+}
+
+/*
+ * set metadata for an extent since block @block_index, length is
+ * @logical_blocks.  The extent uses @data_sectors sectors
+ */
+static void insitu_comp_set_extent(struct insitu_comp_req *req,
+	u64 block_index, u16 logical_blocks, sector_t data_sectors)
+{
+	int i;
+	u8 data;
+
+	for (i = 0; i < logical_blocks; i++) {
+		data = min_t(sector_t, data_sectors, 8);
+		data_sectors -= data;
+		if (i != 0)
+			data |= INSITU_COMP_TAIL_MASK;
+		/* For FUA, we write out meta data directly */
+		insitu_comp_set_meta(req->info, block_index + i, data,
+					!(insitu_req_rw(req) & REQ_FUA));
+	}
+}
+
+/*
+ * get metadata for an extent covering block @block_index. @first_block_index
+ * returns the first block of the extent. @logical_sectors returns the extent
+ * length. @data_sectors returns the sectors the extent uses
+ */
+static void insitu_comp_get_extent(struct insitu_comp_info *info,
+	u64 block_index, u64 *first_block_index, u16 *logical_sectors,
+	u16 *data_sectors)
+{
+	u8 data;
+
+	data = insitu_comp_get_meta(info, block_index);
+	while (data & INSITU_COMP_TAIL_MASK) {
+		block_index--;
+		data = insitu_comp_get_meta(info, block_index);
+	}
+	*first_block_index = block_index;
+	*logical_sectors = INSITU_COMP_BLOCK_SIZE >> 9;
+	*data_sectors = data & INSITU_COMP_LENGTH_MASK;
+	block_index++;
+	while (block_index < info->data_blocks) {
+		data = insitu_comp_get_meta(info, block_index);
+		if (!(data & INSITU_COMP_TAIL_MASK))
+			break;
+		*logical_sectors += INSITU_COMP_BLOCK_SIZE >> 9;
+		*data_sectors += data & INSITU_COMP_LENGTH_MASK;
+		block_index++;
+	}
+}
+
+static int insitu_comp_access_super(struct insitu_comp_info *info,
+	void *addr, int rw)
+{
+	struct dm_io_region region;
+	struct dm_io_request req;
+	unsigned long io_error = 0;
+	int ret;
+
+	region.bdev = info->dev->bdev;
+	region.sector = 0;
+	region.count = INSITU_COMP_BLOCK_SIZE >> 9;
+
+	req.bi_rw = rw;
+	req.mem.type = DM_IO_KMEM;
+	req.mem.offset = 0;
+	req.mem.ptr.addr = addr;
+	req.notify.fn = NULL;
+	req.client = info->io_client;
+
+	ret = dm_io(&req, 1, &region, &io_error);
+	if (ret || io_error)
+		return -EIO;
+	return 0;
+}
+
+static void insitu_comp_meta_io_done(unsigned long error, void *context)
+{
+	struct insitu_comp_meta_io *meta_io = context;
+
+	meta_io->fn(meta_io->data, error);
+	kmem_cache_free(insitu_comp_meta_io_cachep, meta_io);
+}
+
+static int insitu_comp_write_meta(struct insitu_comp_info *info,
+	u64 start_page, u64 end_page, void *data,
+	void (*fn)(void *data, unsigned long error), int rw)
+{
+	struct insitu_comp_meta_io *meta_io;
+
+	BUG_ON(end_page > info->meta_bitmap_pages);
+
+	meta_io = kmem_cache_alloc(insitu_comp_meta_io_cachep, GFP_NOIO);
+	if (!meta_io) {
+		fn(data, -ENOMEM);
+		return -ENOMEM;
+	}
+	meta_io->data = data;
+	meta_io->fn = fn;
+
+	meta_io->io_region.bdev = info->dev->bdev;
+	meta_io->io_region.sector = INSITU_COMP_META_START_SECTOR +
+					(start_page << (PAGE_SHIFT - 9));
+	meta_io->io_region.count = (end_page - start_page) << (PAGE_SHIFT - 9);
+
+	atomic64_add(meta_io->io_region.count << 9, &info->meta_write_size);
+
+	meta_io->io_req.bi_rw = rw;
+	meta_io->io_req.mem.type = DM_IO_VMA;
+	meta_io->io_req.mem.offset = 0;
+	meta_io->io_req.mem.ptr.addr = info->meta_bitmap +
+						(start_page << PAGE_SHIFT);
+	meta_io->io_req.notify.fn = insitu_comp_meta_io_done;
+	meta_io->io_req.notify.context = meta_io;
+	meta_io->io_req.client = info->io_client;
+
+	dm_io(&meta_io->io_req, 1, &meta_io->io_region, NULL);
+	return 0;
+}
+
+struct writeback_flush_data {
+	struct completion complete;
+	atomic_t cnt;
+};
+
+static void writeback_flush_io_done(void *data, unsigned long error)
+{
+	struct writeback_flush_data *wb = data;
+
+	if (atomic_dec_return(&wb->cnt))
+		return;
+	complete(&wb->complete);
+}
+
+static void insitu_comp_flush_dirty_meta(struct insitu_comp_info *info,
+			struct writeback_flush_data *data)
+{
+	struct page *page;
+	u64 start = 0, index;
+	u32 pending = 0, cnt = 0;
+	bool dirty;
+	struct blk_plug plug;
+
+	blk_start_plug(&plug);
+	for (index = 0; index < info->meta_bitmap_pages; index++, cnt++) {
+		if (cnt == 256) {
+			cnt = 0;
+			cond_resched();
+		}
+
+		page = vmalloc_to_page(info->meta_bitmap +
+					(index << PAGE_SHIFT));
+		dirty = TestClearPageDirty(page);
+
+		if (pending == 0 && dirty) {
+			start = index;
+			pending++;
+			continue;
+		} else if (pending == 0)
+			continue;
+		else if (pending > 0 && dirty) {
+			pending++;
+			continue;
+		}
+
+		/* pending > 0 && !dirty */
+		atomic_inc(&data->cnt);
+		insitu_comp_write_meta(info, start, start + pending, data,
+			writeback_flush_io_done, WRITE);
+		pending = 0;
+	}
+
+	if (pending > 0) {
+		atomic_inc(&data->cnt);
+		insitu_comp_write_meta(info, start, start + pending, data,
+			writeback_flush_io_done, WRITE);
+	}
+	blkdev_issue_flush(info->dev->bdev, GFP_NOIO, NULL);
+	blk_finish_plug(&plug);
+}
+
+/* writeback thread flushs all dirty metadata to disk in an interval */
+static int insitu_comp_meta_writeback_thread(void *data)
+{
+	struct insitu_comp_info *info = data;
+	struct writeback_flush_data wb;
+
+	atomic_set(&wb.cnt, 1);
+	init_completion(&wb.complete);
+
+	while (!kthread_should_stop()) {
+		schedule_timeout_interruptible(
+			msecs_to_jiffies(info->writeback_delay * 1000));
+		insitu_comp_flush_dirty_meta(info, &wb);
+	}
+
+	insitu_comp_flush_dirty_meta(info, &wb);
+
+	writeback_flush_io_done(&wb, 0);
+	wait_for_completion(&wb.complete);
+	return 0;
+}
+
+static int insitu_comp_init_meta(struct insitu_comp_info *info, bool new)
+{
+	struct dm_io_region region;
+	struct dm_io_request req;
+	unsigned long io_error = 0;
+	struct blk_plug plug;
+	int ret;
+	ssize_t len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits, BITS_PER_LONG);
+
+	len *= sizeof(unsigned long);
+
+	region.bdev = info->dev->bdev;
+	region.sector = INSITU_COMP_META_START_SECTOR;
+	region.count = (len + 511) >> 9;
+
+	req.mem.type = DM_IO_VMA;
+	req.mem.offset = 0;
+	req.mem.ptr.addr = info->meta_bitmap;
+	req.notify.fn = NULL;
+	req.client = info->io_client;
+
+	blk_start_plug(&plug);
+	if (new) {
+		memset(info->meta_bitmap, 0, len);
+		req.bi_rw = WRITE_FLUSH;
+		ret = dm_io(&req, 1, &region, &io_error);
+	} else {
+		req.bi_rw = READ;
+		ret = dm_io(&req, 1, &region, &io_error);
+	}
+	blk_finish_plug(&plug);
+
+	if (ret || io_error) {
+		info->ti->error = "Access metadata error";
+		return -EIO;
+	}
+
+	if (info->write_mode == INSITU_COMP_WRITE_BACK) {
+		info->writeback_tsk = kthread_run(
+			insitu_comp_meta_writeback_thread,
+			info, "insitu_comp_writeback");
+		if (!info->writeback_tsk) {
+			info->ti->error = "Create writeback thread error";
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int insitu_comp_alloc_compressor(struct insitu_comp_info *info)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		info->tfm[i] = crypto_alloc_comp(
+			compressors[info->comp_alg].name, 0, 0);
+		if (IS_ERR(info->tfm[i])) {
+			info->tfm[i] = NULL;
+			goto err;
+		}
+	}
+	return 0;
+err:
+	for_each_possible_cpu(i) {
+		if (info->tfm[i]) {
+			crypto_free_comp(info->tfm[i]);
+			info->tfm[i] = NULL;
+		}
+	}
+	return -ENOMEM;
+}
+
+static void insitu_comp_free_compressor(struct insitu_comp_info *info)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		if (info->tfm[i]) {
+			crypto_free_comp(info->tfm[i]);
+			info->tfm[i] = NULL;
+		}
+	}
+}
+
+static int insitu_comp_read_or_create_super(struct insitu_comp_info *info)
+{
+	void *addr;
+	struct insitu_comp_super_block *super;
+	u64 total_blocks;
+	u64 data_blocks, meta_blocks;
+	u32 rem, cnt;
+	bool new_super = false;
+	int ret;
+	ssize_t len;
+
+	total_blocks = i_size_read(info->dev->bdev->bd_inode) >>
+					INSITU_COMP_BLOCK_SHIFT;
+	data_blocks = total_blocks - 1;
+	rem = do_div(data_blocks, INSITU_COMP_BLOCK_SIZE * 8 +
+			INSITU_COMP_META_BITS);
+	meta_blocks = data_blocks * INSITU_COMP_META_BITS;
+	data_blocks *= INSITU_COMP_BLOCK_SIZE * 8;
+
+	cnt = rem;
+	rem /= (INSITU_COMP_BLOCK_SIZE * 8 / INSITU_COMP_META_BITS + 1);
+	data_blocks += rem * (INSITU_COMP_BLOCK_SIZE * 8 /
+				INSITU_COMP_META_BITS);
+	meta_blocks += rem;
+
+	cnt %= (INSITU_COMP_BLOCK_SIZE * 8 / INSITU_COMP_META_BITS + 1);
+	meta_blocks += 1;
+	data_blocks += cnt - 1;
+
+	info->data_blocks = data_blocks;
+	info->data_start = (1 + meta_blocks) << INSITU_COMP_BLOCK_SECTOR_SHIFT;
+
+	addr = kzalloc(INSITU_COMP_BLOCK_SIZE, GFP_KERNEL);
+	if (!addr) {
+		info->ti->error = "Cannot allocate super";
+		return -ENOMEM;
+	}
+
+	super = addr;
+	ret = insitu_comp_access_super(info, addr, READ);
+	if (ret)
+		goto out;
+
+	if (le64_to_cpu(super->magic) == INSITU_COMP_SUPER_MAGIC) {
+		if (le64_to_cpu(super->version) != INSITU_COMP_VERSION ||
+		    le64_to_cpu(super->meta_blocks) != meta_blocks ||
+		    le64_to_cpu(super->data_blocks) != data_blocks) {
+			info->ti->error = "Super is invalid";
+			ret = -EINVAL;
+			goto out;
+		}
+		if (!crypto_has_comp(compressors[super->comp_alg].name, 0, 0)) {
+			info->ti->error =
+					"Compressor algorithm doesn't support";
+			ret = -EINVAL;
+			goto out;
+		}
+	} else {
+		super->magic = cpu_to_le64(INSITU_COMP_SUPER_MAGIC);
+		super->version = cpu_to_le64(INSITU_COMP_VERSION);
+		super->meta_blocks = cpu_to_le64(meta_blocks);
+		super->data_blocks = cpu_to_le64(data_blocks);
+		super->comp_alg = default_compressor;
+		ret = insitu_comp_access_super(info, addr, WRITE_FUA);
+		if (ret) {
+			info->ti->error = "Access super fails";
+			goto out;
+		}
+		new_super = true;
+	}
+
+	info->comp_alg = super->comp_alg;
+	if (insitu_comp_alloc_compressor(info)) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	info->meta_bitmap_bits = data_blocks * INSITU_COMP_META_BITS;
+	len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits, BITS_PER_LONG);
+	len *= sizeof(unsigned long);
+	info->meta_bitmap_pages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	info->meta_bitmap = vmalloc(info->meta_bitmap_pages * PAGE_SIZE);
+	if (!info->meta_bitmap) {
+		ret = -ENOMEM;
+		goto bitmap_err;
+	}
+
+	ret = insitu_comp_init_meta(info, new_super);
+	if (ret)
+		goto meta_err;
+
+	return 0;
+meta_err:
+	vfree(info->meta_bitmap);
+bitmap_err:
+	insitu_comp_free_compressor(info);
+out:
+	kfree(addr);
+	return ret;
+}
+
+/*
+ * <dev> <writethough>/<writeback> <meta_commit_delay>
+ */
+static int insitu_comp_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	struct insitu_comp_info *info;
+	char write_mode[15];
+	int ret, i;
+
+	if (argc < 2) {
+		ti->error = "Invalid argument count";
+		return -EINVAL;
+	}
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info) {
+		ti->error = "Cannot allocate context";
+		return -ENOMEM;
+	}
+	info->ti = ti;
+
+	if (sscanf(argv[1], "%s", write_mode) != 1) {
+		ti->error = "Invalid argument";
+		ret = -EINVAL;
+		goto err_para;
+	}
+
+	if (strcmp(write_mode, "writeback") == 0) {
+		if (argc != 3) {
+			ti->error = "Invalid argument";
+			ret = -EINVAL;
+			goto err_para;
+		}
+		info->write_mode = INSITU_COMP_WRITE_BACK;
+		if (sscanf(argv[2], "%u", &info->writeback_delay) != 1) {
+			ti->error = "Invalid argument";
+			ret = -EINVAL;
+			goto err_para;
+		}
+	} else if (strcmp(write_mode, "writethrough") == 0) {
+		info->write_mode = INSITU_COMP_WRITE_THROUGH;
+	} else {
+		ti->error = "Invalid argument";
+		ret = -EINVAL;
+		goto err_para;
+	}
+
+	if (dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
+							&info->dev)) {
+		ti->error = "Can't get device";
+		ret = -EINVAL;
+		goto err_para;
+	}
+
+	info->io_client = dm_io_client_create();
+	if (!info->io_client) {
+		ti->error = "Can't create io client";
+		ret = -EINVAL;
+		goto err_ioclient;
+	}
+
+	if (bdev_logical_block_size(info->dev->bdev) != 512) {
+		ti->error = "Can't logical block size too big";
+		ret = -EINVAL;
+		goto err_blocksize;
+	}
+
+	ret = insitu_comp_read_or_create_super(info);
+	if (ret)
+		goto err_blocksize;
+
+	for (i = 0; i < BITMAP_HASH_LEN; i++) {
+		info->bitmap_locks[i].io_running = 0;
+		spin_lock_init(&info->bitmap_locks[i].wait_lock);
+		INIT_LIST_HEAD(&info->bitmap_locks[i].wait_list);
+	}
+
+	atomic64_set(&info->compressed_write_size, 0);
+	atomic64_set(&info->uncompressed_write_size, 0);
+	atomic64_set(&info->meta_write_size, 0);
+	ti->num_flush_bios = 1;
+	/* doesn't support discard yet */
+	ti->per_bio_data_size = sizeof(struct insitu_comp_req);
+	ti->private = info;
+	return 0;
+err_blocksize:
+	dm_io_client_destroy(info->io_client);
+err_ioclient:
+	dm_put_device(ti, info->dev);
+err_para:
+	kfree(info);
+	return ret;
+}
+
+static void insitu_comp_dtr(struct dm_target *ti)
+{
+	struct insitu_comp_info *info = ti->private;
+
+	if (info->write_mode == INSITU_COMP_WRITE_BACK)
+		kthread_stop(info->writeback_tsk);
+	insitu_comp_free_compressor(info);
+	vfree(info->meta_bitmap);
+	dm_io_client_destroy(info->io_client);
+	dm_put_device(ti, info->dev);
+	kfree(info);
+}
+
+static u64 insitu_comp_sector_to_block(sector_t sect)
+{
+	return sect >> INSITU_COMP_BLOCK_SECTOR_SHIFT;
+}
+
+static struct insitu_comp_hash_lock *
+insitu_comp_block_hash_lock(struct insitu_comp_info *info, u64 block_index)
+{
+	return &info->bitmap_locks[(block_index >> HASH_LOCK_SHIFT) &
+			BITMAP_HASH_MASK];
+}
+
+static struct insitu_comp_hash_lock *
+insitu_comp_trylock_block(struct insitu_comp_info *info,
+	struct insitu_comp_req *req, u64 block_index)
+{
+	struct insitu_comp_hash_lock *hash_lock;
+
+	hash_lock = insitu_comp_block_hash_lock(req->info, block_index);
+
+	spin_lock_irq(&hash_lock->wait_lock);
+	if (!hash_lock->io_running) {
+		hash_lock->io_running = 1;
+		spin_unlock_irq(&hash_lock->wait_lock);
+		return hash_lock;
+	}
+	list_add_tail(&req->sibling, &hash_lock->wait_list);
+	spin_unlock_irq(&hash_lock->wait_lock);
+	return NULL;
+}
+
+static void insitu_comp_queue_req_list(struct insitu_comp_info *info,
+	struct list_head *list);
+static void insitu_comp_unlock_block(struct insitu_comp_info *info,
+	struct insitu_comp_req *req, struct insitu_comp_hash_lock *hash_lock)
+{
+	LIST_HEAD(pending_list);
+	unsigned long flags;
+
+	spin_lock_irqsave(&hash_lock->wait_lock, flags);
+	/* wakeup all pending reqs to avoid live lock */
+	list_splice_init(&hash_lock->wait_list, &pending_list);
+	hash_lock->io_running = 0;
+	spin_unlock_irqrestore(&hash_lock->wait_lock, flags);
+
+	insitu_comp_queue_req_list(info, &pending_list);
+}
+
+static void insitu_comp_unlock_req_range(struct insitu_comp_req *req)
+{
+	insitu_comp_unlock_block(req->info, req, req->lock);
+}
+
+/* Check comments of HASH_LOCK_SHIFT. each request only need take one lock */
+static int insitu_comp_lock_req_range(struct insitu_comp_req *req)
+{
+	u64 block_index, tmp;
+
+	block_index = insitu_comp_sector_to_block(insitu_req_start_sector(req));
+	tmp = insitu_comp_sector_to_block(insitu_req_end_sector(req) - 1);
+	BUG_ON(insitu_comp_block_hash_lock(req->info, block_index) !=
+			insitu_comp_block_hash_lock(req->info, tmp));
+
+	req->lock = insitu_comp_trylock_block(req->info, req, block_index);
+	if (!req->lock)
+		return 0;
+
+	return 1;
+}
+
+static void insitu_comp_queue_req(struct insitu_comp_info *info,
+	struct insitu_comp_req *req)
+{
+	unsigned long flags;
+	struct insitu_comp_io_worker *worker =
+		&insitu_comp_io_workers[req->cpu];
+
+	spin_lock_irqsave(&worker->lock, flags);
+	list_add_tail(&req->sibling, &worker->pending);
+	spin_unlock_irqrestore(&worker->lock, flags);
+
+	queue_work_on(req->cpu, insitu_comp_wq, &worker->work);
+}
+
+static void insitu_comp_queue_req_list(struct insitu_comp_info *info,
+	struct list_head *list)
+{
+	struct insitu_comp_req *req;
+	while (!list_empty(list)) {
+		req = list_first_entry(list, struct insitu_comp_req, sibling);
+		list_del_init(&req->sibling);
+		insitu_comp_queue_req(info, req);
+	}
+}
+
+static void insitu_comp_get_req(struct insitu_comp_req *req)
+{
+	atomic_inc(&req->io_pending);
+}
+
+static void insitu_comp_free_io_range(struct insitu_comp_io_range *io)
+{
+	kfree(io->decomp_data);
+	kfree(io->comp_data);
+	kmem_cache_free(insitu_comp_io_range_cachep, io);
+}
+
+static void insitu_comp_put_req(struct insitu_comp_req *req)
+{
+	struct insitu_comp_io_range *io;
+
+	if (atomic_dec_return(&req->io_pending))
+		return;
+
+	if (req->stage == STAGE_INIT) /* waiting for locking */
+		return;
+
+	if (req->stage == STAGE_READ_DECOMP ||
+	    req->stage == STAGE_WRITE_COMP ||
+	    req->result)
+		req->stage = STAGE_DONE;
+
+	if (req->stage != STAGE_DONE) {
+		insitu_comp_queue_req(req->info, req);
+		return;
+	}
+
+	while (!list_empty(&req->all_io)) {
+		io = list_entry(req->all_io.next, struct insitu_comp_io_range,
+			next);
+		list_del(&io->next);
+		insitu_comp_free_io_range(io);
+	}
+
+	insitu_comp_unlock_req_range(req);
+
+	insitu_req_endio(req, req->result);
+}
+
+static void insitu_comp_io_range_done(unsigned long error, void *context)
+{
+	struct insitu_comp_io_range *io = context;
+
+	if (error)
+		io->req->result = error;
+	insitu_comp_put_req(io->req);
+}
+
+static inline int insitu_comp_compressor_len(struct insitu_comp_info *info,
+	int len)
+{
+	if (compressors[info->comp_alg].comp_len)
+		return compressors[info->comp_alg].comp_len(len);
+	return len;
+}
+
+/*
+ * caller should set region.sector, region.count. bi_rw. IO always to/from
+ * comp_data
+ */
+static struct insitu_comp_io_range *
+insitu_comp_create_io_range(struct insitu_comp_req *req, int comp_len,
+	int decomp_len)
+{
+	struct insitu_comp_io_range *io;
+
+	io = kmem_cache_alloc(insitu_comp_io_range_cachep, GFP_NOIO);
+	if (!io)
+		return NULL;
+
+	io->comp_data = kmalloc(insitu_comp_compressor_len(req->info, comp_len),
+								GFP_NOIO);
+	io->decomp_data = kmalloc(decomp_len, GFP_NOIO);
+	if (!io->decomp_data || !io->comp_data) {
+		kfree(io->decomp_data);
+		kfree(io->comp_data);
+		kmem_cache_free(insitu_comp_io_range_cachep, io);
+		return NULL;
+	}
+
+	io->io_req.notify.fn = insitu_comp_io_range_done;
+	io->io_req.notify.context = io;
+	io->io_req.client = req->info->io_client;
+	io->io_req.mem.type = DM_IO_KMEM;
+	io->io_req.mem.ptr.addr = io->comp_data;
+	io->io_req.mem.offset = 0;
+
+	io->io_region.bdev = req->info->dev->bdev;
+
+	io->decomp_len = decomp_len;
+	io->comp_len = comp_len;
+	io->req = req;
+	return io;
+}
+
+static void insitu_comp_req_copy(struct insitu_comp_req *req, off_t req_off, void *buf,
+		ssize_t len, bool to_buf)
+{
+	struct bio *bio = req->bio;
+	struct bvec_iter iter;
+	off_t buf_off = 0;
+	ssize_t size;
+	void *addr;
+
+	iter = bio->bi_iter;
+	bio_advance_iter(bio, &iter, req_off);
+
+	while (len) {
+		addr = kmap_atomic(bio_iter_page(bio, iter));
+		size = min_t(ssize_t, len, bio_iter_len(bio, iter));
+		if (to_buf)
+			memcpy(buf + buf_off, addr + bio_iter_offset(bio, iter),
+				size);
+		else
+			memcpy(addr + bio_iter_offset(bio, iter), buf + buf_off,
+				size);
+		kunmap_atomic(addr);
+
+		buf_off += size;
+		len -= size;
+
+		bio_advance_iter(bio, &iter, size);
+	}
+}
+
+/*
+ * return value:
+ * < 0 : error
+ * == 0 : ok
+ * == 1 : ok, but comp/decomp is skipped
+ * Compressed data size is roundup of 512, which makes the payload.
+ * We store the actual compressed length in the last u32 of the payload.
+ * If there is no free space, we add 512 to the payload size.
+ */
+static int insitu_comp_io_range_comp(struct insitu_comp_info *info,
+	void *comp_data, unsigned int *comp_len, void *decomp_data,
+	unsigned int decomp_len, bool do_comp)
+{
+	struct crypto_comp *tfm;
+	u32 *addr;
+	unsigned int actual_comp_len;
+	int ret;
+
+	if (do_comp) {
+		actual_comp_len = *comp_len;
+
+		tfm = info->tfm[get_cpu()];
+		ret = crypto_comp_compress(tfm, decomp_data, decomp_len,
+			comp_data, &actual_comp_len);
+		put_cpu();
+
+		atomic64_add(decomp_len, &info->uncompressed_write_size);
+		if (ret || decomp_len < actual_comp_len + sizeof(u32) + 512) {
+			*comp_len = decomp_len;
+			atomic64_add(*comp_len, &info->compressed_write_size);
+			return 1;
+		}
+
+		*comp_len = round_up(actual_comp_len, 512);
+		if (*comp_len - actual_comp_len < sizeof(u32))
+			*comp_len += 512;
+		atomic64_add(*comp_len, &info->compressed_write_size);
+		addr = comp_data + *comp_len;
+		addr--;
+		*addr = cpu_to_le32(actual_comp_len);
+	} else {
+		if (*comp_len == decomp_len)
+			return 1;
+		addr = comp_data + *comp_len;
+		addr--;
+		actual_comp_len = le32_to_cpu(*addr);
+
+		tfm = info->tfm[get_cpu()];
+		ret = crypto_comp_decompress(tfm, comp_data, actual_comp_len,
+			decomp_data, &decomp_len);
+		put_cpu();
+		if (ret)
+			return -EINVAL;
+	}
+	return 0;
+}
+
+/*
+ * compressed data is updated. We decompress it and fill req. If there is no
+ * valid compressed data, we just zero req
+ */
+static void insitu_comp_handle_read_decomp(struct insitu_comp_req *req)
+{
+	struct insitu_comp_io_range *io;
+	off_t req_off = 0;
+	int ret;
+
+	req->stage = STAGE_READ_DECOMP;
+
+	if (req->result)
+		return;
+
+	list_for_each_entry(io, &req->all_io, next) {
+		ssize_t dst_off = 0, src_off = 0, len;
+
+		io->io_region.sector -= req->info->data_start;
+
+		/* Do decomp here */
+		ret = insitu_comp_io_range_comp(req->info, io->comp_data,
+			&io->comp_len, io->decomp_data, io->decomp_len, false);
+		if (ret < 0) {
+			req->result = -EIO;
+			return;
+		}
+
+		if (io->io_region.sector >= insitu_req_start_sector(req))
+			dst_off = (io->io_region.sector - insitu_req_start_sector(req))
+				<< 9;
+		else
+			src_off = (insitu_req_start_sector(req) - io->io_region.sector)
+				<< 9;
+		len = min_t(ssize_t, io->decomp_len - src_off,
+			(insitu_req_sectors(req) << 9) - dst_off);
+
+		/* io range in all_io list is ordered for read IO */
+		while (req_off != dst_off) {
+			ssize_t size = min_t(ssize_t, PAGE_SIZE,
+					dst_off - req_off);
+			insitu_comp_req_copy(req, req_off,
+				empty_zero_page, size, false);
+			req_off += size;
+		}
+
+		if (ret == 1) /* uncompressed, valid data is in .comp_data */
+			insitu_comp_req_copy(req, dst_off,
+					io->comp_data + src_off, len, false);
+		else
+			insitu_comp_req_copy(req, dst_off,
+					io->decomp_data + src_off, len, false);
+		req_off = dst_off + len;
+	}
+
+	while (req_off != (insitu_req_sectors(req) << 9)) {
+		ssize_t size = min_t(ssize_t, PAGE_SIZE,
+			(insitu_req_sectors(req) << 9) - req_off);
+		insitu_comp_req_copy(req, req_off, empty_zero_page,
+			size, false);
+		req_off += size;
+	}
+}
+
+/*
+ * read one extent data from disk. The extent starts from block @block and has
+ * @data_sectors data
+ */
+static void insitu_comp_read_one_extent(struct insitu_comp_req *req, u64 block,
+	u16 logical_sectors, u16 data_sectors)
+{
+	struct insitu_comp_io_range *io;
+
+	io = insitu_comp_create_io_range(req, data_sectors << 9,
+		logical_sectors << 9);
+	if (!io) {
+		req->result = -EIO;
+		return;
+	}
+
+	insitu_comp_get_req(req);
+	list_add_tail(&io->next, &req->all_io);
+
+	io->io_region.sector = (block << INSITU_COMP_BLOCK_SECTOR_SHIFT) +
+				req->info->data_start;
+	io->io_region.count = data_sectors;
+
+	io->io_req.bi_rw = READ;
+	dm_io(&io->io_req, 1, &io->io_region, NULL);
+}
+
+static void insitu_comp_handle_read_read_existing(struct insitu_comp_req *req)
+{
+	u64 block_index, first_block_index;
+	u16 logical_sectors, data_sectors;
+
+	req->stage = STAGE_READ_EXISTING;
+
+	block_index = insitu_comp_sector_to_block(insitu_req_start_sector(req));
+again:
+	insitu_comp_get_extent(req->info, block_index, &first_block_index,
+		&logical_sectors, &data_sectors);
+	if (data_sectors > 0)
+		insitu_comp_read_one_extent(req, first_block_index,
+			logical_sectors, data_sectors);
+
+	if (req->result)
+		return;
+
+	block_index = first_block_index + (logical_sectors >>
+				INSITU_COMP_BLOCK_SECTOR_SHIFT);
+	/* the request might cover several extents */
+	if ((block_index << INSITU_COMP_BLOCK_SECTOR_SHIFT) <
+			insitu_req_end_sector(req))
+		goto again;
+
+	/* A shortcut if all data is in already */
+	if (list_empty(&req->all_io))
+		insitu_comp_handle_read_decomp(req);
+}
+
+static void insitu_comp_handle_read_request(struct insitu_comp_req *req)
+{
+	insitu_comp_get_req(req);
+
+	if (req->stage == STAGE_INIT) {
+		if (!insitu_comp_lock_req_range(req)) {
+			insitu_comp_put_req(req);
+			return;
+		}
+
+		insitu_comp_handle_read_read_existing(req);
+	} else if (req->stage == STAGE_READ_EXISTING)
+		insitu_comp_handle_read_decomp(req);
+
+	insitu_comp_put_req(req);
+}
+
+static void insitu_comp_write_meta_done(void *context, unsigned long error)
+{
+	struct insitu_comp_req *req = context;
+	insitu_comp_put_req(req);
+}
+
+static u64 insitu_comp_block_meta_page_index(u64 block, bool end)
+{
+	u64 bits = block * INSITU_COMP_META_BITS - !!end;
+	/* (1 << 3) bits per byte */
+	return bits >> (3 + PAGE_SHIFT);
+}
+
+/*
+ * the request covers some extents partially. Decompress data of the extents,
+ * compress remaining valid data, and finally write them out
+ */
+static int insitu_comp_handle_write_modify(struct insitu_comp_io_range *io,
+	u64 *meta_start, u64 *meta_end, bool *handle_req)
+{
+	struct insitu_comp_req *req = io->req;
+	sector_t start, count;
+	unsigned int comp_len;
+	off_t offset;
+	u64 page_index;
+	int ret;
+
+	io->io_region.sector -= req->info->data_start;
+
+	/* decompress original data */
+	ret = insitu_comp_io_range_comp(req->info, io->comp_data, &io->comp_len,
+			io->decomp_data, io->decomp_len, false);
+	if (ret < 0) {
+		req->result = -EINVAL;
+		return -EIO;
+	}
+
+	start = io->io_region.sector;
+	count = io->decomp_len >> 9;
+	if (start < insitu_req_start_sector(req) && start + count >
+					insitu_req_end_sector(req)) {
+		/* we don't split an extent */
+		if (ret == 1) {
+			memcpy(io->decomp_data, io->comp_data, io->decomp_len);
+			insitu_comp_req_copy(req, 0,
+			   io->decomp_data + ((insitu_req_start_sector(req) - start) <<
+			   9), insitu_req_sectors(req) << 9, true);
+		} else {
+			insitu_comp_req_copy(req, 0,
+			   io->decomp_data + ((insitu_req_start_sector(req) - start) <<
+			   9), insitu_req_sectors(req) << 9, true);
+			kfree(io->comp_data);
+			/* New compressed len might be bigger */
+			io->comp_data = kmalloc(insitu_comp_compressor_len(
+				req->info, io->decomp_len), GFP_NOIO);
+			io->comp_len = io->decomp_len;
+			if (!io->comp_data) {
+				req->result = -ENOMEM;
+				return -EIO;
+			}
+			io->io_req.mem.ptr.addr = io->comp_data;
+		}
+		/* need compress data */
+		ret = 0;
+		offset = 0;
+		*handle_req = false;
+	} else if (start < insitu_req_start_sector(req)) {
+		count = insitu_req_start_sector(req) - start;
+		offset = 0;
+	} else {
+		offset = insitu_req_end_sector(req) - start;
+		start = insitu_req_end_sector(req);
+		count = count - offset;
+	}
+
+	/* Original data is uncompressed, we don't need writeback */
+	if (ret == 1) {
+		comp_len = count << 9;
+		goto handle_meta;
+	}
+
+	/* assume compress less data uses less space (at least 4k lsess data) */
+	comp_len = io->comp_len;
+	ret = insitu_comp_io_range_comp(req->info, io->comp_data, &comp_len,
+		io->decomp_data + (offset << 9), count << 9, true);
+	if (ret < 0) {
+		req->result = -EIO;
+		return -EIO;
+	}
+
+	insitu_comp_get_req(req);
+	if (ret == 1)
+		io->io_req.mem.ptr.addr = io->decomp_data + (offset << 9);
+	io->io_region.count = comp_len >> 9;
+	io->io_region.sector = start + req->info->data_start;
+
+	io->io_req.bi_rw = insitu_req_rw(req);
+	dm_io(&io->io_req, 1, &io->io_region, NULL);
+handle_meta:
+	insitu_comp_set_extent(req, start >> INSITU_COMP_BLOCK_SECTOR_SHIFT,
+		count >> INSITU_COMP_BLOCK_SECTOR_SHIFT, comp_len >> 9);
+
+	page_index = insitu_comp_block_meta_page_index(start >>
+					INSITU_COMP_BLOCK_SECTOR_SHIFT, false);
+	if (*meta_start > page_index)
+		*meta_start = page_index;
+	page_index = insitu_comp_block_meta_page_index(
+		(start + count) >> INSITU_COMP_BLOCK_SECTOR_SHIFT, true);
+	if (*meta_end < page_index)
+		*meta_end = page_index;
+	return 0;
+}
+
+/* Compress data and write it out */
+static void insitu_comp_handle_write_comp(struct insitu_comp_req *req)
+{
+	struct insitu_comp_io_range *io;
+	sector_t count;
+	unsigned int comp_len;
+	u64 meta_start = -1L, meta_end = 0, page_index;
+	int ret;
+	bool handle_req = true;
+
+	req->stage = STAGE_WRITE_COMP;
+
+	if (req->result)
+		return;
+
+	list_for_each_entry(io, &req->all_io, next) {
+		if (insitu_comp_handle_write_modify(io, &meta_start, &meta_end,
+						&handle_req))
+			return;
+	}
+
+	if (!handle_req)
+		goto update_meta;
+
+	count = insitu_req_sectors(req);
+	io = insitu_comp_create_io_range(req, count << 9, count << 9);
+	if (!io) {
+		req->result = -EIO;
+		return;
+	}
+	insitu_comp_req_copy(req, 0, io->decomp_data, count << 9, true);
+
+	/* compress data */
+	comp_len = io->comp_len;
+	ret = insitu_comp_io_range_comp(req->info, io->comp_data, &comp_len,
+		io->decomp_data, count << 9, true);
+	if (ret < 0) {
+		insitu_comp_free_io_range(io);
+		req->result = -EIO;
+		return;
+	}
+
+	insitu_comp_get_req(req);
+	list_add_tail(&io->next, &req->all_io);
+	io->io_region.sector = insitu_req_start_sector(req) + req->info->data_start;
+	if (ret == 1)
+		io->io_req.mem.ptr.addr = io->decomp_data;
+	io->io_region.count = comp_len >> 9;
+	io->io_req.bi_rw = insitu_req_rw(req);
+	dm_io(&io->io_req, 1, &io->io_region, NULL);
+	insitu_comp_set_extent(req,
+		insitu_req_start_sector(req) >> INSITU_COMP_BLOCK_SECTOR_SHIFT,
+		count >> INSITU_COMP_BLOCK_SECTOR_SHIFT, comp_len >> 9);
+
+	page_index = insitu_comp_block_meta_page_index(
+		insitu_req_start_sector(req) >> INSITU_COMP_BLOCK_SECTOR_SHIFT, false);
+	if (meta_start > page_index)
+		meta_start = page_index;
+	page_index = insitu_comp_block_meta_page_index(
+		(insitu_req_start_sector(req) + count) >> INSITU_COMP_BLOCK_SECTOR_SHIFT,
+		true);
+	if (meta_end < page_index)
+		meta_end = page_index;
+update_meta:
+	if (req->info->write_mode == INSITU_COMP_WRITE_THROUGH ||
+						(insitu_req_rw(req) & REQ_FUA)) {
+		insitu_comp_get_req(req);
+		insitu_comp_write_meta(req->info, meta_start, meta_end + 1, req,
+			insitu_comp_write_meta_done, insitu_req_rw(req));
+	}
+}
+
+/* request might cover some extents partially, read them first */
+static void insitu_comp_handle_write_read_existing(struct insitu_comp_req *req)
+{
+	u64 block_index, first_block_index;
+	u16 logical_sectors, data_sectors;
+
+	req->stage = STAGE_READ_EXISTING;
+
+	block_index = insitu_comp_sector_to_block(insitu_req_start_sector(req));
+	insitu_comp_get_extent(req->info, block_index, &first_block_index,
+		&logical_sectors, &data_sectors);
+	if (data_sectors > 0 && (first_block_index < block_index ||
+	    first_block_index + insitu_comp_sector_to_block(logical_sectors) >
+	    insitu_comp_sector_to_block(insitu_req_end_sector(req))))
+		insitu_comp_read_one_extent(req, first_block_index,
+			logical_sectors, data_sectors);
+
+	if (req->result)
+		return;
+
+	if (first_block_index + insitu_comp_sector_to_block(logical_sectors) >=
+	    insitu_comp_sector_to_block(insitu_req_end_sector(req)))
+		goto out;
+
+	block_index = insitu_comp_sector_to_block(insitu_req_end_sector(req)) - 1;
+	insitu_comp_get_extent(req->info, block_index, &first_block_index,
+		&logical_sectors, &data_sectors);
+	if (data_sectors > 0 &&
+	    first_block_index + insitu_comp_sector_to_block(logical_sectors) >
+	    block_index + 1)
+		insitu_comp_read_one_extent(req, first_block_index,
+			logical_sectors, data_sectors);
+
+	if (req->result)
+		return;
+out:
+	if (list_empty(&req->all_io))
+		insitu_comp_handle_write_comp(req);
+}
+
+static void insitu_comp_handle_write_request(struct insitu_comp_req *req)
+{
+	insitu_comp_get_req(req);
+
+	if (req->stage == STAGE_INIT) {
+		if (!insitu_comp_lock_req_range(req)) {
+			insitu_comp_put_req(req);
+			return;
+		}
+
+		insitu_comp_handle_write_read_existing(req);
+	} else if (req->stage == STAGE_READ_EXISTING)
+		insitu_comp_handle_write_comp(req);
+
+	insitu_comp_put_req(req);
+}
+
+/* For writeback mode */
+static void insitu_comp_handle_flush_request(struct insitu_comp_req *req)
+{
+	struct writeback_flush_data wb;
+
+	atomic_set(&wb.cnt, 1);
+	init_completion(&wb.complete);
+
+	insitu_comp_flush_dirty_meta(req->info, &wb);
+
+	writeback_flush_io_done(&wb, 0);
+	wait_for_completion(&wb.complete);
+
+	insitu_req_endio(req, 0);
+}
+
+static void insitu_comp_handle_request(struct insitu_comp_req *req)
+{
+	if (insitu_req_rw(req) & REQ_FLUSH)
+		insitu_comp_handle_flush_request(req);
+	else if (insitu_req_rw(req) & REQ_WRITE)
+		insitu_comp_handle_write_request(req);
+	else
+		insitu_comp_handle_read_request(req);
+}
+
+static void insitu_comp_do_request_work(struct work_struct *work)
+{
+	struct insitu_comp_io_worker *worker = container_of(work,
+			struct insitu_comp_io_worker, work);
+	LIST_HEAD(list);
+	struct insitu_comp_req *req;
+	struct blk_plug plug;
+	bool repeat;
+
+	blk_start_plug(&plug);
+again:
+	spin_lock_irq(&worker->lock);
+	list_splice_init(&worker->pending, &list);
+	spin_unlock_irq(&worker->lock);
+
+	repeat = !list_empty(&list);
+	while (!list_empty(&list)) {
+		req = list_first_entry(&list, struct insitu_comp_req, sibling);
+		list_del(&req->sibling);
+
+		insitu_comp_handle_request(req);
+	}
+	if (repeat)
+		goto again;
+	blk_finish_plug(&plug);
+}
+
+static int insitu_comp_map(struct dm_target *ti, struct bio *bio)
+{
+	struct insitu_comp_info *info = ti->private;
+	struct insitu_comp_req *req;
+
+	req = dm_per_bio_data(bio, sizeof(struct insitu_comp_req));
+
+	if ((bio->bi_rw & REQ_FLUSH) &&
+			info->write_mode == INSITU_COMP_WRITE_THROUGH) {
+		bio->bi_bdev = info->dev->bdev;
+		return DM_MAPIO_REMAPPED;
+	}
+
+	req->bio = bio;
+	req->info = info;
+	atomic_set(&req->io_pending, 0);
+	INIT_LIST_HEAD(&req->all_io);
+	req->result = 0;
+	req->stage = STAGE_INIT;
+
+	req->cpu = raw_smp_processor_id();
+	insitu_comp_queue_req(info, req);
+
+	return DM_MAPIO_SUBMITTED;
+}
+
+/*
+ * INFO: uncompressed_data_size compressed_data_size metadata_size
+ * TABLE: writethrough/writeback commit_delay
+ */
+static void insitu_comp_status(struct dm_target *ti, status_type_t type,
+			  unsigned status_flags, char *result, unsigned maxlen)
+{
+	struct insitu_comp_info *info = ti->private;
+	unsigned int sz = 0;
+
+	switch (type) {
+	case STATUSTYPE_INFO:
+		DMEMIT("%lu %lu %lu",
+			atomic64_read(&info->uncompressed_write_size),
+			atomic64_read(&info->compressed_write_size),
+			atomic64_read(&info->meta_write_size));
+		break;
+	case STATUSTYPE_TABLE:
+		if (info->write_mode == INSITU_COMP_WRITE_BACK)
+			DMEMIT("%s %s %d", info->dev->name, "writeback",
+				info->writeback_delay);
+		else
+			DMEMIT("%s %s", info->dev->name, "writethrough");
+		break;
+	}
+}
+
+static int insitu_comp_iterate_devices(struct dm_target *ti,
+				  iterate_devices_callout_fn fn, void *data)
+{
+	struct insitu_comp_info *info = ti->private;
+
+	return fn(ti, info->dev, info->data_start,
+		info->data_blocks << INSITU_COMP_BLOCK_SECTOR_SHIFT, data);
+}
+
+static void insitu_comp_io_hints(struct dm_target *ti,
+			    struct queue_limits *limits)
+{
+	/* No blk_limits_logical_block_size */
+	limits->logical_block_size = limits->physical_block_size =
+		limits->io_min = INSITU_COMP_BLOCK_SIZE;
+	blk_limits_max_hw_sectors(limits, INSITU_COMP_MAX_SIZE >> 9);
+}
+
+static int insitu_comp_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
+			struct bio_vec *biovec, int max_size)
+{
+	/* Guarantee request can only cover one aligned 128k range */
+	return min_t(int, max_size, INSITU_COMP_MAX_SIZE - bvm->bi_size -
+			((bvm->bi_sector << 9) % INSITU_COMP_MAX_SIZE));
+}
+
+static struct target_type insitu_comp_target = {
+	.name   = "insitu_comp",
+	.version = {1, 0, 0},
+	.module = THIS_MODULE,
+	.ctr    = insitu_comp_ctr,
+	.dtr    = insitu_comp_dtr,
+	.map    = insitu_comp_map,
+	.status = insitu_comp_status,
+	.iterate_devices = insitu_comp_iterate_devices,
+	.io_hints = insitu_comp_io_hints,
+	.merge = insitu_comp_merge,
+};
+
+static int __init insitu_comp_init(void)
+{
+	int r;
+
+	for (r = 0; r < ARRAY_SIZE(compressors); r++)
+		if (crypto_has_comp(compressors[r].name, 0, 0))
+			break;
+	if (r >= ARRAY_SIZE(compressors)) {
+		DMWARN("No crypto compressors are supported");
+		return -EINVAL;
+	}
+
+	default_compressor = r;
+
+	r = -ENOMEM;
+	insitu_comp_io_range_cachep = kmem_cache_create("insitu_comp_io_range",
+		sizeof(struct insitu_comp_io_range), 0, 0, NULL);
+	if (!insitu_comp_io_range_cachep) {
+		DMWARN("Can't create io_range cache");
+		goto err;
+	}
+
+	insitu_comp_meta_io_cachep = kmem_cache_create("insitu_comp_meta_io",
+		sizeof(struct insitu_comp_meta_io), 0, 0, NULL);
+	if (!insitu_comp_meta_io_cachep) {
+		DMWARN("Can't create meta_io cache");
+		goto err;
+	}
+
+	insitu_comp_wq = alloc_workqueue("insitu_comp_io",
+		WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
+	if (!insitu_comp_wq) {
+		DMWARN("Can't create io workqueue");
+		goto err;
+	}
+
+	r = dm_register_target(&insitu_comp_target);
+	if (r < 0) {
+		DMWARN("target registration failed");
+		goto err;
+	}
+
+	for_each_possible_cpu(r) {
+		INIT_LIST_HEAD(&insitu_comp_io_workers[r].pending);
+		spin_lock_init(&insitu_comp_io_workers[r].lock);
+		INIT_WORK(&insitu_comp_io_workers[r].work,
+			insitu_comp_do_request_work);
+	}
+	return 0;
+err:
+	if (insitu_comp_io_range_cachep)
+		kmem_cache_destroy(insitu_comp_io_range_cachep);
+	if (insitu_comp_meta_io_cachep)
+		kmem_cache_destroy(insitu_comp_meta_io_cachep);
+	if (insitu_comp_wq)
+		destroy_workqueue(insitu_comp_wq);
+
+	return r;
+}
+
+static void __exit insitu_comp_exit(void)
+{
+	dm_unregister_target(&insitu_comp_target);
+	kmem_cache_destroy(insitu_comp_io_range_cachep);
+	kmem_cache_destroy(insitu_comp_meta_io_cachep);
+	destroy_workqueue(insitu_comp_wq);
+}
+
+module_init(insitu_comp_init);
+module_exit(insitu_comp_exit);
+
+MODULE_AUTHOR("Shaohua Li <shli@kernel.org>");
+MODULE_DESCRIPTION(DM_NAME " target with insitu data compression for SSD");
+MODULE_LICENSE("GPL");
Index: linux/drivers/md/dm-insitu-comp.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/drivers/md/dm-insitu-comp.h	2014-02-17 18:37:07.108425465 +0800
@@ -0,0 +1,158 @@
+#ifndef __DM_INSITU_COMPRESSION_H__
+#define __DM_INSITU_COMPRESSION_H__
+#include <linux/types.h>
+
+struct insitu_comp_super_block {
+	__le64 magic;
+	__le64 version;
+	__le64 meta_blocks;
+	__le64 data_blocks;
+	u8 comp_alg;
+} __attribute__((packed));
+
+#define INSITU_COMP_SUPER_MAGIC 0x106526c206506c09
+#define INSITU_COMP_VERSION 1
+#define INSITU_COMP_ALG_LZO 0
+#define INSITU_COMP_ALG_ZLIB 1
+
+#ifdef __KERNEL__
+struct insitu_comp_compressor_data {
+	char *name;
+	int (*comp_len)(int comp_len);
+};
+
+static inline int lzo_comp_len(int comp_len)
+{
+	return lzo1x_worst_compress(comp_len);
+}
+
+/*
+ * Minium logical sector size of this target is 4096 byte, which is a block.
+ * Data of a block is compressed. Compressed data is round up to 512B, which is
+ * the payload. For each block, we have 5 bits meta data. bit 0 - 3 stands
+ * payload length. 0 - 8 sectors. If compressed payload length is 8 sectors, we
+ * just store uncompressed data. Actual compressed data length is stored at the
+ * last 32 bits of payload if data is compressed. In disk, payload is stored at
+ * the begining of logical sector of the block. If IO size is bigger than one
+ * block, we store the whole data as an extent. Bit 4 stands tail for an
+ * extent. Max allowed extent size is 128k.
+ */
+#define INSITU_COMP_BLOCK_SIZE 4096
+#define INSITU_COMP_BLOCK_SHIFT 12
+#define INSITU_COMP_BLOCK_SECTOR_SHIFT (INSITU_COMP_BLOCK_SHIFT - 9)
+
+#define INSITU_COMP_MIN_SIZE 4096
+/* Change this should change HASH_LOCK_SHIFT too */
+#define INSITU_COMP_MAX_SIZE (128 * 1024)
+
+#define INSITU_COMP_LENGTH_MASK ((1 << 4) - 1)
+#define INSITU_COMP_TAIL_MASK (1 << 4)
+#define INSITU_COMP_META_BITS 5
+
+#define INSITU_COMP_META_START_SECTOR (INSITU_COMP_BLOCK_SIZE >> 9)
+
+enum INSITU_COMP_WRITE_MODE {
+	INSITU_COMP_WRITE_BACK,
+	INSITU_COMP_WRITE_THROUGH,
+};
+
+/*
+ * request can cover one aligned 128k (4k * (1 << 5)) range. Since maxium
+ * request size is 128k, we only need take one lock for each request
+ */
+#define HASH_LOCK_SHIFT 5
+
+#define BITMAP_HASH_SHIFT 9
+#define BITMAP_HASH_MASK ((1 << BITMAP_HASH_SHIFT) - 1)
+#define BITMAP_HASH_LEN (1 << BITMAP_HASH_SHIFT)
+
+struct insitu_comp_hash_lock {
+	int io_running;
+	spinlock_t wait_lock;
+	struct list_head wait_list;
+};
+
+struct insitu_comp_info {
+	struct dm_target *ti;
+	struct dm_dev *dev;
+
+	int comp_alg;
+	struct crypto_comp *tfm[NR_CPUS];
+
+	sector_t data_start;
+	u64 data_blocks;
+
+	char *meta_bitmap;
+	u64 meta_bitmap_bits;
+	u64 meta_bitmap_pages;
+	struct insitu_comp_hash_lock bitmap_locks[BITMAP_HASH_LEN];
+
+	enum INSITU_COMP_WRITE_MODE write_mode;
+	unsigned int writeback_delay; /* second unit */
+	struct task_struct *writeback_tsk;
+	struct dm_io_client *io_client;
+
+	atomic64_t compressed_write_size;
+	atomic64_t uncompressed_write_size;
+	atomic64_t meta_write_size;
+};
+
+struct insitu_comp_meta_io {
+	struct dm_io_request io_req;
+	struct dm_io_region io_region;
+	void *data;
+	void (*fn)(void *data, unsigned long error);
+};
+
+struct insitu_comp_io_range {
+	struct dm_io_request io_req;
+	struct dm_io_region io_region;
+	void *decomp_data;
+	unsigned int decomp_len;
+	void *comp_data;
+	unsigned int comp_len; /* For write, this is estimated */
+	struct list_head next;
+	struct insitu_comp_req *req;
+};
+
+enum INSITU_COMP_REQ_STAGE {
+	STAGE_INIT,
+	STAGE_READ_EXISTING,
+	STAGE_READ_DECOMP,
+	STAGE_WRITE_COMP,
+	STAGE_DONE,
+};
+
+struct insitu_comp_req {
+	struct bio *bio;
+	struct insitu_comp_info *info;
+	struct list_head sibling;
+
+	struct list_head all_io;
+	atomic_t io_pending;
+	enum INSITU_COMP_REQ_STAGE stage;
+
+	struct insitu_comp_hash_lock *lock;
+	int result;
+
+	int cpu;
+};
+
+#define insitu_req_start_sector(req) (req->bio->bi_iter.bi_sector)
+#define insitu_req_end_sector(req) (bio_end_sector(req->bio))
+#define insitu_req_rw(req) (req->bio->bi_rw)
+#define insitu_req_sectors(req) (bio_sectors(req->bio))
+
+static inline void insitu_req_endio(struct insitu_comp_req *req, int error)
+{
+	bio_endio(req->bio, error);
+}
+
+struct insitu_comp_io_worker {
+	struct list_head pending;
+	spinlock_t lock;
+	struct work_struct work;
+};
+#endif
+
+#endif
Index: linux/Documentation/device-mapper/insitu-comp.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/Documentation/device-mapper/insitu-comp.txt	2014-02-17 17:34:45.427464765 +0800
@@ -0,0 +1,50 @@
+This is a simple DM target supporting compression for SSD only. Under layer SSD
+must support 512B sector size, the target only supports 4k sector size.
+
+Disk layout:
+|super|...meta...|..data...|
+
+Store unit is 4k (a block). Super is 1 block, which stores meta and data size
+and compression algorithm. Meta is a bitmap. For each data block, there are 5
+bits meta.
+
+Data:
+Data of a block is compressed. Compressed data is round up to 512B, which is
+the payload. In disk, payload is stored at the begining of logical sector of
+the block. Let's look at an example. Say we store data to block A, which is in
+sector B(A*8), its orginal size is 4k, compressed size is 1500. Compressed data
+(CD) will use 3 sectors (512B). The 3 sectors are the payload. Payload will be
+stored at sector B.
+
+---------------------------------------------------
+... | CD1 | CD2 | CD3 |   |   |   |   |    | ...
+---------------------------------------------------
+    ^B    ^B+1  ^B+2                  ^B+7 ^B+8
+
+For this block, we will not use sector B+3 to B+7 (a hole). We use 4 meta bits
+to present payload size. The compressed size (1500) isn't stored in meta
+directly. Instead, we store it at the last 32bits of payload. In this example,
+we store it at the end of sector B+2. If compressed size + sizeof(32bits)
+crosses a sector, payload size will increase one sector. If payload uses 8
+sectors, we store uncompressed data directly.
+
+If IO size is bigger than one block, we can store the data as an extent. Data
+of the whole extent will compressed and stored in the similar way like above.
+The first block of the extent is the head, all others are the tail. If extent
+is 1 block, the block is head. We have 1 bit of meta to present if a block is
+head or tail. If 4 meta bits of head block can't store extent payload size, we
+will borrow tail block meta bits to store payload size. Max allowd extent size
+is 128k, so we don't compress/decompress too big size data.
+
+Meta:
+Modifying data will modify meta too. Meta will be written(flush) to disk
+depending on meta write policy. We support writeback and writethrough mode. In
+writeback mode, meta will be written to disk in an interval or a FLUSH request.
+In writethrough mode, data and meta data will be written to disk together.
+
+=========================
+Parameters: <dev> [<writethrough>|<writeback> <meta_commit_delay>]
+   <dev>: underlying device
+   <writethrough>: metadata flush to disk with writetrough mode
+   <writeback>: metadata flush to disk with writeback mode
+   <meta_commit_delay>: metadata flush to disk interval in writeback mode

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch v3]DM: dm-insitu-comp: a compressed DM target for SSD
  2014-02-18 10:13 [patch v3]DM: dm-insitu-comp: a compressed DM target for SSD Shaohua Li
@ 2014-03-07  7:57 ` Shaohua Li
  2014-03-10 13:52   ` Mike Snitzer
  0 siblings, 1 reply; 11+ messages in thread
From: Shaohua Li @ 2014-03-07  7:57 UTC (permalink / raw)
  To: linux-kernel, dm-devel; +Cc: agk, snitzer, axboe

ping!

On Tue, Feb 18, 2014 at 06:13:04PM +0800, Shaohua Li wrote:
> 
> This is a simple DM target supporting compression for SSD only. Under layer SSD
> must support 512B sector size, the target only supports 4k sector size.
> 
> Disk layout:
> |super|...meta...|..data...|
> 
> Store unit is 4k (a block). Super is 1 block, which stores meta and data size
> and compression algorithm. Meta is a bitmap. For each data block, there are 5
> bits meta.
> 
> Data:
> Data of a block is compressed. Compressed data is round up to 512B, which is
> the payload. In disk, payload is stored at the begining of logical sector of
> the block. Let's look at an example. Say we store data to block A, which is in
> sector B(A*8), its orginal size is 4k, compressed size is 1500. Compressed data
> (CD) will use 3 sectors (512B). The 3 sectors are the payload. Payload will be
> stored at sector B.
> 
> ---------------------------------------------------
> ... | CD1 | CD2 | CD3 |   |   |   |   |    | ...
> ---------------------------------------------------
>     ^B    ^B+1  ^B+2                  ^B+7 ^B+8
> 
> For this block, we will not use sector B+3 to B+7 (a hole). We use 4 meta bits
> to present payload size. The compressed size (1500) isn't stored in meta
> directly. Instead, we store it at the last 32bits of payload. In this example,
> we store it at the end of sector B+2. If compressed size + sizeof(32bits)
> crosses a sector, payload size will increase one sector. If payload uses 8
> sectors, we store uncompressed data directly.
> 
> If IO size is bigger than one block, we can store the data as an extent. Data
> of the whole extent will compressed and stored in the similar way like above.
> The first block of the extent is the head, all others are the tail. If extent
> is 1 block, the block is head. We have 1 bit of meta to present if a block is
> head or tail. If 4 meta bits of head block can't store extent payload size, we
> will borrow tail block meta bits to store payload size. Max allowd extent size
> is 128k, so we don't compress/decompress too big size data.
> 
> Meta:
> Modifying data will modify meta too. Meta will be written(flush) to disk
> depending on meta write policy. We support writeback and writethrough mode. In
> writeback mode, meta will be written to disk in an interval or a FLUSH request.
> In writethrough mode, data and meta data will be written to disk together.
> 
> Advantages:
> 1. simple. Since we store compressed data in-place, we don't need complicated
> disk data management.
> 2. efficient. For each 4k, we only need 5 bits meta. 1T data will use less than
> 200M meta, so we can load all meta into memory. And actual compression size is
> in payload. So if IO doesn't need RMW and we use writeback meta flush, we don't
> need extra IO for meta.
> 
> Disadvantages:
> 1. hole. Since we store compressed data in-place, there are a lot of holes (in
> above example, B+3 - B+7) Hole can impact IO, because we can't do IO merge.
> 2. 1:1 size. Compression doesn't change disk size. If disk is 1T, we can only store
> 1T data even we do compression.
> 
> But this is for SSD only. Generally SSD firmware has a FTL layer to map disk
> sectors to flash nand. High end SSD firmware has filesystem-like FTL.
> 1. hole. Disk has a lot of holes, but SSD FTL can still store data continuous
> in nand. Even we can't do IO merge in OS layer, SSD firmware can do it.
> 2. 1:1 size. On one side, we write compressed data to SSD, which means less
> data is written to SSD. This will be very helpful to improve SSD garbage
> collection, and so write speed and life cycle. So even this is a problem, the
> target is still helpful. On the other side, advanced SSD FTL can easily do thin
> provision. For example, if nand is 1T and we let SSD report it as 2T, and use
> the SSD as compressed target. In such SSD, we don't have the 1:1 size issue.
> 
> So if SSD FTL can map non-continuous disk sectors to continuous nand and
> support thin provision, the compressed target will work very well.
> 
> V2->V3:
> Updated with new bio iter API
> 
> V1->V2:
> 1. Change name to insitu_comp, cleanup code, add comments and doc
> 2. Improve performance (extent locking, dedicated workqueue)
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  Documentation/device-mapper/insitu-comp.txt |   50 
>  drivers/md/Kconfig                          |    6 
>  drivers/md/Makefile                         |    1 
>  drivers/md/dm-insitu-comp.c                 | 1480 ++++++++++++++++++++++++++++
>  drivers/md/dm-insitu-comp.h                 |  158 ++
>  5 files changed, 1695 insertions(+)
> 
> Index: linux/drivers/md/Kconfig
> ===================================================================
> --- linux.orig/drivers/md/Kconfig	2014-02-17 17:34:45.431464714 +0800
> +++ linux/drivers/md/Kconfig	2014-02-17 17:34:45.423464815 +0800
> @@ -295,6 +295,12 @@ config DM_CACHE_CLEANER
>           A simple cache policy that writes back all data to the
>           origin.  Used when decommissioning a dm-cache.
>  
> +config DM_INSITU_COMPRESSION
> +       tristate "Insitu compression target"
> +       depends on BLK_DEV_DM
> +       ---help---
> +         Allow volume managers to insitu compress data for SSD.
> +
>  config DM_MIRROR
>         tristate "Mirror target"
>         depends on BLK_DEV_DM
> Index: linux/drivers/md/Makefile
> ===================================================================
> --- linux.orig/drivers/md/Makefile	2014-02-17 17:34:45.431464714 +0800
> +++ linux/drivers/md/Makefile	2014-02-17 17:34:45.423464815 +0800
> @@ -53,6 +53,7 @@ obj-$(CONFIG_DM_VERITY)		+= dm-verity.o
>  obj-$(CONFIG_DM_CACHE)		+= dm-cache.o
>  obj-$(CONFIG_DM_CACHE_MQ)	+= dm-cache-mq.o
>  obj-$(CONFIG_DM_CACHE_CLEANER)	+= dm-cache-cleaner.o
> +obj-$(CONFIG_DM_INSITU_COMPRESSION)		+= dm-insitu-comp.o
>  
>  ifeq ($(CONFIG_DM_UEVENT),y)
>  dm-mod-objs			+= dm-uevent.o
> Index: linux/drivers/md/dm-insitu-comp.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux/drivers/md/dm-insitu-comp.c	2014-02-17 20:16:38.093360018 +0800
> @@ -0,0 +1,1480 @@
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/blkdev.h>
> +#include <linux/bio.h>
> +#include <linux/slab.h>
> +#include <linux/device-mapper.h>
> +#include <linux/dm-io.h>
> +#include <linux/crypto.h>
> +#include <linux/lzo.h>
> +#include <linux/kthread.h>
> +#include <linux/page-flags.h>
> +#include <linux/completion.h>
> +#include "dm-insitu-comp.h"
> +
> +#define DM_MSG_PREFIX "dm_insitu_comp"
> +
> +static struct insitu_comp_compressor_data compressors[] = {
> +	[INSITU_COMP_ALG_LZO] = {
> +		.name = "lzo",
> +		.comp_len = lzo_comp_len,
> +	},
> +	[INSITU_COMP_ALG_ZLIB] = {
> +		.name = "deflate",
> +	},
> +};
> +static int default_compressor;
> +
> +static struct kmem_cache *insitu_comp_io_range_cachep;
> +static struct kmem_cache *insitu_comp_meta_io_cachep;
> +
> +static struct insitu_comp_io_worker insitu_comp_io_workers[NR_CPUS];
> +static struct workqueue_struct *insitu_comp_wq;
> +
> +/* each block has 5 bits metadata */
> +static u8 insitu_comp_get_meta(struct insitu_comp_info *info, u64 block_index)
> +{
> +	u64 first_bit = block_index * INSITU_COMP_META_BITS;
> +	int bits, offset;
> +	u8 data, ret = 0;
> +
> +	offset = first_bit & 7;
> +	bits = min_t(u8, INSITU_COMP_META_BITS, 8 - offset);
> +
> +	data = info->meta_bitmap[first_bit >> 3];
> +	ret = (data >> offset) & ((1 << bits) - 1);
> +
> +	if (bits < INSITU_COMP_META_BITS) {
> +		data = info->meta_bitmap[(first_bit >> 3) + 1];
> +		bits = INSITU_COMP_META_BITS - bits;
> +		ret |= (data & ((1 << bits) - 1)) <<
> +			(INSITU_COMP_META_BITS - bits);
> +	}
> +	return ret;
> +}
> +
> +static void insitu_comp_set_meta(struct insitu_comp_info *info,
> +	u64 block_index, u8 meta, bool dirty_meta)
> +{
> +	u64 first_bit = block_index * INSITU_COMP_META_BITS;
> +	int bits, offset;
> +	u8 data;
> +	struct page *page;
> +
> +	offset = first_bit & 7;
> +	bits = min_t(u8, INSITU_COMP_META_BITS, 8 - offset);
> +
> +	data = info->meta_bitmap[first_bit >> 3];
> +	data &= ~(((1 << bits) - 1) << offset);
> +	data |= (meta & ((1 << bits) - 1)) << offset;
> +	info->meta_bitmap[first_bit >> 3] = data;
> +
> +	/*
> +	 * For writethrough, we write metadata directly. For writeback, if
> +	 * request is FUA, we do this too; otherwise we just dirty the page,
> +	 * which will be flush out in an interval
> +	 */
> +	if (info->write_mode == INSITU_COMP_WRITE_BACK) {
> +		page = vmalloc_to_page(&info->meta_bitmap[first_bit >> 3]);
> +		if (dirty_meta)
> +			SetPageDirty(page);
> +		else
> +			ClearPageDirty(page);
> +	}
> +
> +	if (bits < INSITU_COMP_META_BITS) {
> +		meta >>= bits;
> +		data = info->meta_bitmap[(first_bit >> 3) + 1];
> +		bits = INSITU_COMP_META_BITS - bits;
> +		data = (data >> bits) << bits;
> +		data |= meta & ((1 << bits) - 1);
> +		info->meta_bitmap[(first_bit >> 3) + 1] = data;
> +
> +		if (info->write_mode == INSITU_COMP_WRITE_BACK) {
> +			page = vmalloc_to_page(&info->meta_bitmap[
> +						(first_bit >> 3) + 1]);
> +			if (dirty_meta)
> +				SetPageDirty(page);
> +			else
> +				ClearPageDirty(page);
> +		}
> +	}
> +}
> +
> +/*
> + * set metadata for an extent since block @block_index, length is
> + * @logical_blocks.  The extent uses @data_sectors sectors
> + */
> +static void insitu_comp_set_extent(struct insitu_comp_req *req,
> +	u64 block_index, u16 logical_blocks, sector_t data_sectors)
> +{
> +	int i;
> +	u8 data;
> +
> +	for (i = 0; i < logical_blocks; i++) {
> +		data = min_t(sector_t, data_sectors, 8);
> +		data_sectors -= data;
> +		if (i != 0)
> +			data |= INSITU_COMP_TAIL_MASK;
> +		/* For FUA, we write out meta data directly */
> +		insitu_comp_set_meta(req->info, block_index + i, data,
> +					!(insitu_req_rw(req) & REQ_FUA));
> +	}
> +}
> +
> +/*
> + * get metadata for an extent covering block @block_index. @first_block_index
> + * returns the first block of the extent. @logical_sectors returns the extent
> + * length. @data_sectors returns the sectors the extent uses
> + */
> +static void insitu_comp_get_extent(struct insitu_comp_info *info,
> +	u64 block_index, u64 *first_block_index, u16 *logical_sectors,
> +	u16 *data_sectors)
> +{
> +	u8 data;
> +
> +	data = insitu_comp_get_meta(info, block_index);
> +	while (data & INSITU_COMP_TAIL_MASK) {
> +		block_index--;
> +		data = insitu_comp_get_meta(info, block_index);
> +	}
> +	*first_block_index = block_index;
> +	*logical_sectors = INSITU_COMP_BLOCK_SIZE >> 9;
> +	*data_sectors = data & INSITU_COMP_LENGTH_MASK;
> +	block_index++;
> +	while (block_index < info->data_blocks) {
> +		data = insitu_comp_get_meta(info, block_index);
> +		if (!(data & INSITU_COMP_TAIL_MASK))
> +			break;
> +		*logical_sectors += INSITU_COMP_BLOCK_SIZE >> 9;
> +		*data_sectors += data & INSITU_COMP_LENGTH_MASK;
> +		block_index++;
> +	}
> +}
> +
> +static int insitu_comp_access_super(struct insitu_comp_info *info,
> +	void *addr, int rw)
> +{
> +	struct dm_io_region region;
> +	struct dm_io_request req;
> +	unsigned long io_error = 0;
> +	int ret;
> +
> +	region.bdev = info->dev->bdev;
> +	region.sector = 0;
> +	region.count = INSITU_COMP_BLOCK_SIZE >> 9;
> +
> +	req.bi_rw = rw;
> +	req.mem.type = DM_IO_KMEM;
> +	req.mem.offset = 0;
> +	req.mem.ptr.addr = addr;
> +	req.notify.fn = NULL;
> +	req.client = info->io_client;
> +
> +	ret = dm_io(&req, 1, &region, &io_error);
> +	if (ret || io_error)
> +		return -EIO;
> +	return 0;
> +}
> +
> +static void insitu_comp_meta_io_done(unsigned long error, void *context)
> +{
> +	struct insitu_comp_meta_io *meta_io = context;
> +
> +	meta_io->fn(meta_io->data, error);
> +	kmem_cache_free(insitu_comp_meta_io_cachep, meta_io);
> +}
> +
> +static int insitu_comp_write_meta(struct insitu_comp_info *info,
> +	u64 start_page, u64 end_page, void *data,
> +	void (*fn)(void *data, unsigned long error), int rw)
> +{
> +	struct insitu_comp_meta_io *meta_io;
> +
> +	BUG_ON(end_page > info->meta_bitmap_pages);
> +
> +	meta_io = kmem_cache_alloc(insitu_comp_meta_io_cachep, GFP_NOIO);
> +	if (!meta_io) {
> +		fn(data, -ENOMEM);
> +		return -ENOMEM;
> +	}
> +	meta_io->data = data;
> +	meta_io->fn = fn;
> +
> +	meta_io->io_region.bdev = info->dev->bdev;
> +	meta_io->io_region.sector = INSITU_COMP_META_START_SECTOR +
> +					(start_page << (PAGE_SHIFT - 9));
> +	meta_io->io_region.count = (end_page - start_page) << (PAGE_SHIFT - 9);
> +
> +	atomic64_add(meta_io->io_region.count << 9, &info->meta_write_size);
> +
> +	meta_io->io_req.bi_rw = rw;
> +	meta_io->io_req.mem.type = DM_IO_VMA;
> +	meta_io->io_req.mem.offset = 0;
> +	meta_io->io_req.mem.ptr.addr = info->meta_bitmap +
> +						(start_page << PAGE_SHIFT);
> +	meta_io->io_req.notify.fn = insitu_comp_meta_io_done;
> +	meta_io->io_req.notify.context = meta_io;
> +	meta_io->io_req.client = info->io_client;
> +
> +	dm_io(&meta_io->io_req, 1, &meta_io->io_region, NULL);
> +	return 0;
> +}
> +
> +struct writeback_flush_data {
> +	struct completion complete;
> +	atomic_t cnt;
> +};
> +
> +static void writeback_flush_io_done(void *data, unsigned long error)
> +{
> +	struct writeback_flush_data *wb = data;
> +
> +	if (atomic_dec_return(&wb->cnt))
> +		return;
> +	complete(&wb->complete);
> +}
> +
> +static void insitu_comp_flush_dirty_meta(struct insitu_comp_info *info,
> +			struct writeback_flush_data *data)
> +{
> +	struct page *page;
> +	u64 start = 0, index;
> +	u32 pending = 0, cnt = 0;
> +	bool dirty;
> +	struct blk_plug plug;
> +
> +	blk_start_plug(&plug);
> +	for (index = 0; index < info->meta_bitmap_pages; index++, cnt++) {
> +		if (cnt == 256) {
> +			cnt = 0;
> +			cond_resched();
> +		}
> +
> +		page = vmalloc_to_page(info->meta_bitmap +
> +					(index << PAGE_SHIFT));
> +		dirty = TestClearPageDirty(page);
> +
> +		if (pending == 0 && dirty) {
> +			start = index;
> +			pending++;
> +			continue;
> +		} else if (pending == 0)
> +			continue;
> +		else if (pending > 0 && dirty) {
> +			pending++;
> +			continue;
> +		}
> +
> +		/* pending > 0 && !dirty */
> +		atomic_inc(&data->cnt);
> +		insitu_comp_write_meta(info, start, start + pending, data,
> +			writeback_flush_io_done, WRITE);
> +		pending = 0;
> +	}
> +
> +	if (pending > 0) {
> +		atomic_inc(&data->cnt);
> +		insitu_comp_write_meta(info, start, start + pending, data,
> +			writeback_flush_io_done, WRITE);
> +	}
> +	blkdev_issue_flush(info->dev->bdev, GFP_NOIO, NULL);
> +	blk_finish_plug(&plug);
> +}
> +
> +/* writeback thread flushs all dirty metadata to disk in an interval */
> +static int insitu_comp_meta_writeback_thread(void *data)
> +{
> +	struct insitu_comp_info *info = data;
> +	struct writeback_flush_data wb;
> +
> +	atomic_set(&wb.cnt, 1);
> +	init_completion(&wb.complete);
> +
> +	while (!kthread_should_stop()) {
> +		schedule_timeout_interruptible(
> +			msecs_to_jiffies(info->writeback_delay * 1000));
> +		insitu_comp_flush_dirty_meta(info, &wb);
> +	}
> +
> +	insitu_comp_flush_dirty_meta(info, &wb);
> +
> +	writeback_flush_io_done(&wb, 0);
> +	wait_for_completion(&wb.complete);
> +	return 0;
> +}
> +
> +static int insitu_comp_init_meta(struct insitu_comp_info *info, bool new)
> +{
> +	struct dm_io_region region;
> +	struct dm_io_request req;
> +	unsigned long io_error = 0;
> +	struct blk_plug plug;
> +	int ret;
> +	ssize_t len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits, BITS_PER_LONG);
> +
> +	len *= sizeof(unsigned long);
> +
> +	region.bdev = info->dev->bdev;
> +	region.sector = INSITU_COMP_META_START_SECTOR;
> +	region.count = (len + 511) >> 9;
> +
> +	req.mem.type = DM_IO_VMA;
> +	req.mem.offset = 0;
> +	req.mem.ptr.addr = info->meta_bitmap;
> +	req.notify.fn = NULL;
> +	req.client = info->io_client;
> +
> +	blk_start_plug(&plug);
> +	if (new) {
> +		memset(info->meta_bitmap, 0, len);
> +		req.bi_rw = WRITE_FLUSH;
> +		ret = dm_io(&req, 1, &region, &io_error);
> +	} else {
> +		req.bi_rw = READ;
> +		ret = dm_io(&req, 1, &region, &io_error);
> +	}
> +	blk_finish_plug(&plug);
> +
> +	if (ret || io_error) {
> +		info->ti->error = "Access metadata error";
> +		return -EIO;
> +	}
> +
> +	if (info->write_mode == INSITU_COMP_WRITE_BACK) {
> +		info->writeback_tsk = kthread_run(
> +			insitu_comp_meta_writeback_thread,
> +			info, "insitu_comp_writeback");
> +		if (!info->writeback_tsk) {
> +			info->ti->error = "Create writeback thread error";
> +			return -EINVAL;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static int insitu_comp_alloc_compressor(struct insitu_comp_info *info)
> +{
> +	int i;
> +
> +	for_each_possible_cpu(i) {
> +		info->tfm[i] = crypto_alloc_comp(
> +			compressors[info->comp_alg].name, 0, 0);
> +		if (IS_ERR(info->tfm[i])) {
> +			info->tfm[i] = NULL;
> +			goto err;
> +		}
> +	}
> +	return 0;
> +err:
> +	for_each_possible_cpu(i) {
> +		if (info->tfm[i]) {
> +			crypto_free_comp(info->tfm[i]);
> +			info->tfm[i] = NULL;
> +		}
> +	}
> +	return -ENOMEM;
> +}
> +
> +static void insitu_comp_free_compressor(struct insitu_comp_info *info)
> +{
> +	int i;
> +
> +	for_each_possible_cpu(i) {
> +		if (info->tfm[i]) {
> +			crypto_free_comp(info->tfm[i]);
> +			info->tfm[i] = NULL;
> +		}
> +	}
> +}
> +
> +static int insitu_comp_read_or_create_super(struct insitu_comp_info *info)
> +{
> +	void *addr;
> +	struct insitu_comp_super_block *super;
> +	u64 total_blocks;
> +	u64 data_blocks, meta_blocks;
> +	u32 rem, cnt;
> +	bool new_super = false;
> +	int ret;
> +	ssize_t len;
> +
> +	total_blocks = i_size_read(info->dev->bdev->bd_inode) >>
> +					INSITU_COMP_BLOCK_SHIFT;
> +	data_blocks = total_blocks - 1;
> +	rem = do_div(data_blocks, INSITU_COMP_BLOCK_SIZE * 8 +
> +			INSITU_COMP_META_BITS);
> +	meta_blocks = data_blocks * INSITU_COMP_META_BITS;
> +	data_blocks *= INSITU_COMP_BLOCK_SIZE * 8;
> +
> +	cnt = rem;
> +	rem /= (INSITU_COMP_BLOCK_SIZE * 8 / INSITU_COMP_META_BITS + 1);
> +	data_blocks += rem * (INSITU_COMP_BLOCK_SIZE * 8 /
> +				INSITU_COMP_META_BITS);
> +	meta_blocks += rem;
> +
> +	cnt %= (INSITU_COMP_BLOCK_SIZE * 8 / INSITU_COMP_META_BITS + 1);
> +	meta_blocks += 1;
> +	data_blocks += cnt - 1;
> +
> +	info->data_blocks = data_blocks;
> +	info->data_start = (1 + meta_blocks) << INSITU_COMP_BLOCK_SECTOR_SHIFT;
> +
> +	addr = kzalloc(INSITU_COMP_BLOCK_SIZE, GFP_KERNEL);
> +	if (!addr) {
> +		info->ti->error = "Cannot allocate super";
> +		return -ENOMEM;
> +	}
> +
> +	super = addr;
> +	ret = insitu_comp_access_super(info, addr, READ);
> +	if (ret)
> +		goto out;
> +
> +	if (le64_to_cpu(super->magic) == INSITU_COMP_SUPER_MAGIC) {
> +		if (le64_to_cpu(super->version) != INSITU_COMP_VERSION ||
> +		    le64_to_cpu(super->meta_blocks) != meta_blocks ||
> +		    le64_to_cpu(super->data_blocks) != data_blocks) {
> +			info->ti->error = "Super is invalid";
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +		if (!crypto_has_comp(compressors[super->comp_alg].name, 0, 0)) {
> +			info->ti->error =
> +					"Compressor algorithm doesn't support";
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +	} else {
> +		super->magic = cpu_to_le64(INSITU_COMP_SUPER_MAGIC);
> +		super->version = cpu_to_le64(INSITU_COMP_VERSION);
> +		super->meta_blocks = cpu_to_le64(meta_blocks);
> +		super->data_blocks = cpu_to_le64(data_blocks);
> +		super->comp_alg = default_compressor;
> +		ret = insitu_comp_access_super(info, addr, WRITE_FUA);
> +		if (ret) {
> +			info->ti->error = "Access super fails";
> +			goto out;
> +		}
> +		new_super = true;
> +	}
> +
> +	info->comp_alg = super->comp_alg;
> +	if (insitu_comp_alloc_compressor(info)) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	info->meta_bitmap_bits = data_blocks * INSITU_COMP_META_BITS;
> +	len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits, BITS_PER_LONG);
> +	len *= sizeof(unsigned long);
> +	info->meta_bitmap_pages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	info->meta_bitmap = vmalloc(info->meta_bitmap_pages * PAGE_SIZE);
> +	if (!info->meta_bitmap) {
> +		ret = -ENOMEM;
> +		goto bitmap_err;
> +	}
> +
> +	ret = insitu_comp_init_meta(info, new_super);
> +	if (ret)
> +		goto meta_err;
> +
> +	return 0;
> +meta_err:
> +	vfree(info->meta_bitmap);
> +bitmap_err:
> +	insitu_comp_free_compressor(info);
> +out:
> +	kfree(addr);
> +	return ret;
> +}
> +
> +/*
> + * <dev> <writethough>/<writeback> <meta_commit_delay>
> + */
> +static int insitu_comp_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> +{
> +	struct insitu_comp_info *info;
> +	char write_mode[15];
> +	int ret, i;
> +
> +	if (argc < 2) {
> +		ti->error = "Invalid argument count";
> +		return -EINVAL;
> +	}
> +
> +	info = kzalloc(sizeof(*info), GFP_KERNEL);
> +	if (!info) {
> +		ti->error = "Cannot allocate context";
> +		return -ENOMEM;
> +	}
> +	info->ti = ti;
> +
> +	if (sscanf(argv[1], "%s", write_mode) != 1) {
> +		ti->error = "Invalid argument";
> +		ret = -EINVAL;
> +		goto err_para;
> +	}
> +
> +	if (strcmp(write_mode, "writeback") == 0) {
> +		if (argc != 3) {
> +			ti->error = "Invalid argument";
> +			ret = -EINVAL;
> +			goto err_para;
> +		}
> +		info->write_mode = INSITU_COMP_WRITE_BACK;
> +		if (sscanf(argv[2], "%u", &info->writeback_delay) != 1) {
> +			ti->error = "Invalid argument";
> +			ret = -EINVAL;
> +			goto err_para;
> +		}
> +	} else if (strcmp(write_mode, "writethrough") == 0) {
> +		info->write_mode = INSITU_COMP_WRITE_THROUGH;
> +	} else {
> +		ti->error = "Invalid argument";
> +		ret = -EINVAL;
> +		goto err_para;
> +	}
> +
> +	if (dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
> +							&info->dev)) {
> +		ti->error = "Can't get device";
> +		ret = -EINVAL;
> +		goto err_para;
> +	}
> +
> +	info->io_client = dm_io_client_create();
> +	if (!info->io_client) {
> +		ti->error = "Can't create io client";
> +		ret = -EINVAL;
> +		goto err_ioclient;
> +	}
> +
> +	if (bdev_logical_block_size(info->dev->bdev) != 512) {
> +		ti->error = "Can't logical block size too big";
> +		ret = -EINVAL;
> +		goto err_blocksize;
> +	}
> +
> +	ret = insitu_comp_read_or_create_super(info);
> +	if (ret)
> +		goto err_blocksize;
> +
> +	for (i = 0; i < BITMAP_HASH_LEN; i++) {
> +		info->bitmap_locks[i].io_running = 0;
> +		spin_lock_init(&info->bitmap_locks[i].wait_lock);
> +		INIT_LIST_HEAD(&info->bitmap_locks[i].wait_list);
> +	}
> +
> +	atomic64_set(&info->compressed_write_size, 0);
> +	atomic64_set(&info->uncompressed_write_size, 0);
> +	atomic64_set(&info->meta_write_size, 0);
> +	ti->num_flush_bios = 1;
> +	/* doesn't support discard yet */
> +	ti->per_bio_data_size = sizeof(struct insitu_comp_req);
> +	ti->private = info;
> +	return 0;
> +err_blocksize:
> +	dm_io_client_destroy(info->io_client);
> +err_ioclient:
> +	dm_put_device(ti, info->dev);
> +err_para:
> +	kfree(info);
> +	return ret;
> +}
> +
> +static void insitu_comp_dtr(struct dm_target *ti)
> +{
> +	struct insitu_comp_info *info = ti->private;
> +
> +	if (info->write_mode == INSITU_COMP_WRITE_BACK)
> +		kthread_stop(info->writeback_tsk);
> +	insitu_comp_free_compressor(info);
> +	vfree(info->meta_bitmap);
> +	dm_io_client_destroy(info->io_client);
> +	dm_put_device(ti, info->dev);
> +	kfree(info);
> +}
> +
> +static u64 insitu_comp_sector_to_block(sector_t sect)
> +{
> +	return sect >> INSITU_COMP_BLOCK_SECTOR_SHIFT;
> +}
> +
> +static struct insitu_comp_hash_lock *
> +insitu_comp_block_hash_lock(struct insitu_comp_info *info, u64 block_index)
> +{
> +	return &info->bitmap_locks[(block_index >> HASH_LOCK_SHIFT) &
> +			BITMAP_HASH_MASK];
> +}
> +
> +static struct insitu_comp_hash_lock *
> +insitu_comp_trylock_block(struct insitu_comp_info *info,
> +	struct insitu_comp_req *req, u64 block_index)
> +{
> +	struct insitu_comp_hash_lock *hash_lock;
> +
> +	hash_lock = insitu_comp_block_hash_lock(req->info, block_index);
> +
> +	spin_lock_irq(&hash_lock->wait_lock);
> +	if (!hash_lock->io_running) {
> +		hash_lock->io_running = 1;
> +		spin_unlock_irq(&hash_lock->wait_lock);
> +		return hash_lock;
> +	}
> +	list_add_tail(&req->sibling, &hash_lock->wait_list);
> +	spin_unlock_irq(&hash_lock->wait_lock);
> +	return NULL;
> +}
> +
> +static void insitu_comp_queue_req_list(struct insitu_comp_info *info,
> +	struct list_head *list);
> +static void insitu_comp_unlock_block(struct insitu_comp_info *info,
> +	struct insitu_comp_req *req, struct insitu_comp_hash_lock *hash_lock)
> +{
> +	LIST_HEAD(pending_list);
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&hash_lock->wait_lock, flags);
> +	/* wakeup all pending reqs to avoid live lock */
> +	list_splice_init(&hash_lock->wait_list, &pending_list);
> +	hash_lock->io_running = 0;
> +	spin_unlock_irqrestore(&hash_lock->wait_lock, flags);
> +
> +	insitu_comp_queue_req_list(info, &pending_list);
> +}
> +
> +static void insitu_comp_unlock_req_range(struct insitu_comp_req *req)
> +{
> +	insitu_comp_unlock_block(req->info, req, req->lock);
> +}
> +
> +/* Check comments of HASH_LOCK_SHIFT. each request only need take one lock */
> +static int insitu_comp_lock_req_range(struct insitu_comp_req *req)
> +{
> +	u64 block_index, tmp;
> +
> +	block_index = insitu_comp_sector_to_block(insitu_req_start_sector(req));
> +	tmp = insitu_comp_sector_to_block(insitu_req_end_sector(req) - 1);
> +	BUG_ON(insitu_comp_block_hash_lock(req->info, block_index) !=
> +			insitu_comp_block_hash_lock(req->info, tmp));
> +
> +	req->lock = insitu_comp_trylock_block(req->info, req, block_index);
> +	if (!req->lock)
> +		return 0;
> +
> +	return 1;
> +}
> +
> +static void insitu_comp_queue_req(struct insitu_comp_info *info,
> +	struct insitu_comp_req *req)
> +{
> +	unsigned long flags;
> +	struct insitu_comp_io_worker *worker =
> +		&insitu_comp_io_workers[req->cpu];
> +
> +	spin_lock_irqsave(&worker->lock, flags);
> +	list_add_tail(&req->sibling, &worker->pending);
> +	spin_unlock_irqrestore(&worker->lock, flags);
> +
> +	queue_work_on(req->cpu, insitu_comp_wq, &worker->work);
> +}
> +
> +static void insitu_comp_queue_req_list(struct insitu_comp_info *info,
> +	struct list_head *list)
> +{
> +	struct insitu_comp_req *req;
> +	while (!list_empty(list)) {
> +		req = list_first_entry(list, struct insitu_comp_req, sibling);
> +		list_del_init(&req->sibling);
> +		insitu_comp_queue_req(info, req);
> +	}
> +}
> +
> +static void insitu_comp_get_req(struct insitu_comp_req *req)
> +{
> +	atomic_inc(&req->io_pending);
> +}
> +
> +static void insitu_comp_free_io_range(struct insitu_comp_io_range *io)
> +{
> +	kfree(io->decomp_data);
> +	kfree(io->comp_data);
> +	kmem_cache_free(insitu_comp_io_range_cachep, io);
> +}
> +
> +static void insitu_comp_put_req(struct insitu_comp_req *req)
> +{
> +	struct insitu_comp_io_range *io;
> +
> +	if (atomic_dec_return(&req->io_pending))
> +		return;
> +
> +	if (req->stage == STAGE_INIT) /* waiting for locking */
> +		return;
> +
> +	if (req->stage == STAGE_READ_DECOMP ||
> +	    req->stage == STAGE_WRITE_COMP ||
> +	    req->result)
> +		req->stage = STAGE_DONE;
> +
> +	if (req->stage != STAGE_DONE) {
> +		insitu_comp_queue_req(req->info, req);
> +		return;
> +	}
> +
> +	while (!list_empty(&req->all_io)) {
> +		io = list_entry(req->all_io.next, struct insitu_comp_io_range,
> +			next);
> +		list_del(&io->next);
> +		insitu_comp_free_io_range(io);
> +	}
> +
> +	insitu_comp_unlock_req_range(req);
> +
> +	insitu_req_endio(req, req->result);
> +}
> +
> +static void insitu_comp_io_range_done(unsigned long error, void *context)
> +{
> +	struct insitu_comp_io_range *io = context;
> +
> +	if (error)
> +		io->req->result = error;
> +	insitu_comp_put_req(io->req);
> +}
> +
> +static inline int insitu_comp_compressor_len(struct insitu_comp_info *info,
> +	int len)
> +{
> +	if (compressors[info->comp_alg].comp_len)
> +		return compressors[info->comp_alg].comp_len(len);
> +	return len;
> +}
> +
> +/*
> + * caller should set region.sector, region.count. bi_rw. IO always to/from
> + * comp_data
> + */
> +static struct insitu_comp_io_range *
> +insitu_comp_create_io_range(struct insitu_comp_req *req, int comp_len,
> +	int decomp_len)
> +{
> +	struct insitu_comp_io_range *io;
> +
> +	io = kmem_cache_alloc(insitu_comp_io_range_cachep, GFP_NOIO);
> +	if (!io)
> +		return NULL;
> +
> +	io->comp_data = kmalloc(insitu_comp_compressor_len(req->info, comp_len),
> +								GFP_NOIO);
> +	io->decomp_data = kmalloc(decomp_len, GFP_NOIO);
> +	if (!io->decomp_data || !io->comp_data) {
> +		kfree(io->decomp_data);
> +		kfree(io->comp_data);
> +		kmem_cache_free(insitu_comp_io_range_cachep, io);
> +		return NULL;
> +	}
> +
> +	io->io_req.notify.fn = insitu_comp_io_range_done;
> +	io->io_req.notify.context = io;
> +	io->io_req.client = req->info->io_client;
> +	io->io_req.mem.type = DM_IO_KMEM;
> +	io->io_req.mem.ptr.addr = io->comp_data;
> +	io->io_req.mem.offset = 0;
> +
> +	io->io_region.bdev = req->info->dev->bdev;
> +
> +	io->decomp_len = decomp_len;
> +	io->comp_len = comp_len;
> +	io->req = req;
> +	return io;
> +}
> +
> +static void insitu_comp_req_copy(struct insitu_comp_req *req, off_t req_off, void *buf,
> +		ssize_t len, bool to_buf)
> +{
> +	struct bio *bio = req->bio;
> +	struct bvec_iter iter;
> +	off_t buf_off = 0;
> +	ssize_t size;
> +	void *addr;
> +
> +	iter = bio->bi_iter;
> +	bio_advance_iter(bio, &iter, req_off);
> +
> +	while (len) {
> +		addr = kmap_atomic(bio_iter_page(bio, iter));
> +		size = min_t(ssize_t, len, bio_iter_len(bio, iter));
> +		if (to_buf)
> +			memcpy(buf + buf_off, addr + bio_iter_offset(bio, iter),
> +				size);
> +		else
> +			memcpy(addr + bio_iter_offset(bio, iter), buf + buf_off,
> +				size);
> +		kunmap_atomic(addr);
> +
> +		buf_off += size;
> +		len -= size;
> +
> +		bio_advance_iter(bio, &iter, size);
> +	}
> +}
> +
> +/*
> + * return value:
> + * < 0 : error
> + * == 0 : ok
> + * == 1 : ok, but comp/decomp is skipped
> + * Compressed data size is roundup of 512, which makes the payload.
> + * We store the actual compressed length in the last u32 of the payload.
> + * If there is no free space, we add 512 to the payload size.
> + */
> +static int insitu_comp_io_range_comp(struct insitu_comp_info *info,
> +	void *comp_data, unsigned int *comp_len, void *decomp_data,
> +	unsigned int decomp_len, bool do_comp)
> +{
> +	struct crypto_comp *tfm;
> +	u32 *addr;
> +	unsigned int actual_comp_len;
> +	int ret;
> +
> +	if (do_comp) {
> +		actual_comp_len = *comp_len;
> +
> +		tfm = info->tfm[get_cpu()];
> +		ret = crypto_comp_compress(tfm, decomp_data, decomp_len,
> +			comp_data, &actual_comp_len);
> +		put_cpu();
> +
> +		atomic64_add(decomp_len, &info->uncompressed_write_size);
> +		if (ret || decomp_len < actual_comp_len + sizeof(u32) + 512) {
> +			*comp_len = decomp_len;
> +			atomic64_add(*comp_len, &info->compressed_write_size);
> +			return 1;
> +		}
> +
> +		*comp_len = round_up(actual_comp_len, 512);
> +		if (*comp_len - actual_comp_len < sizeof(u32))
> +			*comp_len += 512;
> +		atomic64_add(*comp_len, &info->compressed_write_size);
> +		addr = comp_data + *comp_len;
> +		addr--;
> +		*addr = cpu_to_le32(actual_comp_len);
> +	} else {
> +		if (*comp_len == decomp_len)
> +			return 1;
> +		addr = comp_data + *comp_len;
> +		addr--;
> +		actual_comp_len = le32_to_cpu(*addr);
> +
> +		tfm = info->tfm[get_cpu()];
> +		ret = crypto_comp_decompress(tfm, comp_data, actual_comp_len,
> +			decomp_data, &decomp_len);
> +		put_cpu();
> +		if (ret)
> +			return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * compressed data is updated. We decompress it and fill req. If there is no
> + * valid compressed data, we just zero req
> + */
> +static void insitu_comp_handle_read_decomp(struct insitu_comp_req *req)
> +{
> +	struct insitu_comp_io_range *io;
> +	off_t req_off = 0;
> +	int ret;
> +
> +	req->stage = STAGE_READ_DECOMP;
> +
> +	if (req->result)
> +		return;
> +
> +	list_for_each_entry(io, &req->all_io, next) {
> +		ssize_t dst_off = 0, src_off = 0, len;
> +
> +		io->io_region.sector -= req->info->data_start;
> +
> +		/* Do decomp here */
> +		ret = insitu_comp_io_range_comp(req->info, io->comp_data,
> +			&io->comp_len, io->decomp_data, io->decomp_len, false);
> +		if (ret < 0) {
> +			req->result = -EIO;
> +			return;
> +		}
> +
> +		if (io->io_region.sector >= insitu_req_start_sector(req))
> +			dst_off = (io->io_region.sector - insitu_req_start_sector(req))
> +				<< 9;
> +		else
> +			src_off = (insitu_req_start_sector(req) - io->io_region.sector)
> +				<< 9;
> +		len = min_t(ssize_t, io->decomp_len - src_off,
> +			(insitu_req_sectors(req) << 9) - dst_off);
> +
> +		/* io range in all_io list is ordered for read IO */
> +		while (req_off != dst_off) {
> +			ssize_t size = min_t(ssize_t, PAGE_SIZE,
> +					dst_off - req_off);
> +			insitu_comp_req_copy(req, req_off,
> +				empty_zero_page, size, false);
> +			req_off += size;
> +		}
> +
> +		if (ret == 1) /* uncompressed, valid data is in .comp_data */
> +			insitu_comp_req_copy(req, dst_off,
> +					io->comp_data + src_off, len, false);
> +		else
> +			insitu_comp_req_copy(req, dst_off,
> +					io->decomp_data + src_off, len, false);
> +		req_off = dst_off + len;
> +	}
> +
> +	while (req_off != (insitu_req_sectors(req) << 9)) {
> +		ssize_t size = min_t(ssize_t, PAGE_SIZE,
> +			(insitu_req_sectors(req) << 9) - req_off);
> +		insitu_comp_req_copy(req, req_off, empty_zero_page,
> +			size, false);
> +		req_off += size;
> +	}
> +}
> +
> +/*
> + * read one extent data from disk. The extent starts from block @block and has
> + * @data_sectors data
> + */
> +static void insitu_comp_read_one_extent(struct insitu_comp_req *req, u64 block,
> +	u16 logical_sectors, u16 data_sectors)
> +{
> +	struct insitu_comp_io_range *io;
> +
> +	io = insitu_comp_create_io_range(req, data_sectors << 9,
> +		logical_sectors << 9);
> +	if (!io) {
> +		req->result = -EIO;
> +		return;
> +	}
> +
> +	insitu_comp_get_req(req);
> +	list_add_tail(&io->next, &req->all_io);
> +
> +	io->io_region.sector = (block << INSITU_COMP_BLOCK_SECTOR_SHIFT) +
> +				req->info->data_start;
> +	io->io_region.count = data_sectors;
> +
> +	io->io_req.bi_rw = READ;
> +	dm_io(&io->io_req, 1, &io->io_region, NULL);
> +}
> +
> +static void insitu_comp_handle_read_read_existing(struct insitu_comp_req *req)
> +{
> +	u64 block_index, first_block_index;
> +	u16 logical_sectors, data_sectors;
> +
> +	req->stage = STAGE_READ_EXISTING;
> +
> +	block_index = insitu_comp_sector_to_block(insitu_req_start_sector(req));
> +again:
> +	insitu_comp_get_extent(req->info, block_index, &first_block_index,
> +		&logical_sectors, &data_sectors);
> +	if (data_sectors > 0)
> +		insitu_comp_read_one_extent(req, first_block_index,
> +			logical_sectors, data_sectors);
> +
> +	if (req->result)
> +		return;
> +
> +	block_index = first_block_index + (logical_sectors >>
> +				INSITU_COMP_BLOCK_SECTOR_SHIFT);
> +	/* the request might cover several extents */
> +	if ((block_index << INSITU_COMP_BLOCK_SECTOR_SHIFT) <
> +			insitu_req_end_sector(req))
> +		goto again;
> +
> +	/* A shortcut if all data is in already */
> +	if (list_empty(&req->all_io))
> +		insitu_comp_handle_read_decomp(req);
> +}
> +
> +static void insitu_comp_handle_read_request(struct insitu_comp_req *req)
> +{
> +	insitu_comp_get_req(req);
> +
> +	if (req->stage == STAGE_INIT) {
> +		if (!insitu_comp_lock_req_range(req)) {
> +			insitu_comp_put_req(req);
> +			return;
> +		}
> +
> +		insitu_comp_handle_read_read_existing(req);
> +	} else if (req->stage == STAGE_READ_EXISTING)
> +		insitu_comp_handle_read_decomp(req);
> +
> +	insitu_comp_put_req(req);
> +}
> +
> +static void insitu_comp_write_meta_done(void *context, unsigned long error)
> +{
> +	struct insitu_comp_req *req = context;
> +	insitu_comp_put_req(req);
> +}
> +
> +static u64 insitu_comp_block_meta_page_index(u64 block, bool end)
> +{
> +	u64 bits = block * INSITU_COMP_META_BITS - !!end;
> +	/* (1 << 3) bits per byte */
> +	return bits >> (3 + PAGE_SHIFT);
> +}
> +
> +/*
> + * the request covers some extents partially. Decompress data of the extents,
> + * compress remaining valid data, and finally write them out
> + */
> +static int insitu_comp_handle_write_modify(struct insitu_comp_io_range *io,
> +	u64 *meta_start, u64 *meta_end, bool *handle_req)
> +{
> +	struct insitu_comp_req *req = io->req;
> +	sector_t start, count;
> +	unsigned int comp_len;
> +	off_t offset;
> +	u64 page_index;
> +	int ret;
> +
> +	io->io_region.sector -= req->info->data_start;
> +
> +	/* decompress original data */
> +	ret = insitu_comp_io_range_comp(req->info, io->comp_data, &io->comp_len,
> +			io->decomp_data, io->decomp_len, false);
> +	if (ret < 0) {
> +		req->result = -EINVAL;
> +		return -EIO;
> +	}
> +
> +	start = io->io_region.sector;
> +	count = io->decomp_len >> 9;
> +	if (start < insitu_req_start_sector(req) && start + count >
> +					insitu_req_end_sector(req)) {
> +		/* we don't split an extent */
> +		if (ret == 1) {
> +			memcpy(io->decomp_data, io->comp_data, io->decomp_len);
> +			insitu_comp_req_copy(req, 0,
> +			   io->decomp_data + ((insitu_req_start_sector(req) - start) <<
> +			   9), insitu_req_sectors(req) << 9, true);
> +		} else {
> +			insitu_comp_req_copy(req, 0,
> +			   io->decomp_data + ((insitu_req_start_sector(req) - start) <<
> +			   9), insitu_req_sectors(req) << 9, true);
> +			kfree(io->comp_data);
> +			/* New compressed len might be bigger */
> +			io->comp_data = kmalloc(insitu_comp_compressor_len(
> +				req->info, io->decomp_len), GFP_NOIO);
> +			io->comp_len = io->decomp_len;
> +			if (!io->comp_data) {
> +				req->result = -ENOMEM;
> +				return -EIO;
> +			}
> +			io->io_req.mem.ptr.addr = io->comp_data;
> +		}
> +		/* need compress data */
> +		ret = 0;
> +		offset = 0;
> +		*handle_req = false;
> +	} else if (start < insitu_req_start_sector(req)) {
> +		count = insitu_req_start_sector(req) - start;
> +		offset = 0;
> +	} else {
> +		offset = insitu_req_end_sector(req) - start;
> +		start = insitu_req_end_sector(req);
> +		count = count - offset;
> +	}
> +
> +	/* Original data is uncompressed, we don't need writeback */
> +	if (ret == 1) {
> +		comp_len = count << 9;
> +		goto handle_meta;
> +	}
> +
> +	/* assume compress less data uses less space (at least 4k lsess data) */
> +	comp_len = io->comp_len;
> +	ret = insitu_comp_io_range_comp(req->info, io->comp_data, &comp_len,
> +		io->decomp_data + (offset << 9), count << 9, true);
> +	if (ret < 0) {
> +		req->result = -EIO;
> +		return -EIO;
> +	}
> +
> +	insitu_comp_get_req(req);
> +	if (ret == 1)
> +		io->io_req.mem.ptr.addr = io->decomp_data + (offset << 9);
> +	io->io_region.count = comp_len >> 9;
> +	io->io_region.sector = start + req->info->data_start;
> +
> +	io->io_req.bi_rw = insitu_req_rw(req);
> +	dm_io(&io->io_req, 1, &io->io_region, NULL);
> +handle_meta:
> +	insitu_comp_set_extent(req, start >> INSITU_COMP_BLOCK_SECTOR_SHIFT,
> +		count >> INSITU_COMP_BLOCK_SECTOR_SHIFT, comp_len >> 9);
> +
> +	page_index = insitu_comp_block_meta_page_index(start >>
> +					INSITU_COMP_BLOCK_SECTOR_SHIFT, false);
> +	if (*meta_start > page_index)
> +		*meta_start = page_index;
> +	page_index = insitu_comp_block_meta_page_index(
> +		(start + count) >> INSITU_COMP_BLOCK_SECTOR_SHIFT, true);
> +	if (*meta_end < page_index)
> +		*meta_end = page_index;
> +	return 0;
> +}
> +
> +/* Compress data and write it out */
> +static void insitu_comp_handle_write_comp(struct insitu_comp_req *req)
> +{
> +	struct insitu_comp_io_range *io;
> +	sector_t count;
> +	unsigned int comp_len;
> +	u64 meta_start = -1L, meta_end = 0, page_index;
> +	int ret;
> +	bool handle_req = true;
> +
> +	req->stage = STAGE_WRITE_COMP;
> +
> +	if (req->result)
> +		return;
> +
> +	list_for_each_entry(io, &req->all_io, next) {
> +		if (insitu_comp_handle_write_modify(io, &meta_start, &meta_end,
> +						&handle_req))
> +			return;
> +	}
> +
> +	if (!handle_req)
> +		goto update_meta;
> +
> +	count = insitu_req_sectors(req);
> +	io = insitu_comp_create_io_range(req, count << 9, count << 9);
> +	if (!io) {
> +		req->result = -EIO;
> +		return;
> +	}
> +	insitu_comp_req_copy(req, 0, io->decomp_data, count << 9, true);
> +
> +	/* compress data */
> +	comp_len = io->comp_len;
> +	ret = insitu_comp_io_range_comp(req->info, io->comp_data, &comp_len,
> +		io->decomp_data, count << 9, true);
> +	if (ret < 0) {
> +		insitu_comp_free_io_range(io);
> +		req->result = -EIO;
> +		return;
> +	}
> +
> +	insitu_comp_get_req(req);
> +	list_add_tail(&io->next, &req->all_io);
> +	io->io_region.sector = insitu_req_start_sector(req) + req->info->data_start;
> +	if (ret == 1)
> +		io->io_req.mem.ptr.addr = io->decomp_data;
> +	io->io_region.count = comp_len >> 9;
> +	io->io_req.bi_rw = insitu_req_rw(req);
> +	dm_io(&io->io_req, 1, &io->io_region, NULL);
> +	insitu_comp_set_extent(req,
> +		insitu_req_start_sector(req) >> INSITU_COMP_BLOCK_SECTOR_SHIFT,
> +		count >> INSITU_COMP_BLOCK_SECTOR_SHIFT, comp_len >> 9);
> +
> +	page_index = insitu_comp_block_meta_page_index(
> +		insitu_req_start_sector(req) >> INSITU_COMP_BLOCK_SECTOR_SHIFT, false);
> +	if (meta_start > page_index)
> +		meta_start = page_index;
> +	page_index = insitu_comp_block_meta_page_index(
> +		(insitu_req_start_sector(req) + count) >> INSITU_COMP_BLOCK_SECTOR_SHIFT,
> +		true);
> +	if (meta_end < page_index)
> +		meta_end = page_index;
> +update_meta:
> +	if (req->info->write_mode == INSITU_COMP_WRITE_THROUGH ||
> +						(insitu_req_rw(req) & REQ_FUA)) {
> +		insitu_comp_get_req(req);
> +		insitu_comp_write_meta(req->info, meta_start, meta_end + 1, req,
> +			insitu_comp_write_meta_done, insitu_req_rw(req));
> +	}
> +}
> +
> +/* request might cover some extents partially, read them first */
> +static void insitu_comp_handle_write_read_existing(struct insitu_comp_req *req)
> +{
> +	u64 block_index, first_block_index;
> +	u16 logical_sectors, data_sectors;
> +
> +	req->stage = STAGE_READ_EXISTING;
> +
> +	block_index = insitu_comp_sector_to_block(insitu_req_start_sector(req));
> +	insitu_comp_get_extent(req->info, block_index, &first_block_index,
> +		&logical_sectors, &data_sectors);
> +	if (data_sectors > 0 && (first_block_index < block_index ||
> +	    first_block_index + insitu_comp_sector_to_block(logical_sectors) >
> +	    insitu_comp_sector_to_block(insitu_req_end_sector(req))))
> +		insitu_comp_read_one_extent(req, first_block_index,
> +			logical_sectors, data_sectors);
> +
> +	if (req->result)
> +		return;
> +
> +	if (first_block_index + insitu_comp_sector_to_block(logical_sectors) >=
> +	    insitu_comp_sector_to_block(insitu_req_end_sector(req)))
> +		goto out;
> +
> +	block_index = insitu_comp_sector_to_block(insitu_req_end_sector(req)) - 1;
> +	insitu_comp_get_extent(req->info, block_index, &first_block_index,
> +		&logical_sectors, &data_sectors);
> +	if (data_sectors > 0 &&
> +	    first_block_index + insitu_comp_sector_to_block(logical_sectors) >
> +	    block_index + 1)
> +		insitu_comp_read_one_extent(req, first_block_index,
> +			logical_sectors, data_sectors);
> +
> +	if (req->result)
> +		return;
> +out:
> +	if (list_empty(&req->all_io))
> +		insitu_comp_handle_write_comp(req);
> +}
> +
> +static void insitu_comp_handle_write_request(struct insitu_comp_req *req)
> +{
> +	insitu_comp_get_req(req);
> +
> +	if (req->stage == STAGE_INIT) {
> +		if (!insitu_comp_lock_req_range(req)) {
> +			insitu_comp_put_req(req);
> +			return;
> +		}
> +
> +		insitu_comp_handle_write_read_existing(req);
> +	} else if (req->stage == STAGE_READ_EXISTING)
> +		insitu_comp_handle_write_comp(req);
> +
> +	insitu_comp_put_req(req);
> +}
> +
> +/* For writeback mode */
> +static void insitu_comp_handle_flush_request(struct insitu_comp_req *req)
> +{
> +	struct writeback_flush_data wb;
> +
> +	atomic_set(&wb.cnt, 1);
> +	init_completion(&wb.complete);
> +
> +	insitu_comp_flush_dirty_meta(req->info, &wb);
> +
> +	writeback_flush_io_done(&wb, 0);
> +	wait_for_completion(&wb.complete);
> +
> +	insitu_req_endio(req, 0);
> +}
> +
> +static void insitu_comp_handle_request(struct insitu_comp_req *req)
> +{
> +	if (insitu_req_rw(req) & REQ_FLUSH)
> +		insitu_comp_handle_flush_request(req);
> +	else if (insitu_req_rw(req) & REQ_WRITE)
> +		insitu_comp_handle_write_request(req);
> +	else
> +		insitu_comp_handle_read_request(req);
> +}
> +
> +static void insitu_comp_do_request_work(struct work_struct *work)
> +{
> +	struct insitu_comp_io_worker *worker = container_of(work,
> +			struct insitu_comp_io_worker, work);
> +	LIST_HEAD(list);
> +	struct insitu_comp_req *req;
> +	struct blk_plug plug;
> +	bool repeat;
> +
> +	blk_start_plug(&plug);
> +again:
> +	spin_lock_irq(&worker->lock);
> +	list_splice_init(&worker->pending, &list);
> +	spin_unlock_irq(&worker->lock);
> +
> +	repeat = !list_empty(&list);
> +	while (!list_empty(&list)) {
> +		req = list_first_entry(&list, struct insitu_comp_req, sibling);
> +		list_del(&req->sibling);
> +
> +		insitu_comp_handle_request(req);
> +	}
> +	if (repeat)
> +		goto again;
> +	blk_finish_plug(&plug);
> +}
> +
> +static int insitu_comp_map(struct dm_target *ti, struct bio *bio)
> +{
> +	struct insitu_comp_info *info = ti->private;
> +	struct insitu_comp_req *req;
> +
> +	req = dm_per_bio_data(bio, sizeof(struct insitu_comp_req));
> +
> +	if ((bio->bi_rw & REQ_FLUSH) &&
> +			info->write_mode == INSITU_COMP_WRITE_THROUGH) {
> +		bio->bi_bdev = info->dev->bdev;
> +		return DM_MAPIO_REMAPPED;
> +	}
> +
> +	req->bio = bio;
> +	req->info = info;
> +	atomic_set(&req->io_pending, 0);
> +	INIT_LIST_HEAD(&req->all_io);
> +	req->result = 0;
> +	req->stage = STAGE_INIT;
> +
> +	req->cpu = raw_smp_processor_id();
> +	insitu_comp_queue_req(info, req);
> +
> +	return DM_MAPIO_SUBMITTED;
> +}
> +
> +/*
> + * INFO: uncompressed_data_size compressed_data_size metadata_size
> + * TABLE: writethrough/writeback commit_delay
> + */
> +static void insitu_comp_status(struct dm_target *ti, status_type_t type,
> +			  unsigned status_flags, char *result, unsigned maxlen)
> +{
> +	struct insitu_comp_info *info = ti->private;
> +	unsigned int sz = 0;
> +
> +	switch (type) {
> +	case STATUSTYPE_INFO:
> +		DMEMIT("%lu %lu %lu",
> +			atomic64_read(&info->uncompressed_write_size),
> +			atomic64_read(&info->compressed_write_size),
> +			atomic64_read(&info->meta_write_size));
> +		break;
> +	case STATUSTYPE_TABLE:
> +		if (info->write_mode == INSITU_COMP_WRITE_BACK)
> +			DMEMIT("%s %s %d", info->dev->name, "writeback",
> +				info->writeback_delay);
> +		else
> +			DMEMIT("%s %s", info->dev->name, "writethrough");
> +		break;
> +	}
> +}
> +
> +static int insitu_comp_iterate_devices(struct dm_target *ti,
> +				  iterate_devices_callout_fn fn, void *data)
> +{
> +	struct insitu_comp_info *info = ti->private;
> +
> +	return fn(ti, info->dev, info->data_start,
> +		info->data_blocks << INSITU_COMP_BLOCK_SECTOR_SHIFT, data);
> +}
> +
> +static void insitu_comp_io_hints(struct dm_target *ti,
> +			    struct queue_limits *limits)
> +{
> +	/* No blk_limits_logical_block_size */
> +	limits->logical_block_size = limits->physical_block_size =
> +		limits->io_min = INSITU_COMP_BLOCK_SIZE;
> +	blk_limits_max_hw_sectors(limits, INSITU_COMP_MAX_SIZE >> 9);
> +}
> +
> +static int insitu_comp_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
> +			struct bio_vec *biovec, int max_size)
> +{
> +	/* Guarantee request can only cover one aligned 128k range */
> +	return min_t(int, max_size, INSITU_COMP_MAX_SIZE - bvm->bi_size -
> +			((bvm->bi_sector << 9) % INSITU_COMP_MAX_SIZE));
> +}
> +
> +static struct target_type insitu_comp_target = {
> +	.name   = "insitu_comp",
> +	.version = {1, 0, 0},
> +	.module = THIS_MODULE,
> +	.ctr    = insitu_comp_ctr,
> +	.dtr    = insitu_comp_dtr,
> +	.map    = insitu_comp_map,
> +	.status = insitu_comp_status,
> +	.iterate_devices = insitu_comp_iterate_devices,
> +	.io_hints = insitu_comp_io_hints,
> +	.merge = insitu_comp_merge,
> +};
> +
> +static int __init insitu_comp_init(void)
> +{
> +	int r;
> +
> +	for (r = 0; r < ARRAY_SIZE(compressors); r++)
> +		if (crypto_has_comp(compressors[r].name, 0, 0))
> +			break;
> +	if (r >= ARRAY_SIZE(compressors)) {
> +		DMWARN("No crypto compressors are supported");
> +		return -EINVAL;
> +	}
> +
> +	default_compressor = r;
> +
> +	r = -ENOMEM;
> +	insitu_comp_io_range_cachep = kmem_cache_create("insitu_comp_io_range",
> +		sizeof(struct insitu_comp_io_range), 0, 0, NULL);
> +	if (!insitu_comp_io_range_cachep) {
> +		DMWARN("Can't create io_range cache");
> +		goto err;
> +	}
> +
> +	insitu_comp_meta_io_cachep = kmem_cache_create("insitu_comp_meta_io",
> +		sizeof(struct insitu_comp_meta_io), 0, 0, NULL);
> +	if (!insitu_comp_meta_io_cachep) {
> +		DMWARN("Can't create meta_io cache");
> +		goto err;
> +	}
> +
> +	insitu_comp_wq = alloc_workqueue("insitu_comp_io",
> +		WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
> +	if (!insitu_comp_wq) {
> +		DMWARN("Can't create io workqueue");
> +		goto err;
> +	}
> +
> +	r = dm_register_target(&insitu_comp_target);
> +	if (r < 0) {
> +		DMWARN("target registration failed");
> +		goto err;
> +	}
> +
> +	for_each_possible_cpu(r) {
> +		INIT_LIST_HEAD(&insitu_comp_io_workers[r].pending);
> +		spin_lock_init(&insitu_comp_io_workers[r].lock);
> +		INIT_WORK(&insitu_comp_io_workers[r].work,
> +			insitu_comp_do_request_work);
> +	}
> +	return 0;
> +err:
> +	if (insitu_comp_io_range_cachep)
> +		kmem_cache_destroy(insitu_comp_io_range_cachep);
> +	if (insitu_comp_meta_io_cachep)
> +		kmem_cache_destroy(insitu_comp_meta_io_cachep);
> +	if (insitu_comp_wq)
> +		destroy_workqueue(insitu_comp_wq);
> +
> +	return r;
> +}
> +
> +static void __exit insitu_comp_exit(void)
> +{
> +	dm_unregister_target(&insitu_comp_target);
> +	kmem_cache_destroy(insitu_comp_io_range_cachep);
> +	kmem_cache_destroy(insitu_comp_meta_io_cachep);
> +	destroy_workqueue(insitu_comp_wq);
> +}
> +
> +module_init(insitu_comp_init);
> +module_exit(insitu_comp_exit);
> +
> +MODULE_AUTHOR("Shaohua Li <shli@kernel.org>");
> +MODULE_DESCRIPTION(DM_NAME " target with insitu data compression for SSD");
> +MODULE_LICENSE("GPL");
> Index: linux/drivers/md/dm-insitu-comp.h
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux/drivers/md/dm-insitu-comp.h	2014-02-17 18:37:07.108425465 +0800
> @@ -0,0 +1,158 @@
> +#ifndef __DM_INSITU_COMPRESSION_H__
> +#define __DM_INSITU_COMPRESSION_H__
> +#include <linux/types.h>
> +
> +struct insitu_comp_super_block {
> +	__le64 magic;
> +	__le64 version;
> +	__le64 meta_blocks;
> +	__le64 data_blocks;
> +	u8 comp_alg;
> +} __attribute__((packed));
> +
> +#define INSITU_COMP_SUPER_MAGIC 0x106526c206506c09
> +#define INSITU_COMP_VERSION 1
> +#define INSITU_COMP_ALG_LZO 0
> +#define INSITU_COMP_ALG_ZLIB 1
> +
> +#ifdef __KERNEL__
> +struct insitu_comp_compressor_data {
> +	char *name;
> +	int (*comp_len)(int comp_len);
> +};
> +
> +static inline int lzo_comp_len(int comp_len)
> +{
> +	return lzo1x_worst_compress(comp_len);
> +}
> +
> +/*
> + * Minium logical sector size of this target is 4096 byte, which is a block.
> + * Data of a block is compressed. Compressed data is round up to 512B, which is
> + * the payload. For each block, we have 5 bits meta data. bit 0 - 3 stands
> + * payload length. 0 - 8 sectors. If compressed payload length is 8 sectors, we
> + * just store uncompressed data. Actual compressed data length is stored at the
> + * last 32 bits of payload if data is compressed. In disk, payload is stored at
> + * the begining of logical sector of the block. If IO size is bigger than one
> + * block, we store the whole data as an extent. Bit 4 stands tail for an
> + * extent. Max allowed extent size is 128k.
> + */
> +#define INSITU_COMP_BLOCK_SIZE 4096
> +#define INSITU_COMP_BLOCK_SHIFT 12
> +#define INSITU_COMP_BLOCK_SECTOR_SHIFT (INSITU_COMP_BLOCK_SHIFT - 9)
> +
> +#define INSITU_COMP_MIN_SIZE 4096
> +/* Change this should change HASH_LOCK_SHIFT too */
> +#define INSITU_COMP_MAX_SIZE (128 * 1024)
> +
> +#define INSITU_COMP_LENGTH_MASK ((1 << 4) - 1)
> +#define INSITU_COMP_TAIL_MASK (1 << 4)
> +#define INSITU_COMP_META_BITS 5
> +
> +#define INSITU_COMP_META_START_SECTOR (INSITU_COMP_BLOCK_SIZE >> 9)
> +
> +enum INSITU_COMP_WRITE_MODE {
> +	INSITU_COMP_WRITE_BACK,
> +	INSITU_COMP_WRITE_THROUGH,
> +};
> +
> +/*
> + * request can cover one aligned 128k (4k * (1 << 5)) range. Since maxium
> + * request size is 128k, we only need take one lock for each request
> + */
> +#define HASH_LOCK_SHIFT 5
> +
> +#define BITMAP_HASH_SHIFT 9
> +#define BITMAP_HASH_MASK ((1 << BITMAP_HASH_SHIFT) - 1)
> +#define BITMAP_HASH_LEN (1 << BITMAP_HASH_SHIFT)
> +
> +struct insitu_comp_hash_lock {
> +	int io_running;
> +	spinlock_t wait_lock;
> +	struct list_head wait_list;
> +};
> +
> +struct insitu_comp_info {
> +	struct dm_target *ti;
> +	struct dm_dev *dev;
> +
> +	int comp_alg;
> +	struct crypto_comp *tfm[NR_CPUS];
> +
> +	sector_t data_start;
> +	u64 data_blocks;
> +
> +	char *meta_bitmap;
> +	u64 meta_bitmap_bits;
> +	u64 meta_bitmap_pages;
> +	struct insitu_comp_hash_lock bitmap_locks[BITMAP_HASH_LEN];
> +
> +	enum INSITU_COMP_WRITE_MODE write_mode;
> +	unsigned int writeback_delay; /* second unit */
> +	struct task_struct *writeback_tsk;
> +	struct dm_io_client *io_client;
> +
> +	atomic64_t compressed_write_size;
> +	atomic64_t uncompressed_write_size;
> +	atomic64_t meta_write_size;
> +};
> +
> +struct insitu_comp_meta_io {
> +	struct dm_io_request io_req;
> +	struct dm_io_region io_region;
> +	void *data;
> +	void (*fn)(void *data, unsigned long error);
> +};
> +
> +struct insitu_comp_io_range {
> +	struct dm_io_request io_req;
> +	struct dm_io_region io_region;
> +	void *decomp_data;
> +	unsigned int decomp_len;
> +	void *comp_data;
> +	unsigned int comp_len; /* For write, this is estimated */
> +	struct list_head next;
> +	struct insitu_comp_req *req;
> +};
> +
> +enum INSITU_COMP_REQ_STAGE {
> +	STAGE_INIT,
> +	STAGE_READ_EXISTING,
> +	STAGE_READ_DECOMP,
> +	STAGE_WRITE_COMP,
> +	STAGE_DONE,
> +};
> +
> +struct insitu_comp_req {
> +	struct bio *bio;
> +	struct insitu_comp_info *info;
> +	struct list_head sibling;
> +
> +	struct list_head all_io;
> +	atomic_t io_pending;
> +	enum INSITU_COMP_REQ_STAGE stage;
> +
> +	struct insitu_comp_hash_lock *lock;
> +	int result;
> +
> +	int cpu;
> +};
> +
> +#define insitu_req_start_sector(req) (req->bio->bi_iter.bi_sector)
> +#define insitu_req_end_sector(req) (bio_end_sector(req->bio))
> +#define insitu_req_rw(req) (req->bio->bi_rw)
> +#define insitu_req_sectors(req) (bio_sectors(req->bio))
> +
> +static inline void insitu_req_endio(struct insitu_comp_req *req, int error)
> +{
> +	bio_endio(req->bio, error);
> +}
> +
> +struct insitu_comp_io_worker {
> +	struct list_head pending;
> +	spinlock_t lock;
> +	struct work_struct work;
> +};
> +#endif
> +
> +#endif
> Index: linux/Documentation/device-mapper/insitu-comp.txt
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux/Documentation/device-mapper/insitu-comp.txt	2014-02-17 17:34:45.427464765 +0800
> @@ -0,0 +1,50 @@
> +This is a simple DM target supporting compression for SSD only. Under layer SSD
> +must support 512B sector size, the target only supports 4k sector size.
> +
> +Disk layout:
> +|super|...meta...|..data...|
> +
> +Store unit is 4k (a block). Super is 1 block, which stores meta and data size
> +and compression algorithm. Meta is a bitmap. For each data block, there are 5
> +bits meta.
> +
> +Data:
> +Data of a block is compressed. Compressed data is round up to 512B, which is
> +the payload. In disk, payload is stored at the begining of logical sector of
> +the block. Let's look at an example. Say we store data to block A, which is in
> +sector B(A*8), its orginal size is 4k, compressed size is 1500. Compressed data
> +(CD) will use 3 sectors (512B). The 3 sectors are the payload. Payload will be
> +stored at sector B.
> +
> +---------------------------------------------------
> +... | CD1 | CD2 | CD3 |   |   |   |   |    | ...
> +---------------------------------------------------
> +    ^B    ^B+1  ^B+2                  ^B+7 ^B+8
> +
> +For this block, we will not use sector B+3 to B+7 (a hole). We use 4 meta bits
> +to present payload size. The compressed size (1500) isn't stored in meta
> +directly. Instead, we store it at the last 32bits of payload. In this example,
> +we store it at the end of sector B+2. If compressed size + sizeof(32bits)
> +crosses a sector, payload size will increase one sector. If payload uses 8
> +sectors, we store uncompressed data directly.
> +
> +If IO size is bigger than one block, we can store the data as an extent. Data
> +of the whole extent will compressed and stored in the similar way like above.
> +The first block of the extent is the head, all others are the tail. If extent
> +is 1 block, the block is head. We have 1 bit of meta to present if a block is
> +head or tail. If 4 meta bits of head block can't store extent payload size, we
> +will borrow tail block meta bits to store payload size. Max allowd extent size
> +is 128k, so we don't compress/decompress too big size data.
> +
> +Meta:
> +Modifying data will modify meta too. Meta will be written(flush) to disk
> +depending on meta write policy. We support writeback and writethrough mode. In
> +writeback mode, meta will be written to disk in an interval or a FLUSH request.
> +In writethrough mode, data and meta data will be written to disk together.
> +
> +=========================
> +Parameters: <dev> [<writethrough>|<writeback> <meta_commit_delay>]
> +   <dev>: underlying device
> +   <writethrough>: metadata flush to disk with writetrough mode
> +   <writeback>: metadata flush to disk with writeback mode
> +   <meta_commit_delay>: metadata flush to disk interval in writeback mode

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch v3]DM: dm-insitu-comp: a compressed DM target for SSD
  2014-03-07  7:57 ` Shaohua Li
@ 2014-03-10 13:52   ` Mike Snitzer
  2014-03-14  9:40     ` Shaohua Li
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Snitzer @ 2014-03-10 13:52 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, dm-devel, axboe, agk

On Fri, Mar 07 2014 at  2:57am -0500,
Shaohua Li <shli@kernel.org> wrote:

> ping!

Hi,

I intend to get dm-insitu-comp reviewed for 3.15.  Sorry I haven't
gotten back with you before now, been busy tending to 3.14-rc issues.

I took a quick first pass over your code a couple weeks ago.  Looks to
be in great shape relative to coding conventions and the more DM
specific conventions.  Clearly demonstrates you have a good command of
DM concepts and quirks.

But one thing that would really help get dm-insitu-comp into 3.15 is to
show that the code is working as you'd expect.  To that end, it'd be
great if you'd be willing to add dm-insitu-comp support to the
device-mapper-test-suite, see:
https://github.com/jthornber/device-mapper-test-suite

I recently added barebones/simple dm-crypt support, see:
https://github.com/jthornber/device-mapper-test-suite/commit/c865bcd4e48228e18626d94327fb2485cf9ec9a1

But It may be that activation/test code for the other targets (e.g. thin
or cache) are more useful examples to follow for implemnting
dm-insitu-comp stack activation, see:
https://github.com/jthornber/device-mapper-test-suite/blob/master/lib/dmtest/pool-stack.rb
https://github.com/jthornber/device-mapper-test-suite/blob/master/lib/dmtest/cache_stack.rb

All said, implementing dm-insitu-comp support for dmts (including some
tests that establish it is working as intended) isn't a hard requirement
for getting the target upstream but it would _really_ help.

Mike

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch v3]DM: dm-insitu-comp: a compressed DM target for SSD
  2014-03-10 13:52   ` Mike Snitzer
@ 2014-03-14  9:40     ` Shaohua Li
  2014-03-14 22:44       ` Mike Snitzer
  0 siblings, 1 reply; 11+ messages in thread
From: Shaohua Li @ 2014-03-14  9:40 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-kernel, dm-devel, axboe, agk

[-- Attachment #1: Type: text/plain, Size: 1665 bytes --]

On Mon, Mar 10, 2014 at 09:52:56AM -0400, Mike Snitzer wrote:
> On Fri, Mar 07 2014 at  2:57am -0500,
> Shaohua Li <shli@kernel.org> wrote:
> 
> > ping!
> 
> Hi,
> 
> I intend to get dm-insitu-comp reviewed for 3.15.  Sorry I haven't
> gotten back with you before now, been busy tending to 3.14-rc issues.
> 
> I took a quick first pass over your code a couple weeks ago.  Looks to
> be in great shape relative to coding conventions and the more DM
> specific conventions.  Clearly demonstrates you have a good command of
> DM concepts and quirks.
> 
> But one thing that would really help get dm-insitu-comp into 3.15 is to
> show that the code is working as you'd expect.  To that end, it'd be
> great if you'd be willing to add dm-insitu-comp support to the
> device-mapper-test-suite, see:
> https://github.com/jthornber/device-mapper-test-suite
> 
> I recently added barebones/simple dm-crypt support, see:
> https://github.com/jthornber/device-mapper-test-suite/commit/c865bcd4e48228e18626d94327fb2485cf9ec9a1
> 
> But It may be that activation/test code for the other targets (e.g. thin
> or cache) are more useful examples to follow for implemnting
> dm-insitu-comp stack activation, see:
> https://github.com/jthornber/device-mapper-test-suite/blob/master/lib/dmtest/pool-stack.rb
> https://github.com/jthornber/device-mapper-test-suite/blob/master/lib/dmtest/cache_stack.rb
> 
> All said, implementing dm-insitu-comp support for dmts (including some
> tests that establish it is working as intended) isn't a hard requirement
> for getting the target upstream but it would _really_ help.

Ok, I added some simple tests in the test suites.

Thanks,
Shaohua

[-- Attachment #2: comp.patch --]
[-- Type: text/x-diff, Size: 3688 bytes --]

---
 lib/dmtest/suites/insitu-comp.rb                  |    1 
 lib/dmtest/tests/insitu-comp/insitu-comp_tests.rb |  120 ++++++++++++++++++++++
 2 files changed, 121 insertions(+)

Index: device-mapper-test-suite/lib/dmtest/suites/insitu-comp.rb
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ device-mapper-test-suite/lib/dmtest/suites/insitu-comp.rb	2014-03-14 17:16:14.043519177 +0800
@@ -0,0 +1 @@
+require 'dmtest/tests/insitu-comp/insitu-comp_tests'
Index: device-mapper-test-suite/lib/dmtest/tests/insitu-comp/insitu-comp_tests.rb
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ device-mapper-test-suite/lib/dmtest/tests/insitu-comp/insitu-comp_tests.rb	2014-03-14 17:16:14.043519177 +0800
@@ -0,0 +1,120 @@
+require 'dmtest/config'
+require 'dmtest/git'
+require 'dmtest/log'
+require 'dmtest/utils'
+require 'dmtest/fs'
+require 'dmtest/tags'
+require 'dmtest/thinp-test'
+require 'dmtest/cache-status'
+require 'dmtest/disk-units'
+require 'dmtest/test-utils'
+require 'dmtest/tests/cache/fio_subvolume_scenario'
+
+require 'pp'
+
+#------------------------------------------------------------
+
+class InsitucompStack
+  include DM
+  include DM::LexicalOperators
+  include Utils
+
+  def initialize(dm, dev, opts)
+    @dm = dm
+    @dev = dev
+    @opts = opts
+  end
+
+  def activate(&block)
+   with_dev(table) do |comp|
+     @comp = comp
+     block.call(comp)
+   end
+  end
+
+  def table
+    total_blocks = dev_size(@dev) >> 3
+    data_blocks = total_blocks - 1
+    rem = data_blocks % (4096 * 8 + 5)
+    data_blocks /= 4096 * 8 + 5
+    meta_blocks = data_blocks * 5
+    data_blocks *= 4096 * 8
+
+    cnt = rem
+    rem /= (4096 * 8 / 5 + 1)
+    data_blocks += rem * (4096 * 8 / 5)
+    meta_blocks += rem
+
+    cnt %= (4096 * 8 / 5 + 1)
+    meta_blocks += 1
+    data_blocks += cnt - 1
+
+    sector_count = data_blocks << 3
+
+    writethrough = @opts.fetch(:writethrough, true)
+    if writethrough
+      t = Table.new(Target.new('insitu_comp', sector_count, @dev, 'writethrough'))
+    else
+      wb_interval = @opts.fetch(:writeback_interval, 5)
+      t = Table.new(Target.new('insitu_comp', sector_count, @dev, 'writeback', wb_interval))
+    end
+    t
+  end
+
+  private
+  def dm_interface
+    @dm
+  end
+end
+
+#------------------------------------------------------------
+
+class InsitucompTests < ThinpTestCase
+  include Utils
+  include DiskUnits
+  include FioSubVolumeScenario
+
+  def test_basic_setup_writethrough
+    test_basic_setup()
+  end
+
+  def test_basic_setup_writeback
+    test_basic_setup(false, 5)
+  end
+
+  def test_fio_writethrough
+    test_fio()
+  end
+
+  def test_fio_writeback
+    test_fio(false, 5)
+  end
+
+  private
+  def alloc_stack(writethrough, wbinterval)
+    if writethrough
+      stack = InsitucompStack.new(@dm, @data_dev, :writethrough => true)
+    else
+      stack = InsitucompStack.new(@dm, @data_dev, :writethrough => false, :writeback_interval => wbinterval)
+    end
+    stack
+  end
+
+  private
+  def test_basic_setup(writethrough = true, wbinterval = 5)
+    stack = alloc_stack(writethrough, wbinterval)
+    stack.activate do |comp|
+      wipe_device(comp)
+    end
+  end
+
+  private
+  def test_fio(writethrough = true, wbinterval = 5)
+    stack = alloc_stack(writethrough, wbinterval)
+    stack.activate do |comp|
+      do_fio(comp, :ext4,
+             :outfile => AP("fio_dm_insitu-comp" + (writethrough ? "-wt.out" : "-wb.out")),
+             :cfgfile => LP("tests/cache/database-funtime.fio"))
+    end
+  end
+end

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch v3]DM: dm-insitu-comp: a compressed DM target for SSD
  2014-03-14  9:40     ` Shaohua Li
@ 2014-03-14 22:44       ` Mike Snitzer
  2014-03-17  9:56         ` Shaohua Li
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Snitzer @ 2014-03-14 22:44 UTC (permalink / raw)
  To: Shaohua Li; +Cc: axboe, dm-devel, linux-kernel, agk

On Fri, Mar 14 2014 at  5:40am -0400,
Shaohua Li <shli@kernel.org> wrote:

> On Mon, Mar 10, 2014 at 09:52:56AM -0400, Mike Snitzer wrote:
> > On Fri, Mar 07 2014 at  2:57am -0500,
> > Shaohua Li <shli@kernel.org> wrote:
> > 
> > > ping!
> > 
> > Hi,
> > 
> > I intend to get dm-insitu-comp reviewed for 3.15.  Sorry I haven't
> > gotten back with you before now, been busy tending to 3.14-rc issues.
> > 
> > I took a quick first pass over your code a couple weeks ago.  Looks to
> > be in great shape relative to coding conventions and the more DM
> > specific conventions.  Clearly demonstrates you have a good command of
> > DM concepts and quirks.

Think I need to eat my words from above at least partially.  Given you
haven't implemented any of the target suspend or resume hooks this
target will _not_ work properly across suspend + resume cycles that all
DM targets must support.

But we can obviously work through it with urgency for 3.15.

I've pulled your v3 patch into git and have overlayed edits from my
first pass.  Lots of funky wrapping to conform to 80 columns.  But
whitespace aside, I've added FIXME:s in the relevant files.  If you work
on any of these FIXMEs please send follow-up patches so that we don't
step on each others' toes.

Please see the 'for-3.15-insitu-comp' branch of this git repo:
git://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git

https://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=for-3.15-insitu-comp

> > But one thing that would really help get dm-insitu-comp into 3.15 is to
> > show that the code is working as you'd expect.  To that end, it'd be
> > great if you'd be willing to add dm-insitu-comp support to the
> > device-mapper-test-suite, see:
> > https://github.com/jthornber/device-mapper-test-suite
> > 
> > I recently added barebones/simple dm-crypt support, see:
> > https://github.com/jthornber/device-mapper-test-suite/commit/c865bcd4e48228e18626d94327fb2485cf9ec9a1
> > 
> > But It may be that activation/test code for the other targets (e.g. thin
> > or cache) are more useful examples to follow for implemnting
> > dm-insitu-comp stack activation, see:
> > https://github.com/jthornber/device-mapper-test-suite/blob/master/lib/dmtest/pool-stack.rb
> > https://github.com/jthornber/device-mapper-test-suite/blob/master/lib/dmtest/cache_stack.rb
> > 
> > All said, implementing dm-insitu-comp support for dmts (including some
> > tests that establish it is working as intended) isn't a hard requirement
> > for getting the target upstream but it would _really_ help.
> 
> Ok, I added some simple tests in the test suites.

OK, I missed this before because it was an attachment.  I was confused
as to whether you already added or will add support.  Now that I've
replied to this mail mutt pulled in the attachment ;)

I'll take for a spin on Monday (or over the weekend if I'm bored).

Thanks,
Mike

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch v3]DM: dm-insitu-comp: a compressed DM target for SSD
  2014-03-14 22:44       ` Mike Snitzer
@ 2014-03-17  9:56         ` Shaohua Li
  2014-03-17 20:00           ` Mike Snitzer
  0 siblings, 1 reply; 11+ messages in thread
From: Shaohua Li @ 2014-03-17  9:56 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: axboe, dm-devel, linux-kernel, agk

On Fri, Mar 14, 2014 at 06:44:45PM -0400, Mike Snitzer wrote:
> On Fri, Mar 14 2014 at  5:40am -0400,
> Shaohua Li <shli@kernel.org> wrote:
> 
> > On Mon, Mar 10, 2014 at 09:52:56AM -0400, Mike Snitzer wrote:
> > > On Fri, Mar 07 2014 at  2:57am -0500,
> > > Shaohua Li <shli@kernel.org> wrote:
> > > 
> > > > ping!
> > > 
> > > Hi,
> > > 
> > > I intend to get dm-insitu-comp reviewed for 3.15.  Sorry I haven't
> > > gotten back with you before now, been busy tending to 3.14-rc issues.
> > > 
> > > I took a quick first pass over your code a couple weeks ago.  Looks to
> > > be in great shape relative to coding conventions and the more DM
> > > specific conventions.  Clearly demonstrates you have a good command of
> > > DM concepts and quirks.
> 
> Think I need to eat my words from above at least partially.  Given you
> haven't implemented any of the target suspend or resume hooks this
> target will _not_ work properly across suspend + resume cycles that all
> DM targets must support.
> 
> But we can obviously work through it with urgency for 3.15.
> 
> I've pulled your v3 patch into git and have overlayed edits from my
> first pass.  Lots of funky wrapping to conform to 80 columns.  But
> whitespace aside, I've added FIXME:s in the relevant files.  If you work
> on any of these FIXMEs please send follow-up patches so that we don't
> step on each others' toes.
> 
> Please see the 'for-3.15-insitu-comp' branch of this git repo:
> git://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git

Thanks for your to look at it. I fixed them against your tree. Please check below patch.


Subject: dm-insitu-comp: fix different issues

Fix different issues pointed out by Mike and add suspend/resume support.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/dm-insitu-comp.c |  108 ++++++++++++++++++++++++++++++--------------
 drivers/md/dm-insitu-comp.h |    8 +++
 2 files changed, 84 insertions(+), 32 deletions(-)

Index: linux/drivers/md/dm-insitu-comp.c
===================================================================
--- linux.orig/drivers/md/dm-insitu-comp.c	2014-03-17 12:37:37.850751341 +0800
+++ linux/drivers/md/dm-insitu-comp.c	2014-03-17 17:40:01.106660303 +0800
@@ -17,6 +17,7 @@
 #include "dm-insitu-comp.h"
 
 #define DM_MSG_PREFIX "insitu-comp"
+#define DEFAULT_WRITEBACK_DELAY 5
 
 static inline int lzo_comp_len(int comp_len)
 {
@@ -40,6 +41,10 @@ static struct kmem_cache *insitu_comp_me
 static struct insitu_comp_io_worker insitu_comp_io_workers[NR_CPUS];
 static struct workqueue_struct *insitu_comp_wq;
 
+#define BYTE_BITS 8
+#define BYTE_BITS_SHIFT 3
+#define BYTE_BITS_MASK (BYTE_BITS - 1)
+
 /* each block has 5 bits of metadata */
 static u8 insitu_comp_get_meta(struct insitu_comp_info *info, u64 block_index)
 {
@@ -47,15 +52,14 @@ static u8 insitu_comp_get_meta(struct in
 	int bits, offset;
 	u8 data, ret = 0;
 
-	// FIXME: "magic" numbers in this function (7, 3)
-	offset = first_bit & 7;
-	bits = min_t(u8, INSITU_COMP_META_BITS, 8 - offset);
+	offset = first_bit & BYTE_BITS_MASK;
+	bits = min_t(u8, INSITU_COMP_META_BITS, BYTE_BITS - offset);
 
-	data = info->meta_bitmap[first_bit >> 3];
+	data = info->meta_bitmap[first_bit >> BYTE_BITS_SHIFT];
 	ret = (data >> offset) & ((1 << bits) - 1);
 
 	if (bits < INSITU_COMP_META_BITS) {
-		data = info->meta_bitmap[(first_bit >> 3) + 1];
+		data = info->meta_bitmap[(first_bit >> BYTE_BITS_SHIFT) + 1];
 		bits = INSITU_COMP_META_BITS - bits;
 		ret |= (data & ((1 << bits) - 1)) <<
 			(INSITU_COMP_META_BITS - bits);
@@ -71,14 +75,13 @@ static void insitu_comp_set_meta(struct
 	u8 data;
 	struct page *page;
 
-	// FIXME: "magic" numbers in this function (7, 3)
-	offset = first_bit & 7;
-	bits = min_t(u8, INSITU_COMP_META_BITS, 8 - offset);
+	offset = first_bit & BYTE_BITS_MASK;
+	bits = min_t(u8, INSITU_COMP_META_BITS, BYTE_BITS - offset);
 
-	data = info->meta_bitmap[first_bit >> 3];
+	data = info->meta_bitmap[first_bit >> BYTE_BITS_SHIFT];
 	data &= ~(((1 << bits) - 1) << offset);
 	data |= (meta & ((1 << bits) - 1)) << offset;
-	info->meta_bitmap[first_bit >> 3] = data;
+	info->meta_bitmap[first_bit >> BYTE_BITS_SHIFT] = data;
 
 	/*
 	 * For writethrough, we write metadata directly.  For writeback, if
@@ -301,9 +304,24 @@ static int insitu_comp_meta_writeback_th
 	init_completion(&wb.complete);
 
 	while (!kthread_should_stop()) {
-		// FIXME: writeback_delay should be in secs
-		schedule_timeout_interruptible(msecs_to_jiffies(info->writeback_delay * 1000));
+		schedule_timeout_interruptible(
+		    msecs_to_jiffies(info->writeback_delay * MSEC_PER_SEC));
 		insitu_comp_flush_dirty_meta(info, &wb);
+
+		if (info->wb_thread_suspend_status != WB_THREAD_RESUMED) {
+			writeback_flush_io_done(&wb, 0);
+			wait_for_completion(&wb.complete);
+
+			info->wb_thread_suspend_status = WB_THREAD_SUSPENDED;
+			wake_up_interruptible(&info->wb_thread_suspend_wq);
+
+			wait_event_interruptible(info->wb_thread_suspend_wq,
+			  info->wb_thread_suspend_status == WB_THREAD_RESUMED ||
+			  kthread_should_stop());
+
+			atomic_set(&wb.cnt, 1);
+			init_completion(&wb.complete);
+		}
 	}
 
 	insitu_comp_flush_dirty_meta(info, &wb);
@@ -357,6 +375,8 @@ static int insitu_comp_init_meta(struct
 			info->ti->error = "Create writeback thread error";
 			return -EINVAL;
 		}
+		info->wb_thread_suspend_status = WB_THREAD_RESUMED;
+		init_waitqueue_head(&info->wb_thread_suspend_wq);
 	}
 
 	return 0;
@@ -410,7 +430,6 @@ static int insitu_comp_read_or_create_su
 
 	total_blocks = i_size_read(info->dev->bdev->bd_inode) >> INSITU_COMP_BLOCK_SHIFT;
 	data_blocks = total_blocks - 1;
-	// FIXME: 64bit divide on 32bit?  must compile/work on 32bit
 	rem = do_div(data_blocks, INSITU_COMP_BLOCK_SIZE * 8 + INSITU_COMP_META_BITS);
 	meta_blocks = data_blocks * INSITU_COMP_META_BITS;
 	data_blocks *= INSITU_COMP_BLOCK_SIZE * 8;
@@ -507,13 +526,11 @@ out:
  */
 static int insitu_comp_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 {
-	// FIXME: add proper feature arg processing.
-	// FIXME: pick default metadata write mode.
 	struct insitu_comp_info *info;
 	char write_mode[15];
 	int ret, i;
 
-	if (argc < 2) {
+	if (argc < 1) {
 		ti->error = "Invalid argument count";
 		return -EINVAL;
 	}
@@ -525,19 +542,18 @@ static int insitu_comp_ctr(struct dm_tar
 	}
 	info->ti = ti;
 
+	info->write_mode = INSITU_COMP_WRITE_BACK;
+	info->writeback_delay = DEFAULT_WRITEBACK_DELAY;
+	if (argc == 1)
+		goto skip_optargs;
+
 	if (sscanf(argv[1], "%s", write_mode) != 1) {
 		ti->error = "Invalid argument";
 		ret = -EINVAL;
 		goto err_para;
 	}
 
-	if (strcmp(write_mode, "writeback") == 0) {
-		if (argc != 3) {
-			ti->error = "Invalid argument";
-			ret = -EINVAL;
-			goto err_para;
-		}
-		info->write_mode = INSITU_COMP_WRITE_BACK;
+	if (strcmp(write_mode, "writeback") == 0 && argc == 3) {
 		if (sscanf(argv[2], "%u", &info->writeback_delay) != 1) {
 			ti->error = "Invalid argument";
 			ret = -EINVAL;
@@ -551,6 +567,7 @@ static int insitu_comp_ctr(struct dm_tar
 		goto err_para;
 	}
 
+skip_optargs:
 	if (dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
 							&info->dev)) {
 		ti->error = "Can't get device";
@@ -1348,6 +1365,28 @@ static int insitu_comp_map(struct dm_tar
 	return DM_MAPIO_SUBMITTED;
 }
 
+static void insitu_comp_postsuspend(struct dm_target *ti)
+{
+	struct insitu_comp_info *info = ti->private;
+	/* all requests are finished already */
+	if (info->write_mode != INSITU_COMP_WRITE_BACK)
+		return;
+	info->wb_thread_suspend_status = WB_THREAD_SUSPENDING;
+	wake_up_process(info->writeback_tsk);
+
+	wait_event_interruptible(info->wb_thread_suspend_wq,
+		info->wb_thread_suspend_status == WB_THREAD_SUSPENDED);
+}
+
+static void insitu_comp_resume(struct dm_target *ti)
+{
+	struct insitu_comp_info *info = ti->private;
+	if (info->write_mode != INSITU_COMP_WRITE_BACK)
+		return;
+	info->wb_thread_suspend_status = WB_THREAD_RESUMED;
+	wake_up_interruptible(&info->wb_thread_suspend_wq);
+}
+
 /*
  * INFO: uncompressed_data_size compressed_data_size metadata_size
  * TABLE: writethrough/writeback commit_delay
@@ -1360,10 +1399,10 @@ static void insitu_comp_status(struct dm
 
 	switch (type) {
 	case STATUSTYPE_INFO:
-		DMEMIT("%lu %lu %lu",
-		       atomic64_read(&info->uncompressed_write_size),
-		       atomic64_read(&info->compressed_write_size),
-		       atomic64_read(&info->meta_write_size));
+		DMEMIT("%llu %llu %llu",
+		    (long long)atomic64_read(&info->uncompressed_write_size),
+		    (long long)atomic64_read(&info->compressed_write_size),
+		    (long long)atomic64_read(&info->meta_write_size));
 		break;
 	case STATUSTYPE_TABLE:
 		if (info->write_mode == INSITU_COMP_WRITE_BACK)
@@ -1407,8 +1446,8 @@ static struct target_type insitu_comp_ta
 	.ctr    = insitu_comp_ctr,
 	.dtr    = insitu_comp_dtr,
 	.map    = insitu_comp_map,
-	// FIXME: no .postsuspend or .preresume or .resume!?
-	// need to flush workqueue at a minimum.  what about commit?  see pool_target or cache_target
+	.postsuspend = insitu_comp_postsuspend,
+	.resume = insitu_comp_resume,
 	.status = insitu_comp_status,
 	.iterate_devices = insitu_comp_iterate_devices,
 	.io_hints = insitu_comp_io_hints,
@@ -1430,14 +1469,19 @@ static int __init insitu_comp_init(void)
 	default_compressor = r;
 
 	r = -ENOMEM;
-	// FIXME: add dm_ prefix to at least these 2 structs so slabs are attributed to dm
-	insitu_comp_io_range_cachep = KMEM_CACHE(insitu_comp_io_range, 0);
+	insitu_comp_io_range_cachep = kmem_cache_create("dm_insitu_comp_io_range",
+		sizeof(struct insitu_comp_io_range),
+		__alignof__(struct insitu_comp_io_range),
+		0, NULL);
 	if (!insitu_comp_io_range_cachep) {
 		DMWARN("Can't create io_range cache");
 		goto err;
 	}
 
-	insitu_comp_meta_io_cachep = KMEM_CACHE(insitu_comp_meta_io, 0);
+	insitu_comp_meta_io_cachep = kmem_cache_create("dm_insitu_comp_meta_io",
+		sizeof(struct insitu_comp_meta_io),
+		__alignof__(struct insitu_comp_meta_io),
+		0, NULL);
 	if (!insitu_comp_meta_io_cachep) {
 		DMWARN("Can't create meta_io cache");
 		goto err;
Index: linux/drivers/md/dm-insitu-comp.h
===================================================================
--- linux.orig/drivers/md/dm-insitu-comp.h	2014-03-17 12:37:37.850751341 +0800
+++ linux/drivers/md/dm-insitu-comp.h	2014-03-17 16:22:24.553201921 +0800
@@ -92,6 +92,8 @@ struct insitu_comp_info {
 	enum INSITU_COMP_WRITE_MODE write_mode;
 	unsigned int writeback_delay; /* second unit */
 	struct task_struct *writeback_tsk;
+	int wb_thread_suspend_status;
+	wait_queue_head_t wb_thread_suspend_wq;
 	struct dm_io_client *io_client;
 
 	atomic64_t compressed_write_size;
@@ -99,6 +101,12 @@ struct insitu_comp_info {
 	atomic64_t meta_write_size;
 };
 
+enum {
+	WB_THREAD_RESUMED = 0,
+	WB_THREAD_SUSPENDING = 1,
+	WB_THREAD_SUSPENDED = 2,
+};
+
 struct insitu_comp_meta_io {
 	struct dm_io_request io_req;
 	struct dm_io_region io_region;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch v3]DM: dm-insitu-comp: a compressed DM target for SSD
  2014-03-17  9:56         ` Shaohua Li
@ 2014-03-17 20:00           ` Mike Snitzer
  2014-03-18  7:41             ` Shaohua Li
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Snitzer @ 2014-03-17 20:00 UTC (permalink / raw)
  To: Shaohua Li; +Cc: axboe, dm-devel, linux-kernel, agk

On Mon, Mar 17 2014 at  5:56am -0400,
Shaohua Li <shli@kernel.org> wrote:

> On Fri, Mar 14, 2014 at 06:44:45PM -0400, Mike Snitzer wrote:
> > On Fri, Mar 14 2014 at  5:40am -0400,
> > Shaohua Li <shli@kernel.org> wrote:
> > 
> > > On Mon, Mar 10, 2014 at 09:52:56AM -0400, Mike Snitzer wrote:
> > > > On Fri, Mar 07 2014 at  2:57am -0500,
> > > > Shaohua Li <shli@kernel.org> wrote:
> > > > 
> > > > > ping!
> > > > 
> > > > Hi,
> > > > 
> > > > I intend to get dm-insitu-comp reviewed for 3.15.  Sorry I haven't
> > > > gotten back with you before now, been busy tending to 3.14-rc issues.
> > > > 
> > > > I took a quick first pass over your code a couple weeks ago.  Looks to
> > > > be in great shape relative to coding conventions and the more DM
> > > > specific conventions.  Clearly demonstrates you have a good command of
> > > > DM concepts and quirks.
> > 
> > Think I need to eat my words from above at least partially.  Given you
> > haven't implemented any of the target suspend or resume hooks this
> > target will _not_ work properly across suspend + resume cycles that all
> > DM targets must support.
> > 
> > But we can obviously work through it with urgency for 3.15.
> > 
> > I've pulled your v3 patch into git and have overlayed edits from my
> > first pass.  Lots of funky wrapping to conform to 80 columns.  But
> > whitespace aside, I've added FIXME:s in the relevant files.  If you work
> > on any of these FIXMEs please send follow-up patches so that we don't
> > step on each others' toes.
> > 
> > Please see the 'for-3.15-insitu-comp' branch of this git repo:
> > git://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git
> 
> Thanks for your to look at it. I fixed them against your tree. Please check below patch.

I folded your changes in, and then committed a patch ontop that cleans
some code up.  But added 2 FIXMEs that still speak to pretty fundamental
problems with the architecture of the dm-insitu-comp target, see:
https://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=for-3.15-insitu-comp&id=8565ab6b04837591d03c94851c2f9f9162ce12f4

Unfortunately the single insitu_comp_wq workqueue that all insitu-comp
targets are to share isn't a workable solution.  Each target needs to
have resource isolation from other targets (imagine insitu-comp used for
multiple SSDs).  This is important for suspend too because you'll need
to flush/stop the workqueue.

You introduced a state machine for tracking suspending, suspended,
resumed.  This really isn't necessary.  During suspend you need to
flush_workqueue().  On resume you shouldn't need to do anything special.

As I noted in the commit, the thin and cache targets can serve as
references for how you can manage the workqueue across suspend/resume
and the lifetime of these workqueues relative to .ctr and .dtr.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch v3]DM: dm-insitu-comp: a compressed DM target for SSD
  2014-03-17 20:00           ` Mike Snitzer
@ 2014-03-18  7:41             ` Shaohua Li
  2014-03-18 21:28               ` Mike Snitzer
  0 siblings, 1 reply; 11+ messages in thread
From: Shaohua Li @ 2014-03-18  7:41 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: axboe, dm-devel, linux-kernel, agk

On Mon, Mar 17, 2014 at 04:00:40PM -0400, Mike Snitzer wrote:
> On Mon, Mar 17 2014 at  5:56am -0400,
> Shaohua Li <shli@kernel.org> wrote:
> 
> > On Fri, Mar 14, 2014 at 06:44:45PM -0400, Mike Snitzer wrote:
> > > On Fri, Mar 14 2014 at  5:40am -0400,
> > > Shaohua Li <shli@kernel.org> wrote:
> > > 
> > > > On Mon, Mar 10, 2014 at 09:52:56AM -0400, Mike Snitzer wrote:
> > > > > On Fri, Mar 07 2014 at  2:57am -0500,
> > > > > Shaohua Li <shli@kernel.org> wrote:
> > > > > 
> > > > > > ping!
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > I intend to get dm-insitu-comp reviewed for 3.15.  Sorry I haven't
> > > > > gotten back with you before now, been busy tending to 3.14-rc issues.
> > > > > 
> > > > > I took a quick first pass over your code a couple weeks ago.  Looks to
> > > > > be in great shape relative to coding conventions and the more DM
> > > > > specific conventions.  Clearly demonstrates you have a good command of
> > > > > DM concepts and quirks.
> > > 
> > > Think I need to eat my words from above at least partially.  Given you
> > > haven't implemented any of the target suspend or resume hooks this
> > > target will _not_ work properly across suspend + resume cycles that all
> > > DM targets must support.
> > > 
> > > But we can obviously work through it with urgency for 3.15.
> > > 
> > > I've pulled your v3 patch into git and have overlayed edits from my
> > > first pass.  Lots of funky wrapping to conform to 80 columns.  But
> > > whitespace aside, I've added FIXME:s in the relevant files.  If you work
> > > on any of these FIXMEs please send follow-up patches so that we don't
> > > step on each others' toes.
> > > 
> > > Please see the 'for-3.15-insitu-comp' branch of this git repo:
> > > git://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git
> > 
> > Thanks for your to look at it. I fixed them against your tree. Please check below patch.
> 
> I folded your changes in, and then committed a patch ontop that cleans
> some code up.  But added 2 FIXMEs that still speak to pretty fundamental
> problems with the architecture of the dm-insitu-comp target, see:
> https://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=for-3.15-insitu-comp&id=8565ab6b04837591d03c94851c2f9f9162ce12f4
> 
> Unfortunately the single insitu_comp_wq workqueue that all insitu-comp
> targets are to share isn't a workable solution.  Each target needs to
> have resource isolation from other targets (imagine insitu-comp used for
> multiple SSDs).  This is important for suspend too because you'll need
> to flush/stop the workqueue.

Is this just because of suspend? I didn't see fundamental reason why the
workqueue can't be shared even for several targets.
 
> You introduced a state machine for tracking suspending, suspended,
> resumed.  This really isn't necessary.  During suspend you need to
> flush_workqueue().  On resume you shouldn't need to do anything special.
> 
> As I noted in the commit, the thin and cache targets can serve as
> references for how you can manage the workqueue across suspend/resume
> and the lifetime of these workqueues relative to .ctr and .dtr.

As far as I checking the code, .postsuspend is called after all requests are
finished. This already guarantees no pending requests running in insitu-comp
workqueue. Doing a workqueue flush isn't required. The writeback thread is
running in background and waiting for requests completion can't guarantee the
thread isn't running, so we must make sure it is safely parked.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch v3]DM: dm-insitu-comp: a compressed DM target for SSD
  2014-03-18  7:41             ` Shaohua Li
@ 2014-03-18 21:28               ` Mike Snitzer
  2014-03-19  1:45                 ` Shaohua Li
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Snitzer @ 2014-03-18 21:28 UTC (permalink / raw)
  To: Shaohua Li; +Cc: axboe, dm-devel, linux-kernel, agk

On Tue, Mar 18 2014 at  3:41am -0400,
Shaohua Li <shli@kernel.org> wrote:

> On Mon, Mar 17, 2014 at 04:00:40PM -0400, Mike Snitzer wrote:
> > 
> > I folded your changes in, and then committed a patch ontop that cleans
> > some code up.  But added 2 FIXMEs that still speak to pretty fundamental
> > problems with the architecture of the dm-insitu-comp target, see:
> > https://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=for-3.15-insitu-comp&id=8565ab6b04837591d03c94851c2f9f9162ce12f4
> > 
> > Unfortunately the single insitu_comp_wq workqueue that all insitu-comp
> > targets are to share isn't a workable solution.  Each target needs to
> > have resource isolation from other targets (imagine insitu-comp used for
> > multiple SSDs).  This is important for suspend too because you'll need
> > to flush/stop the workqueue.
> 
> Is this just because of suspend? I didn't see fundamental reason why the
> workqueue can't be shared even for several targets.

I'm not seeing how you are guaranteeing that all queued work is
completed during suspend.  insitu_comp_queue_req() just calls
queue_work_on().

BTW, queue_work_on()'s comment above its implementation says:
"We queue the work to a specific CPU, the caller must ensure it can't go
away." -- you're not able to insure a cpu isn't hotplugged so... but I
also see you've used it in your raid5 perf improvement changes so you
obviously have experience with using this interface.

> > You introduced a state machine for tracking suspending, suspended,
> > resumed.  This really isn't necessary.  During suspend you need to
> > flush_workqueue().  On resume you shouldn't need to do anything special.
> > 
> > As I noted in the commit, the thin and cache targets can serve as
> > references for how you can manage the workqueue across suspend/resume
> > and the lifetime of these workqueues relative to .ctr and .dtr.
> 
> As far as I checking the code, .postsuspend is called after all requests are
> finished. This already guarantees no pending requests running in insitu-comp
> workqueue.

I could easily be missing something obvious, but I don't see where that
guarantee is implemented.

> Doing a workqueue flush isn't required. The writeback thread is
> running in background and waiting for requests completion can't guarantee the
> thread isn't running, so we must make sure it is safely parked.

Sure, but you don't need a state machine to do that.  The DM core takes
care of calling these hooks, so you just need to stop the writeback
thread during suspend and (re)start/kick it on resume (preresume).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch v3]DM: dm-insitu-comp: a compressed DM target for SSD
  2014-03-18 21:28               ` Mike Snitzer
@ 2014-03-19  1:45                 ` Shaohua Li
  2014-03-19 16:16                   ` Mike Snitzer
  0 siblings, 1 reply; 11+ messages in thread
From: Shaohua Li @ 2014-03-19  1:45 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: axboe, dm-devel, linux-kernel, agk

On Tue, Mar 18, 2014 at 05:28:43PM -0400, Mike Snitzer wrote:
> On Tue, Mar 18 2014 at  3:41am -0400,
> Shaohua Li <shli@kernel.org> wrote:
> 
> > On Mon, Mar 17, 2014 at 04:00:40PM -0400, Mike Snitzer wrote:
> > > 
> > > I folded your changes in, and then committed a patch ontop that cleans
> > > some code up.  But added 2 FIXMEs that still speak to pretty fundamental
> > > problems with the architecture of the dm-insitu-comp target, see:
> > > https://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=for-3.15-insitu-comp&id=8565ab6b04837591d03c94851c2f9f9162ce12f4
> > > 
> > > Unfortunately the single insitu_comp_wq workqueue that all insitu-comp
> > > targets are to share isn't a workable solution.  Each target needs to
> > > have resource isolation from other targets (imagine insitu-comp used for
> > > multiple SSDs).  This is important for suspend too because you'll need
> > > to flush/stop the workqueue.
> > 
> > Is this just because of suspend? I didn't see fundamental reason why the
> > workqueue can't be shared even for several targets.
> 
> I'm not seeing how you are guaranteeing that all queued work is
> completed during suspend.  insitu_comp_queue_req() just calls
> queue_work_on().
> 
> BTW, queue_work_on()'s comment above its implementation says:
> "We queue the work to a specific CPU, the caller must ensure it can't go
> away." -- you're not able to insure a cpu isn't hotplugged so... but I
> also see you've used it in your raid5 perf improvement changes so you
> obviously have experience with using this interface.

Good point, I did miss this. the raid5 case hold a lock, while this case
doesn't. A fix is attached below.

> > > You introduced a state machine for tracking suspending, suspended,
> > > resumed.  This really isn't necessary.  During suspend you need to
> > > flush_workqueue().  On resume you shouldn't need to do anything special.
> > > 
> > > As I noted in the commit, the thin and cache targets can serve as
> > > references for how you can manage the workqueue across suspend/resume
> > > and the lifetime of these workqueues relative to .ctr and .dtr.
> > 
> > As far as I checking the code, .postsuspend is called after all requests are
> > finished. This already guarantees no pending requests running in insitu-comp
> > workqueue.
> 
> I could easily be missing something obvious, but I don't see where that
> guarantee is implemented.

Alright, so this is the divergence. dm_suspend() calls dm_wait_for_completion()
and then dm_table_postsuspend_targets(). As far as I understand,
dm_wait_for_completion() will wait all pending requests to finish. The comments
declaim this too. Am I missing anything?

Basically the two kinds of IO. IO requests from upper layer, which
dm_wait_for_completion() will guarantee they are finished. Metadata IO
requests, which .postsuspend should make sure they are finished.
 
> > Doing a workqueue flush isn't required. The writeback thread is
> > running in background and waiting for requests completion can't guarantee the
> > thread isn't running, so we must make sure it is safely parked.
> 
> Sure, but you don't need a state machine to do that.  The DM core takes
> care of calling these hooks, so you just need to stop the writeback
> thread during suspend and (re)start/kick it on resume (preresume).

Yep, I need wait the writeback thread finish all pending metadata IO, the state
machine works well here.


---
 drivers/md/dm-insitu-comp.c |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Index: linux/drivers/md/dm-insitu-comp.c
===================================================================
--- linux.orig/drivers/md/dm-insitu-comp.c	2014-03-17 17:40:01.106660303 +0800
+++ linux/drivers/md/dm-insitu-comp.c	2014-03-19 08:49:12.314627050 +0800
@@ -704,14 +704,19 @@ static void insitu_comp_queue_req(struct
 				  struct insitu_comp_req *req)
 {
 	unsigned long flags;
-	struct insitu_comp_io_worker *worker =
-		&insitu_comp_io_workers[req->cpu];
+	struct insitu_comp_io_worker *worker;
+
+	preempt_disable();
+	if (!cpu_online(req->cpu))
+		req->cpu = cpumask_any(cpu_online_mask);
+	worker = &insitu_comp_io_workers[req->cpu];
 
 	spin_lock_irqsave(&worker->lock, flags);
 	list_add_tail(&req->sibling, &worker->pending);
 	spin_unlock_irqrestore(&worker->lock, flags);
 
 	queue_work_on(req->cpu, insitu_comp_wq, &worker->work);
+	preempt_enable();
 }
 
 static void insitu_comp_queue_req_list(struct insitu_comp_info *info,

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch v3]DM: dm-insitu-comp: a compressed DM target for SSD
  2014-03-19  1:45                 ` Shaohua Li
@ 2014-03-19 16:16                   ` Mike Snitzer
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Snitzer @ 2014-03-19 16:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: axboe, dm-devel, linux-kernel, agk

On Tue, Mar 18 2014 at  9:45pm -0400,
Shaohua Li <shli@kernel.org> wrote:

> On Tue, Mar 18, 2014 at 05:28:43PM -0400, Mike Snitzer wrote:
> > On Tue, Mar 18 2014 at  3:41am -0400,
> > Shaohua Li <shli@kernel.org> wrote:
> > 
> > > On Mon, Mar 17, 2014 at 04:00:40PM -0400, Mike Snitzer wrote:
> > > > 
> > > > I folded your changes in, and then committed a patch ontop that cleans
> > > > some code up.  But added 2 FIXMEs that still speak to pretty fundamental
> > > > problems with the architecture of the dm-insitu-comp target, see:
> > > > https://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=for-3.15-insitu-comp&id=8565ab6b04837591d03c94851c2f9f9162ce12f4
> > > > 
> > > > Unfortunately the single insitu_comp_wq workqueue that all insitu-comp
> > > > targets are to share isn't a workable solution.  Each target needs to
> > > > have resource isolation from other targets (imagine insitu-comp used for
> > > > multiple SSDs).  This is important for suspend too because you'll need
> > > > to flush/stop the workqueue.
> > > 
> > > Is this just because of suspend? I didn't see fundamental reason why the
> > > workqueue can't be shared even for several targets.
> > 
> > I'm not seeing how you are guaranteeing that all queued work is
> > completed during suspend.  insitu_comp_queue_req() just calls
> > queue_work_on().
> > 
> > BTW, queue_work_on()'s comment above its implementation says:
> > "We queue the work to a specific CPU, the caller must ensure it can't go
> > away." -- you're not able to insure a cpu isn't hotplugged so... but I
> > also see you've used it in your raid5 perf improvement changes so you
> > obviously have experience with using this interface.
> 
> Good point, I did miss this. the raid5 case hold a lock, while this case
> doesn't. A fix is attached below.

I've pushed it to the branch.
 
> > > > You introduced a state machine for tracking suspending, suspended,
> > > > resumed.  This really isn't necessary.  During suspend you need to
> > > > flush_workqueue().  On resume you shouldn't need to do anything special.
> > > > 
> > > > As I noted in the commit, the thin and cache targets can serve as
> > > > references for how you can manage the workqueue across suspend/resume
> > > > and the lifetime of these workqueues relative to .ctr and .dtr.
> > > 
> > > As far as I checking the code, .postsuspend is called after all requests are
> > > finished. This already guarantees no pending requests running in insitu-comp
> > > workqueue.
> > 
> > I could easily be missing something obvious, but I don't see where that
> > guarantee is implemented.
> 
> Alright, so this is the divergence. dm_suspend() calls dm_wait_for_completion()
> and then dm_table_postsuspend_targets(). As far as I understand,
> dm_wait_for_completion() will wait all pending requests to finish. The comments
> declaim this too. Am I missing anything?
> 
> Basically the two kinds of IO. IO requests from upper layer, which
> dm_wait_for_completion() will guarantee they are finished. Metadata IO
> requests, which .postsuspend should make sure they are finished.

OK, I see.  You don't have the notion of a transaction, that I can see,
so this this begs the question: what kind of crash consistency/recovery
are you providing?  Seems that the metadata writeback support is
allowing for more potential for lost data on a crash.

Also, it isn't clear how you're coping with the potential for a crash
while updating a extent (when metadata bits are borrowed from the tail,
etc).  Without transaction a block (or extent of blocks) could be
partially updated, how are you guaranteeing the corresponding data is
either entirely old or new?  The compressed nature of this data makes a
requirement for atomic updates to occur here.  Are you somehow
leveraging Fusion-io SSD to provide such guarantee?

So effectively there are concerns about data integrity of this target
that need answering.  Unfortunately I'm running low on time I can
dedicate to continued review of this target and need to transition to
other priorities.

> > > Doing a workqueue flush isn't required. The writeback thread is
> > > running in background and waiting for requests completion can't guarantee the
> > > thread isn't running, so we must make sure it is safely parked.
> > 
> > Sure, but you don't need a state machine to do that.  The DM core takes
> > care of calling these hooks, so you just need to stop the writeback
> > thread during suspend and (re)start/kick it on resume (preresume).
> 
> Yep, I need wait the writeback thread finish all pending metadata IO, the state
> machine works well here.

I see what you've done, I get that you're using the state machine to
wait but I still contend it isn't needed.  Like I said before, just
stop writeback in suspend and restart on resume.  No state is needed.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-03-19 16:16 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-18 10:13 [patch v3]DM: dm-insitu-comp: a compressed DM target for SSD Shaohua Li
2014-03-07  7:57 ` Shaohua Li
2014-03-10 13:52   ` Mike Snitzer
2014-03-14  9:40     ` Shaohua Li
2014-03-14 22:44       ` Mike Snitzer
2014-03-17  9:56         ` Shaohua Li
2014-03-17 20:00           ` Mike Snitzer
2014-03-18  7:41             ` Shaohua Li
2014-03-18 21:28               ` Mike Snitzer
2014-03-19  1:45                 ` Shaohua Li
2014-03-19 16:16                   ` Mike Snitzer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).