All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] dm-zoned target for ZBC devices
@ 2016-07-19 14:02 Hannes Reinecke
  2016-07-19 14:02 ` [PATCH 1/3] block: add flag for single-threaded submission Hannes Reinecke
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Hannes Reinecke @ 2016-07-19 14:02 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: dm-devel-redhat.com, Damien Le Moal, linux-scsi, linux-block,
	Christoph Hellwig, Jens Axboe, Hannes Reinecke

Hi all,

this patchset implements a 'dm-zoned' device-mapper target to handle
ZBC host-managed (and host-aware) devices as 'normal' disks.

Patch has been made over Tejuns 'libata/for-4.8' repository.
It relies on the patchsets "Support for zoned block devices" and
"Add support for ZBC host-managed devices" posted earlier.

As usual, comments and reviews are welcome.

Damien Le Moal (1):
  dm-zoned: New device mapper target for zoned block devices

Hannes Reinecke (2):
  block: add flag for single-threaded submission
  sd: enable single-threaded I/O submission for zoned devices

 Documentation/device-mapper/dm-zoned.txt |  147 +++
 block/blk-core.c                         |    2 +
 drivers/md/Kconfig                       |   14 +
 drivers/md/Makefile                      |    2 +
 drivers/md/dm-zoned-io.c                 | 1186 ++++++++++++++++++
 drivers/md/dm-zoned-meta.c               | 1950 ++++++++++++++++++++++++++++++
 drivers/md/dm-zoned-reclaim.c            |  770 ++++++++++++
 drivers/md/dm-zoned.h                    |  687 +++++++++++
 drivers/scsi/sd.c                        |    1 +
 include/linux/blkdev.h                   |    2 +
 10 files changed, 4761 insertions(+)
 create mode 100644 Documentation/device-mapper/dm-zoned.txt
 create mode 100644 drivers/md/dm-zoned-io.c
 create mode 100644 drivers/md/dm-zoned-meta.c
 create mode 100644 drivers/md/dm-zoned-reclaim.c
 create mode 100644 drivers/md/dm-zoned.h

-- 
1.8.5.6


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] block: add flag for single-threaded submission
  2016-07-19 14:02 [PATCH 0/3] dm-zoned target for ZBC devices Hannes Reinecke
@ 2016-07-19 14:02 ` Hannes Reinecke
  2016-07-20  1:13     ` Damien Le Moal
  2016-07-21  5:54   ` Christoph Hellwig
  2016-07-19 14:02 ` [PATCH 2/3] sd: enable single-threaded I/O submission for zoned devices Hannes Reinecke
  2016-07-19 14:02 ` [PATCH 3/3] dm-zoned: New device mapper target for zoned block devices Hannes Reinecke
  2 siblings, 2 replies; 15+ messages in thread
From: Hannes Reinecke @ 2016-07-19 14:02 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: dm-devel-redhat.com, Damien Le Moal, linux-scsi, linux-block,
	Christoph Hellwig, Jens Axboe, Hannes Reinecke

Some devices (most notably SMR drives) support only
one I/O stream eg for ensuring ordered I/O submission.
This patch adds a new block queue flag
'BLK_QUEUE_SINGLE' to support these devices.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 block/blk-core.c       | 2 ++
 include/linux/blkdev.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4bcf30a..ff08d77 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -320,6 +320,8 @@ inline void __blk_run_queue_uncond(struct request_queue *q)
 	 * number of active request_fn invocations such that blk_drain_queue()
 	 * can wait until all these request_fn calls have finished.
 	 */
+	if (blk_queue_single(q) && q->request_fn_active)
+		return;
 	q->request_fn_active++;
 	q->request_fn(q);
 	q->request_fn_active--;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c351444..2f7775a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -539,6 +539,7 @@ struct request_queue {
 #define QUEUE_FLAG_WC	       23	/* Write back caching */
 #define QUEUE_FLAG_FUA	       24	/* device supports FUA writes */
 #define QUEUE_FLAG_FLUSH_NQ    25	/* flush not queueuable */
+#define QUEUE_FLAG_SINGLE      26	/* single-threaded submission only */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\
@@ -628,6 +629,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 #define blk_queue_discard(q)	test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
 #define blk_queue_secdiscard(q)	(blk_queue_discard(q) && \
 	test_bit(QUEUE_FLAG_SECDISCARD, &(q)->queue_flags))
+#define blk_queue_single(q)	test_bit(QUEUE_FLAG_SINGLE, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] sd: enable single-threaded I/O submission for zoned devices
  2016-07-19 14:02 [PATCH 0/3] dm-zoned target for ZBC devices Hannes Reinecke
  2016-07-19 14:02 ` [PATCH 1/3] block: add flag for single-threaded submission Hannes Reinecke
@ 2016-07-19 14:02 ` Hannes Reinecke
  2016-07-20  1:15     ` Damien Le Moal
  2016-07-19 14:02 ` [PATCH 3/3] dm-zoned: New device mapper target for zoned block devices Hannes Reinecke
  2 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2016-07-19 14:02 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: dm-devel-redhat.com, Damien Le Moal, linux-scsi, linux-block,
	Christoph Hellwig, Jens Axboe, Hannes Reinecke

zoned devices require single-thread I/O submission to guarantee
sequential I/O, so enable the block layer flag for it.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/scsi/sd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 4b704b0..44960fd 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2154,6 +2154,7 @@ static void sd_read_zones(struct scsi_disk *sdkp, unsigned char *buffer)
 	blk_queue_chunk_sectors(sdkp->disk->queue,
 				logical_to_sectors(sdkp->device, zone_len));
 	sd_config_discard(sdkp, SD_ZBC_RESET_WP);
+	queue_flag_set_unlocked(QUEUE_FLAG_SINGLE, sdkp->disk->queue);
 
 	sd_zbc_setup(sdkp, buffer, SD_BUF_SIZE);
 }
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] dm-zoned: New device mapper target for zoned block devices
  2016-07-19 14:02 [PATCH 0/3] dm-zoned target for ZBC devices Hannes Reinecke
  2016-07-19 14:02 ` [PATCH 1/3] block: add flag for single-threaded submission Hannes Reinecke
  2016-07-19 14:02 ` [PATCH 2/3] sd: enable single-threaded I/O submission for zoned devices Hannes Reinecke
@ 2016-07-19 14:02 ` Hannes Reinecke
  2 siblings, 0 replies; 15+ messages in thread
From: Hannes Reinecke @ 2016-07-19 14:02 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: dm-devel-redhat.com, Damien Le Moal, linux-scsi, linux-block,
	Christoph Hellwig, Jens Axboe, Hannes Reinecke

From: Damien Le Moal <damien.lemoal@hgst.com>

dm-zoned presents a regular block device fully randomly writeable,
hiding write constraints of host-managed zoned block devices and
mitigating potential performance degradation of host-aware devices.

Signed-off-by: Damien Le Moal <damien.lemoal@hgst.com>
Signed-off-by: Hannes Reinecke <hare@suse.com>
---
 Documentation/device-mapper/dm-zoned.txt |  147 +++
 drivers/md/Kconfig                       |   14 +
 drivers/md/Makefile                      |    2 +
 drivers/md/dm-zoned-io.c                 | 1186 ++++++++++++++++++
 drivers/md/dm-zoned-meta.c               | 1950 ++++++++++++++++++++++++++++++
 drivers/md/dm-zoned-reclaim.c            |  770 ++++++++++++
 drivers/md/dm-zoned.h                    |  687 +++++++++++
 7 files changed, 4756 insertions(+)
 create mode 100644 Documentation/device-mapper/dm-zoned.txt
 create mode 100644 drivers/md/dm-zoned-io.c
 create mode 100644 drivers/md/dm-zoned-meta.c
 create mode 100644 drivers/md/dm-zoned-reclaim.c
 create mode 100644 drivers/md/dm-zoned.h

diff --git a/Documentation/device-mapper/dm-zoned.txt b/Documentation/device-mapper/dm-zoned.txt
new file mode 100644
index 0000000..28595ef
--- /dev/null
+++ b/Documentation/device-mapper/dm-zoned.txt
@@ -0,0 +1,147 @@
+dm-zoned
+========
+
+The dm-zoned device mapper target provides transparent write access to
+zoned block devices (ZBC and ZAC compliant devices). It hides to the
+device user (a file system or an application doing raw block device
+accesses) any sequential write constraint on host-managed devices and
+can mitigate potential device performance degradation with host-aware
+zoned devices.
+
+For a more detailed description of the zoned block device models and
+constraints see (for SCSI devices):
+
+http://www.t10.org/drafts.htm#ZBC_Family
+
+And (for ATA devices):
+
+http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf
+
+
+Algorithm
+=========
+
+The zones of the device are separated into 3 sets:
+1) Metadata zones: these are randomly writeable zones used to store metadata.
+Randomly writeable zones may be conventional zones or sequential write preferred
+zones (host-aware devices only). These zones have a fixed mapping and must be
+available at the beginning of the device address space (from LBA 0).
+2) Buffer zones: these are randomly writeable zones used to temporarily
+buffer unaligned writes to data zones. Buffer zones zones may be conventional
+zones or sequential write preferred zones (host-aware devivces only) and any
+random zone in the device address space can be used as a buffer zone (there
+are no constraint on these zones location).
+3) Data zones: all remaining zones. Most will likely be sequential zones,
+either sequiential write required zones (host-managed devices) or sequential
+write preferred zones (host-aware devices). Conventional zones unused as
+metadata zone or buffer zone will be part of the set of data zones. dm-zoned
+tries to efficiently allocate and map these zones to limit the performance
+impact of buffering random writes for chunks of the logical device that are
+being heavily randomly written.
+
+dm-zoned exposes a logical device with a sector size of 4096 bytes, irespectively
+of the physical sector size of the backend device being used.  This allows
+reducing the amount of metadata needed to manage valid blocks (blocks written)
+and buffering of random writes. In more detail, the on-disk metadata format
+is as follows:
+1) Block 0 contains the super block which describes the amount of metadata
+blocks used, the number of buffer zones reserved, their position on disk and
+the data zones being buffered.
+2) Following block 0, a set of blocks is used to describe the mapping to data
+zones of the logical chunks of the logical device (the size of a logical chunk
+is equal to the device zone size).
+3) A set of blocks used to store bitmaps indicating the validity of blocks in
+the buffer zones and data zones. A valid block is a block that was writen and
+not discarded. For a buffered data zone, a block can be valid only in the data
+zone or in the buffer zone.
+
+For a logical chunk mapped to a conventional data zone, all write operations are
+processed by directly writing the data zone. If the mapping zone is a sequential
+zone, the write operation is processed directly only and only if the write offset
+within the logical chunk equals the write pointer offset within the data zone
+(i.e. the write operation is aligned on the zone write pointer).
+
+Otherwise, write operations are processed indirectly using a buffer zone: a buffer
+zone is allocated and assigned to the data zone being accessed and data written to
+the buffer zone. This results in the invalidation of the written block in the data
+zone and validation in the buffer zone.
+
+Read operations are processed according to the block validity information provided
+by the bitmaps: valid blocks are read either from the data zone or, if the data
+zone is buffered, from the buffer zone assigned to the data zone.
+
+After some time, the limited number of buffer zones available may be exhausted and
+unaligned writes to unbuffered zones become impossible. To avoid such situation, a
+reclaim process regularly scan used buffer zones and try to "reclaim" them by
+rewriting (sequentially) the buffered data blocks and the valid blocks in the data
+zone being buffered into a new data zone. This "merge" operation completes with the
+remapping of the data zone chunk to the newly writen data zone and the release of
+the buffer zone.
+
+This reclaim process is optimized to try to detect data zones that are being
+heavily randomly written and try to do the merge operation into a conventional
+data zone (if available).
+
+Usage
+=====
+
+Parameters: <zoned device path> [Options]
+Options:
+	debug             : Enable debug messages
+	format            : Reset and format the device metadata. This will
+	                    invalidate all blocks of the device and trigger
+			    a reset write pointer of all zones, causing the
+			    loss of all previously written data.
+	num_bzones=<num>  : If the format option is specified, change the
+			    default number of buffer zones from 64 to <num>.
+			    If <num> is too large and cannot be accomodated
+			    with the number of available random zones, the
+			    maximum possible number of buffer zones is used.
+	align_wp=<blocks> : Use write same command to move an SMR zone write
+			    pointer position to the offset of a write request,
+			    limiting the write same operation to at most
+			    <blocks>. This can reduce the use of buffer zones,
+			    but can also significantly decrease the disk
+			    useable throughput. Set to 0 (default) to disable
+			    this feature. The maximum allowed is half the
+			    disk zone size.
+
+Example scripts
+===============
+
+[[
+#!/bin/sh
+
+if [ $# -lt 1 ]; then
+	echo "Usage: $0 <Zoned device path> [Options]"
+	echo "Options:"
+	echo "    debug             : Enable debug messages"
+	echo "    format            : Reset and format the device metadata. This will"
+	echo "                        invalidate all blocks of the device and trigger"
+	echo "                        a reset write pointer of all zones, causing the"
+	echo "                        loss of all previously written data."
+	echo "    num_bzones=<num>  : If the format option is specified, change the"
+	echo "                        default number of buffer zones from 64 to <num>."
+	echo "                        If <num> is too large and cannot be accomodated"
+	echo "                        with the number of available random zones, the"
+	echo "                        maximum possible number of buffer zones is used."
+	echo "    align_wp=<blocks> : Use write same command to move an SMR zone write"
+	echo "                        pointer position to the offset of a write request,"
+	echo "		              limiting the write same operation to at most"
+	echo "		              <blocks>. This can reduce the use of buffer zones,"
+	echo "                        but can also significantly decrease the disk"
+	echo "                        useable throughput. Set to 0 (default) to disable"
+	echo "                        this feature. The maximum allowed is half the"
+	echo "                        disk zone size."
+	exit 1
+fi
+
+dev="${1}"
+shift
+options="$@"
+
+modprobe dm-zoned
+
+echo "0 `blockdev --getsize ${dev}` dm-zoned ${dev} ${options}" | dmsetup create zoned-`basename ${dev}`
+]]
+
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 02a5345..4f31863 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -500,4 +500,18 @@ config DM_LOG_WRITES
 
 	  If unsure, say N.
 
+config DM_ZONED
+	tristate "Zoned block device cache write target support (EXPERIMENTAL)"
+	depends on BLK_DEV_DM && BLK_DEV_ZONED
+	default n
+	---help---
+	  This device-mapper target implements an on-disk caching layer for
+	  zoned block devices (ZBC), doing so hiding random write constraints
+	  of the backend device.
+
+	  To compile this code as a module, choose M here: the module will
+	  be called dm-zoned.
+
+	  If unsure, say N.
+
 endif # MD
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 52ba8dd..2d61be5 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -18,6 +18,7 @@ dm-era-y	+= dm-era-target.o
 dm-verity-y	+= dm-verity-target.o
 md-mod-y	+= md.o bitmap.o
 raid456-y	+= raid5.o raid5-cache.o
+dm-zoned-y      += dm-zoned-io.o dm-zoned-meta.o dm-zoned-reclaim.o
 
 # Note: link order is important.  All raid personalities
 # and must come before md.o, as they each initialise 
@@ -58,6 +59,7 @@ obj-$(CONFIG_DM_CACHE_SMQ)	+= dm-cache-smq.o
 obj-$(CONFIG_DM_CACHE_CLEANER)	+= dm-cache-cleaner.o
 obj-$(CONFIG_DM_ERA)		+= dm-era.o
 obj-$(CONFIG_DM_LOG_WRITES)	+= dm-log-writes.o
+obj-$(CONFIG_DM_ZONED)          += dm-zoned.o
 
 ifeq ($(CONFIG_DM_UEVENT),y)
 dm-mod-objs			+= dm-uevent.o
diff --git a/drivers/md/dm-zoned-io.c b/drivers/md/dm-zoned-io.c
new file mode 100644
index 0000000..347510a
--- /dev/null
+++ b/drivers/md/dm-zoned-io.c
@@ -0,0 +1,1186 @@
+/*
+ * (C) Copyright 2016 Western Digital.
+ *
+ * This software is distributed under the terms of the GNU Lesser General
+ * Public License version 2, or any later version, "as is," without technical
+ * support, and WITHOUT ANY WARRANTY, without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * Author: Damien Le Moal <damien.lemoal@hgst.com>
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/version.h>
+
+#include "dm-zoned.h"
+
+/**
+ * Target BIO completion.
+ */
+static inline void
+dm_zoned_bio_end(struct bio *bio, int err)
+{
+	struct dm_zoned_bioctx *bioctx
+		= dm_per_bio_data(bio, sizeof(struct dm_zoned_bioctx));
+
+	if (err)
+		bioctx->error = err;
+
+	if (atomic_dec_and_test(&bioctx->ref)) {
+		/* I/O Completed */
+		if (bioctx->dzone)
+			dm_zoned_put_dzone(bioctx->target, bioctx->dzone);
+		bio->bi_error = bioctx->error;
+		bio_endio(bio);
+	}
+}
+
+/**
+ * I/O request completion callback. This terminates
+ * the target BIO when there are no more references
+ * on the BIO context.
+ */
+static void
+dm_zoned_bio_end_io(struct bio *bio)
+{
+	struct dm_zoned_bioctx *bioctx = bio->bi_private;
+	struct dm_zoned_zone *dzone = bioctx->dzone;
+	int err = bio->bi_error;
+	unsigned long flags;
+
+	dm_zoned_lock_zone(dzone, flags);
+	dm_zoned_assert(dzone->zwork);
+	if (atomic_dec_and_test(&dzone->zwork->bio_count)) {
+		clear_bit_unlock(DM_ZONE_ACTIVE_BIO, &dzone->flags);
+		smp_mb__after_atomic();
+		wake_up_bit(&dzone->flags, DM_ZONE_ACTIVE_BIO);
+	}
+	dm_zoned_unlock_zone(dzone, flags);
+
+	dm_zoned_bio_end(bioctx->bio, err);
+
+	bio_put(bio);
+
+}
+
+/**
+ * Issue a request to process a BIO.
+ * Processing of the BIO may be partial.
+ */
+static int
+dm_zoned_submit_zone_bio(struct dm_zoned_target *dzt,
+			 struct dm_zoned_zone *zone,
+			 struct bio *dzt_bio,
+			 sector_t chunk_block,
+			 unsigned int nr_blocks)
+{
+	struct dm_zoned_bioctx *bioctx
+		= dm_per_bio_data(dzt_bio, sizeof(struct dm_zoned_bioctx));
+	unsigned int nr_sectors = dm_zoned_block_to_sector(nr_blocks);
+	unsigned int size = nr_sectors << SECTOR_SHIFT;
+	struct dm_zoned_zone *dzone = bioctx->dzone;
+	unsigned long flags;
+	struct bio *clone;
+
+	dm_zoned_dev_assert(dzt, size != 0);
+	dm_zoned_dev_assert(dzt, size <= dzt_bio->bi_iter.bi_size);
+
+	clone = bio_clone_fast(dzt_bio, GFP_NOIO, dzt->bio_set);
+	if (!clone)
+		return -ENOMEM;
+
+	/* Setup the clone */
+	clone->bi_bdev = dzt->zbd;
+	clone->bi_rw = dzt_bio->bi_rw;
+	clone->bi_iter.bi_sector = dm_zoned_zone_start_sector(zone)
+		+ dm_zoned_block_to_sector(chunk_block);
+	clone->bi_iter.bi_size = size;
+	clone->bi_end_io = dm_zoned_bio_end_io;
+	clone->bi_private = bioctx;
+
+	bio_advance(dzt_bio, size);
+
+	/* Submit the clone */
+	dm_zoned_lock_zone(dzone, flags);
+	if (atomic_inc_return(&dzone->zwork->bio_count) == 1)
+		set_bit(DM_ZONE_ACTIVE_BIO, &dzone->flags);
+	atomic_inc(&bioctx->ref);
+	dm_zoned_unlock_zone(dzone, flags);
+	generic_make_request(clone);
+
+	return 0;
+}
+
+/**
+ * Zero out blocks of a read BIO buffers.
+ */
+static void
+dm_zoned_handle_read_zero(struct dm_zoned_target *dzt,
+			  struct dm_zoned_zone *zone,
+			  struct bio *bio,
+			  sector_t chunk_block,
+			  unsigned int nr_blocks)
+{
+	unsigned int size = nr_blocks << DM_ZONED_BLOCK_SHIFT;
+
+#ifdef __DM_ZONED_DEBUG
+	if (zone)
+		dm_zoned_dev_debug(dzt, "=> ZERO READ chunk %zu -> zone %lu, block %zu, %u blocks\n",
+				 dm_zoned_bio_chunk(dzt, bio),
+				 zone->id,
+				 chunk_block,
+				 nr_blocks);
+	else
+		dm_zoned_dev_debug(dzt, "=> ZERO READ unmapped chunk %zu, block %zu, %u blocks\n",
+				 dm_zoned_bio_chunk(dzt, bio),
+				 chunk_block,
+				 nr_blocks);
+#endif
+
+	dm_zoned_dev_assert(dzt, size != 0);
+	dm_zoned_dev_assert(dzt, size <= bio->bi_iter.bi_size);
+	dm_zoned_dev_assert(dzt, bio_data_dir(bio) == READ);
+
+	/* Clear nr_blocks */
+	swap(bio->bi_iter.bi_size, size);
+	zero_fill_bio(bio);
+	swap(bio->bi_iter.bi_size, size);
+
+	bio_advance(bio, size);
+}
+
+/**
+ * Issue a read request or zero out blocks buffers
+ * to process an entire or part of a read BIO.
+ */
+static int
+dm_zoned_handle_read_bio(struct dm_zoned_target *dzt,
+			 struct dm_zoned_zone *zone,
+			 struct bio *bio,
+			 sector_t chunk_block,
+			 unsigned int nr_blocks)
+{
+
+	dm_zoned_dev_debug(dzt, "=> %s READ zone %lu, block %zu, %u blocks\n",
+			 (dm_zoned_zone_buf(zone) ? "BUF" : "SMR"),
+			 zone->id,
+			 chunk_block,
+			 nr_blocks);
+
+	if (!nr_blocks)
+		return -EIO;
+
+	/* Submit read */
+	return dm_zoned_submit_zone_bio(dzt, zone, bio, chunk_block, nr_blocks);
+}
+
+/**
+ * Process a read BIO.
+ */
+static int
+dm_zoned_handle_read(struct dm_zoned_target *dzt,
+		     struct dm_zoned_zone *zone,
+		     struct bio *bio)
+{
+	struct dm_zoned_zone *bzone;
+	sector_t chunk_block = dm_zoned_bio_chunk_block(dzt, bio);
+	unsigned int nr_blocks = dm_zoned_bio_blocks(bio);
+	sector_t end_block = chunk_block + nr_blocks;
+	int ret = -EIO;
+
+	/* Read into unmapped chunks need only zeroing the BIO buffer */
+	if (!zone) {
+		dm_zoned_handle_read_zero(dzt, NULL, bio, chunk_block, nr_blocks);
+		return 0;
+	}
+
+	/* If this is an empty SMR zone that is also not */
+	/* buffered, all its blocks are invalid.         */
+	bzone = zone->bzone;
+	if (!bzone && dm_zoned_zone_is_smr(zone) && dm_zoned_zone_empty(zone)) {
+		dm_zoned_handle_read_zero(dzt, zone, bio, chunk_block, nr_blocks);
+		return 0;
+	}
+
+	/* Check block validity to determine the read location  */
+	while (chunk_block < end_block) {
+
+		if (dm_zoned_zone_is_cmr(zone)
+		    || chunk_block < zone->wp_block) {
+			/* Test block validity in the data zone */
+			ret = dm_zoned_block_valid(dzt, zone, chunk_block);
+			if (ret < 0)
+				return ret;
+			if (ret > 0) {
+				/* Read data zone blocks */
+				nr_blocks = min_t(unsigned int, ret,
+						  end_block - chunk_block);
+				ret = dm_zoned_handle_read_bio(dzt, zone, bio,
+							       chunk_block,
+							       nr_blocks);
+				if (ret < 0)
+					return ret;
+				chunk_block += nr_blocks;
+				continue;
+			}
+		}
+
+		/* Check the buffer zone, if there is one */
+		if (bzone) {
+			ret = dm_zoned_block_valid(dzt, bzone, chunk_block);
+			if (ret < 0)
+				return ret;
+			if (ret > 0) {
+				/* Read buffer zone blocks */
+				nr_blocks = min_t(unsigned int, ret,
+						  end_block - chunk_block);
+				ret = dm_zoned_handle_read_bio(dzt, bzone, bio,
+							       chunk_block,
+							       nr_blocks);
+				if (ret < 0)
+					return ret;
+				chunk_block += nr_blocks;
+				continue;
+			}
+		}
+
+		/* No valid block: zeroout the block in the BIO */
+		dm_zoned_handle_read_zero(dzt, zone, bio, chunk_block, 1);
+		chunk_block++;
+
+	}
+
+	return 0;
+}
+
+/**
+ * Write blocks in the buffer zone of @zone.
+ * If no buffer zone is assigned yet, get one.
+ * Called with @zone write locked.
+ */
+static int
+dm_zoned_handle_buffered_write(struct dm_zoned_target *dzt,
+			       struct dm_zoned_zone *zone,
+			       struct bio *bio,
+			       sector_t chunk_block,
+			       unsigned int nr_blocks)
+{
+	struct dm_zoned_zone *bzone;
+	int ret;
+
+	/* Make sure we have a buffer zone */
+	bzone = dm_zoned_alloc_bzone(dzt, zone);
+	if (!bzone)
+		return -EBUSY;
+
+	dm_zoned_dev_debug(dzt, "=> WRITE BUF zone %lu, block %zu, %u blocks\n",
+			 bzone->id,
+			 chunk_block,
+			 nr_blocks);
+
+	/* Submit write */
+	ret = dm_zoned_submit_zone_bio(dzt, bzone, bio, chunk_block, nr_blocks);
+	if (ret)
+		return -EIO;
+
+	/* Stats */
+	zone->mtime = jiffies;
+	zone->wr_buf_blocks += nr_blocks;
+
+	/* Validate the blocks in the buffer zone */
+	/* and invalidate in the data zone.       */
+	ret = dm_zoned_validate_blocks(dzt, bzone, chunk_block, nr_blocks);
+	if (ret == 0 && chunk_block < zone->wp_block)
+		ret = dm_zoned_invalidate_blocks(dzt, zone,
+						 chunk_block, nr_blocks);
+
+	return ret;
+}
+
+/**
+ * Write blocks directly in a data zone, at the write pointer.
+ * If a buffer zone is assigned, invalidate the blocks written
+ * in place.
+ */
+static int
+dm_zoned_handle_direct_write(struct dm_zoned_target *dzt,
+			     struct dm_zoned_zone *zone,
+			     struct bio *bio,
+			     sector_t chunk_block,
+			     unsigned int nr_blocks)
+{
+	struct dm_zoned_zone *bzone = zone->bzone;
+	int ret;
+
+	dm_zoned_dev_debug(dzt, "=> WRITE %s zone %lu, block %zu, %u blocks\n",
+			 (dm_zoned_zone_is_cmr(zone) ? "CMR" : "SMR"),
+			 zone->id,
+			 chunk_block,
+			 nr_blocks);
+
+	/* Submit write */
+	ret = dm_zoned_submit_zone_bio(dzt, zone, bio, chunk_block, nr_blocks);
+	if (ret)
+		return -EIO;
+
+	if (dm_zoned_zone_is_smr(zone))
+		zone->wp_block += nr_blocks;
+
+	/* Stats */
+	zone->mtime = jiffies;
+	zone->wr_dir_blocks += nr_blocks;
+
+	/* Validate the blocks in the data zone */
+	/* and invalidate in the buffer zone.   */
+	ret = dm_zoned_validate_blocks(dzt, zone, chunk_block, nr_blocks);
+	if (ret == 0 && bzone) {
+		dm_zoned_dev_assert(dzt, dm_zoned_zone_is_smr(zone));
+		ret = dm_zoned_invalidate_blocks(dzt, bzone,
+						 chunk_block, nr_blocks);
+	}
+
+	return ret;
+}
+
+/**
+ * Determine if an unaligned write in an SMR zone can be aligned.
+ * If yes, advance the zone write pointer.
+ */
+static int
+dm_zoned_align_write(struct dm_zoned_target *dzt,
+		     struct dm_zoned_zone *dzone,
+		     sector_t chunk_block)
+{
+	sector_t hole_blocks;
+
+	if (!test_bit(DM_ZONED_ALIGN_WP, &dzt->flags))
+		return 0;
+
+	hole_blocks = chunk_block - dzone->wp_block;
+	if (dzone->bzone || hole_blocks > dzt->align_wp_max_blocks)
+		return 0;
+
+	return dm_zoned_advance_zone_wp(dzt, dzone, hole_blocks) == 0;
+}
+
+/**
+ * Process a write BIO.
+ */
+static int
+dm_zoned_handle_write(struct dm_zoned_target *dzt,
+		      struct dm_zoned_zone *dzone,
+		      struct bio *bio)
+{
+	unsigned int nr_blocks = dm_zoned_bio_blocks(bio);
+	sector_t chunk_block = dm_zoned_bio_chunk_block(dzt, bio);
+	int ret;
+
+	/* Write into unmapped chunks happen   */
+	/* only if we ran out of data zones... */
+	if (!dzone) {
+		dm_zoned_dev_debug(dzt, "WRITE unmapped chunk %zu, block %zu, %u blocks\n",
+				 dm_zoned_bio_chunk(dzt, bio),
+				 chunk_block,
+				 nr_blocks);
+		return -ENOSPC;
+	}
+
+	dm_zoned_dev_debug(dzt, "WRITE chunk %zu -> zone %lu, block %zu, %u blocks (wp block %zu)\n",
+			 dm_zoned_bio_chunk(dzt, bio),
+			 dzone->id,
+			 chunk_block,
+			 nr_blocks,
+			 dzone->wp_block);
+
+	if (dm_zoned_zone_readonly(dzone)) {
+		dm_zoned_dev_error(dzt, "Write to readonly zone %lu\n",
+				 dzone->id);
+		return -EROFS;
+	}
+
+	/* Write in CMR zone ? */
+	if (dm_zoned_zone_is_cmr(dzone))
+		return dm_zoned_handle_direct_write(dzt, dzone, bio,
+						    chunk_block, nr_blocks);
+
+	/* Writing to an SMR zone: direct write the part of the BIO */
+	/* that aligns with the zone write pointer and buffer write */
+	/* what cannot, which may be the entire BIO.                */
+	if (chunk_block < dzone->wp_block) {
+		unsigned int wblocks = min(nr_blocks,
+			(unsigned int)(dzone->wp_block - chunk_block));
+		ret = dm_zoned_handle_buffered_write(dzt, dzone, bio,
+						     chunk_block, wblocks);
+		if (ret)
+			goto out;
+		nr_blocks -= wblocks;
+		chunk_block += wblocks;
+	}
+
+	if (nr_blocks) {
+		if (chunk_block == dzone->wp_block)
+			ret = dm_zoned_handle_direct_write(dzt, dzone, bio,
+							   chunk_block,
+							   nr_blocks);
+		else {
+			/*
+			 * Writing after the write pointer: try to align
+			 * the write if the zone is not already buffered.
+			 * If that fails, fallback to buffered write.
+			 */
+			if (dm_zoned_align_write(dzt, dzone, chunk_block)) {
+				ret = dm_zoned_handle_direct_write(dzt, dzone,
+								   bio,
+								   chunk_block,
+								   nr_blocks);
+				if (ret == 0)
+					goto out;
+			}
+			ret = dm_zoned_handle_buffered_write(dzt, dzone, bio,
+							     chunk_block,
+							     nr_blocks);
+		}
+	}
+
+out:
+	dm_zoned_validate_bzone(dzt, dzone);
+
+	return ret;
+}
+
+static int
+dm_zoned_handle_discard(struct dm_zoned_target *dzt,
+			struct dm_zoned_zone *zone,
+			struct bio *bio)
+{
+	struct dm_zoned_zone *bzone;
+	unsigned int nr_blocks = dm_zoned_bio_blocks(bio);
+	sector_t chunk_block = dm_zoned_bio_chunk_block(dzt, bio);
+	int ret;
+
+	/* For discard into unmapped chunks, there is nothing to do */
+	if (!zone) {
+		dm_zoned_dev_debug(dzt, "DISCARD unmapped chunk %zu, block %zu, %u blocks\n",
+				 dm_zoned_bio_chunk(dzt, bio),
+				 chunk_block,
+				 nr_blocks);
+		return 0;
+	}
+
+	dm_zoned_dev_debug(dzt, "DISCARD chunk %zu -> zone %lu, block %zu, %u blocks\n",
+			 dm_zoned_bio_chunk(dzt, bio),
+			 zone->id,
+			 chunk_block,
+			 nr_blocks);
+
+	if (dm_zoned_zone_readonly(zone)) {
+		dm_zoned_dev_error(dzt, "Discard in readonly zone %lu\n",
+				 zone->id);
+		return -EROFS;
+	}
+
+	/* Wait for all ongoing write I/Os to complete */
+	dm_zoned_wait_for_stable_zone(zone);
+
+	/* Invalidate blocks in the data zone. If a */
+	/* buffer zone is assigned, do the same.    */
+	/* The data zone write pointer may be reset */
+	bzone = zone->bzone;
+	if (bzone) {
+		ret = dm_zoned_invalidate_blocks(dzt, bzone,
+						 chunk_block, nr_blocks);
+		if (ret)
+			goto out;
+	}
+
+	/* If this is an empty SMR zone, there is nothing to do */
+	if (!dm_zoned_zone_is_smr(zone) ||
+	    !dm_zoned_zone_empty(zone))
+		ret = dm_zoned_invalidate_blocks(dzt, zone,
+						 chunk_block, nr_blocks);
+
+out:
+	dm_zoned_validate_bzone(dzt, zone);
+	dm_zoned_validate_dzone(dzt, zone);
+
+	return ret;
+}
+
+/**
+ * Process a data zone IO.
+ * Return 1 if the BIO was processed.
+ */
+static void
+dm_zoned_handle_zone_bio(struct dm_zoned_target *dzt,
+			 struct dm_zoned_zone *dzone,
+			 struct bio *bio)
+{
+	int ret;
+
+	/* Process the BIO */
+	if (bio_data_dir(bio) == READ)
+		ret = dm_zoned_handle_read(dzt, dzone, bio);
+	else if (bio->bi_rw & REQ_DISCARD)
+		ret = dm_zoned_handle_discard(dzt, dzone, bio);
+	else if (bio->bi_rw & REQ_WRITE)
+		ret = dm_zoned_handle_write(dzt, dzone, bio);
+	else {
+		dm_zoned_dev_error(dzt, "Unknown BIO type 0x%lx\n",
+				 bio->bi_rw);
+		ret = -EIO;
+	}
+
+	if (ret != -EBUSY)
+		dm_zoned_bio_end(bio, ret);
+
+	return;
+}
+
+/**
+ * Zone I/O work function.
+ */
+void
+dm_zoned_zone_work(struct work_struct *work)
+{
+	struct dm_zoned_zwork *zwork =
+		container_of(work, struct dm_zoned_zwork, work);
+	struct dm_zoned_zone *dzone = zwork->dzone;
+	struct dm_zoned_target *dzt = zwork->target;
+	int n = DM_ZONE_WORK_MAX_BIO;
+	unsigned long flags;
+	struct bio *bio;
+
+	dm_zoned_lock_zone(dzone, flags);
+
+	dm_zoned_dev_assert(dzt, dzone->zwork == zwork);
+
+	while (n && bio_list_peek(&zwork->bio_list)) {
+
+		/* Process the first BIO in the list */
+		bio = bio_list_pop(&zwork->bio_list);
+		dm_zoned_unlock_zone(dzone, flags);
+
+		dm_zoned_handle_zone_bio(dzt, dzone, bio);
+
+		dm_zoned_lock_zone(dzone, flags);
+		if (test_bit(DM_ZONE_ACTIVE_WAIT, &dzone->flags)) {
+			bio_list_add_head(&zwork->bio_list, bio);
+			break;
+		}
+
+		n--;
+
+	}
+
+	dm_zoned_run_dzone(dzt, dzone);
+
+	dm_zoned_unlock_zone(dzone, flags);
+
+	dm_zoned_put_dzone(dzt, dzone);
+}
+
+/**
+ * Process a flush request. Device mapper core
+ * ensures that no other I/O is in flight. So just
+ * propagate the flush to the backend and sync metadata.
+ */
+static void
+dm_zoned_handle_flush(struct dm_zoned_target *dzt,
+		      struct bio *bio)
+{
+
+	dm_zoned_dev_debug(dzt, "FLUSH (%d active zones, %d wait active zones)\n",
+			 atomic_read(&dzt->dz_nr_active),
+			 atomic_read(&dzt->dz_nr_active_wait));
+
+	dm_zoned_bio_end(bio, dm_zoned_flush(dzt));
+}
+
+/**
+ * Flush work.
+ */
+static void
+dm_zoned_flush_work(struct work_struct *work)
+{
+	struct dm_zoned_target *dzt =
+		container_of(work, struct dm_zoned_target, flush_work);
+	struct bio *bio;
+	unsigned long flags;
+
+	spin_lock_irqsave(&dzt->flush_lock, flags);
+	while ((bio = bio_list_pop(&dzt->flush_list))) {
+		spin_unlock_irqrestore(&dzt->flush_lock, flags);
+		dm_zoned_handle_flush(dzt, bio);
+		spin_lock_irqsave(&dzt->flush_lock, flags);
+	}
+	spin_unlock_irqrestore(&dzt->flush_lock, flags);
+}
+
+/*
+ * Process a new BIO.
+ * Return values:
+ *  DM_MAPIO_SUBMITTED : The target has submitted the bio request.
+ *  DM_MAPIO_REMAPPED  : Bio request is remapped, device mapper should submit bio.
+ *  DM_MAPIO_REQUEUE   : Request that the BIO be submitted again.
+ */
+static int
+dm_zoned_map(struct dm_target *ti,
+	     struct bio *bio)
+{
+	struct dm_zoned_target *dzt = ti->private;
+	struct dm_zoned_bioctx *bioctx
+		= dm_per_bio_data(bio, sizeof(struct dm_zoned_bioctx));
+	unsigned int nr_sectors = dm_zoned_bio_sectors(bio);
+	struct dm_zoned_zone *dzone;
+	sector_t chunk_sector;
+	unsigned long flags;
+
+	bio->bi_bdev = dzt->zbd;
+	if (!nr_sectors && !(bio->bi_rw & REQ_FLUSH)) {
+		bio->bi_bdev = dzt->zbd;
+		return DM_MAPIO_REMAPPED;
+	}
+
+	/* The BIO should be block aligned */
+	if ((nr_sectors & DM_ZONED_BLOCK_SECTORS_MASK) ||
+	    (dm_zoned_bio_sector(bio) & DM_ZONED_BLOCK_SECTORS_MASK)) {
+		dm_zoned_dev_error(dzt, "Unaligned BIO sector %zu, len %u\n",
+				 dm_zoned_bio_sector(bio),
+				 nr_sectors);
+		return -EIO;
+	}
+
+	dzt->last_bio_time = jiffies;
+
+	/* Initialize the IO context */
+	bioctx->target = dzt;
+	bioctx->dzone = NULL;
+	bioctx->bio = bio;
+	atomic_set(&bioctx->ref, 1);
+	bioctx->error = 0;
+
+	/* Set the BIO pending in the flush list */
+	if (bio->bi_rw & REQ_FLUSH) {
+		spin_lock_irqsave(&dzt->flush_lock, flags);
+		bio_list_add(&dzt->flush_list, bio);
+		spin_unlock_irqrestore(&dzt->flush_lock, flags);
+		queue_work(dzt->flush_wq, &dzt->flush_work);
+		return DM_MAPIO_SUBMITTED;
+	}
+
+	/* Split zone BIOs to fit entirely into a zone */
+	chunk_sector = dm_zoned_bio_chunk_sector(dzt, bio);
+	if (chunk_sector + nr_sectors > dzt->zone_nr_sectors)
+		dm_accept_partial_bio(bio, dzt->zone_nr_sectors - chunk_sector);
+
+	dm_zoned_dev_debug(dzt, "BIO sector %zu, len %u -> chunk %zu\n",
+			 dm_zoned_bio_sector(bio),
+			 dm_zoned_bio_sectors(bio),
+			 dm_zoned_bio_chunk(dzt, bio));
+
+	/* Get the zone mapping the chunk the BIO belongs to. */
+	/* If the chunk is unmapped, process the BIO directly */
+	/* without going through the zone work.               */
+	dzone = dm_zoned_bio_map(dzt, bio);
+	if (IS_ERR(dzone))
+		return PTR_ERR(dzone);
+	if (!dzone)
+		dm_zoned_handle_zone_bio(dzt, NULL, bio);
+
+	return DM_MAPIO_SUBMITTED;
+}
+
+/**
+ * Parse dmsetup arguments.
+ */
+static int
+dm_zoned_parse_args(struct dm_target *ti,
+		    struct dm_arg_set *as,
+		    struct dm_zoned_target_config *conf)
+{
+	const char *arg;
+	int ret = 0;
+
+	/* Check arguments */
+	if (as->argc < 1) {
+		ti->error = "No target device specified";
+		return -EINVAL;
+	}
+
+	/* Set defaults */
+	conf->dev_path = (char *) dm_shift_arg(as);
+	conf->format = 0;
+	conf->nr_buf_zones = DM_ZONED_NR_BZONES;
+	conf->align_wp = DM_ZONED_ALIGN_WP_MAX_BLOCK;
+	conf->debug = 0;
+
+	while (as->argc) {
+
+		arg = dm_shift_arg(as);
+
+		if (strcmp(arg, "debug") == 0) {
+#ifdef __DM_ZONED_DEBUG
+			dm_zoned_info("Debug messages enabled\n");
+			conf->debug = 1;
+#else
+			dm_zoned_info("Debug message support not enabled: ignoring option \"debug\"\n");
+#endif
+			continue;
+		}
+
+		if (strcmp(arg, "format") == 0) {
+			conf->format = 1;
+			continue;
+		}
+
+		if (strncmp(arg, "num_bzones=", 11) == 0) {
+			if (kstrtoul(arg + 11, 0, &conf->nr_buf_zones) < 0) {
+				ti->error = "Invalid number of buffer zones";
+				break;
+			}
+			continue;
+		}
+
+		if (strncmp(arg, "align_wp=", 9) == 0) {
+			if (kstrtoul(arg + 9, 0, &conf->align_wp) < 0) {
+				ti->error = "Invalid number of blocks";
+				break;
+			}
+			continue;
+		}
+
+		ti->error = "Unknown argument";
+		return -EINVAL;
+
+	}
+
+	return ret;
+
+}
+
+/**
+ * Setup target.
+ */
+static int
+dm_zoned_ctr(struct dm_target *ti,
+	     unsigned int argc,
+	     char **argv)
+{
+	struct dm_zoned_target_config conf;
+	struct dm_zoned_target *dzt;
+	struct dm_arg_set as;
+	char wq_name[32];
+	int ret;
+
+	/* Parse arguments */
+	as.argc = argc;
+	as.argv = argv;
+	ret = dm_zoned_parse_args(ti, &as, &conf);
+	if (ret)
+		return ret;
+
+	dm_zoned_info("Intializing device %s\n", conf.dev_path);
+
+	/* Allocate and initialize the target descriptor */
+	dzt = kzalloc(sizeof(struct dm_zoned_target), GFP_KERNEL);
+	if (!dzt) {
+		ti->error = "Allocate target descriptor failed";
+		return -ENOMEM;
+	}
+	dm_zoned_account_mem(dzt, sizeof(struct dm_zoned_target));
+
+	/* Get the target device */
+	ret = dm_get_device(ti, conf.dev_path, dm_table_get_mode(ti->table),
+			    &dzt->ddev);
+	if (ret != 0) {
+		ti->error = "Get target device failed";
+		goto err;
+	}
+
+	dzt->zbd = dzt->ddev->bdev;
+	dzt->zbd_capacity = i_size_read(dzt->zbd->bd_inode) >> SECTOR_SHIFT;
+	if (ti->begin ||
+	    (ti->len != dzt->zbd_capacity)) {
+		ti->error = "Partial mapping not supported";
+		ret = -EINVAL;
+		goto err;
+	}
+
+	(void)bdevname(dzt->zbd, dzt->zbd_name);
+	dzt->zbdq = bdev_get_queue(dzt->zbd);
+	dzt->zbd_metablk_shift = DM_ZONED_BLOCK_SHIFT -
+		dzt->zbd->bd_inode->i_sb->s_blocksize_bits;
+	if (conf.debug)
+		set_bit(DM_ZONED_DEBUG, &dzt->flags);
+
+	mutex_init(&dzt->map_lock);
+	INIT_LIST_HEAD(&dzt->bz_lru_list);
+	INIT_LIST_HEAD(&dzt->bz_free_list);
+	INIT_LIST_HEAD(&dzt->bz_wait_list);
+	INIT_LIST_HEAD(&dzt->dz_unmap_smr_list);
+	INIT_LIST_HEAD(&dzt->dz_unmap_cmr_list);
+	INIT_LIST_HEAD(&dzt->dz_map_cmr_list);
+	INIT_LIST_HEAD(&dzt->dz_empty_list);
+	atomic_set(&dzt->dz_nr_active, 0);
+	atomic_set(&dzt->dz_nr_active_wait, 0);
+
+	dm_zoned_dev_info(dzt, "Initializing device %s\n",
+			dzt->zbd_name);
+
+	ret = dm_zoned_init_meta(dzt, &conf);
+	if (ret != 0) {
+		ti->error = "Metadata initialization failed";
+		goto err;
+	}
+
+	/* Set target (no write same support) */
+	ti->private = dzt;
+	ti->max_io_len = dzt->zone_nr_sectors << 9;
+	ti->num_flush_bios = 1;
+	ti->num_discard_bios = 1;
+	ti->num_write_same_bios = 0;
+	ti->per_io_data_size = sizeof(struct dm_zoned_bioctx);
+	ti->flush_supported = true;
+	ti->discards_supported = true;
+	ti->split_discard_bios = true;
+	ti->discard_zeroes_data_unsupported = true;
+	ti->len = dzt->zone_nr_sectors * dzt->nr_data_zones;
+
+	if (conf.align_wp) {
+		set_bit(DM_ZONED_ALIGN_WP, &dzt->flags);
+		dzt->align_wp_max_blocks = min_t(unsigned int, conf.align_wp,
+						 dzt->zone_nr_blocks > 1);
+	}
+
+	/* BIO set */
+	dzt->bio_set = bioset_create(DM_ZONED_MIN_BIOS, 0);
+	if (!dzt->bio_set) {
+		ti->error = "Create BIO set failed";
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	/* Zone I/O work queue */
+	snprintf(wq_name, sizeof(wq_name), "dm_zoned_zwq_%s", dzt->zbd_name);
+	dzt->zone_wq = create_workqueue(wq_name);
+	if (!dzt->zone_wq) {
+		ti->error = "Create zone workqueue failed";
+		ret = -ENOMEM;
+		goto err;
+	}
+	dm_zoned_dev_info(dzt, "Allowing at most %d zone workers\n",
+			  min_t(int, dzt->nr_buf_zones * 2, DM_ZONE_WORK_MAX));
+	workqueue_set_max_active(dzt->zone_wq,
+				 min_t(int, dzt->nr_buf_zones * 2,
+				       DM_ZONE_WORK_MAX));
+
+	/* Flush work */
+	spin_lock_init(&dzt->flush_lock);
+	bio_list_init(&dzt->flush_list);
+	INIT_WORK(&dzt->flush_work, dm_zoned_flush_work);
+	snprintf(wq_name, sizeof(wq_name), "dm_zoned_fwq_%s", dzt->zbd_name);
+	dzt->flush_wq = create_singlethread_workqueue(wq_name);
+	if (!dzt->flush_wq) {
+		ti->error = "Create flush workqueue failed";
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	/* Buffer zones reclaim work */
+	dzt->reclaim_client = dm_io_client_create();
+	if (IS_ERR(dzt->reclaim_client)) {
+		ti->error = "Create GC I/O client failed";
+		ret = PTR_ERR(dzt->reclaim_client);
+		dzt->reclaim_client = NULL;
+		goto err;
+	}
+	INIT_DELAYED_WORK(&dzt->reclaim_work, dm_zoned_reclaim_work);
+	snprintf(wq_name, sizeof(wq_name), "dm_zoned_rwq_%s", dzt->zbd_name);
+	dzt->reclaim_wq = create_singlethread_workqueue(wq_name);
+	if (!dzt->reclaim_wq) {
+		ti->error = "Create reclaim workqueue failed";
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	snprintf(wq_name, sizeof(wq_name), "dm_zoned_rzwq_%s", dzt->zbd_name);
+	dzt->reclaim_zwq = create_workqueue(wq_name);
+	if (!dzt->reclaim_zwq) {
+		ti->error = "Create reclaim zone workqueue failed";
+		ret = -ENOMEM;
+		goto err;
+	}
+	workqueue_set_max_active(dzt->reclaim_zwq,
+				 DM_ZONED_RECLAIM_MAX_WORKERS);
+
+	dm_zoned_dev_info(dzt,
+			  "Target device: %zu 512-byte logical sectors (%zu blocks)\n",
+			  ti->len,
+			  dm_zoned_sector_to_block(ti->len));
+
+	dzt->last_bio_time = jiffies;
+	dm_zoned_trigger_reclaim(dzt);
+
+	return 0;
+
+err:
+
+	if (dzt->ddev) {
+		if (dzt->reclaim_wq)
+			destroy_workqueue(dzt->reclaim_wq);
+		if (dzt->reclaim_client)
+			dm_io_client_destroy(dzt->reclaim_client);
+		if (dzt->flush_wq)
+			destroy_workqueue(dzt->flush_wq);
+		if (dzt->zone_wq)
+			destroy_workqueue(dzt->zone_wq);
+		if (dzt->bio_set)
+			bioset_free(dzt->bio_set);
+		dm_zoned_cleanup_meta(dzt);
+		dm_put_device(ti, dzt->ddev);
+	}
+
+	kfree(dzt);
+
+	return ret;
+
+}
+
+/**
+ * Cleanup target.
+ */
+static void
+dm_zoned_dtr(struct dm_target *ti)
+{
+	struct dm_zoned_target *dzt = ti->private;
+
+	dm_zoned_dev_info(dzt, "Removing target device\n");
+
+	dm_zoned_flush(dzt);
+
+	flush_workqueue(dzt->zone_wq);
+	destroy_workqueue(dzt->zone_wq);
+
+	flush_workqueue(dzt->reclaim_zwq);
+	cancel_delayed_work_sync(&dzt->reclaim_work);
+	destroy_workqueue(dzt->reclaim_zwq);
+	destroy_workqueue(dzt->reclaim_wq);
+	dm_io_client_destroy(dzt->reclaim_client);
+
+	flush_workqueue(dzt->flush_wq);
+	destroy_workqueue(dzt->flush_wq);
+
+	bioset_free(dzt->bio_set);
+
+	dm_zoned_cleanup_meta(dzt);
+
+	dm_put_device(ti, dzt->ddev);
+
+	kfree(dzt);
+}
+
+/**
+ * Setup target request queue limits.
+ */
+static void
+dm_zoned_io_hints(struct dm_target *ti,
+		  struct queue_limits *limits)
+{
+	struct dm_zoned_target *dzt = ti->private;
+	unsigned int chunk_sectors = dzt->zone_nr_sectors;
+
+	BUG_ON(!is_power_of_2(chunk_sectors));
+
+	/* Align to zone size */
+	limits->chunk_sectors = chunk_sectors;
+	limits->max_sectors = chunk_sectors;
+
+	blk_limits_io_min(limits, DM_ZONED_BLOCK_SIZE);
+	blk_limits_io_opt(limits, DM_ZONED_BLOCK_SIZE);
+
+	limits->logical_block_size = DM_ZONED_BLOCK_SIZE;
+	limits->physical_block_size = DM_ZONED_BLOCK_SIZE;
+
+	limits->discard_alignment = DM_ZONED_BLOCK_SIZE;
+	limits->discard_granularity = DM_ZONED_BLOCK_SIZE;
+	limits->max_discard_sectors = chunk_sectors;
+	limits->max_hw_discard_sectors = chunk_sectors;
+	limits->discard_zeroes_data = true;
+
+}
+
+/**
+ * Pass on ioctl to the backend device.
+ */
+static int
+dm_zoned_prepare_ioctl(struct dm_target *ti,
+		       struct block_device **bdev,
+		       fmode_t *mode)
+{
+	struct dm_zoned_target *dzt = ti->private;
+
+	*bdev = dzt->zbd;
+
+	return 0;
+}
+
+/**
+ * Stop reclaim before suspend.
+ */
+static void
+dm_zoned_presuspend(struct dm_target *ti)
+{
+	struct dm_zoned_target *dzt = ti->private;
+
+	dm_zoned_dev_debug(dzt, "Pre-suspend\n");
+
+	/* Enter suspend state */
+	set_bit(DM_ZONED_SUSPENDED, &dzt->flags);
+	smp_mb__after_atomic();
+
+	/* Stop reclaim */
+	cancel_delayed_work_sync(&dzt->reclaim_work);
+}
+
+/**
+ * Restart reclaim if suspend failed.
+ */
+static void
+dm_zoned_presuspend_undo(struct dm_target *ti)
+{
+	struct dm_zoned_target *dzt = ti->private;
+
+	dm_zoned_dev_debug(dzt, "Pre-suspend undo\n");
+
+	/* Clear suspend state */
+	clear_bit_unlock(DM_ZONED_SUSPENDED, &dzt->flags);
+	smp_mb__after_atomic();
+
+	/* Restart reclaim */
+	mod_delayed_work(dzt->reclaim_wq, &dzt->reclaim_work, 0);
+}
+
+/**
+ * Stop works and flush on suspend.
+ */
+static void
+dm_zoned_postsuspend(struct dm_target *ti)
+{
+	struct dm_zoned_target *dzt = ti->private;
+
+	dm_zoned_dev_debug(dzt, "Post-suspend\n");
+
+	/* Stop works and flush */
+	flush_workqueue(dzt->zone_wq);
+	flush_workqueue(dzt->flush_wq);
+
+	dm_zoned_flush(dzt);
+}
+
+/**
+ * Refresh zone information before resuming.
+ */
+static int
+dm_zoned_preresume(struct dm_target *ti)
+{
+	struct dm_zoned_target *dzt = ti->private;
+
+	if (!test_bit(DM_ZONED_SUSPENDED, &dzt->flags))
+		return 0;
+
+	dm_zoned_dev_debug(dzt, "Pre-resume\n");
+
+	/* Refresh zone information */
+	return dm_zoned_resume_meta(dzt);
+}
+
+/**
+ * Resume.
+ */
+static void
+dm_zoned_resume(struct dm_target *ti)
+{
+	struct dm_zoned_target *dzt = ti->private;
+
+	if (!test_bit(DM_ZONED_SUSPENDED, &dzt->flags))
+		return;
+
+	dm_zoned_dev_debug(dzt, "Resume\n");
+
+	/* Clear suspend state */
+	clear_bit_unlock(DM_ZONED_SUSPENDED, &dzt->flags);
+	smp_mb__after_atomic();
+
+	/* Restart reclaim */
+	mod_delayed_work(dzt->reclaim_wq, &dzt->reclaim_work, 0);
+
+}
+
+static int
+dm_zoned_iterate_devices(struct dm_target *ti,
+			 iterate_devices_callout_fn fn,
+			 void *data)
+{
+	struct dm_zoned_target *dzt = ti->private;
+
+	return fn(ti, dzt->ddev, dzt->nr_meta_zones * dzt->zone_nr_sectors,
+		  ti->len, data);
+}
+
+/**
+ * Module definition.
+ */
+static struct target_type dm_zoned_type = {
+	.name		 = "dm-zoned",
+	.version	 = {1, 0, 0},
+	.module		 = THIS_MODULE,
+	.ctr		 = dm_zoned_ctr,
+	.dtr		 = dm_zoned_dtr,
+	.map		 = dm_zoned_map,
+	.io_hints	 = dm_zoned_io_hints,
+	.prepare_ioctl	 = dm_zoned_prepare_ioctl,
+	.presuspend	 = dm_zoned_presuspend,
+	.presuspend_undo = dm_zoned_presuspend_undo,
+	.postsuspend	 = dm_zoned_postsuspend,
+	.preresume	 = dm_zoned_preresume,
+	.resume		 = dm_zoned_resume,
+	.iterate_devices = dm_zoned_iterate_devices,
+};
+
+struct kmem_cache *dm_zoned_zone_cache;
+
+static int __init dm_zoned_init(void)
+{
+	int ret;
+
+	dm_zoned_info("Version %d.%d, (C) Western Digital\n",
+		    DM_ZONED_VER_MAJ,
+		    DM_ZONED_VER_MIN);
+
+	dm_zoned_zone_cache = KMEM_CACHE(dm_zoned_zone, 0);
+	if (!dm_zoned_zone_cache)
+		return -ENOMEM;
+
+	ret = dm_register_target(&dm_zoned_type);
+	if (ret != 0) {
+		dm_zoned_error("Register dm-zoned target failed %d\n", ret);
+		kmem_cache_destroy(dm_zoned_zone_cache);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void __exit dm_zoned_exit(void)
+{
+	dm_unregister_target(&dm_zoned_type);
+	kmem_cache_destroy(dm_zoned_zone_cache);
+}
+
+module_init(dm_zoned_init);
+module_exit(dm_zoned_exit);
+
+MODULE_DESCRIPTION(DM_NAME " target for ZBC/ZAC devices (host-managed and host-aware)");
+MODULE_AUTHOR("Damien Le Moal <damien.lemoal@hgst.com>");
+MODULE_LICENSE("GPL");
diff --git a/drivers/md/dm-zoned-meta.c b/drivers/md/dm-zoned-meta.c
new file mode 100644
index 0000000..b9e5161
--- /dev/null
+++ b/drivers/md/dm-zoned-meta.c
@@ -0,0 +1,1950 @@
+/*
+ * (C) Copyright 2016 Western Digital.
+ *
+ * This software is distributed under the terms of the GNU Lesser General
+ * Public License version 2, or any later version, "as is," without technical
+ * support, and WITHOUT ANY WARRANTY, without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * Author: Damien Le Moal <damien.lemoal@hgst.com>
+ */
+
+#include <linux/module.h>
+#include <linux/version.h>
+#include <linux/slab.h>
+
+#include "dm-zoned.h"
+
+/**
+ * Free zones descriptors.
+ */
+static void
+dm_zoned_drop_zones(struct dm_zoned_target *dzt)
+{
+	struct blk_zone *blkz;
+	sector_t sector = 0;
+
+	/* Allocate and initialize zone descriptors */
+	while (sector < dzt->zbd_capacity) {
+		blkz = blk_lookup_zone(dzt->zbdq, sector);
+		if (blkz && blkz->private_data) {
+			kmem_cache_free(dm_zoned_zone_cache,
+					blkz->private_data);
+			blkz->private_data = NULL;
+		}
+		sector = blkz->start + blkz->len;
+	}
+}
+
+/**
+ * Allocate and initialize zone descriptors
+ * using the zone information from disk.
+ */
+static int
+dm_zoned_init_zones(struct dm_zoned_target *dzt)
+{
+	struct dm_zoned_zone *zone, *last_meta_zone = NULL;
+	struct blk_zone *blkz;
+	sector_t sector = 0;
+	int ret = -ENXIO;
+
+	/* Allocate and initialize zone descriptors */
+	while (sector < dzt->zbd_capacity) {
+
+		blkz = blk_lookup_zone(dzt->zbdq, sector);
+		if (!blkz) {
+			dm_zoned_dev_error(dzt,
+				"Unable to get zone at sector %zu\n",
+				sector);
+			goto out;
+		}
+
+		zone = kmem_cache_alloc(dm_zoned_zone_cache, GFP_KERNEL);
+		if (!zone) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		dm_zoned_account_mem(dzt, sizeof(struct dm_zoned_zone));
+
+		/* Assume at this stage that all zones are unmapped */
+		/* data zones. This will be corrected later using   */
+		/* the buffer and data zone mapping tables.         */
+		blkz->private_data = zone;
+		INIT_LIST_HEAD(&zone->link);
+		INIT_LIST_HEAD(&zone->elink);
+		zone->id = dzt->nr_zones;
+		zone->blkz = blkz;
+		zone->flags = DM_ZONE_DATA;
+		zone->zwork = NULL;
+		zone->map = DM_ZONED_MAP_UNMAPPED;
+		zone->bzone = NULL;
+
+		if (!dzt->nr_zones)
+			dzt->zone_nr_sectors = blkz->len;
+
+		if (dm_zoned_zone_is_smr(zone)) {
+			zone->wp_block = dm_zoned_sector_to_block(blkz->wp)
+				- dm_zoned_zone_start_block(zone);
+			list_add_tail(&zone->link, &dzt->dz_unmap_smr_list);
+			dzt->nr_smr_zones++;
+		} else {
+			zone->wp_block = 0;
+			list_add_tail(&zone->link, &dzt->dz_unmap_cmr_list);
+			dzt->nr_cmr_zones++;
+		}
+
+		dm_zoned_zone_reset_stats(zone);
+
+		if (dm_zoned_zone_is_rnd(zone)) {
+			dzt->nr_rnd_zones++;
+			if ((!last_meta_zone) ||
+			    dm_zoned_zone_next_sector(last_meta_zone) ==
+			    sector) {
+				dzt->nr_meta_zones++;
+				last_meta_zone = zone;
+			}
+		}
+
+		dzt->nr_zones++;
+		sector = dm_zoned_zone_next_sector(zone);
+
+	}
+
+	if (!dzt->nr_zones) {
+		dm_zoned_dev_error(dzt, "No zones information\n");
+		goto out;
+	}
+
+	if (!dzt->nr_rnd_zones) {
+		dm_zoned_dev_error(dzt, "No randomly writable zones found\n");
+		goto out;
+	}
+
+	if (!dzt->nr_meta_zones) {
+		dm_zoned_dev_error(dzt, "No metadata zones found\n");
+		goto out;
+	}
+
+	/* Temporaray ? We can make it work for any zone size... */
+	if (!is_power_of_2(dzt->zone_nr_sectors)) {
+		dm_zoned_dev_error(dzt,
+			"Sectors per zone %zu is not a power of 2\n",
+			dzt->zone_nr_sectors);
+		goto out;
+	}
+
+	dzt->zone_nr_sectors_shift = ilog2(dzt->zone_nr_sectors);
+	dzt->zone_nr_sectors_mask = dzt->zone_nr_sectors - 1;
+
+	dzt->zone_nr_blocks = dm_zoned_sector_to_block(dzt->zone_nr_sectors);
+	dzt->zone_nr_blocks_shift = ilog2(dzt->zone_nr_blocks);
+	dzt->zone_nr_blocks_mask = dzt->zone_nr_blocks - 1;
+
+	dzt->zone_bitmap_size = dzt->zone_nr_blocks >> 3;
+	dzt->zone_nr_bitmap_blocks = dzt->zone_bitmap_size >>
+		DM_ZONED_BLOCK_SHIFT;
+
+	ret = 0;
+
+out:
+
+	if (ret != 0)
+		dm_zoned_drop_zones(dzt);
+
+	return ret;
+}
+
+/**
+ * Check zone information after a resume.
+ */
+static int
+dm_zoned_check_zones(struct dm_zoned_target *dzt)
+{
+	struct dm_zoned_zone *zone;
+	struct blk_zone *blkz;
+	sector_t sector = 0;
+	sector_t wp_block;
+
+	/* Allocate and initialize zone descriptors */
+	while (sector < dzt->zbd_capacity) {
+
+		blkz = blk_lookup_zone(dzt->zbdq, sector);
+		if (!blkz) {
+			dm_zoned_dev_error(dzt,
+				"Unable to get zone at sector %zu\n", sector);
+			return -EIO;
+		}
+
+		zone = blkz->private_data;
+		if (!zone) {
+			dm_zoned_dev_error(dzt,
+				"Lost private data of zone at sector %zu\n",
+				sector);
+			return -EIO;
+		}
+
+		if (zone->blkz != blkz) {
+			dm_zoned_dev_error(dzt,
+				"Inconsistent private data of zone at sector %zu\n",
+				sector);
+			return -EIO;
+		}
+
+		wp_block = dm_zoned_sector_to_block(blkz->wp) -
+			dm_zoned_zone_start_block(zone);
+		if (!dm_zoned_zone_is_smr(zone))
+			zone->wp_block = 0;
+		else if (zone->wp_block != wp_block) {
+			dm_zoned_dev_error(dzt,
+				"Zone %lu: Inconsistent write pointer position (%zu / %zu)\n",
+				zone->id, zone->wp_block, wp_block);
+			zone->wp_block = wp_block;
+			dm_zoned_invalidate_blocks(dzt, zone, zone->wp_block,
+				dzt->zone_nr_blocks - zone->wp_block);
+			dm_zoned_validate_dzone(dzt, zone);
+		}
+
+		sector = dm_zoned_zone_next_sector(zone);
+
+	}
+
+	return 0;
+}
+
+/**
+ * Lookup a zone containing the specified sector.
+ */
+static inline struct dm_zoned_zone *
+dm_zoned_lookup_zone(struct dm_zoned_target *dzt,
+		     sector_t sector)
+{
+	struct blk_zone *blkz = blk_lookup_zone(dzt->zbdq, sector);
+
+	return blkz ? blkz->private_data : NULL;
+}
+
+/**
+ * Lookup a zone using a zone ID.
+ */
+static inline struct dm_zoned_zone *
+dm_zoned_lookup_zone_by_id(struct dm_zoned_target *dzt,
+			   unsigned int zone_id)
+{
+	return dm_zoned_lookup_zone(dzt, (sector_t)zone_id <<
+				    dzt->zone_nr_sectors_shift);
+}
+
+/**
+ * Set a zone write pointer.
+ */
+int
+dm_zoned_advance_zone_wp(struct dm_zoned_target *dzt,
+			 struct dm_zoned_zone *zone,
+			 sector_t nr_blocks)
+{
+	int ret;
+
+	if (!dm_zoned_zone_is_smr(zone) ||
+	    zone->wp_block + nr_blocks > dm_zoned_zone_next_block(zone))
+		return -EIO;
+
+	/* Zeroout the space between the write */
+	/* pointer and the requested position. */
+	ret = blkdev_issue_zeroout(dzt->zbd,
+		dm_zoned_block_to_sector(dm_zoned_zone_start_block(zone) +
+					 zone->wp_block),
+		dm_zoned_block_to_sector(nr_blocks), GFP_KERNEL, false);
+	if (ret) {
+		dm_zoned_dev_error(dzt,
+			"Advance zone %lu wp block %zu by %zu blocks failed %d\n",
+			zone->id, zone->wp_block, nr_blocks, ret);
+		return ret;
+	}
+
+	zone->wp_block += nr_blocks;
+
+	return 0;
+}
+
+/**
+ * Reset a zone write pointer.
+ */
+int
+dm_zoned_reset_zone_wp(struct dm_zoned_target *dzt,
+		       struct dm_zoned_zone *zone)
+{
+	int ret;
+
+	/* Ignore offline zones, read only zones, */
+	/* CMR zones and empty SMR zones.         */
+	if (dm_zoned_zone_offline(zone)
+	    || dm_zoned_zone_readonly(zone)
+	    || dm_zoned_zone_is_cmr(zone)
+	    || dm_zoned_zone_empty(zone))
+		return 0;
+
+	/* Discard the zone */
+	ret = blkdev_issue_discard(dzt->zbd,
+				   dm_zoned_zone_start_sector(zone),
+				   dm_zoned_zone_sectors(zone),
+				   GFP_KERNEL, 0);
+	if (ret) {
+		dm_zoned_dev_error(dzt, "Reset zone %lu failed %d\n",
+				 zone->id, ret);
+		return ret;
+	}
+
+	/* Rewind */
+	zone->wp_block = 0;
+
+	return 0;
+}
+
+/**
+ * Reset all zones write pointer.
+ */
+static int
+dm_zoned_reset_zones(struct dm_zoned_target *dzt)
+{
+	struct dm_zoned_zone *zone;
+	sector_t sector = 0;
+	int ret = 0;
+
+	dm_zoned_dev_debug(dzt, "Resetting all zones\n");
+
+	while ((zone = dm_zoned_lookup_zone(dzt, sector))) {
+		ret = dm_zoned_reset_zone_wp(dzt, zone);
+		if (ret)
+			return ret;
+		sector = dm_zoned_zone_next_sector(zone);
+	}
+
+	return 0;
+}
+
+/**
+ * Get from cache or read from disk a metadata block.
+ */
+static struct buffer_head *
+dm_zoned_get_meta(struct dm_zoned_target *dzt,
+		  sector_t block)
+{
+	struct buffer_head *bh;
+
+	/* Get block */
+	bh = __bread(dzt->zbd,
+		     block << dzt->zbd_metablk_shift,
+		     DM_ZONED_BLOCK_SIZE);
+	if (!bh) {
+		dm_zoned_dev_error(dzt, "Read block %zu failed\n",
+				 block);
+		return ERR_PTR(-EIO);
+	}
+
+	return bh;
+}
+
+/**
+ * Mark a metadata block dirty.
+ */
+static inline void
+dm_zoned_dirty_meta(struct dm_zoned_target *dzt,
+		    struct buffer_head *bh)
+{
+	mark_buffer_dirty_inode(bh, dzt->zbd->bd_inode);
+}
+
+/**
+ * Zero fill a metadata block.
+ */
+static int
+dm_zoned_zero_meta(struct dm_zoned_target *dzt,
+		   sector_t block)
+{
+	struct buffer_head *bh = dm_zoned_get_meta(dzt, block);
+
+	if (IS_ERR(bh))
+		return PTR_ERR(bh);
+
+	memset(bh->b_data, 0, DM_ZONED_BLOCK_SIZE);
+	dm_zoned_dirty_meta(dzt, bh);
+	__brelse(bh);
+
+	return 0;
+}
+
+/**
+ * Flush dirty meta-data.
+ */
+int
+dm_zoned_flush(struct dm_zoned_target *dzt)
+{
+	int ret;
+
+	/* Sync meta-data */
+	ret = sync_mapping_buffers(dzt->zbd->bd_inode->i_mapping);
+	if (ret) {
+		dm_zoned_dev_error(dzt, "Sync metadata failed %d\n", ret);
+		return ret;
+	}
+
+	/* Flush drive cache (this will also sync data) */
+	return blkdev_issue_flush(dzt->zbd, GFP_KERNEL, NULL);
+}
+
+/**
+ * Format buffer zone mapping.
+ */
+static int
+dm_zoned_format_bzone_mapping(struct dm_zoned_target *dzt)
+{
+	struct dm_zoned_super *sb =
+		(struct dm_zoned_super *) dzt->sb_bh->b_data;
+	struct dm_zoned_zone *zone;
+	int z, b = 0;
+
+	/* Set buffer zones mapping entries */
+	dzt->bz_map = sb->bz_map;
+	for (z = dzt->nr_meta_zones;
+	     (z < dzt->nr_zones) && (b < dzt->nr_buf_zones); z++) {
+		zone = dm_zoned_lookup_zone_by_id(dzt, z);
+		if (!zone)
+			return -ENXIO;
+		if (dm_zoned_zone_is_rnd(zone)) {
+			dzt->bz_map[b].bzone_id = cpu_to_le32(zone->id);
+			dzt->bz_map[b].dzone_id =
+				cpu_to_le32(DM_ZONED_MAP_UNMAPPED);
+			b++;
+		}
+	}
+
+	if (b < dzt->nr_buf_zones) {
+		dm_zoned_dev_error(dzt,
+			"Broken format: %d/%u buffer zones set\n",
+			b, dzt->nr_buf_zones);
+		return -ENXIO;
+	}
+
+	return 0;
+}
+
+/**
+ * Initialize buffer zone mapping.
+ */
+static int
+dm_zoned_load_bzone_mapping(struct dm_zoned_target *dzt)
+{
+	struct dm_zoned_super *sb =
+		(struct dm_zoned_super *) dzt->sb_bh->b_data;
+	struct dm_zoned_zone *bzone, *dzone;
+	unsigned long bzone_id, dzone_id;
+	int i, b = 0;
+
+	/* Process buffer zones mapping entries */
+	dzt->bz_map = sb->bz_map;
+	for (i = 0; i < dzt->nr_buf_zones; i++) {
+
+		bzone_id = le32_to_cpu(dzt->bz_map[i].bzone_id);
+		if (!bzone_id || bzone_id >= dzt->nr_zones) {
+			dm_zoned_dev_error(dzt,
+				"Invalid buffer zone %lu in mapping table entry %d\n",
+				bzone_id, i);
+			return -ENXIO;
+		}
+
+		bzone = dm_zoned_lookup_zone_by_id(dzt, bzone_id);
+		if (!bzone) {
+			dm_zoned_dev_error(dzt, "Buffer zone %lu not found\n",
+					   bzone_id);
+			return -ENXIO;
+		}
+
+		/* Fix the zone type */
+		bzone->flags = DM_ZONE_BUF;
+		list_del_init(&bzone->link);
+		bzone->map = i;
+
+		dzone_id = le32_to_cpu(dzt->bz_map[i].dzone_id);
+		if (dzone_id != DM_ZONED_MAP_UNMAPPED) {
+			if (dzone_id >= dzt->nr_zones) {
+				dm_zoned_dev_error(dzt,
+					"Invalid data zone %lu in mapping table entry %d\n",
+					dzone_id, i);
+				return -ENXIO;
+			}
+			dzone = dm_zoned_lookup_zone_by_id(dzt, dzone_id);
+			if (!dzone) {
+				dm_zoned_dev_error(dzt,
+					"Data zone %lu not found\n", dzone_id);
+				return -ENXIO;
+			}
+		} else
+			dzone = NULL;
+
+		if (dzone) {
+			dm_zoned_dev_debug(dzt,
+				"Zone %lu is buffering zone %lu\n",
+				bzone->id, dzone->id);
+			dzone->bzone = bzone;
+			bzone->bzone = dzone;
+			list_add_tail(&bzone->link, &dzt->bz_lru_list);
+		} else {
+			list_add_tail(&bzone->link, &dzt->bz_free_list);
+			atomic_inc(&dzt->bz_nr_free);
+		}
+
+		b++;
+
+	}
+
+	if (b != dzt->nr_buf_zones) {
+		dm_zoned_dev_error(dzt,
+			"Invalid buffer zone mapping (%d / %u valid entries)\n",
+			b, dzt->nr_buf_zones);
+		return -ENXIO;
+	}
+
+	dzt->bz_nr_free_low = dzt->nr_buf_zones * DM_ZONED_NR_BZONES_LOW / 100;
+	if (dzt->bz_nr_free_low < DM_ZONED_NR_BZONES_LOW_MIN)
+		dzt->bz_nr_free_low = DM_ZONED_NR_BZONES_LOW_MIN;
+
+	return 0;
+}
+
+/**
+ * Set a buffer zone mapping.
+ */
+static void
+dm_zoned_set_bzone_mapping(struct dm_zoned_target *dzt,
+			   struct dm_zoned_zone *bzone,
+			   unsigned int dzone_id)
+{
+	struct dm_zoned_bz_map *bz_map = &dzt->bz_map[bzone->map];
+
+	dm_zoned_dev_assert(dzt, le32_to_cpu(bz_map->bzone_id) == bzone->id);
+
+	lock_buffer(dzt->sb_bh);
+	bz_map->dzone_id = cpu_to_le32(dzone_id);
+	dm_zoned_dirty_meta(dzt, dzt->sb_bh);
+	unlock_buffer(dzt->sb_bh);
+}
+
+/**
+ * Change a buffer zone mapping.
+ */
+static void
+dm_zoned_change_bzone_mapping(struct dm_zoned_target *dzt,
+			      struct dm_zoned_zone *bzone,
+			      struct dm_zoned_zone *new_bzone,
+			      unsigned int dzone_id)
+{
+	struct dm_zoned_bz_map *bz_map = &dzt->bz_map[bzone->map];
+
+	new_bzone->map = bzone->map;
+	bzone->map = DM_ZONED_MAP_UNMAPPED;
+
+	lock_buffer(dzt->sb_bh);
+	bz_map->bzone_id = cpu_to_le32(new_bzone->id);
+	bz_map->dzone_id = cpu_to_le32(dzone_id);
+	dm_zoned_dirty_meta(dzt, dzt->sb_bh);
+	unlock_buffer(dzt->sb_bh);
+}
+
+/**
+ * Get an unused buffer zone and associate it
+ * with @zone.
+ */
+struct dm_zoned_zone *
+dm_zoned_alloc_bzone(struct dm_zoned_target *dzt,
+		     struct dm_zoned_zone *dzone)
+{
+	struct dm_zoned_zone *bzone;
+
+	dm_zoned_map_lock(dzt);
+
+	/* If the data zone already has a buffer */
+	/* zone assigned, keep using it.         */
+	dm_zoned_dev_assert(dzt, dm_zoned_zone_data(dzone));
+	bzone = dzone->bzone;
+	if (bzone)
+		goto out;
+
+	/* If there is no free buffer zone, put the zone to wait */
+	if (!atomic_read(&dzt->bz_nr_free)) {
+		unsigned long flags;
+		dm_zoned_lock_zone(dzone, flags);
+		dm_zoned_dev_assert(dzt, test_bit(DM_ZONE_ACTIVE,
+						  &dzone->flags));
+		dm_zoned_dev_assert(dzt, dzone->zwork);
+		if (!test_and_set_bit(DM_ZONE_ACTIVE_WAIT, &dzone->flags)) {
+			list_add_tail(&dzone->zwork->link, &dzt->bz_wait_list);
+			atomic_inc(&dzt->dz_nr_active_wait);
+		}
+		dm_zoned_unlock_zone(dzone, flags);
+		dm_zoned_trigger_reclaim(dzt);
+		goto out;
+	}
+
+	/* Otherwise, get a free buffer zone */
+	bzone = list_first_entry(&dzt->bz_free_list,
+				 struct dm_zoned_zone, link);
+	list_del_init(&bzone->link);
+	list_add_tail(&bzone->link, &dzt->bz_lru_list);
+	atomic_dec(&dzt->bz_nr_free);
+	dm_zoned_schedule_reclaim(dzt, DM_ZONED_RECLAIM_PERIOD);
+
+	/* Assign the buffer zone to the data zone */
+	bzone->bzone = dzone;
+	dm_zoned_set_bzone_mapping(dzt, bzone, dzone->id);
+
+	dzone->bzone = bzone;
+	smp_mb__before_atomic();
+	set_bit(DM_ZONE_BUFFERED, &dzone->flags);
+	smp_mb__after_atomic();
+
+	dm_zoned_dev_debug(dzt, "Buffer zone %lu assigned to zone %lu\n",
+			   bzone->id, dzone->id);
+
+out:
+
+	dm_zoned_map_unlock(dzt);
+
+	return bzone;
+}
+
+/**
+ * Wake up buffer zone waiter.
+ */
+static void
+dm_zoned_wake_bzone_waiter(struct dm_zoned_target *dzt)
+{
+	struct dm_zoned_zwork *zwork;
+	struct dm_zoned_zone *dzone;
+	unsigned long flags;
+
+	if (list_empty(&dzt->bz_wait_list))
+		return;
+
+	/* Wake up the first buffer waiting zone */
+	zwork = list_first_entry(&dzt->bz_wait_list,
+				 struct dm_zoned_zwork, link);
+	list_del_init(&zwork->link);
+	dzone = zwork->dzone;
+	dm_zoned_lock_zone(dzone, flags);
+	clear_bit_unlock(DM_ZONE_ACTIVE_WAIT, &dzone->flags);
+	atomic_dec(&dzt->dz_nr_active_wait);
+	smp_mb__after_atomic();
+	dm_zoned_run_dzone(dzt, dzone);
+	dm_zoned_unlock_zone(dzone, flags);
+}
+
+/**
+ * Unmap and free the buffer zone of a data zone.
+ */
+void
+dm_zoned_free_bzone(struct dm_zoned_target *dzt,
+		    struct dm_zoned_zone *bzone)
+{
+	struct dm_zoned_zone *dzone = bzone->bzone;
+
+	dm_zoned_map_lock(dzt);
+
+	dm_zoned_dev_assert(dzt, dm_zoned_zone_buf(bzone));
+	dm_zoned_dev_assert(dzt, dzone);
+	dm_zoned_dev_assert(dzt, dm_zoned_zone_data(dzone));
+
+	/* Return the buffer zone into the free list */
+	smp_mb__before_atomic();
+	clear_bit(DM_ZONE_DIRTY, &bzone->flags);
+	clear_bit(DM_ZONE_BUFFERED, &dzone->flags);
+	smp_mb__after_atomic();
+
+	bzone->bzone = NULL;
+
+	dzone->bzone = NULL;
+	dzone->wr_buf_blocks = 0;
+
+	list_del_init(&bzone->link);
+	list_add_tail(&bzone->link, &dzt->bz_free_list);
+	atomic_inc(&dzt->bz_nr_free);
+	dm_zoned_set_bzone_mapping(dzt, bzone, DM_ZONED_MAP_UNMAPPED);
+	dm_zoned_wake_bzone_waiter(dzt);
+
+	dm_zoned_dev_debug(dzt, "Freed buffer zone %lu\n", bzone->id);
+
+	dm_zoned_map_unlock(dzt);
+}
+
+/**
+ * After a write or a discard, the buffer zone of
+ * a data zone may become entirely invalid and can be freed.
+ * Check this here.
+ */
+void
+dm_zoned_validate_bzone(struct dm_zoned_target *dzt,
+			struct dm_zoned_zone *dzone)
+{
+	struct dm_zoned_zone *bzone = dzone->bzone;
+
+	dm_zoned_dev_assert(dzt, dm_zoned_zone_data(dzone));
+	dm_zoned_dev_assert(dzt, test_bit(DM_ZONE_ACTIVE, &dzone->flags));
+
+	if (!bzone || !test_and_clear_bit(DM_ZONE_DIRTY, &bzone->flags))
+		return;
+
+	/* If all blocks are invalid, free it */
+	if (dm_zoned_zone_weight(dzt, bzone) == 0) {
+		dm_zoned_free_bzone(dzt, bzone);
+		return;
+	}
+
+	/* LRU update the list of buffered data zones */
+	dm_zoned_map_lock(dzt);
+	list_del_init(&bzone->link);
+	list_add_tail(&bzone->link, &dzt->bz_lru_list);
+	dm_zoned_map_unlock(dzt);
+}
+
+/**
+ * Format data zone mapping.
+ */
+static int
+dm_zoned_format_dzone_mapping(struct dm_zoned_target *dzt)
+{
+	struct buffer_head *map_bh;
+	unsigned int *map;
+	int i, j;
+
+	/* Zero fill the data zone mapping table */
+	for (i = 0; i < dzt->nr_map_blocks; i++) {
+		map_bh = dm_zoned_get_meta(dzt, i + 1);
+		if (IS_ERR(map_bh))
+			return PTR_ERR(map_bh);
+		map = (unsigned int *) map_bh->b_data;
+		lock_buffer(map_bh);
+		for (j = 0; j < DM_ZONED_MAP_ENTRIES_PER_BLOCK; j++)
+			map[j] = cpu_to_le32(DM_ZONED_MAP_UNMAPPED);
+		dm_zoned_dirty_meta(dzt, map_bh);
+		unlock_buffer(map_bh);
+		__brelse(map_bh);
+	}
+
+	return 0;
+}
+
+/**
+ * Cleanup resources used for the data zone mapping table.
+ */
+static void
+dm_zoned_cleanup_dzone_mapping(struct dm_zoned_target *dzt)
+{
+	int i;
+
+	/* Cleanup zone mapping resources */
+	if (!dzt->dz_map_bh)
+		return;
+
+	for (i = 0; i < dzt->nr_map_blocks; i++)
+		brelse(dzt->dz_map_bh[i]);
+
+	kfree(dzt->dz_map_bh);
+}
+
+/**
+ * Initialize data zone mapping.
+ */
+static int
+dm_zoned_load_dzone_mapping(struct dm_zoned_target *dzt)
+{
+	struct dm_zoned_zone *zone;
+	struct buffer_head *map_bh;
+	unsigned int *map;
+	unsigned long dzone_id;
+	int i, j, chunk = 0;
+	int ret = 0;
+
+	/* Data zone mapping table blocks array */
+	dzt->dz_map_bh = kzalloc(sizeof(struct buffer_head *) *
+				 dzt->nr_map_blocks, GFP_KERNEL);
+	if (!dzt->dz_map_bh)
+		return -ENOMEM;
+	dm_zoned_account_mem(dzt, sizeof(struct buffer_head *) *
+			     dzt->nr_map_blocks);
+
+	/* Get data zone mapping blocks and initialize zone mapping */
+	for (i = 0; i < dzt->nr_map_blocks; i++) {
+
+		/* Get mapping block */
+		map_bh = dm_zoned_get_meta(dzt, i + 1);
+		if (IS_ERR(map_bh)) {
+			ret = PTR_ERR(map_bh);
+			goto out;
+		}
+		dzt->dz_map_bh[i] = map_bh;
+		dm_zoned_account_mem(dzt, DM_ZONED_BLOCK_SIZE);
+
+		/* Process entries */
+		map = (unsigned int *) map_bh->b_data;
+		j = 0;
+		for (j = 0; j < DM_ZONED_MAP_ENTRIES_PER_BLOCK &&
+			    chunk < dzt->nr_data_zones; j++) {
+			dzone_id = le32_to_cpu(map[j]);
+			if (dzone_id != DM_ZONED_MAP_UNMAPPED) {
+				zone = dm_zoned_lookup_zone_by_id(dzt,
+								  dzone_id);
+				if (!zone) {
+					dm_zoned_dev_error(dzt,
+						"Mapping entry %d: zone %lu not found\n",
+						chunk, dzone_id);
+					map[j] = DM_ZONED_MAP_UNMAPPED;
+					dm_zoned_dirty_meta(dzt, map_bh);
+				} else {
+					zone->map = chunk;
+					dzt->dz_nr_unmap--;
+					list_del_init(&zone->link);
+					if (dm_zoned_zone_is_cmr(zone))
+						list_add_tail(&zone->link,
+							&dzt->dz_map_cmr_list);
+				}
+			}
+			chunk++;
+		}
+
+	}
+
+out:
+	if (ret)
+		dm_zoned_cleanup_dzone_mapping(dzt);
+
+	return ret;
+}
+
+/**
+ * Set the data zone mapping entry for a chunk of the logical disk.
+ */
+static void
+dm_zoned_set_dzone_mapping(struct dm_zoned_target *dzt,
+			   unsigned int chunk,
+			   unsigned int dzone_id)
+{
+	struct buffer_head *map_bh =
+		dzt->dz_map_bh[chunk >> DM_ZONED_MAP_ENTRIES_SHIFT];
+	unsigned int *map = (unsigned int *) map_bh->b_data;
+
+	lock_buffer(map_bh);
+	map[chunk & DM_ZONED_MAP_ENTRIES_MASK] = cpu_to_le32(dzone_id);
+	dm_zoned_dirty_meta(dzt, map_bh);
+	unlock_buffer(map_bh);
+}
+
+/**
+ * Get the data zone mapping of a chunk of the logical disk.
+ */
+static unsigned int
+dm_zoned_get_dzone_mapping(struct dm_zoned_target *dzt,
+			   unsigned int chunk)
+{
+	struct buffer_head *map_bh =
+		dzt->dz_map_bh[chunk >> DM_ZONED_MAP_ENTRIES_SHIFT];
+	unsigned int *map = (unsigned int *) map_bh->b_data;
+
+	return le32_to_cpu(map[chunk & DM_ZONED_MAP_ENTRIES_MASK]);
+}
+
+/**
+ * Get an unmapped data zone and map it to chunk.
+ * This must be called with the mapping lock held.
+ */
+struct dm_zoned_zone *
+dm_zoned_alloc_dzone(struct dm_zoned_target *dzt,
+		     unsigned int chunk,
+		     unsigned int type_hint)
+{
+	struct dm_zoned_zone *dzone = NULL;
+
+again:
+
+	/* Get an unmapped data zone: if asked to, try to get */
+	/* an unmapped randomly writtable zone. Otherwise,    */
+	/* get a sequential zone.                             */
+	switch (type_hint) {
+	case DM_DZONE_CMR:
+		dzone = list_first_entry_or_null(&dzt->dz_unmap_cmr_list,
+						 struct dm_zoned_zone, link);
+		if (dzone)
+			break;
+	case DM_DZONE_SMR:
+	default:
+		dzone = list_first_entry_or_null(&dzt->dz_unmap_smr_list,
+						 struct dm_zoned_zone, link);
+		if (dzone)
+			break;
+		dzone = list_first_entry_or_null(&dzt->dz_unmap_cmr_list,
+						 struct dm_zoned_zone, link);
+		break;
+	}
+
+	if (dzone) {
+		list_del_init(&dzone->link);
+		dzt->dz_nr_unmap--;
+		if (dm_zoned_zone_offline(dzone)) {
+			dm_zoned_dev_error(dzt, "Ignoring offline dzone %lu\n",
+					   dzone->id);
+			goto again;
+		}
+
+		dm_zoned_dev_debug(dzt, "Allocated %s dzone %lu\n",
+				 dm_zoned_zone_is_cmr(dzone) ? "CMR" : "SMR",
+				 dzone->id);
+
+		/* Set the zone chunk mapping */
+		if (chunk != DM_ZONED_MAP_UNMAPPED) {
+			dm_zoned_set_dzone_mapping(dzt, chunk, dzone->id);
+			dzone->map = chunk;
+			if (dm_zoned_zone_is_cmr(dzone))
+				list_add_tail(&dzone->link,
+					      &dzt->dz_map_cmr_list);
+		}
+
+	}
+
+	return dzone;
+}
+
+/**
+ * Unmap and free a chunk data zone.
+ * This must be called with the mapping lock held.
+ */
+void
+dm_zoned_free_dzone(struct dm_zoned_target *dzt,
+		    struct dm_zoned_zone *dzone)
+{
+
+	dm_zoned_dev_assert(dzt, dm_zoned_zone_data(dzone));
+	dm_zoned_dev_assert(dzt, !test_bit(DM_ZONE_BUFFERED, &dzone->flags));
+
+	/* Reset the zone */
+	dm_zoned_wait_for_stable_zone(dzone);
+	dm_zoned_reset_zone_wp(dzt, dzone);
+	dm_zoned_zone_reset_stats(dzone);
+
+	dm_zoned_map_lock(dzt);
+
+	/* Clear the zone chunk mapping */
+	if (dzone->map != DM_ZONED_MAP_UNMAPPED) {
+		dm_zoned_set_dzone_mapping(dzt, dzone->map,
+					   DM_ZONED_MAP_UNMAPPED);
+		dzone->map = DM_ZONED_MAP_UNMAPPED;
+	}
+
+	/* If the zone was already marked as empty after */
+	/* a discard, remove it from the empty list.     */
+	if (test_and_clear_bit(DM_ZONE_EMPTY, &dzone->flags))
+		list_del_init(&dzone->elink);
+
+	/* Return the zone to the unmap list */
+	smp_mb__before_atomic();
+	clear_bit(DM_ZONE_DIRTY, &dzone->flags);
+	smp_mb__after_atomic();
+	if (dm_zoned_zone_is_cmr(dzone)) {
+		list_del_init(&dzone->link);
+		list_add_tail(&dzone->link, &dzt->dz_unmap_cmr_list);
+	} else
+		list_add_tail(&dzone->link, &dzt->dz_unmap_smr_list);
+	dzt->dz_nr_unmap++;
+
+	dm_zoned_dev_debug(dzt, "Freed data zone %lu\n", dzone->id);
+
+	dm_zoned_map_unlock(dzt);
+}
+
+/**
+ * After a failed write or a discard, a data zone may become
+ * entirely invalid and can be freed. Check this here.
+ */
+void
+dm_zoned_validate_dzone(struct dm_zoned_target *dzt,
+			struct dm_zoned_zone *dzone)
+{
+	int dweight;
+
+	dm_zoned_dev_assert(dzt, dm_zoned_zone_data(dzone));
+
+	if (dzone->bzone ||
+	    !test_and_clear_bit(DM_ZONE_DIRTY, &dzone->flags))
+		return;
+
+	dweight = dm_zoned_zone_weight(dzt, dzone);
+	dm_zoned_map_lock(dzt);
+	if (dweight == 0 &&
+	    !test_and_set_bit_lock(DM_ZONE_EMPTY, &dzone->flags)) {
+		list_add_tail(&dzone->elink, &dzt->dz_empty_list);
+		dm_zoned_schedule_reclaim(dzt, DM_ZONED_RECLAIM_PERIOD);
+	}
+	dm_zoned_map_unlock(dzt);
+}
+
+/**
+ * Change the mapping of the chunk served by @from_dzone
+ * to @to_dzone (used by GC). This implies that @from_dzone
+ * is invalidated, unmapped and freed.
+ */
+void
+dm_zoned_remap_dzone(struct dm_zoned_target *dzt,
+		     struct dm_zoned_zone *from_dzone,
+		     struct dm_zoned_zone *to_dzone)
+{
+	unsigned int chunk = from_dzone->map;
+
+	dm_zoned_map_lock(dzt);
+
+	dm_zoned_dev_assert(dzt, dm_zoned_zone_data(from_dzone));
+	dm_zoned_dev_assert(dzt, chunk != DM_ZONED_MAP_UNMAPPED);
+	dm_zoned_dev_assert(dzt, dm_zoned_zone_data(to_dzone));
+	dm_zoned_dev_assert(dzt, to_dzone->map == DM_ZONED_MAP_UNMAPPED);
+
+	from_dzone->map = DM_ZONED_MAP_UNMAPPED;
+	if (dm_zoned_zone_is_cmr(from_dzone))
+		list_del_init(&from_dzone->link);
+
+	dm_zoned_set_dzone_mapping(dzt, chunk, to_dzone->id);
+	to_dzone->map = chunk;
+	if (dm_zoned_zone_is_cmr(to_dzone))
+		list_add_tail(&to_dzone->link, &dzt->dz_map_cmr_list);
+
+	dm_zoned_map_unlock(dzt);
+}
+
+/**
+ * Change the type of @bzone to data zone and map it
+ * to the chunk being mapped by its current data zone.
+ * In the buffer zone mapping table, replace @bzone
+ * with @new_bzone.
+ */
+void
+dm_zoned_remap_bzone(struct dm_zoned_target *dzt,
+		     struct dm_zoned_zone *bzone,
+		     struct dm_zoned_zone *new_bzone)
+{
+	struct dm_zoned_zone *dzone = bzone->bzone;
+	unsigned int chunk = dzone->map;
+
+	dm_zoned_map_lock(dzt);
+
+	dm_zoned_dev_assert(dzt, dm_zoned_zone_buf(bzone));
+	dm_zoned_dev_assert(dzt, dm_zoned_zone_data(new_bzone));
+	dm_zoned_dev_assert(dzt, chunk != DM_ZONED_MAP_UNMAPPED);
+	dm_zoned_dev_assert(dzt, new_bzone->map == DM_ZONED_MAP_UNMAPPED);
+
+	/* Cleanup dzone */
+	smp_mb__before_atomic();
+	clear_bit(DM_ZONE_BUFFERED, &dzone->flags);
+	smp_mb__after_atomic();
+	dzone->bzone = NULL;
+	dzone->map = DM_ZONED_MAP_UNMAPPED;
+
+	/* new_bzone becomes a free buffer zone */
+	new_bzone->flags = DM_ZONE_BUF;
+	smp_mb__before_atomic();
+	set_bit(DM_ZONE_RECLAIM, &new_bzone->flags);
+	smp_mb__after_atomic();
+	dm_zoned_change_bzone_mapping(dzt, bzone, new_bzone,
+				    DM_ZONED_MAP_UNMAPPED);
+	list_add_tail(&new_bzone->link, &dzt->bz_free_list);
+	atomic_inc(&dzt->bz_nr_free);
+	dm_zoned_wake_bzone_waiter(dzt);
+
+	/* bzone becomes a mapped data zone */
+	bzone->bzone = NULL;
+	list_del_init(&bzone->link);
+	bzone->flags = DM_ZONE_DATA;
+	smp_mb__before_atomic();
+	set_bit(DM_ZONE_DIRTY, &bzone->flags);
+	set_bit(DM_ZONE_RECLAIM, &bzone->flags);
+	smp_mb__after_atomic();
+	bzone->map = chunk;
+	dm_zoned_set_dzone_mapping(dzt, chunk, bzone->id);
+	list_add_tail(&bzone->link, &dzt->dz_map_cmr_list);
+
+	dm_zoned_map_unlock(dzt);
+}
+
+/**
+ * Get the data zone mapping the chunk of the BIO.
+ * There may be no mapping.
+ */
+struct dm_zoned_zone *
+dm_zoned_bio_map(struct dm_zoned_target *dzt,
+		 struct bio *bio)
+{
+	struct dm_zoned_bioctx *bioctx =
+		dm_per_bio_data(bio, sizeof(struct dm_zoned_bioctx));
+	struct dm_zoned_zwork *zwork;
+	struct dm_zoned_zone *dzone;
+	unsigned long flags;
+	unsigned int dzone_id;
+	unsigned int chunk;
+
+	/* Get a work to activate the mapping zone if needed. */
+	zwork = kmalloc(sizeof(struct dm_zoned_zwork), GFP_KERNEL);
+	if (unlikely(!zwork))
+		return ERR_PTR(-ENOMEM);
+
+again:
+	dzone = NULL;
+	dm_zoned_map_lock(dzt);
+
+	chunk = bio->bi_iter.bi_sector >> dzt->zone_nr_sectors_shift;
+	dzone_id = dm_zoned_get_dzone_mapping(dzt, chunk);
+
+	/* For write to unmapped chunks, try */
+	/* to allocate an unused data zone.  */
+	if (dzone_id != DM_ZONED_MAP_UNMAPPED)
+		dzone = dm_zoned_lookup_zone_by_id(dzt, dzone_id);
+	else if ((bio->bi_rw & REQ_WRITE) &&
+		 (!(bio->bi_rw & REQ_DISCARD)))
+		dzone = dm_zoned_alloc_dzone(dzt, chunk, DM_DZONE_ANY);
+
+	if (!dzone)
+		/* No mapping: no work needed */
+		goto out;
+
+	dm_zoned_lock_zone(dzone, flags);
+
+	/* If the zone buffer is being reclaimed, wait */
+	if (test_bit(DM_ZONE_RECLAIM, &dzone->flags)) {
+		dm_zoned_dev_debug(dzt, "Wait for zone %lu reclaim (%lx)\n",
+				 dzone->id,
+				 dzone->flags);
+		dm_zoned_unlock_zone(dzone, flags);
+		dm_zoned_map_unlock(dzt);
+		wait_on_bit_io(&dzone->flags, DM_ZONE_RECLAIM,
+			       TASK_UNINTERRUPTIBLE);
+		goto again;
+	}
+
+	if (test_and_clear_bit(DM_ZONE_EMPTY, &dzone->flags))
+		list_del_init(&dzone->elink);
+
+	/* Got the mapping zone: set it active */
+	if (!test_and_set_bit(DM_ZONE_ACTIVE, &dzone->flags)) {
+		INIT_WORK(&zwork->work, dm_zoned_zone_work);
+		zwork->target = dzt;
+		zwork->dzone = dzone;
+		INIT_LIST_HEAD(&zwork->link);
+		atomic_set(&zwork->ref, 0);
+		bio_list_init(&zwork->bio_list);
+		atomic_set(&zwork->bio_count, 0);
+		dzone->zwork = zwork;
+		atomic_inc(&dzt->dz_nr_active);
+	} else {
+		kfree(zwork);
+		zwork = dzone->zwork;
+		dm_zoned_dev_assert(dzt, zwork);
+	}
+
+	bioctx->dzone = dzone;
+	atomic_inc(&zwork->ref);
+	bio_list_add(&zwork->bio_list, bio);
+
+	dm_zoned_run_dzone(dzt, dzone);
+	zwork = NULL;
+
+	dm_zoned_unlock_zone(dzone, flags);
+
+out:
+	dm_zoned_map_unlock(dzt);
+
+	if (zwork)
+		kfree(zwork);
+
+	return dzone;
+}
+
+/**
+ * If needed and possible, queue an active zone work.
+ */
+void
+dm_zoned_run_dzone(struct dm_zoned_target *dzt,
+		   struct dm_zoned_zone *dzone)
+{
+	struct dm_zoned_zwork *zwork = dzone->zwork;
+
+	dm_zoned_dev_assert(dzt, test_bit(DM_ZONE_ACTIVE, &dzone->flags));
+	dm_zoned_dev_assert(dzt, zwork != NULL);
+	dm_zoned_dev_assert(dzt, atomic_read(&zwork->ref) > 0);
+
+	if (bio_list_peek(&zwork->bio_list) &&
+	    !test_bit(DM_ZONE_ACTIVE_WAIT, &dzone->flags)) {
+		if (queue_work(dzt->zone_wq, &zwork->work))
+			atomic_inc(&zwork->ref);
+	}
+}
+
+/**
+ * Release an active data zone: the last put will
+ * deactivate the zone and free its work struct.
+ */
+void
+dm_zoned_put_dzone(struct dm_zoned_target *dzt,
+		   struct dm_zoned_zone *dzone)
+{
+	struct dm_zoned_zwork *zwork = dzone->zwork;
+	unsigned long flags;
+
+	dm_zoned_dev_assert(dzt, test_bit(DM_ZONE_ACTIVE, &dzone->flags));
+	dm_zoned_dev_assert(dzt, zwork != NULL);
+	dm_zoned_dev_assert(dzt, atomic_read(&zwork->ref) > 0);
+
+	dm_zoned_lock_zone(dzone, flags);
+
+	if (atomic_dec_and_test(&zwork->ref)) {
+		kfree(zwork);
+		dzone->zwork = NULL;
+		clear_bit_unlock(DM_ZONE_ACTIVE, &dzone->flags);
+		smp_mb__after_atomic();
+		atomic_dec(&dzt->dz_nr_active);
+		wake_up_bit(&dzone->flags, DM_ZONE_ACTIVE);
+	}
+
+	dm_zoned_unlock_zone(dzone, flags);
+}
+
+/**
+ * Determine metadata format.
+ */
+static int
+dm_zoned_format(struct dm_zoned_target *dzt,
+		struct dm_zoned_target_config *conf)
+{
+	unsigned int nr_meta_blocks, nr_meta_zones = 1;
+	unsigned int nr_buf_zones, nr_data_zones;
+	unsigned int nr_bitmap_blocks, nr_map_blocks;
+
+	dm_zoned_dev_info(dzt, "Formatting device with %lu buffer zones\n",
+			conf->nr_buf_zones);
+
+	if (conf->nr_buf_zones < DM_ZONED_NR_BZONES_MIN) {
+		conf->nr_buf_zones = DM_ZONED_NR_BZONES_MIN;
+		dm_zoned_dev_info(dzt,
+			"    Number of buffer zones too low: using %lu\n",
+			conf->nr_buf_zones);
+	}
+
+	if (conf->nr_buf_zones > DM_ZONED_NR_BZONES_MAX) {
+		conf->nr_buf_zones = DM_ZONED_NR_BZONES_MAX;
+		dm_zoned_dev_info(dzt,
+			"    Number of buffer zones too large: using %lu\n",
+			conf->nr_buf_zones);
+	}
+
+	nr_buf_zones = conf->nr_buf_zones;
+
+again:
+
+	nr_data_zones = dzt->nr_zones - nr_buf_zones - nr_meta_zones;
+	nr_map_blocks = nr_data_zones >> DM_ZONED_MAP_ENTRIES_SHIFT;
+	if (nr_data_zones & DM_ZONED_MAP_ENTRIES_MASK)
+		nr_map_blocks++;
+	nr_bitmap_blocks = (dzt->nr_zones - nr_meta_zones) *
+		dzt->zone_nr_bitmap_blocks;
+	nr_meta_blocks = 1 + nr_map_blocks + nr_bitmap_blocks;
+	nr_meta_zones = (nr_meta_blocks + dzt->zone_nr_blocks_mask) >>
+		dzt->zone_nr_blocks_shift;
+
+	if (nr_meta_zones > dzt->nr_meta_zones) {
+		dm_zoned_dev_error(dzt,
+			"Insufficient random write space for metadata (need %u zones, have %u)\n",
+			nr_meta_zones, dzt->nr_meta_zones);
+		return -ENXIO;
+	}
+
+	if ((nr_meta_zones + nr_buf_zones) > dzt->nr_rnd_zones) {
+		nr_buf_zones = dzt->nr_rnd_zones - nr_meta_zones;
+		dm_zoned_dev_info(dzt,
+			"Insufficient random zones: retrying with %u buffer zones\n",
+			nr_buf_zones);
+		goto again;
+	}
+
+	/* Fixup everything */
+	dzt->nr_meta_zones = nr_meta_zones;
+	dzt->nr_buf_zones = nr_buf_zones;
+	dzt->nr_data_zones = dzt->nr_zones - nr_buf_zones - nr_meta_zones;
+	dzt->nr_map_blocks = dzt->nr_data_zones >> DM_ZONED_MAP_ENTRIES_SHIFT;
+	if (dzt->nr_data_zones & DM_ZONED_MAP_ENTRIES_MASK)
+		dzt->nr_map_blocks++;
+	dzt->nr_bitmap_blocks = (dzt->nr_buf_zones + dzt->nr_data_zones) *
+		dzt->zone_nr_bitmap_blocks;
+	dzt->bitmap_block = 1 + dzt->nr_map_blocks;
+
+	return 0;
+}
+
+/**
+ * Format the target device metadata.
+ */
+static int
+dm_zoned_format_meta(struct dm_zoned_target *dzt,
+		     struct dm_zoned_target_config *conf)
+{
+	struct dm_zoned_super *sb;
+	int b, ret;
+
+	/* Reset all zones */
+	ret = dm_zoned_reset_zones(dzt);
+	if (ret)
+		return ret;
+
+	/* Initialize the super block data */
+	ret = dm_zoned_format(dzt, conf);
+	if (ret)
+		return ret;
+
+	/* Format buffer zones mapping */
+	ret = dm_zoned_format_bzone_mapping(dzt);
+	if (ret)
+		return ret;
+
+	/* Format data zones mapping */
+	ret = dm_zoned_format_dzone_mapping(dzt);
+	if (ret)
+		return ret;
+
+	/* Clear bitmaps */
+	for (b = 0; b < dzt->nr_bitmap_blocks; b++) {
+		ret = dm_zoned_zero_meta(dzt, dzt->bitmap_block + b);
+		if (ret)
+			return ret;
+	}
+
+	/* Finally, write super block */
+	sb = (struct dm_zoned_super *) dzt->sb_bh->b_data;
+	lock_buffer(dzt->sb_bh);
+	sb->magic = cpu_to_le32(DM_ZONED_MAGIC);
+	sb->version = cpu_to_le32(DM_ZONED_META_VER);
+	sb->nr_map_blocks = cpu_to_le32(dzt->nr_map_blocks);
+	sb->nr_bitmap_blocks = cpu_to_le32(dzt->nr_bitmap_blocks);
+	sb->nr_buf_zones = cpu_to_le32(dzt->nr_buf_zones);
+	sb->nr_data_zones = cpu_to_le32(dzt->nr_data_zones);
+	dm_zoned_dirty_meta(dzt, dzt->sb_bh);
+	unlock_buffer(dzt->sb_bh);
+
+	return dm_zoned_flush(dzt);
+}
+
+/**
+ * Count zones in a list.
+ */
+static int
+dm_zoned_zone_count(struct list_head *list)
+{
+	struct dm_zoned_zone *zone;
+	int n = 0;
+
+	list_for_each_entry(zone, list, link) {
+		n++;
+	}
+
+	return n;
+}
+
+/**
+ * Shuffle data zone list: file systems tend to distribute
+ * accesses accross a disk to achieve stable performance
+ * over time. Allocating and mapping these spread accessing
+ * to contiguous data zones in LBA order would achieve the
+ * opposite result (fast accesses initially, slower later).
+ * So make sure this does not happen by shuffling the initially
+ * LBA ordered list of SMR data zones.
+ * Shuffling: LBA ordered zone list 0,1,2,3,4,5,6,7 [...] is
+ * reorganized as: 0,4,1,5,2,6,3,7 [...]
+ */
+static void
+dm_zoned_shuffle_dzones(struct dm_zoned_target *dzt)
+{
+	struct dm_zoned_zone *dzone;
+	struct list_head tmp1;
+	struct list_head tmp2;
+	int n = 0;
+
+	INIT_LIST_HEAD(&tmp1);
+	INIT_LIST_HEAD(&tmp2);
+
+	while (!list_empty(&dzt->dz_unmap_smr_list) &&
+	      n < dzt->nr_smr_data_zones / 2) {
+		dzone = list_first_entry(&dzt->dz_unmap_smr_list,
+					 struct dm_zoned_zone, link);
+		list_del_init(&dzone->link);
+		list_add_tail(&dzone->link, &tmp1);
+		n++;
+	}
+	while (!list_empty(&dzt->dz_unmap_smr_list)) {
+		dzone = list_first_entry(&dzt->dz_unmap_smr_list,
+					 struct dm_zoned_zone, link);
+		list_del_init(&dzone->link);
+		list_add_tail(&dzone->link, &tmp2);
+	}
+	while (!list_empty(&tmp1) && !list_empty(&tmp2)) {
+		dzone = list_first_entry_or_null(&tmp1,
+						 struct dm_zoned_zone, link);
+		if (dzone) {
+			list_del_init(&dzone->link);
+			list_add_tail(&dzone->link, &dzt->dz_unmap_smr_list);
+		}
+		dzone = list_first_entry_or_null(&tmp2,
+						 struct dm_zoned_zone, link);
+		if (dzone) {
+			list_del_init(&dzone->link);
+			list_add_tail(&dzone->link, &dzt->dz_unmap_smr_list);
+		}
+	}
+}
+
+/**
+ * Load meta data from disk.
+ */
+static int
+dm_zoned_load_meta(struct dm_zoned_target *dzt)
+{
+	struct dm_zoned_super *sb =
+		(struct dm_zoned_super *) dzt->sb_bh->b_data;
+	struct dm_zoned_zone *zone;
+	int i, ret;
+
+	/* Check super block */
+	if (le32_to_cpu(sb->magic) != DM_ZONED_MAGIC) {
+		dm_zoned_dev_error(dzt, "Invalid meta magic "
+				   "(need 0x%08x, got 0x%08x)\n",
+				   DM_ZONED_MAGIC, le32_to_cpu(sb->magic));
+		return -ENXIO;
+	}
+	if (le32_to_cpu(sb->version) != DM_ZONED_META_VER) {
+		dm_zoned_dev_error(dzt, "Invalid meta version "
+				   "(need %d, got %d)\n",
+				   DM_ZONED_META_VER, le32_to_cpu(sb->version));
+		return -ENXIO;
+	}
+
+	dzt->nr_buf_zones = le32_to_cpu(sb->nr_buf_zones);
+	dzt->nr_data_zones = le32_to_cpu(sb->nr_data_zones);
+	if ((dzt->nr_buf_zones + dzt->nr_data_zones) > dzt->nr_zones) {
+		dm_zoned_dev_error(dzt, "Invalid format: %u buffer zones "
+				   "+ %u data zones > %u zones\n",
+				   dzt->nr_buf_zones,
+				   dzt->nr_data_zones,
+				   dzt->nr_zones);
+		return -ENXIO;
+	}
+	dzt->nr_meta_zones = dzt->nr_zones -
+		(dzt->nr_buf_zones + dzt->nr_data_zones);
+	dzt->nr_map_blocks = le32_to_cpu(sb->nr_map_blocks);
+	dzt->nr_bitmap_blocks = le32_to_cpu(sb->nr_bitmap_blocks);
+	dzt->nr_data_zones = le32_to_cpu(sb->nr_data_zones);
+	dzt->bitmap_block = dzt->nr_map_blocks + 1;
+	dzt->dz_nr_unmap = dzt->nr_data_zones;
+
+	/* Load the buffer zones mapping table */
+	ret = dm_zoned_load_bzone_mapping(dzt);
+	if (ret) {
+		dm_zoned_dev_error(dzt, "Load buffer zone mapping failed %d\n",
+				 ret);
+		return ret;
+	}
+
+	/* Load the data zone mapping table */
+	ret = dm_zoned_load_dzone_mapping(dzt);
+	if (ret) {
+		dm_zoned_dev_error(dzt, "Load data zone mapping failed %d\n",
+				 ret);
+		return ret;
+	}
+
+	/* The first nr_meta_zones are still marked */
+	/* as unmapped data zones: fix this         */
+	for (i = 0; i < dzt->nr_meta_zones; i++) {
+		zone = dm_zoned_lookup_zone_by_id(dzt, i);
+		if (!zone) {
+			dm_zoned_dev_error(dzt, "Meta zone %d not found\n", i);
+			return -ENXIO;
+		}
+		zone->flags = DM_ZONE_META;
+		list_del_init(&zone->link);
+	}
+	dzt->nr_cmr_data_zones = dm_zoned_zone_count(&dzt->dz_map_cmr_list) +
+		dm_zoned_zone_count(&dzt->dz_unmap_cmr_list);
+	dzt->nr_smr_data_zones = dzt->nr_data_zones - dzt->nr_cmr_data_zones;
+
+	dm_zoned_shuffle_dzones(dzt);
+
+	dm_zoned_dev_info(dzt, "Backend device:\n");
+	dm_zoned_dev_info(dzt,
+		"    %zu 512-byte logical sectors\n",
+		(sector_t)dzt->nr_zones << dzt->zone_nr_sectors_shift);
+	dm_zoned_dev_info(dzt,
+		"    %u zones of %zu 512-byte logical sectors\n",
+		dzt->nr_zones, dzt->zone_nr_sectors);
+	dm_zoned_dev_info(dzt,
+		"    %u CMR zones, %u SMR zones (%u random write zones)\n",
+		dzt->nr_cmr_zones,
+		dzt->nr_smr_zones,
+		dzt->nr_rnd_zones);
+	dm_zoned_dev_info(dzt,
+		"    %u metadata zones\n", dzt->nr_meta_zones);
+	dm_zoned_dev_info(dzt,
+		"    %u buffer zones (%d free zones, %u low threshold)\n",
+		dzt->nr_buf_zones,  atomic_read(&dzt->bz_nr_free),
+		dzt->bz_nr_free_low);
+	dm_zoned_dev_info(dzt,
+		"    %u data zones (%u SMR zones, %u CMR zones), %u unmapped zones\n",
+		dzt->nr_data_zones, dzt->nr_smr_data_zones,
+		dzt->nr_cmr_data_zones, dzt->dz_nr_unmap);
+
+#ifdef __DM_ZONED_DEBUG
+	dm_zoned_dev_info(dzt, "Format:\n");
+	dm_zoned_dev_info(dzt,
+		"        %u data zone mapping blocks from block 1\n",
+		dzt->nr_map_blocks);
+	dm_zoned_dev_info(dzt,
+		"        %u bitmap blocks from block %zu (%u blocks per zone)\n",
+		dzt->nr_bitmap_blocks, dzt->bitmap_block,
+		dzt->zone_nr_bitmap_blocks);
+	dm_zoned_dev_info(dzt,
+		"Using %zu KiB of memory\n", dzt->used_mem >> 10);
+#endif
+
+	return 0;
+}
+
+/**
+ * Initialize the target metadata.
+ */
+int
+dm_zoned_init_meta(struct dm_zoned_target *dzt,
+		   struct dm_zoned_target_config *conf)
+{
+	int ret;
+
+	/* Flush the target device */
+	blkdev_issue_flush(dzt->zbd, GFP_NOFS, NULL);
+
+	/* Initialize zone descriptors */
+	ret = dm_zoned_init_zones(dzt);
+	if (ret)
+		goto out;
+
+	/* Get super block */
+	dzt->sb_bh = dm_zoned_get_meta(dzt, 0);
+	if (IS_ERR(dzt->sb_bh)) {
+		ret = PTR_ERR(dzt->sb_bh);
+		dzt->sb_bh = NULL;
+		dm_zoned_dev_error(dzt, "Read super block failed %d\n", ret);
+		goto out;
+	}
+	dm_zoned_account_mem(dzt, DM_ZONED_BLOCK_SIZE);
+
+	/* If asked to reformat */
+	if (conf->format) {
+		ret = dm_zoned_format_meta(dzt, conf);
+		if (ret)
+			goto out;
+	}
+
+	/* Load meta-data */
+	ret = dm_zoned_load_meta(dzt);
+	if (ret)
+		goto out;
+
+out:
+	if (ret)
+		dm_zoned_cleanup_meta(dzt);
+
+	return ret;
+}
+
+/**
+ * Check metadata on resume.
+ */
+int
+dm_zoned_resume_meta(struct dm_zoned_target *dzt)
+{
+	return dm_zoned_check_zones(dzt);
+}
+
+/**
+ * Cleanup the target metadata resources.
+ */
+void
+dm_zoned_cleanup_meta(struct dm_zoned_target *dzt)
+{
+
+	dm_zoned_cleanup_dzone_mapping(dzt);
+	brelse(dzt->sb_bh);
+	dm_zoned_drop_zones(dzt);
+}
+
+/**
+ * Set @nr_bits bits in @bitmap starting from @bit.
+ * Return the number of bits changed from 0 to 1.
+ */
+static unsigned int
+dm_zoned_set_bits(unsigned long *bitmap,
+		  unsigned int bit,
+		  unsigned int nr_bits)
+{
+	unsigned long *addr;
+	unsigned int end = bit + nr_bits;
+	unsigned int n = 0;
+
+	while (bit < end) {
+
+		if (((bit & (BITS_PER_LONG - 1)) == 0) &&
+		    ((end - bit) >= BITS_PER_LONG)) {
+			/* Try to set the whole word at once */
+			addr = bitmap + BIT_WORD(bit);
+			if (*addr == 0) {
+				*addr = ULONG_MAX;
+				n += BITS_PER_LONG;
+				bit += BITS_PER_LONG;
+				continue;
+			}
+		}
+
+		if (!test_and_set_bit(bit, bitmap))
+			n++;
+		bit++;
+	}
+
+	return n;
+
+}
+
+/**
+ * Get the bitmap block storing the bit for @chunk_block
+ * in @zone.
+ */
+static struct buffer_head *
+dm_zoned_get_bitmap(struct dm_zoned_target *dzt,
+		    struct dm_zoned_zone *zone,
+		    sector_t chunk_block)
+{
+	sector_t bitmap_block = dzt->bitmap_block
+		+ ((sector_t)(zone->id - dzt->nr_meta_zones)
+		   * dzt->zone_nr_bitmap_blocks)
+		+ (chunk_block >> DM_ZONED_BLOCK_SHIFT_BITS);
+
+	return dm_zoned_get_meta(dzt, bitmap_block);
+}
+
+/**
+ * Validate (set bit) all the blocks in
+ * the range [@block..@block+@nr_blocks-1].
+ */
+int
+dm_zoned_validate_blocks(struct dm_zoned_target *dzt,
+			 struct dm_zoned_zone *zone,
+			 sector_t chunk_block,
+			 unsigned int nr_blocks)
+{
+	unsigned int count, bit, nr_bits;
+	struct buffer_head *bh;
+
+	dm_zoned_dev_debug(dzt, "=> VALIDATE zone %lu, block %zu, %u blocks\n",
+			 zone->id,
+			 chunk_block,
+			 nr_blocks);
+
+	dm_zoned_dev_assert(dzt, !dm_zoned_zone_meta(zone));
+	dm_zoned_dev_assert(dzt,
+			    (chunk_block + nr_blocks) <= dzt->zone_nr_blocks);
+
+	while (nr_blocks) {
+
+		/* Get bitmap block */
+		bh = dm_zoned_get_bitmap(dzt, zone, chunk_block);
+		if (IS_ERR(bh))
+			return PTR_ERR(bh);
+
+		/* Set bits */
+		bit = chunk_block & DM_ZONED_BLOCK_MASK_BITS;
+		nr_bits = min(nr_blocks, DM_ZONED_BLOCK_SIZE_BITS - bit);
+
+		lock_buffer(bh);
+		count = dm_zoned_set_bits((unsigned long *) bh->b_data,
+					bit, nr_bits);
+		if (count) {
+			dm_zoned_dirty_meta(dzt, bh);
+			set_bit(DM_ZONE_DIRTY, &zone->flags);
+		}
+		unlock_buffer(bh);
+		__brelse(bh);
+
+		nr_blocks -= nr_bits;
+		chunk_block += nr_bits;
+
+	}
+
+	return 0;
+}
+
+/**
+ * Clear @nr_bits bits in @bitmap starting from @bit.
+ * Return the number of bits changed from 1 to 0.
+ */
+static int
+dm_zoned_clear_bits(unsigned long *bitmap,
+		    int bit,
+		    int nr_bits)
+{
+	unsigned long *addr;
+	int end = bit + nr_bits;
+	int n = 0;
+
+	while (bit < end) {
+
+		if (((bit & (BITS_PER_LONG - 1)) == 0) &&
+		    ((end - bit) >= BITS_PER_LONG)) {
+			/* Try to clear whole word at once */
+			addr = bitmap + BIT_WORD(bit);
+			if (*addr == ULONG_MAX) {
+				*addr = 0;
+				n += BITS_PER_LONG;
+				bit += BITS_PER_LONG;
+				continue;
+			}
+		}
+
+		if (test_and_clear_bit(bit, bitmap))
+			n++;
+		bit++;
+	}
+
+	return n;
+
+}
+
+/**
+ * Invalidate (clear bit) all the blocks in
+ * the range [@block..@block+@nr_blocks-1].
+ */
+int
+dm_zoned_invalidate_blocks(struct dm_zoned_target *dzt,
+			   struct dm_zoned_zone *zone,
+			   sector_t chunk_block,
+			   unsigned int nr_blocks)
+{
+	unsigned int count, bit, nr_bits;
+	struct buffer_head *bh;
+
+	dm_zoned_dev_debug(dzt, "INVALIDATE zone %lu, block %zu, %u blocks\n",
+			 zone->id,
+			 chunk_block,
+			 nr_blocks);
+
+	dm_zoned_dev_assert(dzt, !dm_zoned_zone_meta(zone));
+	dm_zoned_dev_assert(dzt,
+			    (chunk_block + nr_blocks) <= dzt->zone_nr_blocks);
+
+	while (nr_blocks) {
+
+		/* Get bitmap block */
+		bh = dm_zoned_get_bitmap(dzt, zone, chunk_block);
+		if (IS_ERR(bh))
+			return PTR_ERR(bh);
+
+		/* Clear bits */
+		bit = chunk_block & DM_ZONED_BLOCK_MASK_BITS;
+		nr_bits = min(nr_blocks, DM_ZONED_BLOCK_SIZE_BITS - bit);
+
+		lock_buffer(bh);
+		count = dm_zoned_clear_bits((unsigned long *) bh->b_data,
+					  bit, nr_bits);
+		if (count) {
+			dm_zoned_dirty_meta(dzt, bh);
+			set_bit(DM_ZONE_DIRTY, &zone->flags);
+		}
+		unlock_buffer(bh);
+		__brelse(bh);
+
+		nr_blocks -= nr_bits;
+		chunk_block += nr_bits;
+
+	}
+
+	return 0;
+}
+
+/**
+ * Get a block bit value.
+ */
+static int
+dm_zoned_test_block(struct dm_zoned_target *dzt,
+		    struct dm_zoned_zone *zone,
+		    sector_t chunk_block)
+{
+	struct buffer_head *bh;
+	int ret;
+
+	/* Get bitmap block */
+	bh = dm_zoned_get_bitmap(dzt, zone, chunk_block);
+	if (IS_ERR(bh))
+		return PTR_ERR(bh);
+
+	/* Get offset */
+	ret = test_bit(chunk_block & DM_ZONED_BLOCK_MASK_BITS,
+		       (unsigned long *) bh->b_data) != 0;
+
+	__brelse(bh);
+
+	return ret;
+}
+
+/**
+ * Return the offset from @block to the first block
+ * with a bit value set to @set. Search at most @nr_blocks
+ * blocks from @block.
+ */
+static int
+dm_zoned_offset_to_block(struct dm_zoned_target *dzt,
+			 struct dm_zoned_zone *zone,
+			 sector_t chunk_block,
+			 unsigned int nr_blocks,
+			 int set)
+{
+	struct buffer_head *bh;
+	unsigned int bit, set_bit, nr_bits;
+	unsigned long *bitmap;
+	int n = 0;
+
+	while (nr_blocks) {
+
+		/* Get bitmap block */
+		bh = dm_zoned_get_bitmap(dzt, zone, chunk_block);
+		if (IS_ERR(bh))
+			return PTR_ERR(bh);
+
+		/* Get offset */
+		bitmap = (unsigned long *) bh->b_data;
+		bit = chunk_block & DM_ZONED_BLOCK_MASK_BITS;
+		nr_bits = min(nr_blocks, DM_ZONED_BLOCK_SIZE_BITS - bit);
+		if (set)
+			set_bit = find_next_bit(bitmap,
+				DM_ZONED_BLOCK_SIZE_BITS, bit);
+		else
+			set_bit = find_next_zero_bit(bitmap,
+				DM_ZONED_BLOCK_SIZE_BITS, bit);
+		__brelse(bh);
+
+		n += set_bit - bit;
+		if (set_bit < DM_ZONED_BLOCK_SIZE_BITS)
+			break;
+
+		nr_blocks -= nr_bits;
+		chunk_block += nr_bits;
+
+	}
+
+	return n;
+}
+
+/**
+ * Test if @block is valid. If it is, the number of consecutive
+ * valid blocks from @block will be returned at the address
+ * indicated by @nr_blocks;
+ */
+int
+dm_zoned_block_valid(struct dm_zoned_target *dzt,
+		     struct dm_zoned_zone *zone,
+		     sector_t chunk_block)
+{
+	int valid;
+
+	dm_zoned_dev_assert(dzt, !dm_zoned_zone_meta(zone));
+	dm_zoned_dev_assert(dzt, chunk_block < dzt->zone_nr_blocks);
+
+	/* Test block */
+	valid = dm_zoned_test_block(dzt, zone, chunk_block);
+	if (valid <= 0)
+		return valid;
+
+	/* The block is valid: get the number of valid blocks from block */
+	return dm_zoned_offset_to_block(dzt, zone, chunk_block,
+				      dzt->zone_nr_blocks - chunk_block,
+				      0);
+}
+
+/**
+ * Count the number of bits set starting from @bit
+ * up to @bit + @nr_bits - 1.
+ */
+static int
+dm_zoned_count_bits(void *bitmap,
+		    int bit,
+		    int nr_bits)
+{
+	unsigned long *addr;
+	int end = bit + nr_bits;
+	int n = 0;
+
+	while (bit < end) {
+
+		if (((bit & (BITS_PER_LONG - 1)) == 0) &&
+		    ((end - bit) >= BITS_PER_LONG)) {
+			addr = (unsigned long *)bitmap + BIT_WORD(bit);
+			if (*addr == ULONG_MAX) {
+				n += BITS_PER_LONG;
+				bit += BITS_PER_LONG;
+				continue;
+			}
+		}
+
+		if (test_bit(bit, bitmap))
+			n++;
+		bit++;
+	}
+
+	return n;
+
+}
+
+/**
+ * Return the number of valid blocks in the range
+ * of blocks [@block..@block+@nr_blocks-1].
+ */
+int
+dm_zoned_valid_blocks(struct dm_zoned_target *dzt,
+		      struct dm_zoned_zone *zone,
+		      sector_t chunk_block,
+		      unsigned int nr_blocks)
+{
+	struct buffer_head *bh;
+	unsigned int bit, nr_bits;
+	void *bitmap;
+	int n = 0;
+
+	dm_zoned_dev_assert(dzt, !dm_zoned_zone_meta(zone));
+	dm_zoned_dev_assert(dzt,
+			    (chunk_block + nr_blocks) <= dzt->zone_nr_blocks);
+
+	while (nr_blocks) {
+
+		/* Get bitmap block */
+		bh = dm_zoned_get_bitmap(dzt, zone, chunk_block);
+		if (IS_ERR(bh))
+			return PTR_ERR(bh);
+
+		/* Count bits in this block */
+		bitmap = bh->b_data;
+		bit = chunk_block & DM_ZONED_BLOCK_MASK_BITS;
+		nr_bits = min(nr_blocks, DM_ZONED_BLOCK_SIZE_BITS - bit);
+		n += dm_zoned_count_bits(bitmap, bit, nr_bits);
+
+		__brelse(bh);
+
+		nr_blocks -= nr_bits;
+		chunk_block += nr_bits;
+
+	}
+
+	return n;
+}
diff --git a/drivers/md/dm-zoned-reclaim.c b/drivers/md/dm-zoned-reclaim.c
new file mode 100644
index 0000000..3b6cfa5
--- /dev/null
+++ b/drivers/md/dm-zoned-reclaim.c
@@ -0,0 +1,770 @@
+/*
+ * (C) Copyright 2016 Western Digital.
+ *
+ * This software is distributed under the terms of the GNU Lesser General
+ * Public License version 2, or any later version, "as is," without technical
+ * support, and WITHOUT ANY WARRANTY, without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * Author: Damien Le Moal <damien.lemoal@hgst.com>
+ */
+
+#include <linux/module.h>
+#include <linux/version.h>
+#include <linux/slab.h>
+
+#include "dm-zoned.h"
+
+/**
+ * Free a page list.
+ */
+static void
+dm_zoned_reclaim_free_page_list(struct dm_zoned_target *dzt,
+				struct page_list *pl,
+				unsigned int nr_blocks)
+{
+	unsigned int nr_pages;
+	int i;
+
+	nr_pages = ((nr_blocks << DM_ZONED_BLOCK_SHIFT) +
+		    PAGE_SIZE - 1) >> PAGE_SHIFT;
+	for (i = 0; i < nr_pages; i++) {
+		if (pl[i].page)
+			put_page(pl[i].page);
+	}
+	kfree(pl);
+}
+
+/**
+ * Allocate a page list.
+ */
+static struct page_list *
+dm_zoned_reclaim_alloc_page_list(struct dm_zoned_target *dzt,
+				 unsigned int nr_blocks)
+{
+	struct page_list *pl;
+	unsigned int nr_pages;
+	int i;
+
+	/* Get a page list */
+	nr_pages = ((nr_blocks << DM_ZONED_BLOCK_SHIFT) +
+		    PAGE_SIZE - 1) >> PAGE_SHIFT;
+	pl = kzalloc(sizeof(struct page_list) * nr_pages, GFP_KERNEL);
+	if (!pl)
+		return NULL;
+
+	/* Get pages */
+	for (i = 0; i < nr_pages; i++) {
+		pl[i].page = alloc_page(GFP_KERNEL);
+		if (!pl[i].page) {
+			dm_zoned_reclaim_free_page_list(dzt, pl, i);
+			return NULL;
+		}
+		if (i > 0)
+			pl[i - 1].next = &pl[i];
+	}
+
+	return pl;
+}
+
+/**
+ * Read blocks.
+ */
+static int
+dm_zoned_reclaim_read(struct dm_zoned_target *dzt,
+		      struct dm_zoned_zone *zone,
+		      sector_t chunk_block,
+		      unsigned int nr_blocks,
+		      struct page_list *pl)
+{
+	struct dm_io_request ioreq;
+	struct dm_io_region ioreg;
+	int ret;
+
+	dm_zoned_dev_debug(dzt, "Reclaim: Read %s zone %lu, "
+			   "block %zu, %u blocks\n",
+			   dm_zoned_zone_is_cmr(zone) ? "CMR" : "SMR",
+			   zone->id, chunk_block, nr_blocks);
+
+	/* Setup I/O request and region */
+	ioreq.bi_rw = READ;
+	ioreq.mem.type = DM_IO_PAGE_LIST;
+	ioreq.mem.offset = 0;
+	ioreq.mem.ptr.pl = pl;
+	ioreq.notify.fn = NULL;
+	ioreq.notify.context = NULL;
+	ioreq.client = dzt->reclaim_client;
+	ioreg.bdev = dzt->zbd;
+	ioreg.sector = dm_zoned_block_to_sector(dm_zoned_zone_start_block(zone)
+						+ chunk_block);
+	ioreg.count = dm_zoned_block_to_sector(nr_blocks);
+
+	/* Do read */
+	ret = dm_io(&ioreq, 1, &ioreg, NULL);
+	if (ret) {
+		dm_zoned_dev_error(dzt, "Reclaim: Read %s zone %lu, "
+				   "block %zu, %u blocks failed %d\n",
+				   dm_zoned_zone_is_cmr(zone) ? "CMR" : "SMR",
+				   zone->id, chunk_block, nr_blocks, ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+/**
+ * Write blocks.
+ */
+static int
+dm_zoned_reclaim_write(struct dm_zoned_target *dzt,
+		       struct dm_zoned_zone *zone,
+		       sector_t chunk_block,
+		       unsigned int nr_blocks,
+		       struct page_list *pl)
+{
+	struct dm_io_request ioreq;
+	struct dm_io_region ioreg;
+	int ret;
+
+	dm_zoned_dev_debug(dzt, "Reclaim: Write %s zone %lu, block %zu, %u blocks\n",
+			 dm_zoned_zone_is_cmr(zone) ? "CMR" : "SMR",
+			 zone->id,
+			 chunk_block,
+			 nr_blocks);
+
+	/* Fill holes between writes */
+	if (dm_zoned_zone_is_smr(zone) && chunk_block > zone->wp_block) {
+		ret = dm_zoned_advance_zone_wp(dzt, zone, chunk_block - zone->wp_block);
+		if (ret)
+			return ret;
+	}
+
+	/* Setup I/O request and region */
+	ioreq.bi_rw = REQ_WRITE;
+	ioreq.mem.type = DM_IO_PAGE_LIST;
+	ioreq.mem.offset = 0;
+	ioreq.mem.ptr.pl = pl;
+	ioreq.notify.fn = NULL;
+	ioreq.notify.context = NULL;
+	ioreq.client = dzt->reclaim_client;
+	ioreg.bdev = dzt->zbd;
+	ioreg.sector = dm_zoned_block_to_sector(dm_zoned_zone_start_block(zone) + chunk_block);
+	ioreg.count = dm_zoned_block_to_sector(nr_blocks);
+
+	/* Do write */
+	ret = dm_io(&ioreq, 1, &ioreg, NULL);
+	if (ret) {
+		dm_zoned_dev_error(dzt, "Reclaim: Write %s zone %lu, block %zu, %u blocks failed %d\n",
+				 dm_zoned_zone_is_cmr(zone) ? "CMR" : "SMR",
+				 zone->id,
+				 chunk_block,
+				 nr_blocks,
+				 ret);
+		return ret;
+	}
+
+	if (dm_zoned_zone_is_smr(zone))
+		zone->wp_block += nr_blocks;
+
+	return 0;
+}
+
+/**
+ * Copy blocks between zones.
+ */
+static int
+dm_zoned_reclaim_copy(struct dm_zoned_target *dzt,
+		      struct dm_zoned_zone *from_zone,
+		      struct dm_zoned_zone *to_zone,
+		      sector_t chunk_block,
+		      unsigned int nr_blocks)
+{
+	struct page_list *pl;
+	sector_t block = chunk_block;
+	unsigned int blocks = nr_blocks;
+	unsigned int count, max_count;
+	int ret;
+
+	/* Get a page list */
+	max_count = min_t(unsigned int, nr_blocks, DM_ZONED_RECLAIM_MAX_BLOCKS);
+	pl = dm_zoned_reclaim_alloc_page_list(dzt, max_count);
+	if (!pl) {
+		dm_zoned_dev_error(dzt, "Reclaim: Allocate %u pages failed\n",
+				 max_count);
+		return -ENOMEM;
+	}
+
+	while (blocks) {
+
+		/* Read blocks */
+		count = min_t(unsigned int, blocks, max_count);
+		ret = dm_zoned_reclaim_read(dzt, from_zone, block, count, pl);
+		if (ret)
+			goto out;
+
+		/* Write blocks */
+		ret = dm_zoned_reclaim_write(dzt, to_zone, block, count, pl);
+		if (ret)
+			goto out;
+
+		block += count;
+		blocks -= count;
+
+	}
+
+	/* Validate written blocks */
+	ret = dm_zoned_validate_blocks(dzt, to_zone, chunk_block, nr_blocks);
+
+out:
+	dm_zoned_reclaim_free_page_list(dzt, pl, max_count);
+
+	return ret;
+}
+
+/**
+ * Get a zone for reclaim.
+ */
+static inline int
+dm_zoned_reclaim_lock(struct dm_zoned_target *dzt,
+		      struct dm_zoned_zone *dzone)
+{
+	unsigned long flags;
+	int ret = 0;
+
+	/* Skip active zones */
+	dm_zoned_lock_zone(dzone, flags);
+	if (!test_bit(DM_ZONE_ACTIVE, &dzone->flags)
+	    && !test_and_set_bit(DM_ZONE_RECLAIM, &dzone->flags))
+		ret = 1;
+	dm_zoned_unlock_zone(dzone, flags);
+
+	return ret;
+}
+
+/**
+ * Clear a zone reclaim flag.
+ */
+static inline void
+dm_zoned_reclaim_unlock(struct dm_zoned_target *dzt,
+			struct dm_zoned_zone *dzone)
+{
+	unsigned long flags;
+
+	dm_zoned_lock_zone(dzone, flags);
+	clear_bit_unlock(DM_ZONE_RECLAIM, &dzone->flags);
+	smp_mb__after_atomic();
+	wake_up_bit(&dzone->flags, DM_ZONE_RECLAIM);
+	dm_zoned_unlock_zone(dzone, flags);
+}
+
+/**
+ * Write valid blocks of @dzone into its buffer zone
+ * and swap the buffer zone with with @wzone.
+ */
+static void
+dm_zoned_reclaim_remap_buffer(struct dm_zoned_target *dzt,
+			      struct dm_zoned_zone *dzone,
+			      struct dm_zoned_zone *wzone)
+{
+	struct dm_zoned_zone *bzone = dzone->bzone;
+	struct dm_zoned_zone *rzone;
+	unsigned int nr_blocks;
+	sector_t chunk_block = 0;
+	int ret = 0;
+
+	dm_zoned_dev_debug(dzt, "Reclaim: Remap bzone %lu as dzone "
+			   "(new bzone %lu, %s dzone %lu)\n",
+			   bzone->id,
+			   wzone->id,
+			   dm_zoned_zone_is_cmr(dzone) ? "CMR" : "SMR",
+			   dzone->id);
+
+	while (chunk_block < dzt->zone_nr_blocks) {
+
+		/* Test block validity in the data zone */
+		rzone = dzone;
+		if (chunk_block < dzone->wp_block) {
+			ret = dm_zoned_block_valid(dzt, dzone, chunk_block);
+			if (ret < 0)
+				break;
+		}
+		if (!ret) {
+			chunk_block++;
+			continue;
+		}
+
+		/* Copy and validate blocks */
+		nr_blocks = ret;
+		ret = dm_zoned_reclaim_copy(dzt, dzone, bzone, chunk_block, nr_blocks);
+		if (ret)
+			break;
+
+		chunk_block += nr_blocks;
+
+	}
+
+	if (ret) {
+		/* Free the target data zone */
+		dm_zoned_invalidate_zone(dzt, wzone);
+		dm_zoned_free_dzone(dzt, wzone);
+		goto out;
+	}
+
+	/* Remap bzone to dzone chunk and set wzone as a buffer zone */
+	dm_zoned_reclaim_lock(dzt, bzone);
+	dm_zoned_remap_bzone(dzt, bzone, wzone);
+
+	/* Invalidate all blocks in the data zone and free it */
+	dm_zoned_invalidate_zone(dzt, dzone);
+	dm_zoned_free_dzone(dzt, dzone);
+
+out:
+	dm_zoned_reclaim_unlock(dzt, bzone);
+	dm_zoned_reclaim_unlock(dzt, wzone);
+}
+
+/**
+ * Merge valid blocks of @dzone and of its buffer zone into @wzone.
+ */
+static void
+dm_zoned_reclaim_merge_buffer(struct dm_zoned_target *dzt,
+			      struct dm_zoned_zone *dzone,
+			      struct dm_zoned_zone *wzone)
+{
+	struct dm_zoned_zone *bzone = dzone->bzone;
+	struct dm_zoned_zone *rzone;
+	unsigned int nr_blocks;
+	sector_t chunk_block = 0;
+	int ret = 0;
+
+	dm_zoned_dev_debug(dzt, "Reclaim: Merge zones %lu and %lu into %s dzone %lu\n",
+			   bzone->id,
+			   dzone->id,
+			   dm_zoned_zone_is_cmr(wzone) ? "CMR" : "SMR",
+			   wzone->id);
+
+	while (chunk_block < dzt->zone_nr_blocks) {
+
+		/* Test block validity in the data zone */
+		rzone = dzone;
+		if (chunk_block < dzone->wp_block) {
+			ret = dm_zoned_block_valid(dzt, dzone, chunk_block);
+			if (ret < 0)
+				break;
+		}
+		if (!ret) {
+			/* Check the buffer zone */
+			rzone = bzone;
+			ret = dm_zoned_block_valid(dzt, bzone, chunk_block);
+			if (ret < 0)
+				break;
+			if (!ret) {
+				chunk_block++;
+				continue;
+			}
+		}
+
+		/* Copy and validate blocks */
+		nr_blocks = ret;
+		ret = dm_zoned_reclaim_copy(dzt, rzone, wzone, chunk_block, nr_blocks);
+		if (ret)
+			break;
+
+		chunk_block += nr_blocks;
+
+	}
+
+	if (ret) {
+		/* Free the target data zone */
+		dm_zoned_invalidate_zone(dzt, wzone);
+		dm_zoned_free_dzone(dzt, wzone);
+		goto out;
+	}
+
+	/* Invalidate all blocks of the buffer zone and free it */
+	dm_zoned_invalidate_zone(dzt, bzone);
+	dm_zoned_free_bzone(dzt, bzone);
+
+	/* Finally, remap dzone to wzone */
+	dm_zoned_remap_dzone(dzt, dzone, wzone);
+	dm_zoned_invalidate_zone(dzt, dzone);
+	dm_zoned_free_dzone(dzt, dzone);
+
+out:
+	dm_zoned_reclaim_unlock(dzt, wzone);
+}
+
+/**
+ * Move valid blocks of the buffer zone into the data zone.
+ */
+static void
+dm_zoned_reclaim_flush_buffer(struct dm_zoned_target *dzt,
+			      struct dm_zoned_zone *dzone)
+{
+	struct dm_zoned_zone *bzone = dzone->bzone;
+	unsigned int nr_blocks;
+	sector_t chunk_block = 0;
+	int ret = 0;
+
+	dm_zoned_dev_debug(dzt, "Reclaim: Flush buffer zone %lu into %s dzone %lu\n",
+			   bzone->id,
+			   dm_zoned_zone_is_cmr(dzone) ? "CMR" : "SMR",
+			   dzone->id);
+
+	/* The data zone may be empty due to discard after writes. */
+	/* So reset it before writing the buffer zone blocks.      */
+	dm_zoned_reset_zone_wp(dzt, dzone);
+
+	while (chunk_block < dzt->zone_nr_blocks) {
+
+		/* Test block validity */
+		ret = dm_zoned_block_valid(dzt, bzone, chunk_block);
+		if (ret < 0)
+			break;
+		if (!ret) {
+			chunk_block++;
+			continue;
+		}
+
+		/* Copy and validate blocks */
+		nr_blocks = ret;
+		ret = dm_zoned_reclaim_copy(dzt, bzone, dzone, chunk_block, nr_blocks);
+		if (ret)
+			break;
+
+		chunk_block += nr_blocks;
+
+	}
+
+	if (ret) {
+		/* Cleanup the data zone */
+		dm_zoned_invalidate_zone(dzt, dzone);
+		dm_zoned_reset_zone_wp(dzt, dzone);
+		return;
+	}
+
+	/* Invalidate all blocks of the buffer zone and free it */
+	dm_zoned_invalidate_zone(dzt, bzone);
+	dm_zoned_free_bzone(dzt, bzone);
+}
+
+/**
+ * Free empty data zone and buffer zone.
+ */
+static void
+dm_zoned_reclaim_empty(struct dm_zoned_target *dzt,
+		       struct dm_zoned_zone *dzone)
+{
+
+	dm_zoned_dev_debug(dzt, "Reclaim: Chunk %zu, free empty dzone %lu\n",
+			   dzone->map,
+			   dzone->id);
+
+	if (dzone->bzone)
+		dm_zoned_free_bzone(dzt, dzone->bzone);
+	dm_zoned_free_dzone(dzt, dzone);
+}
+
+/**
+ * Choose a reclaim zone target for merging/flushing a buffer zone.
+ */
+static struct dm_zoned_zone *
+dm_zoned_reclaim_target(struct dm_zoned_target *dzt,
+			struct dm_zoned_zone *dzone)
+{
+	struct dm_zoned_zone *wzone;
+	unsigned int blocks = dzone->wr_dir_blocks + dzone->wr_buf_blocks;
+	int type = DM_DZONE_ANY;
+
+	dm_zoned_dev_debug(dzt, "Reclaim: Zone %lu, %lu%% buffered blocks\n",
+			   dzone->id,
+			   (blocks ? dzone->wr_buf_blocks * 100 / blocks : 0));
+
+	/* Over 75 % of random write blocks -> cmr */
+	if (!dzone->wr_dir_blocks
+	    || (blocks &&
+		(dzone->wr_buf_blocks * 100 / blocks) >= 75))
+		type = DM_DZONE_CMR;
+
+	/* Get a data zone for merging */
+	dm_zoned_map_lock(dzt);
+	wzone = dm_zoned_alloc_dzone(dzt, DM_ZONED_MAP_UNMAPPED, type);
+	if (wzone) {
+		/*
+		 * When the merge zone will be remapped, it may
+		 * be accessed right away. Mark it as reclaim
+		 * in order to properly cleanup the source data
+		 * zone before any access.
+		 */
+		dm_zoned_reclaim_lock(dzt, wzone);
+	}
+	dm_zoned_map_unlock(dzt);
+
+	dm_zoned_zone_reset_stats(dzone);
+
+	return wzone;
+}
+
+/**
+ * Reclaim the buffer zone of @dzone.
+ */
+static void
+dm_zoned_reclaim_bzone(struct dm_zoned_target *dzt,
+		       struct dm_zoned_zone *bzone)
+{
+	struct dm_zoned_zone *dzone = bzone->bzone;
+	struct dm_zoned_zone *wzone;
+	int bweight, dweight;
+
+	/* Paranoia checks */
+	dm_zoned_dev_assert(dzt, dzone != NULL);
+	dm_zoned_dev_assert(dzt, dzone->bzone == bzone);
+	dm_zoned_dev_assert(dzt, !test_bit(DM_ZONE_ACTIVE, &dzone->flags));
+
+	dweight = dm_zoned_zone_weight(dzt, dzone);
+	bweight = dm_zoned_zone_weight(dzt, bzone);
+	dm_zoned_dev_debug(dzt, "Reclaim: Chunk %zu, dzone %lu (weight %d), "
+			   "bzone %lu (weight %d)\n",
+			   dzone->map, dzone->id, dweight,
+			   bzone->id, bweight);
+
+	/* If everything is invalid, free the zones */
+	if (!dweight && !bweight) {
+		dm_zoned_reclaim_empty(dzt, dzone);
+		goto out;
+	}
+
+	/* If all valid blocks are in the buffer zone, */
+	/* move them directly into the data zone.      */
+	if (!dweight) {
+		dm_zoned_reclaim_flush_buffer(dzt, dzone);
+		goto out;
+	}
+
+	/* Buffer zone and data zone need to be merged in a a new data zone */
+	wzone = dm_zoned_reclaim_target(dzt, dzone);
+	if (!wzone) {
+		dm_zoned_dev_error(dzt, "Reclaim: No target zone available "
+				   "for merge reclaim\n");
+		goto out;
+	}
+
+	/* If the target zone is CMR, write valid blocks of the data zone  */
+	/* into the buffer zone and swap the buffer zone and new data zone */
+	/* But do this only if it is less costly (less blocks to move)     */
+	/* than a regular merge.                                           */
+	if (dm_zoned_zone_is_cmr(wzone) && bweight > dweight) {
+		dm_zoned_reclaim_remap_buffer(dzt, dzone, wzone);
+		goto out;
+	}
+
+	/* Otherwise, merge the valid blocks of the buffer zone and data   */
+	/* zone into an newly allocated SMR data zone. On success, the new */
+	/* data zone is remapped to the chunk of the original data zone    */
+	dm_zoned_reclaim_merge_buffer(dzt, dzone, wzone);
+
+out:
+	dm_zoned_reclaim_unlock(dzt, dzone);
+}
+
+/**
+ * Reclaim buffer zone work.
+ */
+static void
+dm_zoned_reclaim_bzone_work(struct work_struct *work)
+{
+	struct dm_zoned_reclaim_zwork *rzwork = container_of(work,
+					struct dm_zoned_reclaim_zwork, work);
+	struct dm_zoned_target *dzt = rzwork->target;
+
+	dm_zoned_reclaim_bzone(dzt, rzwork->bzone);
+
+	kfree(rzwork);
+}
+
+/**
+ * Select a buffer zone candidate for reclaim.
+ */
+static struct dm_zoned_zone *
+dm_zoned_reclaim_bzone_candidate(struct dm_zoned_target *dzt)
+{
+	struct dm_zoned_zone *bzone;
+
+	/* Search for a buffer zone candidate to reclaim */
+	dm_zoned_map_lock(dzt);
+
+	if (list_empty(&dzt->bz_lru_list))
+		goto out;
+
+	bzone = list_first_entry(&dzt->bz_lru_list, struct dm_zoned_zone, link);
+	while (bzone) {
+		if (dm_zoned_reclaim_lock(dzt, bzone->bzone)) {
+			dm_zoned_map_unlock(dzt);
+			return bzone;
+		}
+		if (list_is_last(&bzone->link, &dzt->bz_lru_list))
+			break;
+		bzone = list_next_entry(bzone, link);
+	}
+
+out:
+	dm_zoned_map_unlock(dzt);
+
+	return NULL;
+
+}
+
+/**
+ * Start reclaim workers.
+ */
+static int
+dm_zoned_reclaim_bzones(struct dm_zoned_target *dzt)
+{
+	struct dm_zoned_zone *bzone = NULL;
+	struct dm_zoned_reclaim_zwork *rzwork;
+	unsigned int max_workers = 0, nr_free;
+	unsigned long start;
+	int n = 0;
+
+	/* Try reclaim if there are used buffer zones AND the disk */
+	/* is idle, or, the number of free buffer zones is low.    */
+	nr_free = atomic_read(&dzt->bz_nr_free);
+	if (nr_free < dzt->bz_nr_free_low)
+		max_workers = dzt->bz_nr_free_low - nr_free;
+	else if (atomic_read(&dzt->dz_nr_active_wait))
+		max_workers = atomic_read(&dzt->dz_nr_active_wait);
+	else if (dm_zoned_idle(dzt))
+		max_workers = 1;
+	max_workers = min(max_workers, (unsigned int)DM_ZONED_RECLAIM_MAX_WORKERS);
+
+	start = jiffies;
+	while (n < max_workers) {
+
+		bzone = dm_zoned_reclaim_bzone_candidate(dzt);
+		if (!bzone)
+			break;
+
+		if (max_workers == 1) {
+			/* Do it in this context */
+			dm_zoned_reclaim_bzone(dzt, bzone);
+		} else {
+			/* Start a zone reclaim work */
+			rzwork = kmalloc(sizeof(struct dm_zoned_reclaim_zwork), GFP_KERNEL);
+			if (unlikely(!rzwork))
+				break;
+			INIT_WORK(&rzwork->work, dm_zoned_reclaim_bzone_work);
+			rzwork->target = dzt;
+			rzwork->bzone = bzone;
+			queue_work(dzt->reclaim_zwq, &rzwork->work);
+		}
+
+		n++;
+
+	}
+
+	if (n) {
+		flush_workqueue(dzt->reclaim_zwq);
+		dm_zoned_flush(dzt);
+		dm_zoned_dev_debug(dzt, "Reclaim: %d bzones reclaimed in %u msecs\n",
+				   n,
+				   jiffies_to_msecs(jiffies - start));
+	}
+
+	return n;
+}
+
+/**
+ * Reclaim unbuffered data zones marked as empty.
+ */
+static int
+dm_zoned_reclaim_dzones(struct dm_zoned_target *dzt)
+{
+	struct dm_zoned_zone *dz, *dzone;
+	int ret;
+
+	dm_zoned_map_lock(dzt);
+
+	/* If not idle, do only CMR zones */
+	while (!list_empty(&dzt->dz_empty_list)) {
+
+		/* Search for a candidate to reclaim */
+		dzone = NULL;
+		list_for_each_entry(dz, &dzt->dz_empty_list, elink) {
+			if (!dm_zoned_idle(dzt) && !dm_zoned_zone_is_cmr(dz))
+				continue;
+			dzone = dz;
+			break;
+		}
+
+		if (!dzone || !dm_zoned_reclaim_lock(dzt, dzone))
+			break;
+
+		clear_bit_unlock(DM_ZONE_EMPTY, &dzone->flags);
+		smp_mb__after_atomic();
+		list_del_init(&dzone->elink);
+
+		dm_zoned_map_unlock(dzt);
+
+		if (dm_zoned_zone_weight(dzt, dzone) == 0)
+			dm_zoned_reclaim_empty(dzt, dzone);
+		dm_zoned_reclaim_unlock(dzt, dzone);
+
+		dm_zoned_map_lock(dzt);
+
+	}
+
+	ret = !list_empty(&dzt->dz_empty_list);
+
+	dm_zoned_map_unlock(dzt);
+
+	return ret;
+}
+
+/**
+ * Buffer zone reclaim work.
+ */
+void
+dm_zoned_reclaim_work(struct work_struct *work)
+{
+	struct dm_zoned_target *dzt = container_of(work,
+		struct dm_zoned_target, reclaim_work.work);
+	int have_empty_dzones;
+	int reclaimed_bzones;
+	unsigned long delay;
+
+	/* Try to reclaim buffer zones */
+	set_bit(DM_ZONED_RECLAIM_ACTIVE, &dzt->flags);
+	smp_mb__after_atomic();
+
+	dm_zoned_dev_debug(dzt, "Reclaim: %u/%u free bzones, disk %s, %d active zones (%d waiting)\n",
+			   atomic_read(&dzt->bz_nr_free),
+			   dzt->nr_buf_zones,
+			   (dm_zoned_idle(dzt) ? "idle" : "busy"),
+			   atomic_read(&dzt->dz_nr_active),
+			   atomic_read(&dzt->dz_nr_active_wait));
+
+	/* Reclaim empty data zones */
+	have_empty_dzones = dm_zoned_reclaim_dzones(dzt);
+
+	/* Reclaim buffer zones */
+	reclaimed_bzones = dm_zoned_reclaim_bzones(dzt);
+
+	if (atomic_read(&dzt->bz_nr_free) < dzt->nr_buf_zones ||
+	    have_empty_dzones) {
+		if (dm_zoned_idle(dzt)) {
+			delay = 0;
+		} else if (atomic_read(&dzt->dz_nr_active_wait) ||
+			 (atomic_read(&dzt->bz_nr_free) < dzt->bz_nr_free_low)) {
+			if (reclaimed_bzones)
+				delay = 0;
+			else
+				delay = HZ / 2;
+		} else
+			delay = DM_ZONED_RECLAIM_PERIOD;
+		dm_zoned_schedule_reclaim(dzt, delay);
+	}
+
+	clear_bit_unlock(DM_ZONED_RECLAIM_ACTIVE, &dzt->flags);
+	smp_mb__after_atomic();
+}
+
diff --git a/drivers/md/dm-zoned.h b/drivers/md/dm-zoned.h
new file mode 100644
index 0000000..ea0ee92
--- /dev/null
+++ b/drivers/md/dm-zoned.h
@@ -0,0 +1,687 @@
+/*
+ * (C) Copyright 2016 Western Digital.
+ *
+ * This software is distributed under the terms of the GNU Lesser General
+ * Public License version 2, or any later version, "as is," without technical
+ * support, and WITHOUT ANY WARRANTY, without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * Author: Damien Le Moal <damien.lemoal@hgst.com>
+ */
+#include <linux/types.h>
+#include <linux/blkdev.h>
+#include <linux/device-mapper.h>
+#include <linux/dm-io.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/rwsem.h>
+#include <linux/mutex.h>
+#include <linux/workqueue.h>
+#include <linux/buffer_head.h>
+
+/**
+ * Enable to get debug messages support.
+ */
+#undef __DM_ZONED_DEBUG
+
+/**
+ * Version.
+ */
+#define DM_ZONED_VER_MAJ			0
+#define DM_ZONED_VER_MIN			1
+
+/**
+ * Zone type (high 4 bits of zone flags).
+ */
+#define DM_ZONE_META		0x10000000
+#define DM_ZONE_BUF		0x20000000
+#define DM_ZONE_DATA		0x30000000
+#define DM_ZONE_TYPE_MASK	0xF0000000
+
+/**
+ * Zone flags.
+ */
+enum {
+	DM_ZONE_ACTIVE,
+	DM_ZONE_ACTIVE_BIO,
+	DM_ZONE_ACTIVE_WAIT,
+	DM_ZONE_BUFFERED,
+	DM_ZONE_DIRTY,
+	DM_ZONE_EMPTY,
+	DM_ZONE_RECLAIM,
+};
+
+/**
+ * dm device emulates 4K blocks.
+ */
+#define DM_ZONED_BLOCK_SHIFT		12
+#define DM_ZONED_BLOCK_SIZE		(1 << DM_ZONED_BLOCK_SHIFT)
+#define DM_ZONED_BLOCK_MASK		(DM_ZONED_BLOCK_SIZE - 1)
+
+#define DM_ZONED_BLOCK_SHIFT_BITS	(DM_ZONED_BLOCK_SHIFT + 3)
+#define DM_ZONED_BLOCK_SIZE_BITS	(DM_ZONED_BLOCK_SIZE << 3)
+#define DM_ZONED_BLOCK_MASK_BITS	(DM_ZONED_BLOCK_SIZE_BITS - 1)
+
+#define DM_ZONED_BLOCK_SECTORS		(DM_ZONED_BLOCK_SIZE >> SECTOR_SHIFT)
+#define DM_ZONED_BLOCK_SECTORS_MASK	(DM_ZONED_BLOCK_SECTORS - 1)
+
+#define dm_zoned_block_to_sector(b) \
+	((b) << (DM_ZONED_BLOCK_SHIFT - SECTOR_SHIFT))
+#define dm_zoned_sector_to_block(s) \
+	((s) >> (DM_ZONED_BLOCK_SHIFT - SECTOR_SHIFT))
+
+#define DM_ZONED_MIN_BIOS		128
+
+/**
+ * On-disk super block (sector 0 of the target device).
+ */
+#define DM_ZONED_MAGIC	((((unsigned int)('D')) << 24) | \
+			 (((unsigned int)('S')) << 16) | \
+			 (((unsigned int)('M')) <<  8) | \
+			 ((unsigned int)('R')))
+#define DM_ZONED_META_VER		1
+
+/**
+ * On disk metadata:
+ *    - Block 0 stores the super block.
+ *    - From block 1, nr_map_blocks blocks of data zone mapping entries
+ *    - From block nr_map_blocks+1, nr_bitmap_blocks blocks of zone
+ *      block bitmap.
+ */
+
+/**
+ * Buffer zones mapping entry: each entry allows to first identify
+ * the zones of the backen device that are used as buffer zones
+ * (using bzone_id), and second, the number of the zone being buffered.
+ * For unused buffer zones, the data zone ID is set to 0.
+ */
+struct dm_zoned_bz_map {
+	__le32			bzone_id;			/*    4 */
+	__le32			dzone_id;			/*    8 */
+};
+
+#define DM_ZONED_NR_BZONES		32
+#define DM_ZONED_NR_BZONES_MIN		4
+#define DM_ZONED_NR_BZONES_LOW		25
+#define DM_ZONED_NR_BZONES_LOW_MIN	2
+
+/**
+ * Buffer zones mapping entries are stored in the super block.
+ * At most DM_ZONED_NR_BZONES_MAX fit (= 496).
+ */
+#define DM_ZONED_NR_BZONES_MAX	((4096 - 104) / sizeof(struct dm_zoned_bz_map))
+
+struct dm_zoned_super {
+
+	__le32			magic;				/*    4 */
+	__le32			version;			/*    8 */
+
+	__le32			nr_buf_zones;			/*   12 */
+	__le32			nr_data_zones;			/*   16 */
+	__le32			nr_map_blocks;			/*   20 */
+	__le32			nr_bitmap_blocks;		/*   24 */
+
+	u8			reserved[104];			/*  128 */
+
+	struct dm_zoned_bz_map	bz_map[DM_ZONED_NR_BZONES_MAX];	/* 4096 */
+
+};
+
+/**
+ * Zone mapping table metadata.
+ */
+#define DM_ZONED_MAP_ENTRIES_PER_BLOCK	(DM_ZONED_BLOCK_SIZE / sizeof(u32))
+#define DM_ZONED_MAP_ENTRIES_SHIFT	(ilog2(DM_ZONED_MAP_ENTRIES_PER_BLOCK))
+#define DM_ZONED_MAP_ENTRIES_MASK	(DM_ZONED_MAP_ENTRIES_PER_BLOCK - 1)
+#define DM_ZONED_MAP_UNMAPPED		UINT_MAX
+
+#define DM_ZONE_WORK_MAX	128
+#define DM_ZONE_WORK_MAX_BIO	64
+
+struct dm_zoned_target;
+struct dm_zoned_zone;
+
+/**
+ * Zone work descriptor: this exists only
+ * for active zones.
+ */
+struct dm_zoned_zwork {
+	struct work_struct	work;
+
+	struct dm_zoned_target	*target;
+	struct dm_zoned_zone	*dzone;
+
+	struct list_head	link;
+
+	/* ref counts the number of BIOs pending   */
+	/* and executing, as well as, the queueing */
+	/* status of the work_struct.              */
+	atomic_t		ref;
+	atomic_t		bio_count;
+	struct bio_list		bio_list;
+
+};
+
+/**
+ * Zone descriptor.
+ */
+struct dm_zoned_zone {
+	struct list_head	link;
+	struct list_head	elink;
+	struct blk_zone		*blkz;
+	struct dm_zoned_zwork	*zwork;
+	unsigned long		flags;
+	unsigned long		id;
+
+	/* For data zones, pointer to a write buffer zone (may be NULL)    */
+	/* For write buffer zones, pointer to the data zone being buffered */
+	struct dm_zoned_zone	*bzone;
+
+	/* For data zones: the logical chunk mapped, which */
+	/* is also the index of the entry for the zone in  */
+	/* the data zone mapping table.                    */
+	/* For buffer zones: the index of the entry for    */
+	/* zone in the buffer zone mapping table stored in */
+	/* the super block.                                */
+	sector_t		map;
+
+	/* The position of the zone write pointer,  */
+	/* relative to the first block of the zone. */
+	sector_t		wp_block;
+
+	/* Stats (to determine access pattern for reclaim) */
+	unsigned long		mtime;
+	unsigned long		wr_dir_blocks;
+	unsigned long		wr_buf_blocks;
+
+};
+
+extern struct kmem_cache *dm_zoned_zone_cache;
+
+#define dm_zoned_lock_zone(zone, flags) \
+	spin_lock_irqsave(&(zone)->blkz->lock, flags)
+#define dm_zoned_unlock_zone(zone, flags) \
+	spin_unlock_irqrestore(&(zone)->blkz->lock, flags)
+#define dm_zoned_zone_is_cmr(z) \
+	blk_zone_is_cmr((z)->blkz)
+#define dm_zoned_zone_is_smr(z) \
+	blk_zone_is_smr((z)->blkz)
+#define dm_zoned_zone_is_seqreq(z) \
+	((z)->blkz->type == BLK_ZONE_TYPE_SEQWRITE_REQ)
+#define dm_zoned_zone_is_seqpref(z) \
+	((z)->blkz->type == BLK_ZONE_TYPE_SEQWRITE_PREF)
+#define dm_zoned_zone_is_seq(z) \
+	(dm_zoned_zone_is_seqreq(z) || dm_zoned_zone_is_seqpref(z))
+#define dm_zoned_zone_is_rnd(z) \
+	(dm_zoned_zone_is_cmr(z) || dm_zoned_zone_is_seqpref(z))
+
+#define dm_zoned_zone_offline(z) \
+	((z)->blkz->state == BLK_ZONE_OFFLINE)
+#define dm_zoned_zone_readonly(z) \
+	((z)->blkz->state == BLK_ZONE_READONLY)
+
+#define dm_zoned_zone_start_sector(z) \
+	((z)->blkz->start)
+#define dm_zoned_zone_sectors(z) \
+	((z)->blkz->len)
+#define dm_zoned_zone_next_sector(z) \
+	(dm_zoned_zone_start_sector(z) + dm_zoned_zone_sectors(z))
+#define dm_zoned_zone_start_block(z) \
+	dm_zoned_sector_to_block(dm_zoned_zone_start_sector(z))
+#define dm_zoned_zone_next_block(z) \
+	dm_zoned_sector_to_block(dm_zoned_zone_next_sector(z))
+#define dm_zoned_zone_empty(z) \
+	((z)->wp_block == dm_zoned_zone_start_block(z))
+
+#define dm_zoned_chunk_sector(dzt, s) \
+	((s) & (dzt)->zone_nr_sectors_mask)
+#define dm_zoned_chunk_block(dzt, b) \
+	((b) & (dzt)->zone_nr_blocks_mask)
+
+#define dm_zoned_zone_type(z) \
+	((z)->flags & DM_ZONE_TYPE_MASK)
+#define dm_zoned_zone_meta(z) \
+	(dm_zoned_zone_type(z) == DM_ZONE_META)
+#define dm_zoned_zone_buf(z) \
+	(dm_zoned_zone_type(z) == DM_ZONE_BUF)
+#define dm_zoned_zone_data(z) \
+	(dm_zoned_zone_type(z) == DM_ZONE_DATA)
+
+#define dm_zoned_bio_sector(bio) \
+	((bio)->bi_iter.bi_sector)
+#define dm_zoned_bio_chunk_sector(dzt, bio) \
+	dm_zoned_chunk_sector((dzt), dm_zoned_bio_sector(bio))
+#define dm_zoned_bio_sectors(bio) \
+	bio_sectors(bio)
+#define dm_zoned_bio_block(bio) \
+	dm_zoned_sector_to_block(dm_zoned_bio_sector(bio))
+#define dm_zoned_bio_blocks(bio) \
+	dm_zoned_sector_to_block(dm_zoned_bio_sectors(bio))
+#define dm_zoned_bio_chunk(dzt, bio) \
+	(dm_zoned_bio_sector(bio) >> (dzt)->zone_nr_sectors_shift)
+#define dm_zoned_bio_chunk_block(dzt, bio) \
+	dm_zoned_chunk_block((dzt), dm_zoned_bio_block(bio))
+
+/**
+ * Reset a zone stats.
+ */
+static inline void
+dm_zoned_zone_reset_stats(struct dm_zoned_zone *zone)
+{
+	zone->mtime = 0;
+	zone->wr_dir_blocks = 0;
+	zone->wr_buf_blocks = 0;
+}
+
+/**
+ * For buffer zone reclaim.
+ */
+#define DM_ZONED_RECLAIM_PERIOD_SECS	1UL /* Reclaim check period (seconds) */
+#define DM_ZONED_RECLAIM_PERIOD		(DM_ZONED_RECLAIM_PERIOD_SECS * HZ)
+#define DM_ZONED_RECLAIM_MAX_BLOCKS	1024 /* Max 4 KB blocks per reclaim I/O */
+#define DM_ZONED_RECLAIM_MAX_WORKERS	4 /* Maximum number of buffer zone reclaim works */
+
+struct dm_zoned_reclaim_zwork {
+	struct work_struct	work;
+	struct dm_zoned_target	*target;
+	struct dm_zoned_zone	*bzone;
+};
+
+/**
+ * Default maximum number of blocks for
+ * an SMR zone WP alignment with WRITE SAME.
+ * (0 => disable align wp)
+ */
+#define DM_ZONED_ALIGN_WP_MAX_BLOCK	0
+
+/**
+ * Target flags.
+ */
+enum {
+	DM_ZONED_DEBUG,
+	DM_ZONED_ALIGN_WP,
+	DM_ZONED_SUSPENDED,
+	DM_ZONED_RECLAIM_ACTIVE,
+};
+
+/**
+ * Target descriptor.
+ */
+struct dm_zoned_target {
+	struct dm_dev		*ddev;
+
+	/* Target zoned device information */
+	char			zbd_name[BDEVNAME_SIZE];
+	struct block_device	*zbd;
+	sector_t		zbd_capacity;
+	struct request_queue	*zbdq;
+	unsigned int		zbd_metablk_shift;
+	unsigned long		flags;
+	struct buffer_head	*sb_bh;
+
+	unsigned int		nr_zones;
+	unsigned int		nr_cmr_zones;
+	unsigned int		nr_smr_zones;
+	unsigned int		nr_rnd_zones;
+	unsigned int		nr_meta_zones;
+	unsigned int		nr_buf_zones;
+	unsigned int		nr_data_zones;
+	unsigned int		nr_cmr_data_zones;
+	unsigned int		nr_smr_data_zones;
+
+#ifdef __DM_ZONED_DEBUG
+	size_t			used_mem;
+#endif
+
+	sector_t		zone_nr_sectors;
+	unsigned int		zone_nr_sectors_shift;
+	sector_t		zone_nr_sectors_mask;
+
+	sector_t		zone_nr_blocks;
+	sector_t		zone_nr_blocks_shift;
+	sector_t		zone_nr_blocks_mask;
+	sector_t		zone_bitmap_size;
+	unsigned int		zone_nr_bitmap_blocks;
+
+	/* Zone mapping management lock */
+	struct mutex		map_lock;
+
+	/* Zone bitmaps */
+	sector_t		bitmap_block;
+	unsigned int		nr_bitmap_blocks;
+
+	/* Buffer zones */
+	struct dm_zoned_bz_map	*bz_map;
+	atomic_t		bz_nr_free;
+	unsigned int		bz_nr_free_low;
+	struct list_head	bz_free_list;
+	struct list_head	bz_lru_list;
+	struct list_head	bz_wait_list;
+
+	/* Data zones */
+	unsigned int		nr_map_blocks;
+	unsigned int		align_wp_max_blocks;
+	struct buffer_head	**dz_map_bh;
+	atomic_t		dz_nr_active;
+	atomic_t		dz_nr_active_wait;
+	unsigned int		dz_nr_unmap;
+	struct list_head	dz_unmap_cmr_list;
+	struct list_head	dz_map_cmr_list;
+	struct list_head	dz_unmap_smr_list;
+	struct list_head	dz_empty_list;
+
+	/* Internal I/Os */
+	struct bio_set		*bio_set;
+	struct workqueue_struct *zone_wq;
+	unsigned long		last_bio_time;
+
+	/* For flush */
+	spinlock_t		flush_lock;
+	struct bio_list		flush_list;
+	struct work_struct	flush_work;
+	struct workqueue_struct *flush_wq;
+
+	/* For reclaim */
+	struct dm_io_client	*reclaim_client;
+	struct delayed_work	reclaim_work;
+	struct workqueue_struct *reclaim_wq;
+	struct workqueue_struct *reclaim_zwq;
+
+};
+
+#define dm_zoned_map_lock(dzt)		mutex_lock(&(dzt)->map_lock)
+#define dm_zoned_map_unlock(dzt)	mutex_unlock(&(dzt)->map_lock)
+
+/**
+ * Number of seconds without BIO to consider
+ * the device idle.
+ */
+#define DM_ZONED_IDLE_SECS		2UL
+
+/**
+ * Test if the target device is idle.
+ */
+static inline int
+dm_zoned_idle(struct dm_zoned_target *dzt)
+{
+	return atomic_read(&(dzt)->dz_nr_active) == 0 &&
+		time_is_before_jiffies(dzt->last_bio_time
+				       + DM_ZONED_IDLE_SECS * HZ);
+}
+
+/**
+ * Target config passed as dmsetup arguments.
+ */
+struct dm_zoned_target_config {
+	char			*dev_path;
+	int			debug;
+	int			format;
+	unsigned long		align_wp;
+	unsigned long		nr_buf_zones;
+};
+
+/**
+ * Zone BIO context.
+ */
+struct dm_zoned_bioctx {
+	struct dm_zoned_target	*target;
+	struct dm_zoned_zone	*dzone;
+	struct bio		*bio;
+	atomic_t		ref;
+	int			error;
+};
+
+#define dm_zoned_info(format, args...)			\
+	printk(KERN_INFO "dm-zoned: " format, ## args)
+
+#define dm_zoned_dev_info(target, format, args...)	\
+	dm_zoned_info("(%s) " format,			\
+		      (dzt)->zbd_name, ## args)
+
+#define dm_zoned_error(format, args...)			\
+	printk(KERN_ERR	"dm-zoned: " format, ## args)
+
+#define dm_zoned_dev_error(dzt, format, args...)	\
+	dm_zoned_error("(%s) " format,			\
+		       (dzt)->zbd_name, ## args)
+
+#define dm_zoned_warning(format, args...)		\
+	printk(KERN_ALERT				\
+	       "dm-zoned: " format, ## args)
+
+#define dm_zoned_dev_warning(dzt, format, args...)	\
+	dm_zoned_warning("(%s) " format,		\
+			 (dzt)->zbd_name, ## args)
+
+#define dm_zoned_dump_stack()				\
+	do {						\
+		dm_zoned_warning("Start stack dump\n");	\
+		dump_stack();				\
+		dm_zoned_warning("End stack dump\n");	\
+	} while (0)
+
+#define dm_zoned_oops(format, args...)			\
+	do {						\
+		dm_zoned_warning(format, ## args);	\
+		dm_zoned_dump_stack();			\
+		BUG();					\
+	} while (0)
+
+#define dm_zoned_dev_oops(dzt, format, args...)			\
+	do {							\
+		dm_zoned_dev_warning(dzt, format, ## args);	\
+		dm_zoned_dump_stack();				\
+		BUG();						\
+	} while (0)
+
+#define dm_zoned_assert_cond(cond)	(unlikely(!(cond)))
+#define dm_zoned_assert(cond)					\
+	do {							\
+		if (dm_zoned_assert_cond(cond)) {		\
+			dm_zoned_oops("(%s/%d) "		\
+				      "Condition %s failed\n",	\
+				      __func__, __LINE__,	\
+				      # cond);			\
+		}						\
+	} while (0)
+
+#define dm_zoned_dev_assert(dzt, cond)					\
+	do {								\
+		if (dm_zoned_assert_cond(cond)) {			\
+			dm_zoned_dev_oops(dzt, "(%s/%d) "		\
+					  "Condition %s failed\n",	\
+					  __func__, __LINE__,		\
+					  # cond);			\
+		}							\
+	} while (0)
+
+#ifdef __DM_ZONED_DEBUG
+
+#define dm_zoned_dev_debug(dzt, format, args...)		\
+	do {							\
+		if (test_bit(DM_ZONED_DEBUG, &(dzt)->flags)) {	\
+			printk(KERN_INFO			\
+			       "dm-zoned: (%s) " format,	\
+			       (dzt)->zbd_name,	## args);	\
+		}						\
+	} while (0)
+
+
+#else
+
+#define dm_zoned_dev_debug(dzt, format, args...) \
+	do { } while (0)
+
+#endif /* __DM_ZONED_DEBUG */
+
+extern int
+dm_zoned_init_meta(struct dm_zoned_target *dzt,
+		   struct dm_zoned_target_config *conf);
+
+extern int
+dm_zoned_resume_meta(struct dm_zoned_target *dzt);
+
+extern void
+dm_zoned_cleanup_meta(struct dm_zoned_target *dzt);
+
+extern int
+dm_zoned_flush(struct dm_zoned_target *dzt);
+
+extern int
+dm_zoned_advance_zone_wp(struct dm_zoned_target *dzt,
+			 struct dm_zoned_zone *zone,
+			 sector_t nr_blocks);
+
+extern int
+dm_zoned_reset_zone_wp(struct dm_zoned_target *dzt,
+		       struct dm_zoned_zone *zone);
+
+extern struct dm_zoned_zone *
+dm_zoned_alloc_bzone(struct dm_zoned_target *dzt,
+		     struct dm_zoned_zone *dzone);
+
+extern void
+dm_zoned_free_bzone(struct dm_zoned_target *dzt,
+		    struct dm_zoned_zone *bzone);
+
+extern void
+dm_zoned_validate_bzone(struct dm_zoned_target *dzt,
+			struct dm_zoned_zone *dzone);
+
+/**
+ * Data zone allocation type hint.
+ */
+enum {
+	DM_DZONE_ANY,
+	DM_DZONE_SMR,
+	DM_DZONE_CMR
+};
+
+extern struct dm_zoned_zone *
+dm_zoned_alloc_dzone(struct dm_zoned_target *dzt,
+		     unsigned int chunk,
+		     unsigned int type_hint);
+
+extern void
+dm_zoned_free_dzone(struct dm_zoned_target *dzt,
+		    struct dm_zoned_zone *dzone);
+
+extern void
+dm_zoned_validate_dzone(struct dm_zoned_target *dzt,
+			struct dm_zoned_zone *dzone);
+
+extern void
+dm_zoned_remap_dzone(struct dm_zoned_target *dzt,
+		     struct dm_zoned_zone *from_dzone,
+		     struct dm_zoned_zone *to_dzone);
+
+extern void
+dm_zoned_remap_bzone(struct dm_zoned_target *dzt,
+		     struct dm_zoned_zone *bzone,
+		     struct dm_zoned_zone *new_bzone);
+
+extern struct dm_zoned_zone *
+dm_zoned_bio_map(struct dm_zoned_target *dzt,
+		 struct bio *bio);
+
+extern void
+dm_zoned_run_dzone(struct dm_zoned_target *dzt,
+		   struct dm_zoned_zone *dzone);
+
+extern void
+dm_zoned_put_dzone(struct dm_zoned_target *dzt,
+		   struct dm_zoned_zone *dzone);
+
+extern int
+dm_zoned_validate_blocks(struct dm_zoned_target *dzt,
+			 struct dm_zoned_zone *zone,
+			 sector_t chunk_block,
+			 unsigned int nr_blocks);
+
+extern int
+dm_zoned_invalidate_blocks(struct dm_zoned_target *dzt,
+			   struct dm_zoned_zone *zone,
+			   sector_t chunk_block,
+			   unsigned int nr_blocks);
+
+static inline int
+dm_zoned_invalidate_zone(struct dm_zoned_target *dzt,
+			 struct dm_zoned_zone *zone)
+{
+	return dm_zoned_invalidate_blocks(dzt, zone,
+					0, dzt->zone_nr_blocks);
+}
+
+extern int
+dm_zoned_block_valid(struct dm_zoned_target *dzt,
+		     struct dm_zoned_zone *zone,
+		     sector_t chunk_block);
+
+extern int
+dm_zoned_valid_blocks(struct dm_zoned_target *dzt,
+		      struct dm_zoned_zone *zone,
+		      sector_t chunk_block,
+		      unsigned int nr_blocks);
+
+static inline int
+dm_zoned_zone_weight(struct dm_zoned_target *dzt,
+		     struct dm_zoned_zone *zone)
+{
+	if (dm_zoned_zone_is_seqreq(zone)) {
+		if (dm_zoned_zone_empty(zone))
+			return 0;
+		return dm_zoned_valid_blocks(dzt, zone,
+				   0, zone->wp_block);
+	}
+
+	return dm_zoned_valid_blocks(dzt, zone,
+				   0, dzt->zone_nr_blocks);
+}
+
+/**
+ * Wait for a zone write BIOs to complete.
+ */
+static inline void
+dm_zoned_wait_for_stable_zone(struct dm_zoned_zone *zone)
+{
+	if (test_bit(DM_ZONE_ACTIVE_BIO, &zone->flags))
+		wait_on_bit_io(&zone->flags, DM_ZONE_ACTIVE_BIO,
+			       TASK_UNINTERRUPTIBLE);
+}
+
+extern void
+dm_zoned_zone_work(struct work_struct *work);
+
+extern void
+dm_zoned_reclaim_work(struct work_struct *work);
+
+/**
+ * Schedule reclaim (delay in jiffies).
+ */
+static inline void
+dm_zoned_schedule_reclaim(struct dm_zoned_target *dzt,
+			  unsigned long delay)
+{
+	mod_delayed_work(dzt->reclaim_wq, &dzt->reclaim_work, delay);
+}
+
+/**
+ * Trigger reclaim.
+ */
+static inline void
+dm_zoned_trigger_reclaim(struct dm_zoned_target *dzt)
+{
+	dm_zoned_schedule_reclaim(dzt, 0);
+}
+
+#ifdef __DM_ZONED_DEBUG
+static inline void
+dm_zoned_account_mem(struct dm_zoned_target *dzt,
+		     size_t bytes)
+{
+	dzt->used_mem += bytes;
+}
+#else
+#define dm_zoned_account_mem(dzt, bytes) do { } while (0)
+#endif
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] block: add flag for single-threaded submission
  2016-07-19 14:02 ` [PATCH 1/3] block: add flag for single-threaded submission Hannes Reinecke
@ 2016-07-20  1:13     ` Damien Le Moal
  2016-07-21  5:54   ` Christoph Hellwig
  1 sibling, 0 replies; 15+ messages in thread
From: Damien Le Moal @ 2016-07-20  1:13 UTC (permalink / raw)
  To: Hannes Reinecke, Mike Snitzer
  Cc: dm-devel-redhat.com, linux-scsi, linux-block, Christoph Hellwig,
	Jens Axboe


On 7/19/16 23:02, Hannes Reinecke wrote:
> Some devices (most notably SMR drives) support only
> one I/O stream eg for ensuring ordered I/O submission.
> This patch adds a new block queue flag
> 'BLK_QUEUE_SINGLE' to support these devices.
>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>  block/blk-core.c       | 2 ++
>  include/linux/blkdev.h | 2 ++
>  2 files changed, 4 insertions(+)

Reviewed-by: Damien Le Moal <damien.lemoal@hgst.com>
Tested-by: Damien Le Moal <damien.lemoal@hgst.com>

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.hgst.com
Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] block: add flag for single-threaded submission
@ 2016-07-20  1:13     ` Damien Le Moal
  0 siblings, 0 replies; 15+ messages in thread
From: Damien Le Moal @ 2016-07-20  1:13 UTC (permalink / raw)
  To: Hannes Reinecke, Mike Snitzer
  Cc: dm-devel-redhat.com, linux-scsi, linux-block, Christoph Hellwig,
	Jens Axboe


On 7/19/16 23:02, Hannes Reinecke wrote:
> Some devices (most notably SMR drives) support only
> one I/O stream eg for ensuring ordered I/O submission.
> This patch adds a new block queue flag
> 'BLK_QUEUE_SINGLE' to support these devices.
>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>  block/blk-core.c       | 2 ++
>  include/linux/blkdev.h | 2 ++
>  2 files changed, 4 insertions(+)

Reviewed-by: Damien Le Moal <damien.lemoal@hgst.com>
Tested-by: Damien Le Moal <damien.lemoal@hgst.com>

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.hgst.com
Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] sd: enable single-threaded I/O submission for zoned devices
  2016-07-19 14:02 ` [PATCH 2/3] sd: enable single-threaded I/O submission for zoned devices Hannes Reinecke
@ 2016-07-20  1:15     ` Damien Le Moal
  0 siblings, 0 replies; 15+ messages in thread
From: Damien Le Moal @ 2016-07-20  1:15 UTC (permalink / raw)
  To: Hannes Reinecke, Mike Snitzer
  Cc: linux-scsi, linux-block, Christoph Hellwig, Jens Axboe


On 7/19/16 23:02, Hannes Reinecke wrote:
> zoned devices require single-thread I/O submission to guarantee
> sequential I/O, so enable the block layer flag for it.
>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>  drivers/scsi/sd.c | 1 +
>  1 file changed, 1 insertion(+)

Reviewed-by: Damien Le Moal <damien.lemoal@hgst.com>
Tested-by: Damien Le Moal <damien.lemoal@hgst.com>

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.hgst.com
Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] sd: enable single-threaded I/O submission for zoned devices
@ 2016-07-20  1:15     ` Damien Le Moal
  0 siblings, 0 replies; 15+ messages in thread
From: Damien Le Moal @ 2016-07-20  1:15 UTC (permalink / raw)
  To: Hannes Reinecke, Mike Snitzer
  Cc: linux-scsi, linux-block, Christoph Hellwig, Jens Axboe


On 7/19/16 23:02, Hannes Reinecke wrote:
> zoned devices require single-thread I/O submission to guarantee
> sequential I/O, so enable the block layer flag for it.
>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>  drivers/scsi/sd.c | 1 +
>  1 file changed, 1 insertion(+)

Reviewed-by: Damien Le Moal <damien.lemoal@hgst.com>
Tested-by: Damien Le Moal <damien.lemoal@hgst.com>

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.hgst.com
Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] block: add flag for single-threaded submission
  2016-07-19 14:02 ` [PATCH 1/3] block: add flag for single-threaded submission Hannes Reinecke
  2016-07-20  1:13     ` Damien Le Moal
@ 2016-07-21  5:54   ` Christoph Hellwig
  2016-07-21  6:01       ` Hannes Reinecke
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2016-07-21  5:54 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Mike Snitzer, dm-devel-redhat.com, Damien Le Moal, linux-scsi,
	linux-block, Jens Axboe

On Tue, Jul 19, 2016 at 04:02:56PM +0200, Hannes Reinecke wrote:
> Some devices (most notably SMR drives) support only
> one I/O stream eg for ensuring ordered I/O submission.
> This patch adds a new block queue flag
> 'BLK_QUEUE_SINGLE' to support these devices.

We'll need a blk-mq implementation of this flag as well before it can
be merged.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] block: add flag for single-threaded submission
  2016-07-21  5:54   ` Christoph Hellwig
@ 2016-07-21  6:01       ` Hannes Reinecke
  0 siblings, 0 replies; 15+ messages in thread
From: Hannes Reinecke @ 2016-07-21  6:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mike Snitzer, dm-devel-redhat.com, Damien Le Moal, linux-scsi,
	linux-block, Jens Axboe

On 07/21/2016 07:54 AM, Christoph Hellwig wrote:
> On Tue, Jul 19, 2016 at 04:02:56PM +0200, Hannes Reinecke wrote:
>> Some devices (most notably SMR drives) support only
>> one I/O stream eg for ensuring ordered I/O submission.
>> This patch adds a new block queue flag
>> 'BLK_QUEUE_SINGLE' to support these devices.
> 
> We'll need a blk-mq implementation of this flag as well before it can
> be merged.
> 
Yes, indeed. Will be fixed for the next round of submissions.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N�rnberg)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] block: add flag for single-threaded submission
@ 2016-07-21  6:01       ` Hannes Reinecke
  0 siblings, 0 replies; 15+ messages in thread
From: Hannes Reinecke @ 2016-07-21  6:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mike Snitzer, dm-devel-redhat.com, Damien Le Moal, linux-scsi,
	linux-block, Jens Axboe

On 07/21/2016 07:54 AM, Christoph Hellwig wrote:
> On Tue, Jul 19, 2016 at 04:02:56PM +0200, Hannes Reinecke wrote:
>> Some devices (most notably SMR drives) support only
>> one I/O stream eg for ensuring ordered I/O submission.
>> This patch adds a new block queue flag
>> 'BLK_QUEUE_SINGLE' to support these devices.
> 
> We'll need a blk-mq implementation of this flag as well before it can
> be merged.
> 
Yes, indeed. Will be fixed for the next round of submissions.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] block: add flag for single-threaded submission
  2016-07-21  6:01       ` Hannes Reinecke
@ 2016-07-21  6:37         ` Hannes Reinecke
  -1 siblings, 0 replies; 15+ messages in thread
From: Hannes Reinecke @ 2016-07-21  6:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mike Snitzer, dm-devel-redhat.com, Damien Le Moal, linux-scsi,
	linux-block, Jens Axboe

On 07/21/2016 08:01 AM, Hannes Reinecke wrote:
> On 07/21/2016 07:54 AM, Christoph Hellwig wrote:
>> On Tue, Jul 19, 2016 at 04:02:56PM +0200, Hannes Reinecke wrote:
>>> Some devices (most notably SMR drives) support only
>>> one I/O stream eg for ensuring ordered I/O submission.
>>> This patch adds a new block queue flag
>>> 'BLK_QUEUE_SINGLE' to support these devices.
>>
>> We'll need a blk-mq implementation of this flag as well before it can
>> be merged.
>>
> Yes, indeed. Will be fixed for the next round of submissions.
> 
Hmm.
Looking closer do I _really_ need that for blk-mq?
>From my understanding any hctx can only run on one dedicated cpu, to
which the hctx is bound.
So if we only have one CPU serving that hctx how can we have several
concurrent calls to queue_rq() here?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N�rnberg)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] block: add flag for single-threaded submission
@ 2016-07-21  6:37         ` Hannes Reinecke
  0 siblings, 0 replies; 15+ messages in thread
From: Hannes Reinecke @ 2016-07-21  6:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mike Snitzer, dm-devel-redhat.com, Damien Le Moal, linux-scsi,
	linux-block, Jens Axboe

On 07/21/2016 08:01 AM, Hannes Reinecke wrote:
> On 07/21/2016 07:54 AM, Christoph Hellwig wrote:
>> On Tue, Jul 19, 2016 at 04:02:56PM +0200, Hannes Reinecke wrote:
>>> Some devices (most notably SMR drives) support only
>>> one I/O stream eg for ensuring ordered I/O submission.
>>> This patch adds a new block queue flag
>>> 'BLK_QUEUE_SINGLE' to support these devices.
>>
>> We'll need a blk-mq implementation of this flag as well before it can
>> be merged.
>>
> Yes, indeed. Will be fixed for the next round of submissions.
> 
Hmm.
Looking closer do I _really_ need that for blk-mq?
From my understanding any hctx can only run on one dedicated cpu, to
which the hctx is bound.
So if we only have one CPU serving that hctx how can we have several
concurrent calls to queue_rq() here?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] block: add flag for single-threaded submission
  2016-07-21  6:37         ` Hannes Reinecke
  (?)
@ 2016-07-21  7:10         ` Damien Le Moal
  -1 siblings, 0 replies; 15+ messages in thread
From: Damien Le Moal @ 2016-07-21  7:10 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Mike Snitzer, dm-devel-redhat.com, linux-scsi,
	linux-block, Jens Axboe

Hannes,

> On Jul 21, 2016, at 15:37, Hannes Reinecke <hare@suse.de> wrote:
> 
> On 07/21/2016 08:01 AM, Hannes Reinecke wrote:
>> On 07/21/2016 07:54 AM, Christoph Hellwig wrote:
>>> On Tue, Jul 19, 2016 at 04:02:56PM +0200, Hannes Reinecke wrote:
>>>> Some devices (most notably SMR drives) support only
>>>> one I/O stream eg for ensuring ordered I/O submission.
>>>> This patch adds a new block queue flag
>>>> 'BLK_QUEUE_SINGLE' to support these devices.
>>> 
>>> We'll need a blk-mq implementation of this flag as well before it can
>>> be merged.
>>> 
>> Yes, indeed. Will be fixed for the next round of submissions.
>> 
> Hmm.
> Looking closer do I _really_ need that for blk-mq?
> From my understanding any hctx can only run on one dedicated cpu, to
> which the hctx is bound.
> So if we only have one CPU serving that hctx how can we have several
> concurrent calls to queue_rq() here?

Ex: an application with 2 threads running on different CPUs, with both
threads writing to the same sequential zone with a proper mutual exclusion
to ensure sequential write order at the application level.

For this example, each thread requests will end up in a different hw queue
(hctx), and a single CPU serving each hctx may not result in an overall
in-order delivery to the disk.

So we need to ensure that all requests for a sequential zone end up in the
same hctx, which I think means simply that we need to allow only a single
queue for a ZBC device. If one day we get a fast SSD supporting ZBC/ZAC
commands, this may impact performance though...

I may be completely wrong about this though.

Cheers.

------------------------
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com 
Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] block: add flag for single-threaded submission
  2016-07-21  6:37         ` Hannes Reinecke
  (?)
  (?)
@ 2016-07-21 14:38         ` Christoph Hellwig
  -1 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2016-07-21 14:38 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Mike Snitzer, dm-devel-redhat.com,
	Damien Le Moal, linux-scsi, linux-block, Jens Axboe

On Thu, Jul 21, 2016 at 08:37:10AM +0200, Hannes Reinecke wrote:
> Looking closer do I _really_ need that for blk-mq?
> >From my understanding any hctx can only run on one dedicated cpu, to
> which the hctx is bound.

That's not the case.  A hctx exists for each hardware queue, and a CPU
must be mapped to a hctx.  So if you have less hctx than CPUs (e.g. just
one) you have multiple CPUs that could submit on a hctx.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-07-21 14:39 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-19 14:02 [PATCH 0/3] dm-zoned target for ZBC devices Hannes Reinecke
2016-07-19 14:02 ` [PATCH 1/3] block: add flag for single-threaded submission Hannes Reinecke
2016-07-20  1:13   ` Damien Le Moal
2016-07-20  1:13     ` Damien Le Moal
2016-07-21  5:54   ` Christoph Hellwig
2016-07-21  6:01     ` Hannes Reinecke
2016-07-21  6:01       ` Hannes Reinecke
2016-07-21  6:37       ` Hannes Reinecke
2016-07-21  6:37         ` Hannes Reinecke
2016-07-21  7:10         ` Damien Le Moal
2016-07-21 14:38         ` Christoph Hellwig
2016-07-19 14:02 ` [PATCH 2/3] sd: enable single-threaded I/O submission for zoned devices Hannes Reinecke
2016-07-20  1:15   ` Damien Le Moal
2016-07-20  1:15     ` Damien Le Moal
2016-07-19 14:02 ` [PATCH 3/3] dm-zoned: New device mapper target for zoned block devices Hannes Reinecke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.