All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] dm: zoned block device fixes
@ 2017-05-29 10:23 Damien Le Moal
  2017-05-29 10:23 ` [PATCH 1/4] dm: Fix mapping zone alignment check Damien Le Moal
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Damien Le Moal @ 2017-05-29 10:23 UTC (permalink / raw)
  To: dm-devel, Mike Snitzer, Alasdair Kergon; +Cc: Bart Van Assche

Mike,

The first 3 patches of this series are incremental fixes for the zoned block
device support patches that you committed to the for-4.13/dm branch.

The first patch correct the zone alignement checks so that the check is
performed for any device, regardless of the device LBA size (it is skipped for
512B LBA devices otherwise).

The second patch is a fix for commit baf844bf4ae3 "dm table: add zoned block
devices validation". In that commit, the stacked limits zoned model was not
set to the zoned model of the table target devices, leading to the exposed
device always being exposed as a regular block device. With this fix, dm-flaky
and dm-linear work fine on top of host-managed  zoned block devices.

The third patch fixes zoned model validation again to allow for target types
emulating a different zoned model than the model of the table target devices,
e.g. dm-zoned.

The last patch is dm-zoned with various fixes (mainly crashes on setup error
and handling of the metadata cache shrinker). For your review, please use this
version.

Thank you.

Best regards.

Damien Le Moal (4):
  dm: Fix mapping zone alignment check
  dm: Fix staking limits for zoned block device
  dm: Fix zoned block device model validation
  dm-zoned: Drive-managed zoned block device target

 Documentation/device-mapper/dm-zoned.txt |  154 +++
 drivers/md/Kconfig                       |   17 +
 drivers/md/Makefile                      |    2 +
 drivers/md/dm-table.c                    |  190 +--
 drivers/md/dm-zoned-io.c                 | 1007 ++++++++++++++
 drivers/md/dm-zoned-metadata.c           | 2220 ++++++++++++++++++++++++++++++
 drivers/md/dm-zoned-reclaim.c            |  535 +++++++
 drivers/md/dm-zoned.h                    |  530 +++++++
 8 files changed, 4560 insertions(+), 95 deletions(-)
 create mode 100644 Documentation/device-mapper/dm-zoned.txt
 create mode 100644 drivers/md/dm-zoned-io.c
 create mode 100644 drivers/md/dm-zoned-metadata.c
 create mode 100644 drivers/md/dm-zoned-reclaim.c
 create mode 100644 drivers/md/dm-zoned.h

-- 
2.9.4

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] dm: Fix mapping zone alignment check
  2017-05-29 10:23 [PATCH 0/4] dm: zoned block device fixes Damien Le Moal
@ 2017-05-29 10:23 ` Damien Le Moal
  2017-05-29 10:23 ` [PATCH 2/4] dm: Fix staking limits for zoned block device Damien Le Moal
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Damien Le Moal @ 2017-05-29 10:23 UTC (permalink / raw)
  To: dm-devel, Mike Snitzer, Alasdair Kergon; +Cc: Bart Van Assche

In device_area_is_invalid(), check a device mapping zone alignment
before checking the LBA size so that the check is also performed for
devices with a single sector (512B) LBA size.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 drivers/md/dm-table.c | 51 ++++++++++++++++++++++++++++++---------------------
 1 file changed, 30 insertions(+), 21 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index b7b95d5..6545150 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -319,27 +319,6 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
 		return 1;
 	}
 
-	if (logical_block_size_sectors <= 1)
-		return 0;
-
-	if (start & (logical_block_size_sectors - 1)) {
-		DMWARN("%s: start=%llu not aligned to h/w "
-		       "logical block size %u of %s",
-		       dm_device_name(ti->table->md),
-		       (unsigned long long)start,
-		       limits->logical_block_size, bdevname(bdev, b));
-		return 1;
-	}
-
-	if (len & (logical_block_size_sectors - 1)) {
-		DMWARN("%s: len=%llu not aligned to h/w "
-		       "logical block size %u of %s",
-		       dm_device_name(ti->table->md),
-		       (unsigned long long)len,
-		       limits->logical_block_size, bdevname(bdev, b));
-		return 1;
-	}
-
 	/*
 	 * If the target is mapped to zoned block device(s), check
 	 * that the zones are not partially mapped.
@@ -355,6 +334,15 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
 			return 1;
 		}
 
+		/*
+		 * Note: The last zone of a zoned block device may be smaller
+		 * than other zones. So for a target mapping the end of a
+		 * zoned block device with such a zone, len would not be zone
+		 * aligned. We do not allow such last smaller zone to be part
+		 * of the mapping here to ensure that mappings with multiple
+		 * devices do not end up with a smaller zone in the middle of
+		 * the sector range.
+		 */
 		if (len & (zone_sectors - 1)) {
 			DMWARN("%s: len=%llu not aligned to h/w zone size %u of %s",
 			       dm_device_name(ti->table->md),
@@ -364,6 +352,27 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
 		}
 	}
 
+	if (logical_block_size_sectors <= 1)
+		return 0;
+
+	if (start & (logical_block_size_sectors - 1)) {
+		DMWARN("%s: start=%llu not aligned to h/w "
+		       "logical block size %u of %s",
+		       dm_device_name(ti->table->md),
+		       (unsigned long long)start,
+		       limits->logical_block_size, bdevname(bdev, b));
+		return 1;
+	}
+
+	if (len & (logical_block_size_sectors - 1)) {
+		DMWARN("%s: len=%llu not aligned to h/w "
+		       "logical block size %u of %s",
+		       dm_device_name(ti->table->md),
+		       (unsigned long long)len,
+		       limits->logical_block_size, bdevname(bdev, b));
+		return 1;
+	}
+
 	return 0;
 }
 
-- 
2.9.4

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/4] dm: Fix staking limits for zoned block device
  2017-05-29 10:23 [PATCH 0/4] dm: zoned block device fixes Damien Le Moal
  2017-05-29 10:23 ` [PATCH 1/4] dm: Fix mapping zone alignment check Damien Le Moal
@ 2017-05-29 10:23 ` Damien Le Moal
  2017-05-29 10:23 ` [PATCH 3/4] dm: Fix zoned block device model validation Damien Le Moal
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Damien Le Moal @ 2017-05-29 10:23 UTC (permalink / raw)
  To: dm-devel, Mike Snitzer, Alasdair Kergon; +Cc: Bart Van Assche

Since blk_stack_limits() does not change the zoned model type, in
dm_calculate_queue_limits(), make sure to report the correct stacked
limits by setting the zoned model to the model of the underlying table
device, after eventual override by the target .io_hints. This allows
removing the local variables zoned_model and zone_sectors and to
simplify validate_hardware_zoned_model() interface.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 drivers/md/dm-table.c | 46 +++++++++++++++++++---------------------------
 1 file changed, 19 insertions(+), 27 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 6545150..9ba58d3 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1438,16 +1438,17 @@ static bool dm_table_matches_zone_sectors(struct dm_table *t,
 }
 
 static int validate_hardware_zoned_model(struct dm_table *table,
-					 enum blk_zoned_model zoned_model,
-					 unsigned int zone_sectors)
+					 struct queue_limits *limits)
 {
-	if (!dm_table_supports_zoned_model(table, zoned_model)) {
+	if (!dm_table_supports_zoned_model(table, limits->zoned)) {
 		DMERR("%s: zoned model is inconsistent across all devices",
 		      dm_device_name(table->md));
 		return -EINVAL;
 	}
 
-	if (zoned_model != BLK_ZONED_NONE) {
+	if (limits->zoned != BLK_ZONED_NONE) {
+		unsigned int zone_sectors = limits->chunk_sectors;
+
 		/* Check zone size validity and compatibility */
 		if (!zone_sectors || !is_power_of_2(zone_sectors))
 			return -EINVAL;
@@ -1471,8 +1472,6 @@ int dm_calculate_queue_limits(struct dm_table *table,
 	struct dm_target *ti;
 	struct queue_limits ti_limits;
 	unsigned i;
-	enum blk_zoned_model zoned_model = BLK_ZONED_NONE;
-	unsigned int zone_sectors = 0;
 
 	blk_set_stacking_limits(limits);
 
@@ -1490,15 +1489,6 @@ int dm_calculate_queue_limits(struct dm_table *table,
 		ti->type->iterate_devices(ti, dm_set_device_limits,
 					  &ti_limits);
 
-		if (zoned_model == BLK_ZONED_NONE && ti_limits.zoned != BLK_ZONED_NONE) {
-			/*
-			 * After stacking all limits, validate all devices
-			 * in table support this zoned model and zone sectors.
-			 */
-			zoned_model = ti_limits.zoned;
-			zone_sectors = ti_limits.chunk_sectors;
-		}
-
 		/* Set I/O hints portion of queue limits */
 		if (ti->type->io_hints)
 			ti->type->io_hints(ti, &ti_limits);
@@ -1523,20 +1513,22 @@ int dm_calculate_queue_limits(struct dm_table *table,
 			       dm_device_name(table->md),
 			       (unsigned long long) ti->begin,
 			       (unsigned long long) ti->len);
+
+		if (limits->zoned == BLK_ZONED_NONE &&
+		    ti_limits.zoned != BLK_ZONED_NONE) {
+			/*
+			 * By default, the stacked limits zoned model is set to
+			 * BLK_ZONED_NONE in blk_set_stacking_limits(). Update
+			 * this model using the first target model reported
+			 * that is not BLK_ZONED_NONE. This will be either the
+			 * first target device zoned model or the model reported
+			 * by the target .io_hints.
+			 */
+			limits->zoned = ti_limits.zoned;
+		}
 	}
 
-	/*
-	 * Verify that the zoned model and zone sectors, as determined before
-	 * any .io_hints override, are the same across all devices in the table.
-	 * - but if limits->zoned is not BLK_ZONED_NONE validate match for it
-	 * - simillarly, check all devices conform to limits->chunk_sectors if
-	 *   .io_hints altered them
-	 */
-	if (limits->zoned != BLK_ZONED_NONE)
-		zoned_model = limits->zoned;
-	if (limits->chunk_sectors != zone_sectors)
-		zone_sectors = limits->chunk_sectors;
-	if (validate_hardware_zoned_model(table, zoned_model, zone_sectors))
+	if (validate_hardware_zoned_model(table, limits))
 		return -EINVAL;
 
 	return validate_hardware_logical_block_alignment(table, limits);
-- 
2.9.4

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/4] dm: Fix zoned block device model validation
  2017-05-29 10:23 [PATCH 0/4] dm: zoned block device fixes Damien Le Moal
  2017-05-29 10:23 ` [PATCH 1/4] dm: Fix mapping zone alignment check Damien Le Moal
  2017-05-29 10:23 ` [PATCH 2/4] dm: Fix staking limits for zoned block device Damien Le Moal
@ 2017-05-29 10:23 ` Damien Le Moal
  2017-05-29 10:23 ` [PATCH 4/4] dm-zoned: Drive-managed zoned block device target Damien Le Moal
  2017-05-30 20:20 ` [PATCH 0/4] dm: zoned block device fixes Mike Snitzer
  4 siblings, 0 replies; 13+ messages in thread
From: Damien Le Moal @ 2017-05-29 10:23 UTC (permalink / raw)
  To: dm-devel, Mike Snitzer, Alasdair Kergon; +Cc: Bart Van Assche

A target type may be implementing a zoned model different from that of
the table devices. This means that the compatibility of the table
devices zoned models and zone sizes should not be checked against the
zoned model and zone size indicated by the stacked limits.
Fix validate_hardware_zoned_model() so that compatibility of zoned
models and zone sizes is done only between the table devices, ignoring
the stacked limits values.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 drivers/md/dm-table.c | 99 +++++++++++++++++++++++++--------------------------
 1 file changed, 49 insertions(+), 50 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 9ba58d3..9849042 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1381,83 +1381,82 @@ bool dm_table_has_no_data_devices(struct dm_table *table)
 	return true;
 }
 
-static int device_is_zoned_model(struct dm_target *ti, struct dm_dev *dev,
-				 sector_t start, sector_t len, void *data)
+static int device_matches_zoned_model(struct dm_target *ti, struct dm_dev *dev,
+				      sector_t start, sector_t len, void *data)
 {
-	struct request_queue *q = bdev_get_queue(dev->bdev);
+	struct block_device *bdev = dev->bdev;
 	enum blk_zoned_model *zoned_model = data;
 
-	return q && blk_queue_zoned_model(q) == *zoned_model;
-}
-
-static bool dm_table_supports_zoned_model(struct dm_table *t,
-					  enum blk_zoned_model zoned_model)
-{
-	struct dm_target *ti;
-	unsigned i;
-
-	for (i = 0; i < dm_table_get_num_targets(t); i++) {
-		ti = dm_table_get_target(t, i);
-
-		if (zoned_model == BLK_ZONED_HM &&
-		    !dm_target_supports_zoned_hm(ti->type))
-			return false;
+	if (bdev_zoned_model(bdev) == BLK_ZONED_HM &&
+	    !dm_target_supports_zoned_hm(ti->type))
+		return 0;
 
-		if (!ti->type->iterate_devices ||
-		    !ti->type->iterate_devices(ti, device_is_zoned_model, &zoned_model))
-			return false;
-	}
+	if (*zoned_model == -1) {
+		*zoned_model = bdev_zoned_model(bdev);
+		return 1;
+ 	}
 
-	return true;
+	return bdev_zoned_model(bdev) == *zoned_model;
 }
 
 static int device_matches_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
 				       sector_t start, sector_t len, void *data)
 {
-	struct request_queue *q = bdev_get_queue(dev->bdev);
+	struct block_device *bdev = dev->bdev;
 	unsigned int *zone_sectors = data;
 
-	return q && blk_queue_zone_sectors(q) == *zone_sectors;
-}
-
-static bool dm_table_matches_zone_sectors(struct dm_table *t,
-					  unsigned int zone_sectors)
-{
-	struct dm_target *ti;
-	unsigned i;
-
-	for (i = 0; i < dm_table_get_num_targets(t); i++) {
-		ti = dm_table_get_target(t, i);
-
-		if (!ti->type->iterate_devices ||
-		    !ti->type->iterate_devices(ti, device_matches_zone_sectors, &zone_sectors))
-			return false;
+	if (*zone_sectors == -1) {
+		*zone_sectors = bdev_zone_sectors(bdev);
+		return 1;
 	}
 
-	return true;
+	return bdev_zone_sectors(bdev) == *zone_sectors;
 }
 
 static int validate_hardware_zoned_model(struct dm_table *table,
 					 struct queue_limits *limits)
 {
-	if (!dm_table_supports_zoned_model(table, limits->zoned)) {
-		DMERR("%s: zoned model is inconsistent across all devices",
-		      dm_device_name(table->md));
-		return -EINVAL;
-	}
+	enum blk_zoned_model zoned_model = -1;
+	sector_t zone_sectors = -1;
+ 	struct dm_target *ti;
+ 	unsigned i;
 
 	if (limits->zoned != BLK_ZONED_NONE) {
-		unsigned int zone_sectors = limits->chunk_sectors;
-
-		/* Check zone size validity and compatibility */
+		/* Check stacked limits zone size validity */
+		zone_sectors = limits->chunk_sectors;
 		if (!zone_sectors || !is_power_of_2(zone_sectors))
 			return -EINVAL;
+	}
 
-		if (!dm_table_matches_zone_sectors(table, zone_sectors)) {
-			DMERR("%s: zone sectors is inconsistent across all devices",
+	/*
+	 * Check table devices zoned model and zone size compatibility.
+	 * This is done for any type of stacked limits zoned model since the
+	 * table targets may have overriden the zoned model of the devices
+	 * (e.g. the target type is doing emulation of a different zoned model).
+	 */
+	for (i = 0; i < dm_table_get_num_targets(table); i++) {
+		ti = dm_table_get_target(table, i);
+
+		if (limits->zoned != BLK_ZONED_NONE &&
+		    !ti->type->iterate_devices)
+			/* Cannot check */
+			return -EINVAL;
+
+		/* Check device zoned model compatibility */
+		if (!ti->type->iterate_devices(ti, device_matches_zoned_model,
+					       &zoned_model)) {
+			DMERR("%s: zoned model is inconsistent across all devices",
 			      dm_device_name(table->md));
 			return -EINVAL;
 		}
+
+		/* Check device zone size compatibility */
+		if (!ti->type->iterate_devices(ti, device_matches_zone_sectors,
+					       &zone_sectors)) {
+ 			DMERR("%s: zone sectors is inconsistent across all devices",
+ 			      dm_device_name(table->md));
+ 			return -EINVAL;
+		}
 	}
 
 	return 0;
-- 
2.9.4

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/4] dm-zoned: Drive-managed zoned block device target
  2017-05-29 10:23 [PATCH 0/4] dm: zoned block device fixes Damien Le Moal
                   ` (2 preceding siblings ...)
  2017-05-29 10:23 ` [PATCH 3/4] dm: Fix zoned block device model validation Damien Le Moal
@ 2017-05-29 10:23 ` Damien Le Moal
  2017-05-30 20:20 ` [PATCH 0/4] dm: zoned block device fixes Mike Snitzer
  4 siblings, 0 replies; 13+ messages in thread
From: Damien Le Moal @ 2017-05-29 10:23 UTC (permalink / raw)
  To: dm-devel, Mike Snitzer, Alasdair Kergon; +Cc: Bart Van Assche

The dm-zoned device mapper target provides transparent write access
to zoned block devices (ZBC and ZAC compliant block devices).
dm-zoned hides to the device user (a file system or an application
doing raw block device accesses) any constraint imposed on write
requests by the device, equivalent to a drive-managed zoned block
device model.

Write requests are processed using a combination of on-disk buffering
using the device conventional zones and direct in-place processing for
requests aligned to a zone sequential write pointer position.
A background reclaim process implemented using dm_kcopyd_copy ensures
that conventional zones are always available for executing unaligned
write requests. The reclaim process overhead is minimized by managing
buffer zones in a least-recently-written order and first targeting the
oldest buffer zones. Doing so, blocks under regular write access (such
as metadata blocks of a file system) remain stored in conventional
zones, resulting in no apparent overhead.

dm-zoned implementation focus on simplicity and on minimizing overhead
(CPU, memory and storage overhead). For a 10TB host-managed disk with
256 MB zones, dm-zoned memory usage per disk instance is at most 4.5 MB
and as little as 5 zones will be used internally for storing metadata
and performing buffer zone reclaim operations. This is achieved using
zone level indirection rather than a full block indirection system for
managing block movement between zones.

dm-zoned primary target is host-managed zoned block devices but it can
also be used with host-aware device models to mitigate potential
device-side performance degradation due to excessive random writing.

dm-zoned target devices can be formatted and checked using the dmzadm
utility available at:

https://github.com/hgst/dm-zoned-tools

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
---
 Documentation/device-mapper/dm-zoned.txt |  154 +++
 drivers/md/Kconfig                       |   17 +
 drivers/md/Makefile                      |    2 +
 drivers/md/dm-zoned-io.c                 | 1007 ++++++++++++++
 drivers/md/dm-zoned-metadata.c           | 2220 ++++++++++++++++++++++++++++++
 drivers/md/dm-zoned-reclaim.c            |  535 +++++++
 drivers/md/dm-zoned.h                    |  530 +++++++
 7 files changed, 4465 insertions(+)
 create mode 100644 Documentation/device-mapper/dm-zoned.txt
 create mode 100644 drivers/md/dm-zoned-io.c
 create mode 100644 drivers/md/dm-zoned-metadata.c
 create mode 100644 drivers/md/dm-zoned-reclaim.c
 create mode 100644 drivers/md/dm-zoned.h

diff --git a/Documentation/device-mapper/dm-zoned.txt b/Documentation/device-mapper/dm-zoned.txt
new file mode 100644
index 0000000..d41f597
--- /dev/null
+++ b/Documentation/device-mapper/dm-zoned.txt
@@ -0,0 +1,154 @@
+dm-zoned
+========
+
+The dm-zoned device mapper exposes a zoned block device (ZBC and ZAC compliant
+devices) as a regular block device without any write pattern constraint. In
+effect, it implements a drive-managed zoned block device which hides to the
+user (a file system or an application doing raw block device accesses) the
+sequential write constraints of host-managed zoned block devices and can
+mitigate the potential device-side performance degradation due to excessive
+random writes on host-aware zoned block devices.
+
+For a more detailed description of the zoned block device models and
+their constraints see (for SCSI devices):
+
+http://www.t10.org/drafts.htm#ZBC_Family
+
+and (for ATA devices):
+
+http://www.t13.org/Documents/UploadedDocuments/docs2015/
+di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf
+
+dm-zoned implementation is simple and minimizes system overhead (CPU and
+memory usage as well as storage capacity loss). For a 10TB host-manmaged disk
+with 256 MB zones, dm-zoned memory usage per disk instance is at most 4.5 MB
+and as little as 5 zones will be used internally for storing metadata and
+performaing reclaim operations.
+
+dm-zoned targte devices can be formatted and checked using the dmzadm utility
+available at:
+
+https://github.com/hgst/dm-zoned-tools
+
+Algorithm
+=========
+
+dm-zoned implements an on-disk buffering scheme to handle non-sequential write
+accesses to the sequential zones of a zoned block device. Conventional zones
+are used for caching as well as for storing internal metadata.
+
+The zones of the device are separated into 2 types:
+
+1) Metadata zones: these are conventional zones used to store metadata.
+Metadata zones are not reported as useable capacity to the user.
+
+2) Data zones: all remaining zones, the vast majority of which will be
+sequential zones used exclusively to store user data. The conventional zones
+of the device may be used also for buffering user random writes. Data in these
+zones may be directly mapped to the conventional zone, but later moved to a
+sequential zone after so that the conventional zone can be reused for buffering
+incoming random writes.
+
+dm-zoned exposes a logical device with a sector size of 4096 bytes,
+irrespectively of the physical sector size of the backend zoned block device
+being used. This allows reducing the amount of metadata needed to manage valid
+blocks (blocks written).
+
+The on-disk metadata format is as follows:
+
+1) The first block of the first convnetional zone found contains the
+super block which describes the amount and position on disk of metadata blocks.
+
+2) Following the super block, a set of blocks is used to describe the mapping
+of the logical device blocks. The mapping is done per chunk of blocks, with
+the chunk size equal to the zoned block device size. The mapping table is
+indexed by chunk number and each mapping entry indicates the zone number of
+the device storing the chunk of data. Each mapping entry may also indicate if
+the zone number of a conventional zone used to buffer random modification to
+the data zone.
+
+3) A set of blocks used to store bitmaps indicating the validity of blocks in
+the data zones follows the mapping table. A valid block is defined as a block
+that was writen and not discarded. For a buffered data chunk, a block is
+always valid only in the data zone mapping the chunk or in the buffer zone of
+the chunk.
+
+For a logical chunk mapped to a conventional zone, all write operations are
+processed by directly writing to the zone. If the mapping zone is a
+sequential zone, the write operation is processed directly only and only if
+the write offset within the logical chunk is equal to the write pointer offset
+within of the sequential data zone (i.e. the write operation is aligned on the
+zone write pointer). Otherwise, write operations are processed indirectly
+using a buffer zone. In such case, an unused conventional zone is allocated
+and assigned to the chunk being accessed. Writing a block to the buffer zone
+of a chunk will automatically invalidate the same block in the sequential zone
+mapping the chunk. If all blocks of the sequential zone become invalid, the
+zone is freed and the chunk buffer zone becomes the primary zone mapping the
+chunk, resulting is native random write performance similar to a regular
+block device.
+
+Read operations are processed according to the block validity information
+provided by the bitmaps. Valid blocks are read either from the sequential zone
+mapping a chunk, or if the chunk is buffered, from the buffer zone assigned.
+If the accessed chunk has no mapping, or the accessed blocks are invalid, the
+read buffer is zeroed and the read operation terminated.
+
+After some time, the limited number of convnetional zones available may be
+exhausted (all used to map chunks or buffer sequential zones) and unaligned
+writes to unbuffered chunks become impossible. To avoid such situation, a
+reclaim process regularly scans used conventional zones and try to reclaim
+the least recently used ones copying the valid blocks of the buffer zone
+to a free sequential zone. Once the copy completes, the chunk mapping is
+updated to point to the sequential zone and the buffer zone freed for reuse.
+
+Metadata Protection
+===================
+
+To protect metadata against corruption in case of sudden power loss or system
+crash, 2 sets of metadata zones are used. One set, the primary set, is used as
+the main metadata region, while the secondary set is used as a staging area.
+Modified metadata are first written to the secondary set and validated by
+updating the super block in the secondary set, indicating using a generation
+counter that this set contains the newest metadata. Once this operation
+completes, updates in place of metadata blocks can be done in the primary
+metadata set, ensuring that one of the set is always consistent (all
+modifications committed or none at all). Flush operations are used as a commit
+point. Upon reception of a flush request, metadata modification activity is
+temporarily blocked (for both incoming BIO processing and reclaim process) and
+all dirty metadata blocks staged and updated. Normal operation is then resumed.
+Metadata flush thus only temporarily delays write and discard requests. Read
+requests can be concurrently processed while metadata flush is being executed.
+
+Usage
+=====
+
+A zoned block device must first be formatted using the dmzadm tool. This will
+analyze the device zone configuration, determine where to place the metadata
+sets on the device and initialize the metadata sets.
+
+Ex:
+
+dmzadm --format /dev/sdxx
+
+For a formatted device, the target can be created normally with the dmsetup
+utility. The only parameter that dm-zoned requires is the device name.
+
+Example scripts
+===============
+
+[[
+#!/bin/sh
+
+if [ $# -ne 1 ]; then
+	echo "Usage: $0 <Zoned device path>"
+	exit 1
+fi
+
+dev="${1}"
+shift
+
+modprobe dm-zoned
+
+echo "0 `blockdev --getsize ${dev}` dm-zoned ${dev}" | dmsetup create dmz-`basename ${dev}`
+]]
+
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 906103c..a081b3c 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -521,6 +521,23 @@ config DM_INTEGRITY
 	  To compile this code as a module, choose M here: the module will
 	  be called dm-integrity.
 
+config DM_ZONED
+	tristate "Drive-managed zoned block device target support"
+	depends on BLK_DEV_DM
+	depends on BLK_DEV_ZONED
+	---help---
+	  This device-mapper target takes a host-managed or host-aware zoned
+	  block device and expose most of its capacity as a regular block
+	  device (drive-managed zoned block device) without any write
+	  constraint. This is mainly intended for use with file systems that
+	  do not natively support zoned block devices but still want to
+	  benefit from the increased capacity offered by SMR disks. Other uses
+	  by applications using raw block devices (for example object stores)
+	  is also possible.
+
+	  To compile this code as a module, choose M here: the module will
+	  be called dm-zoned.
+
 	  If unsure, say N.
 
 endif # MD
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 913720b..f77cc3b 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -20,6 +20,7 @@ dm-era-y	+= dm-era-target.o
 dm-verity-y	+= dm-verity-target.o
 md-mod-y	+= md.o bitmap.o
 raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
+dm-zoned-y	+= dm-zoned-io.o dm-zoned-metadata.o dm-zoned-reclaim.o
 
 # Note: link order is important.  All raid personalities
 # and must come before md.o, as they each initialise 
@@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CACHE_SMQ)	+= dm-cache-smq.o
 obj-$(CONFIG_DM_ERA)		+= dm-era.o
 obj-$(CONFIG_DM_LOG_WRITES)	+= dm-log-writes.o
 obj-$(CONFIG_DM_INTEGRITY)	+= dm-integrity.o
+obj-$(CONFIG_DM_ZONED)		+= dm-zoned.o
 
 ifeq ($(CONFIG_DM_UEVENT),y)
 dm-mod-objs			+= dm-uevent.o
diff --git a/drivers/md/dm-zoned-io.c b/drivers/md/dm-zoned-io.c
new file mode 100644
index 0000000..be7c806
--- /dev/null
+++ b/drivers/md/dm-zoned-io.c
@@ -0,0 +1,1007 @@
+/*
+ * Drive-managed zoned block device target
+ * Copyright (C) 2017 Western Digital Corporation or its affiliates.
+ *
+ * Written by: Damien Le Moal <damien.lemoal@wdc.com>
+ *
+ * This software is distributed under the terms of the GNU General Public
+ * License version 2, or any later version, "as is," without technical
+ * support, and WITHOUT ANY WARRANTY, without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#include <linux/module.h>
+
+#include "dm-zoned.h"
+
+/*
+ * Target BIO completion.
+ */
+static inline void dmz_bio_end(struct bio *bio, int err)
+{
+	struct dmz_bioctx *bioctx =
+		dm_per_bio_data(bio, sizeof(struct dmz_bioctx));
+
+	if (atomic_dec_and_test(&bioctx->ref)) {
+		struct dmz_target *dmz = bioctx->target;
+
+		/* User BIO Completed */
+		if (bioctx->zone)
+			dmz_deactivate_zone(dmz, bioctx->zone);
+		atomic_dec(&dmz->bio_count);
+		bio->bi_error = bioctx->error;
+		bio_endio(bio);
+	}
+}
+
+/*
+ * Partial/internal BIO completion callback.
+ * This terminates the user target BIO when there
+ * are no more references to its context.
+ */
+static void dmz_bio_end_io(struct bio *bio)
+{
+	struct dmz_bioctx *bioctx = bio->bi_private;
+	int err = bio->bi_error;
+
+	if (err) {
+		struct dm_zone *zone = bioctx->zone;
+
+		bioctx->error = err;
+		if (bio_op(bio) == REQ_OP_WRITE &&
+		    dmz_is_seq(zone))
+			set_bit(DMZ_SEQ_WRITE_ERR, &zone->flags);
+	}
+
+	dmz_bio_end(bioctx->bio, err);
+
+	bio_put(bio);
+
+}
+
+/*
+ * Issue a BIO to a zone. The BIO may only partially process the
+ * original target BIO.
+ */
+static int dmz_submit_bio(struct dmz_target *dmz, struct dm_zone *zone,
+			  struct bio *dmz_bio,
+			  sector_t chunk_block, unsigned int nr_blocks)
+{
+	struct dmz_bioctx *bioctx
+		= dm_per_bio_data(dmz_bio, sizeof(struct dmz_bioctx));
+	unsigned int nr_sectors = dmz_blk2sect(nr_blocks);
+	unsigned int size = nr_sectors << SECTOR_SHIFT;
+	struct bio *clone;
+
+	clone = bio_clone_fast(dmz_bio, GFP_NOIO, dmz->bio_set);
+	if (!clone)
+		return -ENOMEM;
+
+	/* Setup the clone */
+	clone->bi_bdev = dmz->zbd;
+	clone->bi_opf = dmz_bio->bi_opf;
+	clone->bi_iter.bi_sector =
+		dmz_start_sect(dmz, zone) + dmz_blk2sect(chunk_block);
+	clone->bi_iter.bi_size = size;
+	clone->bi_end_io = dmz_bio_end_io;
+	clone->bi_private = bioctx;
+
+	bio_advance(dmz_bio, size);
+
+	/* Submit the clone */
+	atomic_inc(&bioctx->ref);
+	generic_make_request(clone);
+
+	return 0;
+}
+
+/*
+ * Zero out pages of discarded blocks accessed by a read BIO.
+ */
+static void dmz_handle_read_zero(struct dmz_target *dmz, struct bio *bio,
+				 sector_t chunk_block, unsigned int nr_blocks)
+{
+	unsigned int size = nr_blocks << DMZ_BLOCK_SHIFT;
+
+	dmz_dev_debug(dmz,
+		      "=> ZERO READ chunk %llu -> block %llu, %u blocks\n",
+		      (unsigned long long)dmz_bio_chunk(dmz, bio),
+		      (unsigned long long)chunk_block,
+		      nr_blocks);
+
+	/* Clear nr_blocks */
+	swap(bio->bi_iter.bi_size, size);
+	zero_fill_bio(bio);
+	swap(bio->bi_iter.bi_size, size);
+
+	bio_advance(bio, size);
+}
+
+/*
+ * Process a read BIO.
+ */
+static int dmz_handle_read(struct dmz_target *dmz, struct dm_zone *zone,
+			   struct bio *bio)
+{
+	sector_t block = dmz_bio_block(bio);
+	unsigned int nr_blocks = dmz_bio_blocks(bio);
+	sector_t chunk_block = dmz_chunk_block(dmz, block);
+	sector_t end_block = chunk_block + nr_blocks;
+	struct dm_zone *rzone, *bzone;
+	int ret;
+
+	/* Read into unmapped chunks need only zeroing the BIO buffer */
+	if (!zone) {
+		dmz_handle_read_zero(dmz, bio, chunk_block, nr_blocks);
+		return 0;
+	}
+
+	dmz_dev_debug(dmz, "READ %s zone %u, block %llu, %u blocks\n",
+		      (dmz_is_rnd(zone) ? "RND" : "SEQ"), dmz_id(dmz, zone),
+		      (unsigned long long)chunk_block, nr_blocks);
+
+	/* Check block validity to determine the read location */
+	bzone = zone->bzone;
+	while (chunk_block < end_block) {
+
+		nr_blocks = 0;
+		if (dmz_is_rnd(zone)
+		    || chunk_block < zone->wp_block) {
+			/* Test block validity in the data zone */
+			ret = dmz_block_valid(dmz, zone, chunk_block);
+			if (ret < 0)
+				return ret;
+			if (ret > 0) {
+				/* Read data zone blocks */
+				nr_blocks = ret;
+				rzone = zone;
+			}
+		}
+
+		/*
+		 * No valid blocks found in the data zone.
+		 * Check the buffer zone, if there is one.
+		 */
+		if (!nr_blocks && bzone) {
+			ret = dmz_block_valid(dmz, bzone, chunk_block);
+			if (ret < 0)
+				return ret;
+			if (ret > 0) {
+				/* Read buffer zone blocks */
+				nr_blocks = ret;
+				rzone = bzone;
+			}
+		}
+
+		if (nr_blocks) {
+
+			/* Valid blocks found: read them */
+			nr_blocks = min_t(unsigned int, nr_blocks,
+					  end_block - chunk_block);
+
+			dmz_dev_debug(dmz,
+				"=> %s READ zone %u, block %llu, %u blocks\n",
+				(dmz_is_buf(rzone) ? "BUF" : "DATA"),
+				dmz_id(dmz, rzone),
+				(unsigned long long)chunk_block,
+				nr_blocks);
+
+			ret = dmz_submit_bio(dmz, rzone, bio,
+					     chunk_block, nr_blocks);
+			if (ret)
+				return ret;
+			chunk_block += nr_blocks;
+
+		} else {
+
+			/* No valid block: zeroout the current BIO block */
+			dmz_handle_read_zero(dmz, bio, chunk_block, 1);
+			chunk_block++;
+
+		}
+
+	}
+
+	return 0;
+}
+
+/*
+ * Write blocks directly in a data zone, at the write pointer.
+ * If a buffer zone is assigned, invalidate the blocks written
+ * in place.
+ */
+static int dmz_handle_direct_write(struct dmz_target *dmz,
+				   struct dm_zone *zone, struct bio *bio,
+				   sector_t chunk_block,
+				   unsigned int nr_blocks)
+{
+	struct dm_zone *bzone = zone->bzone;
+	int ret;
+
+	dmz_dev_debug(dmz, "WRITE %s zone %u, block %llu, %u blocks\n",
+		      (dmz_is_rnd(zone) ? "RND" : "SEQ"), dmz_id(dmz, zone),
+		      (unsigned long long)chunk_block, nr_blocks);
+
+	if (dmz_is_readonly(zone))
+		return -EROFS;
+
+	/* Submit write */
+	ret = dmz_submit_bio(dmz, zone, bio, chunk_block, nr_blocks);
+	if (ret)
+		return -EIO;
+
+	if (dmz_is_seq(zone))
+		zone->wp_block += nr_blocks;
+
+	/*
+	 * Validate the blocks in the data zone and invalidate
+	 * in the buffer zone, if there is one.
+	 */
+	ret = dmz_validate_blocks(dmz, zone, chunk_block, nr_blocks);
+	if (ret == 0 && bzone)
+		ret = dmz_invalidate_blocks(dmz, bzone, chunk_block, nr_blocks);
+
+	return ret;
+}
+
+/*
+ * Write blocks in the buffer zone of @zone.
+ * If no buffer zone is assigned yet, get one.
+ * Called with @zone write locked.
+ */
+static int dmz_handle_buffered_write(struct dmz_target *dmz,
+				     struct dm_zone *zone, struct bio *bio,
+				     sector_t chunk_block,
+				     unsigned int nr_blocks)
+{
+	struct dm_zone *bzone = zone->bzone;
+	int ret;
+
+	if (!bzone) {
+		/* Get a buffer zone */
+		bzone = dmz_get_chunk_buffer(dmz, zone);
+		if (!bzone)
+			return -ENOSPC;
+	}
+
+	dmz_dev_debug(dmz, "WRITE BUF zone %u, block %llu, %u blocks\n",
+		      dmz_id(dmz, bzone), (unsigned long long)chunk_block,
+		      nr_blocks);
+
+	if (dmz_is_readonly(bzone))
+		return -EROFS;
+
+	/* Submit write */
+	ret = dmz_submit_bio(dmz, bzone, bio, chunk_block, nr_blocks);
+	if (ret)
+		return -EIO;
+
+	/*
+	 * Validate the blocks in the buffer zone
+	 * and invalidate in the data zone.
+	 */
+	ret = dmz_validate_blocks(dmz, bzone, chunk_block, nr_blocks);
+	if (ret == 0 && chunk_block < zone->wp_block)
+		ret = dmz_invalidate_blocks(dmz, zone, chunk_block, nr_blocks);
+
+	return ret;
+}
+
+/*
+ * Process a write BIO.
+ */
+static int dmz_handle_write(struct dmz_target *dmz, struct dm_zone *zone,
+			    struct bio *bio)
+{
+	sector_t block = dmz_bio_block(bio);
+	unsigned int nr_blocks = dmz_bio_blocks(bio);
+	sector_t chunk_block = dmz_chunk_block(dmz, block);
+
+	if (!zone)
+		return -ENOSPC;
+
+	if (dmz_is_rnd(zone) ||
+	    chunk_block == zone->wp_block)
+		/*
+		 * zone is a random zone, or it is a sequential zone
+		 * and the BIO is aligned to the zone write pointer:
+		 * direct write the zone.
+		 */
+		return dmz_handle_direct_write(dmz, zone, bio,
+					       chunk_block, nr_blocks);
+
+	/*
+	 * This is an unaligned write in a sequential zone:
+	 * use buffered write.
+	 */
+	return dmz_handle_buffered_write(dmz, zone, bio,
+					 chunk_block, nr_blocks);
+}
+
+/*
+ * Process a discard BIO.
+ */
+static int dmz_handle_discard(struct dmz_target *dmz, struct dm_zone *zone,
+			      struct bio *bio)
+{
+	sector_t block = dmz_bio_block(bio);
+	unsigned int nr_blocks = dmz_bio_blocks(bio);
+	sector_t chunk_block = dmz_chunk_block(dmz, block);
+	int ret = 0;
+
+	/* For unmapped chunks, there is nothing to do */
+	if (!zone)
+		return 0;
+
+	if (dmz_is_readonly(zone))
+		return -EROFS;
+
+	dmz_dev_debug(dmz,
+		      "DISCARD chunk %llu -> zone %u, block %llu, %u blocks\n",
+		      (unsigned long long)dmz_bio_chunk(dmz, bio),
+		      dmz_id(dmz, zone),
+		      (unsigned long long)chunk_block, nr_blocks);
+
+	/*
+	 * Invalidate blocks in the data zone and its
+	 * buffer zone if one is mapped.
+	 */
+	if (dmz_is_rnd(zone) ||
+	    chunk_block < zone->wp_block)
+		ret = dmz_invalidate_blocks(dmz, zone,
+					    chunk_block, nr_blocks);
+	if (ret == 0 && zone->bzone)
+		ret = dmz_invalidate_blocks(dmz, zone->bzone,
+					    chunk_block, nr_blocks);
+
+	return ret;
+}
+
+/*
+ * Process a BIO.
+ */
+static void dmz_handle_bio(struct dmz_target *dmz, struct dm_chunk_work *cw,
+			   struct bio *bio)
+{
+	struct dmz_bioctx *bioctx =
+		dm_per_bio_data(bio, sizeof(struct dmz_bioctx));
+	struct dm_zone *zone;
+	int ret;
+
+	down_read(&dmz->mblk_sem);
+
+	/*
+	 * Get the data zone mapping the chunk. There may be no
+	 * mapping for read and discard. If a mapping is obtained,
+	 + the zone returned will be set to active state.
+	 */
+	zone = dmz_get_chunk_mapping(dmz, dmz_bio_chunk(dmz, bio),
+				     bio_op(bio));
+	if (IS_ERR(zone)) {
+		dmz_bio_end(bio, PTR_ERR(zone));
+		goto out;
+	}
+
+	/* Process the BIO */
+	if (zone) {
+		dmz_activate_zone(dmz, zone);
+		bioctx->zone = zone;
+	}
+
+	switch (bio_op(bio)) {
+	case REQ_OP_READ:
+		ret = dmz_handle_read(dmz, zone, bio);
+		break;
+	case REQ_OP_WRITE:
+		ret = dmz_handle_write(dmz, zone, bio);
+		break;
+	case REQ_OP_DISCARD:
+	case REQ_OP_WRITE_ZEROES:
+		ret = dmz_handle_discard(dmz, zone, bio);
+		break;
+	default:
+		dmz_dev_err(dmz,
+			    "Unsupported BIO operation 0x%x\n",
+			    bio_op(bio));
+		ret = -EIO;
+	}
+
+	dmz_bio_end(bio, ret);
+
+	/*
+	 * Release the chunk mapping. This will check that the mapping
+	 * is still valid, that is, that the zone used still has valid blocks.
+	 */
+	if (zone)
+		dmz_put_chunk_mapping(dmz, zone);
+
+out:
+	up_read(&dmz->mblk_sem);
+}
+
+/*
+ * Increment a chunk reference counter.
+ */
+static inline void dmz_get_chunk_work(struct dm_chunk_work *cw)
+{
+	atomic_inc(&cw->refcount);
+}
+
+/*
+ * Decrement a chunk work reference count and
+ * free it if it becomes 0.
+ */
+static void dmz_put_chunk_work(struct dm_chunk_work *cw)
+{
+	if (atomic_dec_and_test(&cw->refcount)) {
+		atomic_dec(&cw->target->nr_active_chunks);
+		radix_tree_delete(&cw->target->chunk_rxtree, cw->chunk);
+		kfree(cw);
+	}
+}
+
+/*
+ * Chunk BIO work function.
+ */
+static void dmz_chunk_work(struct work_struct *work)
+{
+	struct dm_chunk_work *cw =
+		container_of(work, struct dm_chunk_work, work);
+	struct dmz_target *dmz = cw->target;
+	struct bio *bio;
+
+	mutex_lock(&dmz->chunk_lock);
+
+	/* Process the chunk BIOs */
+	while ((bio = bio_list_pop(&cw->bio_list))) {
+
+		mutex_unlock(&dmz->chunk_lock);
+		dmz_handle_bio(dmz, cw, bio);
+		mutex_lock(&dmz->chunk_lock);
+
+		dmz_put_chunk_work(cw);
+
+	}
+
+	/*
+	 * Queueing the work added one to the work refcount.
+	 * So drop this here.
+	 */
+	dmz_put_chunk_work(cw);
+
+	mutex_unlock(&dmz->chunk_lock);
+}
+
+/*
+ * Flush work.
+ */
+static void dmz_flush_work(struct work_struct *work)
+{
+	struct dmz_target *dmz =
+		container_of(work, struct dmz_target, flush_work.work);
+	struct bio *bio;
+	int ret;
+
+	/* Flush metablocks */
+	ret = dmz_flush_mblocks(dmz);
+
+	/* Process queued flush requests */
+	while (1) {
+
+		spin_lock(&dmz->flush_lock);
+		bio = bio_list_pop(&dmz->flush_list);
+		spin_unlock(&dmz->flush_lock);
+
+		if (!bio)
+			break;
+
+		/* Do flush */
+		dmz_bio_end(bio, ret);
+
+	}
+
+	queue_delayed_work(dmz->flush_wq, &dmz->flush_work,
+			   DMZ_FLUSH_PERIOD);
+}
+
+/*
+ * Get a chunk work and start it to process a new BIO.
+ * If the BIO chunk has no work yet, create one.
+ */
+static void dmz_queue_chunk_work(struct dmz_target *dmz,
+				 struct bio *bio)
+{
+	unsigned int chunk = dmz_bio_chunk(dmz, bio);
+	struct dm_chunk_work *cw;
+
+	mutex_lock(&dmz->chunk_lock);
+
+	/* Get the BIO chunk work. If one is not active yet, create one */
+	cw = radix_tree_lookup(&dmz->chunk_rxtree, chunk);
+	if (!cw) {
+		int ret;
+
+		/* Create a new chunk work */
+		cw = kmalloc(sizeof(struct dm_chunk_work), GFP_NOFS);
+		if (!cw)
+			goto out;
+
+		INIT_WORK(&cw->work, dmz_chunk_work);
+		atomic_set(&cw->refcount, 0);
+		cw->target = dmz;
+		cw->chunk = chunk;
+		bio_list_init(&cw->bio_list);
+
+		ret = radix_tree_insert(&dmz->chunk_rxtree, chunk, cw);
+		if (unlikely(ret != 0)) {
+			kfree(cw);
+			cw = NULL;
+			goto out;
+		}
+
+		atomic_inc(&dmz->nr_active_chunks);
+	}
+
+	bio_list_add(&cw->bio_list, bio);
+	dmz_get_chunk_work(cw);
+
+	if (queue_work(dmz->chunk_wq, &cw->work))
+		dmz_get_chunk_work(cw);
+
+out:
+	mutex_unlock(&dmz->chunk_lock);
+}
+
+/*
+ * Process a new BIO.
+ */
+static int dmz_map(struct dm_target *ti, struct bio *bio)
+{
+	struct dmz_target *dmz = ti->private;
+	struct dmz_bioctx *bioctx
+		= dm_per_bio_data(bio, sizeof(struct dmz_bioctx));
+	sector_t sector = bio->bi_iter.bi_sector;
+	unsigned int nr_sectors = bio_sectors(bio);
+	sector_t chunk_sector;
+
+	dmz_dev_debug(dmz,
+		"BIO sector %llu + %u => chunk %llu, block %llu, %u blocks\n",
+		(u64)sector, nr_sectors,
+		(u64)dmz_bio_chunk(dmz, bio),
+		(u64)dmz_chunk_block(dmz, dmz_bio_block(bio)),
+		(unsigned int)dmz_bio_blocks(bio));
+
+	bio->bi_bdev = dmz->zbd;
+
+	if (!nr_sectors &&
+	    (bio_op(bio) != REQ_OP_FLUSH) &&
+	    (bio_op(bio) != REQ_OP_WRITE)) {
+		bio->bi_bdev = dmz->zbd;
+		return DM_MAPIO_REMAPPED;
+	}
+
+	/* The BIO should be block aligned */
+	if ((nr_sectors & DMZ_BLOCK_SECTORS_MASK) ||
+	    (sector & DMZ_BLOCK_SECTORS_MASK)) {
+		dmz_dev_err(dmz,
+			    "Unaligned BIO sector %llu, len %u\n",
+			    (u64)sector,
+			    nr_sectors);
+		return -EIO;
+	}
+
+	/* Initialize the BIO context */
+	bioctx->target = dmz;
+	bioctx->zone = NULL;
+	bioctx->bio = bio;
+	atomic_set(&bioctx->ref, 1);
+	bioctx->error = 0;
+
+	atomic_inc(&dmz->bio_count);
+	dmz->atime = jiffies;
+
+	/* Set the BIO pending in the flush list */
+	if (bio_op(bio) == REQ_OP_FLUSH ||
+	    (!nr_sectors && bio_op(bio) == REQ_OP_WRITE)) {
+		spin_lock(&dmz->flush_lock);
+		bio_list_add(&dmz->flush_list, bio);
+		spin_unlock(&dmz->flush_lock);
+		dmz_trigger_flush(dmz);
+		return DM_MAPIO_SUBMITTED;
+	}
+
+	/* Split zone BIOs to fit entirely into a zone */
+	chunk_sector = dmz_chunk_sector(dmz, sector);
+	if (chunk_sector + nr_sectors > dmz->zone_nr_sectors)
+		dm_accept_partial_bio(bio,
+				      dmz->zone_nr_sectors - chunk_sector);
+
+	/* Now ready to handle this BIO */
+	dmz_queue_chunk_work(dmz, bio);
+
+	return DM_MAPIO_SUBMITTED;
+}
+
+/*
+ * Setup target.
+ */
+static int dmz_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	struct dmz_target *dmz;
+	int ret;
+
+	/* Check arguments */
+	if (argc != 1) {
+		ti->error = "Invalid argument count";
+		return -EINVAL;
+	}
+
+	/* Allocate and initialize the target descriptor */
+	dmz = kzalloc(sizeof(struct dmz_target), GFP_KERNEL);
+	if (!dmz) {
+		ti->error = "Allocate target descriptor failed";
+		return -ENOMEM;
+	}
+
+	dmz->mblk_rbtree = RB_ROOT;
+	init_rwsem(&dmz->mblk_sem);
+	spin_lock_init(&dmz->mblk_lock);
+	INIT_LIST_HEAD(&dmz->mblk_lru_list);
+	INIT_LIST_HEAD(&dmz->mblk_dirty_list);
+
+	mutex_init(&dmz->map_lock);
+	atomic_set(&dmz->dz_unmap_nr_rnd, 0);
+	INIT_LIST_HEAD(&dmz->dz_unmap_rnd_list);
+	INIT_LIST_HEAD(&dmz->dz_map_rnd_list);
+
+	atomic_set(&dmz->dz_unmap_nr_seq, 0);
+	INIT_LIST_HEAD(&dmz->dz_unmap_seq_list);
+	INIT_LIST_HEAD(&dmz->dz_map_seq_list);
+
+	init_waitqueue_head(&dmz->dz_free_wq);
+
+	atomic_set(&dmz->nr_reclaim_seq_zones, 0);
+	INIT_LIST_HEAD(&dmz->reclaim_seq_zones_list);
+
+	/* Get the target device */
+	ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
+			    &dmz->ddev);
+	if (ret != 0) {
+		ti->error = "Get target device failed";
+		dmz->ddev = NULL;
+		goto err_dev;
+	}
+
+	dmz->zbd = dmz->ddev->bdev;
+	(void)bdevname(dmz->zbd, dmz->zbd_name);
+	dmz->zbdq = bdev_get_queue(dmz->zbd);
+
+	if (bdev_zoned_model(dmz->zbd) == BLK_ZONED_NONE) {
+		ti->error = "Not a zoned block device";
+		ret = -EINVAL;
+		goto err;
+	}
+
+	dmz->zbd_capacity = i_size_read(dmz->zbd->bd_inode) >> SECTOR_SHIFT;
+	if (ti->begin || (ti->len != dmz->zbd_capacity)) {
+		ti->error = "Partial mapping not supported";
+		ret = -EINVAL;
+		goto err;
+	}
+
+	ret = dmz_init_meta(dmz);
+	if (ret != 0) {
+		ti->error = "Metadata initialization failed";
+		goto err;
+	}
+
+	/* Set target (no write same support) */
+	ti->private = dmz;
+	ti->max_io_len = dmz->zone_nr_sectors << 9;
+	ti->num_flush_bios = 1;
+	ti->num_discard_bios = 1;
+	ti->num_write_zeroes_bios = 1;
+	ti->per_io_data_size = sizeof(struct dmz_bioctx);
+	ti->flush_supported = true;
+	ti->discards_supported = true;
+	ti->split_discard_bios = true;
+
+	/* The exposed capacity is the number of chunks that can be mapped */
+	ti->len = dmz->nr_chunks * dmz->zone_nr_sectors;
+
+	/* Zone BIO */
+	atomic_set(&dmz->bio_count, 0);
+	dmz->atime = jiffies;
+	dmz->bio_set = bioset_create_nobvec(DMZ_MIN_BIOS, 0);
+	if (!dmz->bio_set) {
+		ti->error = "Create BIO set failed";
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	/* Chunk BIO work */
+	mutex_init(&dmz->chunk_lock);
+	atomic_set(&dmz->nr_active_chunks, 0);
+	INIT_RADIX_TREE(&dmz->chunk_rxtree, GFP_NOFS);
+	dmz->chunk_wq = alloc_workqueue("dmz_cwq_%s",
+					WQ_MEM_RECLAIM | WQ_UNBOUND,
+					0, dmz->zbd_name);
+	if (!dmz->chunk_wq) {
+		ti->error = "Create chunk workqueue failed";
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	/* Flush work */
+	spin_lock_init(&dmz->flush_lock);
+	bio_list_init(&dmz->flush_list);
+	INIT_DELAYED_WORK(&dmz->flush_work, dmz_flush_work);
+	dmz->flush_wq = alloc_ordered_workqueue("dmz_fwq_%s", WQ_MEM_RECLAIM,
+						dmz->zbd_name);
+	if (!dmz->flush_wq) {
+		ti->error = "Create flush workqueue failed";
+		ret = -ENOMEM;
+		goto err;
+	}
+	mod_delayed_work(dmz->flush_wq, &dmz->flush_work, DMZ_FLUSH_PERIOD);
+
+	/* Reclaim kcopyd client */
+	dmz->reclaim_kc = dm_kcopyd_client_create(&dmz->reclaim_throttle);
+	if (IS_ERR(dmz->reclaim_kc)) {
+		ti->error = "Create kcopyd client failed";
+		ret = PTR_ERR(dmz->reclaim_kc);
+		dmz->reclaim_kc = NULL;
+		goto err;
+	}
+
+	/* Reclaim work */
+	INIT_DELAYED_WORK(&dmz->reclaim_work, dmz_reclaim_work);
+	dmz->reclaim_wq = alloc_ordered_workqueue("dmz_rwq_%s", WQ_MEM_RECLAIM,
+						  dmz->zbd_name);
+	if (!dmz->reclaim_wq) {
+		ti->error = "Create reclaim workqueue failed";
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	/* Metadata cache shrinker */
+	ret = register_shrinker(&dmz->mblk_shrinker);
+	if (ret) {
+		ti->error = "Register metadata cache shrinker failed";
+		goto err;
+	}
+
+	dmz_dev_info(dmz,
+		"Target device: %llu 512-byte logical sectors (%llu blocks)\n",
+		(unsigned long long)ti->len,
+		(unsigned long long)dmz_sect2blk(ti->len));
+
+	dmz_trigger_reclaim(dmz);
+
+	return 0;
+
+err:
+	if (dmz->reclaim_wq)
+		destroy_workqueue(dmz->reclaim_wq);
+	if (dmz->reclaim_kc)
+		dm_kcopyd_client_destroy(dmz->reclaim_kc);
+	if (dmz->flush_wq)
+		destroy_workqueue(dmz->flush_wq);
+	if (dmz->chunk_wq)
+		destroy_workqueue(dmz->chunk_wq);
+	if (dmz->bio_set)
+		bioset_free(dmz->bio_set);
+	dmz_cleanup_meta(dmz);
+
+err_dev:
+	if (dmz->ddev)
+		dm_put_device(ti, dmz->ddev);
+	kfree(dmz);
+
+	return ret;
+}
+
+/*
+ * Cleanup target.
+ */
+static void dmz_dtr(struct dm_target *ti)
+{
+	struct dmz_target *dmz = ti->private;
+
+	dmz_dev_info(dmz, "Removing target device\n");
+
+	unregister_shrinker(&dmz->mblk_shrinker);
+
+	flush_workqueue(dmz->chunk_wq);
+	destroy_workqueue(dmz->chunk_wq);
+
+	cancel_delayed_work_sync(&dmz->reclaim_work);
+	destroy_workqueue(dmz->reclaim_wq);
+	dm_kcopyd_client_destroy(dmz->reclaim_kc);
+
+	cancel_delayed_work_sync(&dmz->flush_work);
+	destroy_workqueue(dmz->flush_wq);
+
+	dmz_flush_mblocks(dmz);
+
+	bioset_free(dmz->bio_set);
+
+	dmz_cleanup_meta(dmz);
+
+	dm_put_device(ti, dmz->ddev);
+
+	kfree(dmz);
+}
+
+/*
+ * Setup target request queue limits.
+ */
+static void dmz_io_hints(struct dm_target *ti,
+			 struct queue_limits *limits)
+{
+	struct dmz_target *dmz = ti->private;
+	unsigned int chunk_sectors = dmz->zone_nr_sectors;
+
+	limits->logical_block_size = DMZ_BLOCK_SIZE;
+	limits->physical_block_size = DMZ_BLOCK_SIZE;
+
+	blk_limits_io_min(limits, DMZ_BLOCK_SIZE);
+	blk_limits_io_opt(limits, DMZ_BLOCK_SIZE);
+
+	limits->discard_alignment = DMZ_BLOCK_SIZE;
+	limits->discard_granularity = DMZ_BLOCK_SIZE;
+	limits->max_discard_sectors = chunk_sectors;
+	limits->max_hw_discard_sectors = chunk_sectors;
+	limits->max_write_zeroes_sectors = chunk_sectors;
+
+	/* FS hint to try to align to the device zone size */
+	limits->chunk_sectors = chunk_sectors;
+	limits->max_sectors = chunk_sectors;
+
+	/* We are exposing a drive-managed zoned block device */
+	limits->zoned = BLK_ZONED_NONE;
+}
+
+/*
+ * Pass on ioctl to the backend device.
+ */
+static int dmz_prepare_ioctl(struct dm_target *ti,
+			     struct block_device **bdev, fmode_t *mode)
+{
+	struct dmz_target *dmz = ti->private;
+
+	*bdev = dmz->zbd;
+
+	return 0;
+}
+
+/*
+ * Stop reclaim before suspend.
+ */
+static void dmz_presuspend(struct dm_target *ti)
+{
+	struct dmz_target *dmz = ti->private;
+
+	dmz_dev_debug(dmz, "Pre-suspend\n");
+
+	/* Enter suspend state */
+	set_bit(DMZ_SUSPENDED, &dmz->flags);
+	smp_mb__after_atomic();
+
+	/* Stop reclaim */
+	cancel_delayed_work_sync(&dmz->reclaim_work);
+}
+
+/*
+ * Restart reclaim if suspend failed.
+ */
+static void dmz_presuspend_undo(struct dm_target *ti)
+{
+	struct dmz_target *dmz = ti->private;
+
+	dmz_dev_debug(dmz, "Pre-suspend undo\n");
+
+	/* Clear suspend state */
+	clear_bit_unlock(DMZ_SUSPENDED, &dmz->flags);
+	smp_mb__after_atomic();
+
+	/* Restart reclaim */
+	mod_delayed_work(dmz->reclaim_wq, &dmz->reclaim_work, 0);
+}
+
+/*
+ * Stop works and flush on suspend.
+ */
+static void dmz_postsuspend(struct dm_target *ti)
+{
+	struct dmz_target *dmz = ti->private;
+
+	dmz_dev_debug(dmz, "Post-suspend\n");
+
+	/* Stop works */
+	flush_workqueue(dmz->chunk_wq);
+	flush_workqueue(dmz->flush_wq);
+}
+
+/*
+ * Refresh zone information before resuming.
+ */
+static int dmz_preresume(struct dm_target *ti)
+{
+	struct dmz_target *dmz = ti->private;
+
+	if (!test_bit(DMZ_SUSPENDED, &dmz->flags))
+		return 0;
+
+	dmz_dev_debug(dmz, "Pre-resume\n");
+
+	/* Refresh zone information */
+	return dmz_resume_meta(dmz);
+}
+
+/*
+ * Resume.
+ */
+static void dmz_resume(struct dm_target *ti)
+{
+	struct dmz_target *dmz = ti->private;
+
+	if (!test_bit(DMZ_SUSPENDED, &dmz->flags))
+		return;
+
+	dmz_dev_debug(dmz, "Resume\n");
+
+	/* Clear suspend state */
+	clear_bit_unlock(DMZ_SUSPENDED, &dmz->flags);
+	smp_mb__after_atomic();
+
+	/* Restart reclaim */
+	mod_delayed_work(dmz->reclaim_wq, &dmz->reclaim_work, 0);
+}
+
+static int dmz_iterate_devices(struct dm_target *ti,
+			       iterate_devices_callout_fn fn, void *data)
+{
+	struct dmz_target *dmz = ti->private;
+	sector_t offset = dmz->zbd_capacity -
+		((sector_t)dmz->nr_chunks * dmz->zone_nr_sectors);
+
+	return fn(ti, dmz->ddev, offset, ti->len, data);
+}
+
+static struct target_type dmz_type = {
+	.name		 = "dm-zoned",
+	.version	 = {1, 0, 0},
+	.features	 = DM_TARGET_SINGLETON | DM_TARGET_ZONED_HM,
+	.module		 = THIS_MODULE,
+	.ctr		 = dmz_ctr,
+	.dtr		 = dmz_dtr,
+	.map		 = dmz_map,
+	.io_hints	 = dmz_io_hints,
+	.prepare_ioctl	 = dmz_prepare_ioctl,
+	.presuspend	 = dmz_presuspend,
+	.presuspend_undo = dmz_presuspend_undo,
+	.postsuspend	 = dmz_postsuspend,
+	.preresume	 = dmz_preresume,
+	.resume		 = dmz_resume,
+	.iterate_devices = dmz_iterate_devices,
+};
+
+static int __init dmz_init(void)
+{
+	dmz_info("Zoned block device target (C) Western Digital\n");
+
+	return dm_register_target(&dmz_type);
+}
+
+static void __exit dmz_exit(void)
+{
+	dm_unregister_target(&dmz_type);
+}
+
+module_init(dmz_init);
+module_exit(dmz_exit);
+
+MODULE_DESCRIPTION(DM_NAME " target for zoned block devices");
+MODULE_AUTHOR("Damien Le Moal <damien.lemoal@wdc.com>");
+MODULE_LICENSE("GPL");
diff --git a/drivers/md/dm-zoned-metadata.c b/drivers/md/dm-zoned-metadata.c
new file mode 100644
index 0000000..04f040d
--- /dev/null
+++ b/drivers/md/dm-zoned-metadata.c
@@ -0,0 +1,2220 @@
+/*
+ * Drive-managed zoned block device target
+ * Copyright (C) 2017 Western Digital Corporation or its affiliates.
+ *
+ * Written by: Damien Le Moal <damien.lemoal@wdc.com>
+ *
+ * This software is distributed under the terms of the GNU General Public
+ * License version 2, or any later version, "as is," without technical
+ * support, and WITHOUT ANY WARRANTY, without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#include <linux/module.h>
+#include <linux/crc32.h>
+
+#include "dm-zoned.h"
+
+/*
+ * Allocate a metadata block.
+ */
+static struct dmz_mblock *dmz_alloc_mblock(struct dmz_target *dmz,
+					   sector_t mblk_no)
+{
+	struct dmz_mblock *mblk = NULL;
+
+	/* See if we can reuse cached blocks */
+	if (dmz->max_nr_mblks &&
+	    atomic_read(&dmz->nr_mblks) > dmz->max_nr_mblks) {
+
+		spin_lock(&dmz->mblk_lock);
+
+		if (list_empty(&dmz->mblk_lru_list) &&
+		    !list_empty(&dmz->mblk_dirty_list))
+			/* Cleanup dirty blocks */
+			dmz_trigger_flush(dmz);
+
+		mblk = list_first_entry_or_null(&dmz->mblk_lru_list,
+						struct dmz_mblock, link);
+		if (mblk) {
+			list_del_init(&mblk->link);
+			rb_erase(&mblk->node, &dmz->mblk_rbtree);
+			mblk->no = mblk_no;
+		}
+
+		spin_unlock(&dmz->mblk_lock);
+
+		if (mblk)
+			return mblk;
+	}
+
+	/* Allocate a new block */
+	mblk = kmalloc(sizeof(struct dmz_mblock), GFP_NOIO);
+	if (!mblk)
+		return NULL;
+
+	mblk->page = alloc_page(GFP_NOIO);
+	if (!mblk->page) {
+		kfree(mblk);
+		return NULL;
+	}
+
+	RB_CLEAR_NODE(&mblk->node);
+	INIT_LIST_HEAD(&mblk->link);
+	atomic_set(&mblk->ref, 0);
+	mblk->state = 0;
+	mblk->no = mblk_no;
+	mblk->data = page_address(mblk->page);
+
+	atomic_inc(&dmz->nr_mblks);
+
+	return mblk;
+}
+
+/*
+ * Free a metadata block.
+ */
+static void dmz_free_mblock(struct dmz_target *dmz, struct dmz_mblock *mblk)
+{
+	__free_pages(mblk->page, 0);
+	kfree(mblk);
+
+	atomic_dec(&dmz->nr_mblks);
+}
+
+/*
+ * Insert a metadata block in the rbtree.
+ */
+static void dmz_insert_mblock(struct dmz_target *dmz,
+			      struct dmz_mblock *mblk)
+{
+	struct rb_root *root = &dmz->mblk_rbtree;
+	struct rb_node **new = &(root->rb_node), *parent = NULL;
+	struct dmz_mblock *b;
+
+	/* Figure out where to put the new node */
+	while (*new) {
+		b = container_of(*new, struct dmz_mblock, node);
+		parent = *new;
+		new = (b->no < mblk->no) ?
+			&((*new)->rb_left) : &((*new)->rb_right);
+	}
+
+	/* Add new node and rebalance tree */
+	rb_link_node(&mblk->node, parent, new);
+	rb_insert_color(&mblk->node, root);
+}
+
+/*
+ * Lookup a metadata block in the rbtree.
+ */
+static struct dmz_mblock *dmz_lookup_mblock(struct dmz_target *dmz,
+					    sector_t mblk_no)
+{
+	struct rb_root *root = &dmz->mblk_rbtree;
+	struct rb_node *node = root->rb_node;
+	struct dmz_mblock *mblk;
+
+	while (node) {
+		mblk = container_of(node, struct dmz_mblock, node);
+		if (mblk->no == mblk_no)
+			return mblk;
+		node = (mblk->no < mblk_no) ? node->rb_left : node->rb_right;
+	}
+
+	return NULL;
+}
+
+/*
+ * Metadata block BIO end callback.
+ */
+static void dmz_mblock_bio_end_io(struct bio *bio)
+{
+	struct dmz_mblock *mblk = bio->bi_private;
+	int flag;
+
+	if (bio->bi_error)
+		set_bit(DMZ_META_ERROR, &mblk->state);
+
+	if (bio_op(bio) == REQ_OP_WRITE)
+		flag = DMZ_META_WRITING;
+	else
+		flag = DMZ_META_READING;
+
+	clear_bit_unlock(flag, &mblk->state);
+	smp_mb__after_atomic();
+	wake_up_bit(&mblk->state, flag);
+
+	bio_put(bio);
+}
+
+/*
+ * Read a metadata block from disk.
+ */
+static struct dmz_mblock *dmz_fetch_mblock(struct dmz_target *dmz,
+					   sector_t mblk_no)
+{
+	struct dmz_mblock *mblk;
+	sector_t block = dmz->sb[dmz->mblk_primary].block + mblk_no;
+	struct bio *bio;
+
+	/* Get block and insert it */
+	mblk = dmz_alloc_mblock(dmz, mblk_no);
+	if (!mblk)
+		return NULL;
+
+	spin_lock(&dmz->mblk_lock);
+	atomic_inc(&mblk->ref);
+	set_bit(DMZ_META_READING, &mblk->state);
+	dmz_insert_mblock(dmz, mblk);
+	spin_unlock(&dmz->mblk_lock);
+
+	bio = bio_alloc(GFP_NOIO, 1);
+	if (!bio) {
+		dmz_free_mblock(dmz, mblk);
+		return NULL;
+	}
+
+	bio->bi_iter.bi_sector = dmz_blk2sect(block);
+	bio->bi_bdev = dmz->zbd;
+	bio->bi_private = mblk;
+	bio->bi_end_io = dmz_mblock_bio_end_io;
+	bio_set_op_attrs(bio, REQ_OP_READ, REQ_META | REQ_PRIO);
+	bio_add_page(bio, mblk->page, DMZ_BLOCK_SIZE, 0);
+	submit_bio(bio);
+
+	return mblk;
+}
+
+/*
+ * Free metadata blocks.
+ */
+static unsigned long dmz_shrink_mblock_cache(struct dmz_target *dmz,
+					     unsigned long limit)
+{
+	struct dmz_mblock *mblk;
+	unsigned long count = 0;
+
+	if (!dmz->max_nr_mblks)
+		return 0;
+
+	while (!list_empty(&dmz->mblk_lru_list) &&
+	       atomic_read(&dmz->nr_mblks) > dmz->min_nr_mblks &&
+	       count < limit) {
+		mblk = list_first_entry(&dmz->mblk_lru_list,
+					struct dmz_mblock, link);
+		list_del_init(&mblk->link);
+		rb_erase(&mblk->node, &dmz->mblk_rbtree);
+		dmz_free_mblock(dmz, mblk);
+		count++;
+	}
+
+	return count;
+}
+
+/*
+ * For mblock shrinker: get the number of unused metadata blocks in the cache.
+ */
+static unsigned long dmz_mblock_shrinker_count(struct shrinker *shrink,
+					       struct shrink_control *sc)
+{
+	struct dmz_target *dmz =
+		container_of(shrink, struct dmz_target, mblk_shrinker);
+
+	return atomic_read(&dmz->nr_mblks);
+}
+
+/*
+ * For mblock shrinker: scan unused metadata blocks and shrink the cache.
+ */
+static unsigned long dmz_mblock_shrinker_scan(struct shrinker *shrink,
+					      struct shrink_control *sc)
+{
+	struct dmz_target *dmz =
+		container_of(shrink, struct dmz_target, mblk_shrinker);
+	unsigned long count;
+
+	spin_lock(&dmz->mblk_lock);
+	count = dmz_shrink_mblock_cache(dmz, sc->nr_to_scan);
+	spin_unlock(&dmz->mblk_lock);
+
+	return count ? count : SHRINK_STOP;
+}
+
+/*
+ * Release a metadata block.
+ */
+static void dmz_release_mblock(struct dmz_target *dmz, struct dmz_mblock *mblk)
+{
+
+	if (!mblk)
+		return;
+
+	spin_lock(&dmz->mblk_lock);
+
+	if (atomic_dec_and_test(&mblk->ref)) {
+		if (test_bit(DMZ_META_ERROR, &mblk->state)) {
+			rb_erase(&mblk->node, &dmz->mblk_rbtree);
+			dmz_free_mblock(dmz, mblk);
+		} else if (!test_bit(DMZ_META_DIRTY, &mblk->state)) {
+			list_add_tail(&mblk->link, &dmz->mblk_lru_list);
+			dmz_shrink_mblock_cache(dmz, 1);
+		}
+	}
+
+	spin_unlock(&dmz->mblk_lock);
+}
+
+/*
+ * Get a metadata block from the rbtree. If the block
+ * is not present, read it from disk.
+ */
+static struct dmz_mblock *dmz_get_mblock(struct dmz_target *dmz,
+					 sector_t mblk_no)
+{
+	struct dmz_mblock *mblk;
+
+	/* Check rbtree */
+	spin_lock(&dmz->mblk_lock);
+	mblk = dmz_lookup_mblock(dmz, mblk_no);
+	if (mblk) {
+		/* Cache hit: remove block from LRU list */
+		if (atomic_inc_return(&mblk->ref) == 1 &&
+		    !test_bit(DMZ_META_DIRTY, &mblk->state))
+			list_del_init(&mblk->link);
+	}
+	spin_unlock(&dmz->mblk_lock);
+
+	if (!mblk) {
+		/* Cache miss: read the block from disk */
+		mblk = dmz_fetch_mblock(dmz, mblk_no);
+		if (!mblk)
+			return ERR_PTR(-ENOMEM);
+	}
+
+	/* Wait for on-going read I/O and check for error */
+	wait_on_bit_io(&mblk->state, DMZ_META_READING,
+		       TASK_UNINTERRUPTIBLE);
+	if (test_bit(DMZ_META_ERROR, &mblk->state)) {
+		dmz_release_mblock(dmz, mblk);
+		return ERR_PTR(-EIO);
+	}
+
+	return mblk;
+}
+
+/*
+ * Mark a metadata block dirty.
+ */
+static void dmz_dirty_mblock(struct dmz_target *dmz, struct dmz_mblock *mblk)
+{
+	spin_lock(&dmz->mblk_lock);
+	if (!test_and_set_bit(DMZ_META_DIRTY, &mblk->state))
+		list_add_tail(&mblk->link, &dmz->mblk_dirty_list);
+	spin_unlock(&dmz->mblk_lock);
+}
+
+/*
+ * Issue a metadata block write BIO.
+ */
+static void dmz_write_mblock(struct dmz_target *dmz, struct dmz_mblock *mblk,
+			     unsigned int set)
+{
+	sector_t block = dmz->sb[set].block + mblk->no;
+	struct bio *bio;
+
+	bio = bio_alloc(GFP_NOIO, 1);
+	if (!bio) {
+		set_bit(DMZ_META_ERROR, &mblk->state);
+		return;
+	}
+
+	set_bit(DMZ_META_WRITING, &mblk->state);
+
+	bio->bi_iter.bi_sector = dmz_blk2sect(block);
+	bio->bi_bdev = dmz->zbd;
+	bio->bi_private = mblk;
+	bio->bi_end_io = dmz_mblock_bio_end_io;
+	bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_META | REQ_PRIO);
+	bio_add_page(bio, mblk->page, DMZ_BLOCK_SIZE, 0);
+	submit_bio(bio);
+}
+
+/*
+ * Sync read/write a block.
+ */
+static int dmz_rdwr_block_sync(struct dmz_target *dmz, int op, sector_t block,
+			       struct page *page)
+{
+	struct bio *bio;
+	int ret;
+
+	bio = bio_alloc(GFP_NOIO, 1);
+	if (!bio)
+		return -ENOMEM;
+
+	bio->bi_iter.bi_sector = dmz_blk2sect(block);
+	bio->bi_bdev = dmz->zbd;
+	bio_set_op_attrs(bio, op, REQ_SYNC | REQ_META | REQ_PRIO);
+	bio_add_page(bio, page, DMZ_BLOCK_SIZE, 0);
+	ret = submit_bio_wait(bio);
+	bio_put(bio);
+
+	return ret;
+}
+
+/*
+ * Write super block of the specified metadata set.
+ */
+static int dmz_write_sb(struct dmz_target *dmz, unsigned int set)
+{
+	sector_t block = dmz->sb[set].block;
+	struct dmz_mblock *mblk = dmz->sb[set].mblk;
+	struct dmz_super *sb = dmz->sb[set].sb;
+	u64 sb_gen = dmz->sb_gen + 1;
+	int ret;
+
+	sb->magic = cpu_to_le32(DMZ_MAGIC);
+	sb->version = cpu_to_le32(DMZ_META_VER);
+
+	sb->gen = cpu_to_le64(sb_gen);
+
+	sb->sb_block = cpu_to_le64(block);
+	sb->nr_meta_blocks = cpu_to_le32(dmz->nr_meta_blocks);
+	sb->nr_reserved_seq = cpu_to_le32(dmz->nr_reserved_seq);
+	sb->nr_chunks = cpu_to_le32(dmz->nr_chunks);
+
+	sb->nr_map_blocks = cpu_to_le32(dmz->nr_map_blocks);
+	sb->nr_bitmap_blocks = cpu_to_le32(dmz->nr_bitmap_blocks);
+
+	sb->crc = 0;
+	sb->crc = cpu_to_le32(crc32_le(sb_gen,
+				       (unsigned char *)sb, DMZ_BLOCK_SIZE));
+
+	ret = dmz_rdwr_block_sync(dmz, REQ_OP_WRITE, block, mblk->page);
+	if (ret == 0)
+		ret = blkdev_issue_flush(dmz->zbd, GFP_KERNEL, NULL);
+
+	return ret;
+}
+
+/*
+ * Write dirty metadata blocks to the specified set.
+ */
+static int dmz_write_dirty_mblocks(struct dmz_target *dmz,
+				   struct list_head *write_list,
+				   unsigned int set)
+{
+	struct dmz_mblock *mblk;
+	struct blk_plug plug;
+	int ret = 0;
+
+	/* Issue writes */
+	blk_start_plug(&plug);
+	list_for_each_entry(mblk, write_list, link)
+		dmz_write_mblock(dmz, mblk, set);
+	blk_finish_plug(&plug);
+
+	/* Wait for completion */
+	list_for_each_entry(mblk, write_list, link) {
+		wait_on_bit_io(&mblk->state, DMZ_META_WRITING,
+			       TASK_UNINTERRUPTIBLE);
+		if (test_bit(DMZ_META_ERROR, &mblk->state)) {
+			dmz_dev_err(dmz, "Write metablock %u/%llu failed\n",
+				    set, (u64)mblk->no);
+			clear_bit(DMZ_META_ERROR, &mblk->state);
+			ret = -EIO;
+		}
+	}
+
+	/* Flush drive cache (this will also sync data) */
+	if (ret == 0)
+		ret = blkdev_issue_flush(dmz->zbd, GFP_KERNEL, NULL);
+
+	return ret;
+}
+
+/*
+ * Log dirty metadata blocks.
+ */
+static int dmz_log_dirty_mblocks(struct dmz_target *dmz,
+				 struct list_head *write_list)
+{
+	unsigned int log_set = dmz->mblk_primary ^ 0x1;
+	int ret;
+
+	dmz_dev_debug(dmz, "Log metadata to set %u, gen %llu\n",
+		      log_set, dmz->sb_gen + 1);
+
+	/* Write dirty blocks to the log */
+	ret = dmz_write_dirty_mblocks(dmz, write_list, log_set);
+	if (ret)
+		return ret;
+
+	/*
+	 * No error so far: now validate the log by updating the
+	 * log index super block generation.
+	 */
+	ret = dmz_write_sb(dmz, log_set);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+/*
+ * Flush dirty metadata blocks.
+ */
+int dmz_flush_mblocks(struct dmz_target *dmz)
+{
+	struct dmz_mblock *mblk;
+	struct list_head write_list;
+	int ret;
+
+	INIT_LIST_HEAD(&write_list);
+
+	/*
+	 * Prevent BIOs to zones and reclaim. This ensure exclusive
+	 * access to metadata.
+	 */
+	down_write(&dmz->mblk_sem);
+
+	/* If there are no dirty metadata blocks, just flush the device cache */
+	if (list_empty(&dmz->mblk_dirty_list)) {
+		ret = blkdev_issue_flush(dmz->zbd, GFP_KERNEL, NULL);
+		goto out;
+	}
+
+	/*
+	 * The primary metadata set is still clean. Keep it this way until
+	 * all updates are successful in the secondary set. That is, use
+	 * the secondary set as a log.
+	 */
+	list_splice_init(&dmz->mblk_dirty_list, &write_list);
+	ret = dmz_log_dirty_mblocks(dmz, &write_list);
+	if (ret)
+		goto out;
+
+	/*
+	 * The log is on disk. It is now safe to update in place
+	 * in the primary metadata set.
+	 */
+	dmz_dev_debug(dmz, "Commit metadata to set %u, gen %llu\n",
+		      dmz->mblk_primary, dmz->sb_gen + 1);
+	ret = dmz_write_dirty_mblocks(dmz, &write_list, dmz->mblk_primary);
+	if (ret)
+		goto out;
+
+	ret = dmz_write_sb(dmz, dmz->mblk_primary);
+	if (ret)
+		goto out;
+
+	while (!list_empty(&write_list)) {
+		mblk = list_first_entry(&write_list,
+					struct dmz_mblock, link);
+		list_del_init(&mblk->link);
+
+		clear_bit(DMZ_META_DIRTY, &mblk->state);
+		if (atomic_read(&mblk->ref) == 0)
+			list_add_tail(&mblk->link, &dmz->mblk_lru_list);
+
+	}
+
+	dmz->sb_gen++;
+
+out:
+	if (ret && !list_empty(&write_list))
+		list_splice(&write_list, &dmz->mblk_dirty_list);
+
+	up_write(&dmz->mblk_sem);
+
+	return ret;
+}
+
+/*
+ * Check super block.
+ */
+static int dmz_check_sb(struct dmz_target *dmz, struct dmz_super *sb)
+{
+	unsigned int nr_meta_zones, nr_data_zones;
+	u32 crc, stored_crc;
+	u64 gen;
+
+	gen = le64_to_cpu(sb->gen);
+	stored_crc = le32_to_cpu(sb->crc);
+	sb->crc = 0;
+	crc = crc32_le(gen, (unsigned char *)sb, DMZ_BLOCK_SIZE);
+	if (crc != stored_crc) {
+		dmz_dev_err(dmz,
+			    "Invalid checksum (needed 0x%08x, got 0x%08x)\n",
+			    crc, stored_crc);
+		return -ENXIO;
+	}
+
+	if (le32_to_cpu(sb->magic) != DMZ_MAGIC) {
+		dmz_dev_err(dmz,
+			    "Invalid meta magic (need 0x%08x, got 0x%08x)\n",
+			    DMZ_MAGIC, le32_to_cpu(sb->magic));
+		return -ENXIO;
+	}
+
+	if (le32_to_cpu(sb->version) != DMZ_META_VER) {
+		dmz_dev_err(dmz, "Invalid meta version (need %d, got %d)\n",
+			    DMZ_META_VER, le32_to_cpu(sb->version));
+		return -ENXIO;
+	}
+
+	nr_meta_zones =
+		(le32_to_cpu(sb->nr_meta_blocks) + dmz->zone_nr_blocks - 1)
+		>> dmz->zone_nr_blocks_shift;
+	if (!nr_meta_zones ||
+	    nr_meta_zones >= dmz->nr_rnd_zones) {
+		dmz_dev_err(dmz, "Invalid number of metadata blocks\n");
+		return -ENXIO;
+	}
+
+	if (!le32_to_cpu(sb->nr_reserved_seq) ||
+	    le32_to_cpu(sb->nr_reserved_seq) >=
+	    (dmz->nr_useable_zones - nr_meta_zones)) {
+		dmz_dev_err(dmz,
+			    "Invalid number of reserved sequential zones\n");
+		return -ENXIO;
+	}
+
+	nr_data_zones = dmz->nr_useable_zones -
+		(nr_meta_zones * 2 + le32_to_cpu(sb->nr_reserved_seq));
+	if (le32_to_cpu(sb->nr_chunks) > nr_data_zones) {
+		dmz_dev_err(dmz, "Invalid number of chunks %u / %u\n",
+			    le32_to_cpu(sb->nr_chunks), nr_data_zones);
+		return -ENXIO;
+	}
+
+	/* OK */
+	dmz->nr_meta_blocks = le32_to_cpu(sb->nr_meta_blocks);
+	dmz->nr_reserved_seq = le32_to_cpu(sb->nr_reserved_seq);
+	dmz->nr_chunks = le32_to_cpu(sb->nr_chunks);
+	dmz->nr_map_blocks = le32_to_cpu(sb->nr_map_blocks);
+	dmz->nr_bitmap_blocks = le32_to_cpu(sb->nr_bitmap_blocks);
+	dmz->nr_meta_zones = nr_meta_zones;
+	dmz->nr_data_zones = nr_data_zones;
+
+	return 0;
+}
+
+/*
+ * Read the first or second super block from disk.
+ */
+static int dmz_read_sb(struct dmz_target *dmz, unsigned int set)
+{
+	return dmz_rdwr_block_sync(dmz, REQ_OP_READ,
+				   dmz->sb[set].block,
+				   dmz->sb[set].mblk->page);
+}
+
+/*
+ * Determine the position of the secondary super blocks on disk.
+ * This is used only if a corruption of the primary super block
+ * is detected.
+ */
+static int dmz_lookup_secondary_sb(struct dmz_target *dmz)
+{
+	struct dmz_mblock *mblk;
+	int i;
+
+	/* Allocate a block */
+	mblk = dmz_alloc_mblock(dmz, 0);
+	if (!mblk)
+		return -ENOMEM;
+
+	dmz->sb[1].mblk = mblk;
+	dmz->sb[1].sb = mblk->data;
+
+	/* Bad first super block: search for the second one */
+	dmz->sb[1].block = dmz->sb[0].block + dmz->zone_nr_blocks;
+	for (i = 0; i < dmz->nr_rnd_zones - 1; i++) {
+		if (dmz_read_sb(dmz, 1) != 0)
+			break;
+		if (le32_to_cpu(dmz->sb[1].sb->magic) == DMZ_MAGIC)
+			return 0;
+		dmz->sb[1].block += dmz->zone_nr_blocks;
+	}
+
+	dmz_free_mblock(dmz, mblk);
+	dmz->sb[1].mblk = NULL;
+
+	return -EIO;
+}
+
+/*
+ * Read the first or second super block from disk.
+ */
+static int dmz_get_sb(struct dmz_target *dmz, unsigned int set)
+{
+	struct dmz_mblock *mblk;
+	int ret;
+
+	/* Allocate a block */
+	mblk = dmz_alloc_mblock(dmz, 0);
+	if (!mblk)
+		return -ENOMEM;
+
+	dmz->sb[set].mblk = mblk;
+	dmz->sb[set].sb = mblk->data;
+
+	/* Read super block */
+	ret = dmz_read_sb(dmz, set);
+	if (ret) {
+		dmz_free_mblock(dmz, mblk);
+		dmz->sb[set].mblk = NULL;
+		return ret;
+	}
+
+	return 0;
+}
+
+/*
+ * Recover a metadata set.
+ */
+static int dmz_recover_mblocks(struct dmz_target *dmz, unsigned int dst_set)
+{
+	unsigned int src_set = dst_set ^ 0x1;
+	struct page *page;
+	int i, ret;
+
+	dmz_dev_warn(dmz, "Metadata set %u invalid: recovering\n",
+		     dst_set);
+
+	if (dst_set == 0)
+		dmz->sb[0].block = dmz_start_block(dmz, dmz->sb_zone);
+	else
+		dmz->sb[1].block = dmz->sb[0].block +
+			(dmz->nr_meta_zones * dmz->zone_nr_blocks);
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+
+	/* Copy metadata blocks */
+	for (i = 1; i < dmz->nr_meta_blocks; i++) {
+		ret = dmz_rdwr_block_sync(dmz, REQ_OP_READ,
+					  dmz->sb[src_set].block + i,
+					  page);
+		if (ret)
+			goto out;
+		ret = dmz_rdwr_block_sync(dmz, REQ_OP_WRITE,
+					  dmz->sb[dst_set].block + i,
+					  page);
+		if (ret)
+			goto out;
+	}
+
+	/* Finalize with the super block */
+	if (!dmz->sb[dst_set].mblk) {
+		dmz->sb[dst_set].mblk = dmz_alloc_mblock(dmz, 0);
+		if (!dmz->sb[dst_set].mblk) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		dmz->sb[dst_set].sb = dmz->sb[dst_set].mblk->data;
+	}
+
+	ret = dmz_write_sb(dmz, dst_set);
+
+out:
+	__free_pages(page, 0);
+
+	return ret;
+}
+
+/*
+ * Get super block from disk.
+ */
+static int dmz_load_sb(struct dmz_target *dmz)
+{
+	bool sb_good[2] = {false, false};
+	u64 sb_gen[2] = {0, 0};
+	int ret;
+
+	/* Read and check the primary super block */
+	dmz->sb[0].block = dmz_start_block(dmz, dmz->sb_zone);
+	ret = dmz_get_sb(dmz, 0);
+	if (ret) {
+		dmz_dev_err(dmz, "Read primary super block failed\n");
+		return ret;
+	}
+
+	ret = dmz_check_sb(dmz, dmz->sb[0].sb);
+
+	/* Read and check secondary super block */
+	if (ret == 0) {
+		sb_good[0] = true;
+		dmz->sb[1].block = dmz->sb[0].block +
+			(dmz->nr_meta_zones * dmz->zone_nr_blocks);
+		ret = dmz_get_sb(dmz, 1);
+	} else {
+		ret = dmz_lookup_secondary_sb(dmz);
+	}
+	if (ret) {
+		dmz_dev_err(dmz, "Read secondary super block failed\n");
+		return ret;
+	}
+
+	ret = dmz_check_sb(dmz, dmz->sb[1].sb);
+	if (ret == 0)
+		sb_good[1] = true;
+
+	/* Use highest generation sb first */
+	if (!sb_good[0] && !sb_good[1]) {
+		dmz_dev_err(dmz, "No valid super block found\n");
+		return -EIO;
+	}
+
+	if (sb_good[0])
+		sb_gen[0] = le64_to_cpu(dmz->sb[0].sb->gen);
+	else
+		ret = dmz_recover_mblocks(dmz, 0);
+
+	if (sb_good[1])
+		sb_gen[1] = le64_to_cpu(dmz->sb[1].sb->gen);
+	else
+		ret = dmz_recover_mblocks(dmz, 1);
+
+	if (ret) {
+		dmz_dev_err(dmz, "Recovery failed\n");
+		return -EIO;
+	}
+
+	if (sb_gen[0] >= sb_gen[1]) {
+		dmz->sb_gen = sb_gen[0];
+		dmz->mblk_primary = 0;
+	} else {
+		dmz->sb_gen = sb_gen[1];
+		dmz->mblk_primary = 1;
+	}
+
+	dmz_dev_debug(dmz, "Using super block %u (gen %llu)\n",
+		      dmz->mblk_primary, dmz->sb_gen);
+
+	return 0;
+}
+
+/*
+ * Initialize a zone descriptor.
+ */
+static int dmz_init_zone(struct dmz_target *dmz, struct dm_zone *zone,
+			 struct blk_zone *blkz)
+{
+
+	/* Ignore the eventual last runt (smaller) zone */
+	if (blkz->len != dmz->zone_nr_sectors) {
+		if (blkz->start + blkz->len == dmz->zbd_capacity)
+			return 0;
+		return -ENXIO;
+	}
+
+	INIT_LIST_HEAD(&zone->link);
+	atomic_set(&zone->refcount, 0);
+	zone->chunk = DMZ_MAP_UNMAPPED;
+
+	if (blkz->type == BLK_ZONE_TYPE_CONVENTIONAL) {
+		set_bit(DMZ_RND, &zone->flags);
+		dmz->nr_rnd_zones++;
+	} else if (blkz->type == BLK_ZONE_TYPE_SEQWRITE_REQ ||
+		   blkz->type == BLK_ZONE_TYPE_SEQWRITE_PREF) {
+		set_bit(DMZ_SEQ, &zone->flags);
+	} else {
+		return -ENXIO;
+	}
+
+	if (blkz->cond == BLK_ZONE_COND_OFFLINE)
+		set_bit(DMZ_OFFLINE, &zone->flags);
+	else if (blkz->cond == BLK_ZONE_COND_READONLY)
+		set_bit(DMZ_READ_ONLY, &zone->flags);
+
+	if (dmz_is_rnd(zone))
+		zone->wp_block = 0;
+	else
+		zone->wp_block = dmz_sect2blk(blkz->wp - blkz->start);
+
+	if (!dmz_is_offline(zone) && !dmz_is_readonly(zone)) {
+		dmz->nr_useable_zones++;
+		if (dmz_is_rnd(zone)) {
+			dmz->nr_rnd_zones++;
+			if (!dmz->sb_zone) {
+				/* Super block zone */
+				dmz->sb_zone = zone;
+			}
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * Free zones descriptors.
+ */
+static void dmz_drop_zones(struct dmz_target *dmz)
+{
+	kfree(dmz->zones);
+	dmz->zones = NULL;
+}
+
+/*
+ * Allocate and initialize zone descriptors using the zone
+ * information from disk.
+ */
+static int dmz_init_zones(struct dmz_target *dmz)
+{
+	struct dm_zone *zone;
+	struct blk_zone *blkz;
+	unsigned int nr_blkz;
+	sector_t sector = 0;
+	int i, ret = 0;
+
+	/* Init */
+	dmz->zone_nr_sectors = dmz->zbdq->limits.chunk_sectors;
+	dmz->zone_nr_sectors_shift = ilog2(dmz->zone_nr_sectors);
+
+	dmz->zone_nr_blocks = dmz_sect2blk(dmz->zone_nr_sectors);
+	dmz->zone_nr_blocks_shift = ilog2(dmz->zone_nr_blocks);
+
+	dmz->zone_bitmap_size = dmz->zone_nr_blocks >> 3;
+	dmz->zone_nr_bitmap_blocks =
+		dmz->zone_bitmap_size >> DMZ_BLOCK_SHIFT;
+
+	dmz->nr_zones = (dmz->zbd_capacity + dmz->zone_nr_sectors - 1)
+		>> dmz->zone_nr_sectors_shift;
+
+	/* Allocate zone array */
+	dmz->zones = kcalloc(dmz->nr_zones, sizeof(struct dm_zone), GFP_KERNEL);
+	if (!dmz->zones)
+		return -ENOMEM;
+
+	dmz_dev_info(dmz, "Using %zu B for zone information\n",
+		     sizeof(struct dm_zone) * dmz->nr_zones);
+
+	/* Get zone information */
+	nr_blkz = DMZ_REPORT_NR_ZONES;
+	blkz = kcalloc(nr_blkz, sizeof(struct blk_zone), GFP_KERNEL);
+	if (!blkz) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/*
+	 * Get zone information and initialize zone descriptors.
+	 * At the same time, determine where the super block
+	 * should be: first block of the first randomly writable
+	 * zone.
+	 */
+	zone = dmz->zones;
+	while (sector < dmz->zbd_capacity) {
+
+		/* Get zone information */
+		nr_blkz = DMZ_REPORT_NR_ZONES;
+		ret = blkdev_report_zones(dmz->zbd, sector,
+					  blkz, &nr_blkz,
+					  GFP_KERNEL);
+		if (ret) {
+			dmz_dev_err(dmz, "Report zones failed %d\n", ret);
+			goto out;
+		}
+
+		/* Process report */
+		for (i = 0; i < nr_blkz; i++) {
+			ret = dmz_init_zone(dmz, zone, &blkz[i]);
+			if (ret)
+				goto out;
+			sector += dmz->zone_nr_sectors;
+			zone++;
+		}
+
+	}
+
+	/* The entire zone configuration of the disk should now be known */
+	if (sector < dmz->zbd_capacity) {
+		dmz_dev_err(dmz, "Failed to get correct zone information\n");
+		ret = -ENXIO;
+	}
+
+out:
+	kfree(blkz);
+
+	if (ret)
+		dmz_drop_zones(dmz);
+
+	return ret;
+}
+
+/*
+ * Update a zone information.
+ */
+static int dmz_update_zone(struct dmz_target *dmz, struct dm_zone *zone)
+{
+	unsigned int nr_blkz = 1;
+	struct blk_zone blkz;
+	int ret;
+
+	/* Get zone information from disk */
+	ret = blkdev_report_zones(dmz->zbd, dmz_start_sect(dmz, zone),
+				  &blkz, &nr_blkz, GFP_KERNEL);
+	if (ret) {
+		dmz_dev_err(dmz, "Get zone %u report failed\n",
+			    dmz_id(dmz, zone));
+		return ret;
+	}
+
+	clear_bit(DMZ_OFFLINE, &zone->flags);
+	clear_bit(DMZ_READ_ONLY, &zone->flags);
+	if (blkz.cond == BLK_ZONE_COND_OFFLINE)
+		set_bit(DMZ_OFFLINE, &zone->flags);
+	else if (blkz.cond == BLK_ZONE_COND_READONLY)
+		set_bit(DMZ_READ_ONLY, &zone->flags);
+
+	if (dmz_is_seq(zone))
+		zone->wp_block = dmz_sect2blk(blkz.wp - blkz.start);
+	else
+		zone->wp_block = 0;
+
+	return 0;
+}
+
+/*
+ * Check a zone write pointer position when the zone is marked
+ * with the sequential write error flag.
+ */
+static int dmz_handle_seq_write_err(struct dmz_target *dmz,
+				    struct dm_zone *zone)
+{
+	unsigned int wp = 0;
+	int ret = 0;
+
+	wp = zone->wp_block;
+	ret = dmz_update_zone(dmz, zone);
+	if (ret != 0)
+		return ret;
+
+	dmz_dev_warn(dmz, "Processing zone %u write error (zone wp %u/%u)\n",
+		     dmz_id(dmz, zone), zone->wp_block, wp);
+
+	if (zone->wp_block < wp)
+		dmz_invalidate_blocks(dmz, zone,
+				      zone->wp_block,
+				      wp - zone->wp_block);
+
+	return 0;
+}
+
+/*
+ * Check zone information after a resume.
+ */
+static int dmz_check_zones(struct dmz_target *dmz)
+{
+	struct dm_zone *zone;
+	sector_t wp_block;
+	unsigned int i;
+	int ret;
+
+	/* Check zones */
+	for (i = 0; i < dmz->nr_zones; i++) {
+
+		zone = dmz_get(dmz, i);
+		if (!zone) {
+			dmz_dev_err(dmz, "Unable to get zone %u\n", i);
+			return -EIO;
+		}
+
+		wp_block = zone->wp_block;
+
+		ret = dmz_update_zone(dmz, zone);
+		if (ret) {
+			dmz_dev_err(dmz, "Broken zone %u\n", i);
+			return ret;
+		}
+
+		if (dmz_is_offline(zone)) {
+			dmz_dev_warn(dmz, "Zone %u is offline\n", i);
+			continue;
+		}
+
+		/* Check write pointer */
+		if (!dmz_is_seq(zone))
+			zone->wp_block = 0;
+		else if (zone->wp_block != wp_block) {
+			dmz_dev_err(dmz, "Zone %u: Invalid wp (%llu / %llu)\n",
+				    i, (u64)zone->wp_block, (u64)wp_block);
+			zone->wp_block = wp_block;
+			dmz_invalidate_blocks(dmz, zone, zone->wp_block,
+					dmz->zone_nr_blocks - zone->wp_block);
+		}
+
+	}
+
+	return 0;
+}
+
+/*
+ * Reset a zone write pointer.
+ */
+static int dmz_reset_zone(struct dmz_target *dmz, struct dm_zone *zone)
+{
+	int ret;
+
+	/*
+	 * Ignore offline zones, read only zones,
+	 * and conventional zones.
+	 */
+	if (dmz_is_offline(zone) ||
+	    dmz_is_readonly(zone) ||
+	    dmz_is_rnd(zone))
+		return 0;
+
+	if (!dmz_is_empty(zone) || dmz_seq_write_err(zone)) {
+		ret = blkdev_reset_zones(dmz->zbd,
+					 dmz_start_sect(dmz, zone),
+					 dmz->zone_nr_sectors,
+					 GFP_KERNEL);
+		if (ret) {
+			dmz_dev_err(dmz, "Reset zone %u failed %d\n",
+				    dmz_id(dmz, zone), ret);
+			return ret;
+		}
+	}
+
+	/* Clear write error bit and rewind write pointer position */
+	clear_bit(DMZ_SEQ_WRITE_ERR, &zone->flags);
+	zone->wp_block = 0;
+
+	return 0;
+}
+
+static void dmz_get_zone_weight(struct dmz_target *dmz, struct dm_zone *zone);
+
+/*
+ * Initialize chunk mapping.
+ */
+static int dmz_load_mapping(struct dmz_target *dmz)
+{
+	struct dm_zone *dzone, *bzone;
+	struct dmz_mblock *dmap_mblk = NULL;
+	struct dmz_map *dmap;
+	unsigned int i = 0, e = 0, chunk = 0;
+	unsigned int dzone_id;
+	unsigned int bzone_id;
+
+	/* Metadata block array for the chunk mapping table */
+	dmz->dz_map_mblk = kcalloc(dmz->nr_map_blocks,
+				   sizeof(struct dmz_mblk *),
+				   GFP_KERNEL);
+	if (!dmz->dz_map_mblk)
+		return -ENOMEM;
+
+	/* Get chunk mapping table blocks and initialize zone mapping */
+	while (chunk < dmz->nr_chunks) {
+
+		if (!dmap_mblk) {
+			/* Get mapping block */
+			dmap_mblk = dmz_get_mblock(dmz, i + 1);
+			if (IS_ERR(dmap_mblk))
+				return PTR_ERR(dmap_mblk);
+			dmz->dz_map_mblk[i] = dmap_mblk;
+			dmap = (struct dmz_map *) dmap_mblk->data;
+			i++;
+			e = 0;
+		}
+
+		/* Check data zone */
+		dzone_id = le32_to_cpu(dmap[e].dzone_id);
+		if (dzone_id == DMZ_MAP_UNMAPPED)
+			goto next;
+
+		if (dzone_id >= dmz->nr_zones) {
+			dmz_dev_err(dmz,
+				"Chunk %u mapping: invalid data zone ID %u\n",
+				chunk, dzone_id);
+			return -EIO;
+		}
+
+		dzone = dmz_get(dmz, dzone_id);
+		set_bit(DMZ_DATA, &dzone->flags);
+		dzone->chunk = chunk;
+		dmz_get_zone_weight(dmz, dzone);
+
+		if (dmz_is_rnd(dzone))
+			list_add_tail(&dzone->link, &dmz->dz_map_rnd_list);
+		else
+			list_add_tail(&dzone->link, &dmz->dz_map_seq_list);
+
+		/* Check buffer zone */
+		bzone_id = le32_to_cpu(dmap[e].bzone_id);
+		if (bzone_id == DMZ_MAP_UNMAPPED)
+			goto next;
+
+		if (bzone_id >= dmz->nr_zones) {
+			dmz_dev_err(dmz,
+				"Chunk %u mapping: invalid buffer zone ID %u\n",
+				chunk, bzone_id);
+			return -EIO;
+		}
+
+		bzone = dmz_get(dmz, bzone_id);
+		if (!dmz_is_rnd(bzone)) {
+			dmz_dev_err(dmz,
+				"Chunk %u mapping: invalid buffer zone %u\n",
+				chunk, bzone_id);
+			return -EIO;
+		}
+
+		set_bit(DMZ_DATA, &bzone->flags);
+		set_bit(DMZ_BUF, &bzone->flags);
+		bzone->chunk = chunk;
+		bzone->bzone = dzone;
+		dzone->bzone = bzone;
+		dmz_get_zone_weight(dmz, bzone);
+		list_add_tail(&bzone->link, &dmz->dz_map_rnd_list);
+
+next:
+		chunk++;
+		e++;
+		if (e >= DMZ_MAP_ENTRIES)
+			dmap_mblk = NULL;
+
+	}
+
+	/*
+	 * At this point, only meta zones and mapped data zones were
+	 * fully initialized. All remaining zones are unmapped data
+	 * zones. Finish initializing those here.
+	 */
+	for (i = 0; i < dmz->nr_zones; i++) {
+
+		dzone = dmz_get(dmz, i);
+		if (dmz_is_meta(dzone))
+			continue;
+
+		if (dmz_is_rnd(dzone))
+			dmz->dz_nr_rnd++;
+		else
+			dmz->dz_nr_seq++;
+
+		if (dmz_is_data(dzone))
+			/* Already initialized */
+			continue;
+
+		/* Unmapped data zone */
+		set_bit(DMZ_DATA, &dzone->flags);
+		dzone->chunk = DMZ_MAP_UNMAPPED;
+		if (dmz_is_rnd(dzone)) {
+			list_add_tail(&dzone->link,
+				      &dmz->dz_unmap_rnd_list);
+			atomic_inc(&dmz->dz_unmap_nr_rnd);
+		} else if (atomic_read(&dmz->nr_reclaim_seq_zones) <
+			   dmz->nr_reserved_seq) {
+			list_add_tail(&dzone->link,
+				      &dmz->reclaim_seq_zones_list);
+			atomic_inc(&dmz->nr_reclaim_seq_zones);
+			dmz->dz_nr_seq--;
+		} else {
+			list_add_tail(&dzone->link,
+				      &dmz->dz_unmap_seq_list);
+			atomic_inc(&dmz->dz_unmap_nr_seq);
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * Set a data chunk mapping.
+ */
+static void dmz_set_chunk_mapping(struct dmz_target *dmz,
+				  unsigned int chunk,
+				  unsigned int dzone_id,
+				  unsigned int bzone_id)
+{
+	struct dmz_mblock *dmap_mblk =
+		dmz->dz_map_mblk[chunk >> DMZ_MAP_ENTRIES_SHIFT];
+	struct dmz_map *dmap = (struct dmz_map *) dmap_mblk->data;
+	int map_idx = chunk & DMZ_MAP_ENTRIES_MASK;
+
+	dmap[map_idx].dzone_id = cpu_to_le32(dzone_id);
+	dmap[map_idx].bzone_id = cpu_to_le32(bzone_id);
+	dmz_dirty_mblock(dmz, dmap_mblk);
+}
+
+/*
+ * The list of mapped zones is maintained in LRU order.
+ * This rotates a zone at the end of its map list.
+ */
+static void __dmz_lru_zone(struct dmz_target *dmz,
+			   struct dm_zone *zone)
+{
+	if (list_empty(&zone->link))
+		return;
+
+	list_del_init(&zone->link);
+	if (dmz_is_seq(zone))
+		/* LRU rotate sequential zone */
+		list_add_tail(&zone->link, &dmz->dz_map_seq_list);
+	else
+		/* LRU rotate random zone */
+		list_add_tail(&zone->link, &dmz->dz_map_rnd_list);
+}
+
+/*
+ * The list of mapped random zones is maintained
+ * in LRU order. This rotates a zone at the end of the list.
+ */
+static void dmz_lru_zone(struct dmz_target *dmz,
+			 struct dm_zone *zone)
+{
+	__dmz_lru_zone(dmz, zone);
+	if (zone->bzone)
+		__dmz_lru_zone(dmz, zone->bzone);
+}
+
+/*
+ * Wait for any zone to be freed.
+ */
+static void dmz_wait_for_free_zones(struct dmz_target *dmz)
+{
+	DEFINE_WAIT(wait);
+
+	dmz_trigger_reclaim(dmz);
+
+	prepare_to_wait(&dmz->dz_free_wq, &wait, TASK_UNINTERRUPTIBLE);
+	dmz_unlock_map(dmz);
+	up_read(&dmz->mblk_sem);
+
+	io_schedule_timeout(HZ);
+
+	down_read(&dmz->mblk_sem);
+	dmz_lock_map(dmz);
+	finish_wait(&dmz->dz_free_wq, &wait);
+}
+
+/*
+ * Wait for a zone reclaim to complete.
+ */
+static void dmz_wait_for_reclaim(struct dmz_target *dmz,
+				 struct dm_zone *zone)
+{
+	dmz_unlock_map(dmz);
+	wait_on_bit_timeout(&zone->flags, DMZ_RECLAIM,
+			    TASK_UNINTERRUPTIBLE,
+			    HZ);
+	dmz_lock_map(dmz);
+}
+
+/*
+ * Activate a zone (increment its reference count).
+ */
+void dmz_activate_zone(struct dmz_target *dmz, struct dm_zone *zone)
+{
+	set_bit(DMZ_ACTIVE, &zone->flags);
+	atomic_inc(&zone->refcount);
+}
+
+/*
+ * Deactivate a zone. This decrement the zone reference counter
+ * and clears the active state of the zone once the count reaches 0,
+ * indicating that all BIOs to the zone have completed. Returns
+ * true if the zone was deactivated.
+ */
+void dmz_deactivate_zone(struct dmz_target *dmz, struct dm_zone *zone)
+{
+	if (atomic_dec_and_test(&zone->refcount)) {
+		WARN_ON(!test_bit(DMZ_ACTIVE, &zone->flags));
+		clear_bit_unlock(DMZ_ACTIVE, &zone->flags);
+		smp_mb__after_atomic();
+	}
+}
+
+/*
+ * Get the zone mapping a chunk, if the chunk is mapped already.
+ * If no mapping exist and the operation is WRITE, a zone is
+ * allocated and used to map the chunk.
+ * The zone returned will be set to the active state.
+ */
+struct dm_zone *dmz_get_chunk_mapping(struct dmz_target *dmz,
+				      unsigned int chunk, int op)
+{
+	struct dmz_mblock *dmap_mblk =
+		dmz->dz_map_mblk[chunk >> DMZ_MAP_ENTRIES_SHIFT];
+	struct dmz_map *dmap = (struct dmz_map *) dmap_mblk->data;
+	int dmap_idx = chunk & DMZ_MAP_ENTRIES_MASK;
+	unsigned int dzone_id;
+	struct dm_zone *dzone = NULL;
+	int ret = 0;
+
+	dmz_lock_map(dmz);
+
+again:
+
+	/* Get the chunk mapping */
+	dzone_id = le32_to_cpu(dmap[dmap_idx].dzone_id);
+	if (dzone_id == DMZ_MAP_UNMAPPED) {
+
+		/*
+		 * Read or discard in unmapped chunks are fine. But for
+		 * writes, we need a mapping, so get one.
+		 */
+		if (op != REQ_OP_WRITE)
+			goto out;
+
+		/* Alloate a random zone */
+		dzone = dmz_alloc_zone(dmz, DMZ_ALLOC_RND);
+		if (!dzone) {
+			dmz_wait_for_free_zones(dmz);
+			goto again;
+		}
+
+		dmz_map_zone(dmz, dzone, chunk);
+
+	} else {
+
+		/* The chunk is already mapped: get the mapping zone */
+		dzone = dmz_get(dmz, dzone_id);
+		if (dzone->chunk != chunk) {
+			dzone = ERR_PTR(-EIO);
+			goto out;
+		}
+
+		/* Repair write pointer if the sequential dzone has error */
+		if (dmz_seq_write_err(dzone)) {
+			ret = dmz_handle_seq_write_err(dmz, dzone);
+			if (ret) {
+				dzone = ERR_PTR(-EIO);
+				goto out;
+			}
+			clear_bit(DMZ_SEQ_WRITE_ERR, &dzone->flags);
+		}
+	}
+
+	/*
+	 * If the zone is being reclaimed, the chunk mapping may change
+	 * to a different zone. So wait for reclaim and retry. Otherwise,
+	 * activate the zone (this will prevent reclaim from touching it).
+	 */
+	if (dmz_in_reclaim(dzone)) {
+		dmz_wait_for_reclaim(dmz, dzone);
+		goto again;
+	}
+	dmz_activate_zone(dmz, dzone);
+	dmz_lru_zone(dmz, dzone);
+
+out:
+	dmz_unlock_map(dmz);
+
+	return dzone;
+}
+
+/*
+ * Write and discard change the block validity of data zones and their buffer
+ * zones. Check here that valid blocks are still present. If all blocks are
+ * invalid, the zones can be unmapped on the fly without waiting for reclaim
+ * to do it.
+ */
+void dmz_put_chunk_mapping(struct dmz_target *dmz, struct dm_zone *dzone)
+{
+	struct dm_zone *bzone;
+
+	dmz_lock_map(dmz);
+
+	bzone = dzone->bzone;
+	if (bzone) {
+		if (dmz_weight(bzone)) {
+			dmz_lru_zone(dmz, bzone);
+		} else {
+			/* Empty buffer zone: reclaim it */
+			dmz_unmap_zone(dmz, bzone);
+			dmz_free_zone(dmz, bzone);
+			bzone = NULL;
+		}
+	}
+
+	/* Deactivate the data zone */
+	dmz_deactivate_zone(dmz, dzone);
+	if (dmz_is_active(dzone) || bzone || dmz_weight(dzone)) {
+		dmz_lru_zone(dmz, dzone);
+	} else {
+		/* Unbuffered inactive empty data zone: reclaim it */
+		dmz_unmap_zone(dmz, dzone);
+		dmz_free_zone(dmz, dzone);
+	}
+
+	dmz_unlock_map(dmz);
+}
+
+/*
+ * Allocate and map a random zone to buffer a chunk
+ * already mapped to a sequential zone.
+ */
+struct dm_zone *dmz_get_chunk_buffer(struct dmz_target *dmz,
+				     struct dm_zone *dzone)
+{
+	struct dm_zone *bzone;
+	unsigned int chunk;
+
+	dmz_lock_map(dmz);
+
+	chunk = dzone->chunk;
+
+	/* Alloate a random zone */
+	do {
+		bzone = dmz_alloc_zone(dmz, DMZ_ALLOC_RND);
+		if (!bzone)
+			dmz_wait_for_free_zones(dmz);
+	} while (!bzone);
+
+	/* Update the chunk mapping */
+	dmz_set_chunk_mapping(dmz, chunk,
+			      dmz_id(dmz, dzone),
+			      dmz_id(dmz, bzone));
+
+	set_bit(DMZ_BUF, &bzone->flags);
+	bzone->chunk = chunk;
+	bzone->bzone = dzone;
+	dzone->bzone = bzone;
+	list_add_tail(&bzone->link, &dmz->dz_map_rnd_list);
+
+	dmz_unlock_map(dmz);
+
+	return bzone;
+}
+
+/*
+ * Get an unmapped (free) zone.
+ * This must be called with the mapping lock held.
+ */
+struct dm_zone *dmz_alloc_zone(struct dmz_target *dmz, unsigned long flags)
+{
+	struct list_head *list;
+	struct dm_zone *zone;
+
+	if (flags & DMZ_ALLOC_RND)
+		list = &dmz->dz_unmap_rnd_list;
+	else
+		list = &dmz->dz_unmap_seq_list;
+
+again:
+	if (list_empty(list)) {
+
+		/*
+		 * No free zone: if this is for reclaim, allow using the
+		 * reserved sequential zones.
+		 */
+		if (!(flags & DMZ_ALLOC_RECLAIM) ||
+		    list_empty(&dmz->reclaim_seq_zones_list))
+			return NULL;
+
+		zone = list_first_entry(&dmz->reclaim_seq_zones_list,
+					struct dm_zone, link);
+		list_del_init(&zone->link);
+		atomic_dec(&dmz->nr_reclaim_seq_zones);
+		return zone;
+
+	}
+
+	zone = list_first_entry(list, struct dm_zone, link);
+	list_del_init(&zone->link);
+
+	if (dmz_is_rnd(zone))
+		atomic_dec(&dmz->dz_unmap_nr_rnd);
+	else
+		atomic_dec(&dmz->dz_unmap_nr_seq);
+
+	if (dmz_is_offline(zone)) {
+		dmz_dev_warn(dmz, "Zone %u is offline\n",
+			     dmz_id(dmz, zone));
+		zone = NULL;
+		goto again;
+	}
+
+	if (dmz_should_reclaim(dmz))
+		dmz_trigger_reclaim(dmz);
+
+	return zone;
+}
+
+/*
+ * Free a zone.
+ * This must be called with the mapping lock held.
+ */
+void dmz_free_zone(struct dmz_target *dmz, struct dm_zone *zone)
+{
+
+
+
+	/* If this is a sequential zone, reset it */
+	if (dmz_is_seq(zone))
+		dmz_reset_zone(dmz, zone);
+
+	/* Return the zone to its type unmap list */
+	if (dmz_is_rnd(zone)) {
+		list_add_tail(&zone->link, &dmz->dz_unmap_rnd_list);
+		atomic_inc(&dmz->dz_unmap_nr_rnd);
+	} else if (atomic_read(&dmz->nr_reclaim_seq_zones) <
+		   dmz->nr_reserved_seq) {
+		list_add_tail(&zone->link, &dmz->reclaim_seq_zones_list);
+		atomic_inc(&dmz->nr_reclaim_seq_zones);
+	} else {
+		list_add_tail(&zone->link, &dmz->dz_unmap_seq_list);
+		atomic_inc(&dmz->dz_unmap_nr_seq);
+	}
+
+	wake_up_all(&dmz->dz_free_wq);
+}
+
+/*
+ * Map a chunk to a zone.
+ * This must be called with the mapping lock held.
+ */
+void dmz_map_zone(struct dmz_target *dmz, struct dm_zone *dzone,
+		  unsigned int chunk)
+{
+
+	/* Set the chunk mapping */
+	dmz_set_chunk_mapping(dmz, chunk,
+			      dmz_id(dmz, dzone),
+			      DMZ_MAP_UNMAPPED);
+	dzone->chunk = chunk;
+	if (dmz_is_rnd(dzone))
+		list_add_tail(&dzone->link, &dmz->dz_map_rnd_list);
+	else
+		list_add_tail(&dzone->link, &dmz->dz_map_seq_list);
+}
+
+/*
+ * Unmap a zone.
+ * This must be called with the mapping lock held.
+ */
+void dmz_unmap_zone(struct dmz_target *dmz, struct dm_zone *zone)
+{
+	unsigned int chunk = zone->chunk;
+	unsigned int dzone_id;
+
+	if (chunk == DMZ_MAP_UNMAPPED)
+		/* Already unmapped */
+		return;
+
+	if (test_and_clear_bit(DMZ_BUF, &zone->flags)) {
+
+		/*
+		 * Unmapping the chunk buffer zone: clear only
+		 * the chunk buffer mapping
+		 */
+		dzone_id = dmz_id(dmz, zone->bzone);
+		zone->bzone->bzone = NULL;
+		zone->bzone = NULL;
+
+	} else {
+
+		/*
+		 * Unmapping the chunk data zone: the zone must
+		 * not be buffered.
+		 */
+		if (WARN_ON(zone->bzone)) {
+			zone->bzone->bzone = NULL;
+			zone->bzone = NULL;
+		}
+		dzone_id = DMZ_MAP_UNMAPPED;
+
+	}
+
+	dmz_set_chunk_mapping(dmz, chunk, dzone_id,
+			      DMZ_MAP_UNMAPPED);
+
+	zone->chunk = DMZ_MAP_UNMAPPED;
+	list_del_init(&zone->link);
+}
+
+/*
+ * Set @nr_bits bits in @bitmap starting from @bit.
+ * Return the number of bits changed from 0 to 1.
+ */
+static unsigned int dmz_set_bits(unsigned long *bitmap,
+				 unsigned int bit, unsigned int nr_bits)
+{
+	unsigned long *addr;
+	unsigned int end = bit + nr_bits;
+	unsigned int n = 0;
+
+	while (bit < end) {
+
+		if (((bit & (BITS_PER_LONG - 1)) == 0) &&
+		    ((end - bit) >= BITS_PER_LONG)) {
+			/* Try to set the whole word at once */
+			addr = bitmap + BIT_WORD(bit);
+			if (*addr == 0) {
+				*addr = ULONG_MAX;
+				n += BITS_PER_LONG;
+				bit += BITS_PER_LONG;
+				continue;
+			}
+		}
+
+		if (!test_and_set_bit(bit, bitmap))
+			n++;
+		bit++;
+
+	}
+
+	return n;
+
+}
+
+/*
+ * Get the bitmap block storing the bit for chunk_block in zone.
+ */
+static struct dmz_mblock *dmz_get_bitmap(struct dmz_target *dmz,
+					 struct dm_zone *zone,
+					 sector_t chunk_block)
+{
+	sector_t bitmap_block = 1 + dmz->nr_map_blocks
+		+ (sector_t)(dmz_id(dmz, zone) * dmz->zone_nr_bitmap_blocks)
+		+ (chunk_block >> DMZ_BLOCK_SHIFT_BITS);
+
+	return dmz_get_mblock(dmz, bitmap_block);
+}
+
+/*
+ * Copy the bitmap of from_zone to the bitmap of to_zone.
+ */
+int dmz_valid_copy(struct dmz_target *dmz, struct dm_zone *from_zone,
+		   struct dm_zone *to_zone)
+{
+	struct dmz_mblock *from_mblk, *to_mblk;
+	sector_t chunk_block = 0;
+
+	/* Get the zones bitmap blocks */
+	while (chunk_block < dmz->zone_nr_blocks) {
+
+		from_mblk = dmz_get_bitmap(dmz, from_zone, chunk_block);
+		if (IS_ERR(from_mblk))
+			return PTR_ERR(from_mblk);
+		to_mblk = dmz_get_bitmap(dmz, to_zone, chunk_block);
+		if (IS_ERR(to_mblk)) {
+			dmz_release_mblock(dmz, from_mblk);
+			return PTR_ERR(to_mblk);
+		}
+
+		memcpy(to_mblk->data, from_mblk->data, DMZ_BLOCK_SIZE);
+		dmz_dirty_mblock(dmz, to_mblk);
+
+		dmz_release_mblock(dmz, to_mblk);
+		dmz_release_mblock(dmz, from_mblk);
+
+		chunk_block += DMZ_BLOCK_SIZE_BITS;
+
+	}
+
+	to_zone->weight = from_zone->weight;
+
+	return 0;
+}
+
+/*
+ * Merge the valid blocks of from_zone into the bitmap of to_zone.
+ */
+int dmz_valid_merge(struct dmz_target *dmz, struct dm_zone *from_zone,
+		    struct dm_zone *to_zone, sector_t chunk_block)
+{
+	unsigned int nr_blocks;
+	int ret;
+
+	/* Get the zones bitmap blocks */
+	while (chunk_block < dmz->zone_nr_blocks) {
+
+		/* Get a valid region from the source zone */
+		ret = dmz_first_valid_block(dmz, from_zone, &chunk_block);
+		if (ret < 0)
+			return ret;
+
+		/* Are we done ? */
+		nr_blocks = ret;
+		if (!nr_blocks)
+			return 0;
+
+		ret = dmz_validate_blocks(dmz, to_zone, chunk_block, nr_blocks);
+		if (ret != 0)
+			return ret;
+
+		chunk_block += nr_blocks;
+
+	}
+
+	return 0;
+}
+
+/*
+ * Validate all the blocks in the range [block..block+nr_blocks-1].
+ */
+int dmz_validate_blocks(struct dmz_target *dmz, struct dm_zone *zone,
+			sector_t chunk_block, unsigned int nr_blocks)
+{
+	unsigned int count, bit, nr_bits;
+	struct dmz_mblock *mblk;
+	unsigned int n = 0;
+
+	dmz_dev_debug(dmz, "=> VALIDATE zone %u, block %llu, %u blocks\n",
+		      dmz_id(dmz, zone), (u64)chunk_block, nr_blocks);
+
+	WARN_ON(chunk_block + nr_blocks > dmz->zone_nr_blocks);
+
+	while (nr_blocks) {
+
+		/* Get bitmap block */
+		mblk = dmz_get_bitmap(dmz, zone, chunk_block);
+		if (IS_ERR(mblk))
+			return PTR_ERR(mblk);
+
+		/* Set bits */
+		bit = chunk_block & DMZ_BLOCK_MASK_BITS;
+		nr_bits = min(nr_blocks, DMZ_BLOCK_SIZE_BITS - bit);
+
+		count = dmz_set_bits((unsigned long *) mblk->data,
+				     bit, nr_bits);
+		if (count) {
+			dmz_dirty_mblock(dmz, mblk);
+			n += count;
+		}
+		dmz_release_mblock(dmz, mblk);
+
+		nr_blocks -= nr_bits;
+		chunk_block += nr_bits;
+
+	}
+
+	if (likely(zone->weight + n <= dmz->zone_nr_blocks)) {
+		zone->weight += n;
+	} else {
+		dmz_dev_warn(dmz, "Zone %u: weight %u should be <= %llu\n",
+			     dmz_id(dmz, zone), zone->weight,
+			     (u64)dmz->zone_nr_blocks - n);
+		zone->weight = dmz->zone_nr_blocks;
+	}
+
+	return 0;
+}
+
+/*
+ * Clear nr_bits bits in bitmap starting from bit.
+ * Return the number of bits cleared.
+ */
+static int dmz_clear_bits(unsigned long *bitmap, int bit, int nr_bits)
+{
+	unsigned long *addr;
+	int end = bit + nr_bits;
+	int n = 0;
+
+	while (bit < end) {
+
+		if (((bit & (BITS_PER_LONG - 1)) == 0) &&
+		    ((end - bit) >= BITS_PER_LONG)) {
+			/* Try to clear whole word at once */
+			addr = bitmap + BIT_WORD(bit);
+			if (*addr == ULONG_MAX) {
+				*addr = 0;
+				n += BITS_PER_LONG;
+				bit += BITS_PER_LONG;
+				continue;
+			}
+		}
+
+		if (test_and_clear_bit(bit, bitmap))
+			n++;
+		bit++;
+
+	}
+
+	return n;
+
+}
+
+/*
+ * Invalidate all the blocks in the range [block..block+nr_blocks-1].
+ */
+int dmz_invalidate_blocks(struct dmz_target *dmz, struct dm_zone *zone,
+			  sector_t chunk_block, unsigned int nr_blocks)
+{
+	unsigned int count, bit, nr_bits;
+	struct dmz_mblock *mblk;
+	unsigned int n = 0;
+
+	dmz_dev_debug(dmz, "=> INVALIDATE zone %u, block %llu, %u blocks\n",
+		      dmz_id(dmz, zone), (u64)chunk_block, nr_blocks);
+
+	WARN_ON(chunk_block + nr_blocks > dmz->zone_nr_blocks);
+
+	while (nr_blocks) {
+
+		/* Get bitmap block */
+		mblk = dmz_get_bitmap(dmz, zone, chunk_block);
+		if (IS_ERR(mblk))
+			return PTR_ERR(mblk);
+
+		/* Clear bits */
+		bit = chunk_block & DMZ_BLOCK_MASK_BITS;
+		nr_bits = min(nr_blocks, DMZ_BLOCK_SIZE_BITS - bit);
+
+		count = dmz_clear_bits((unsigned long *) mblk->data,
+				       bit, nr_bits);
+		if (count) {
+			dmz_dirty_mblock(dmz, mblk);
+			n += count;
+		}
+		dmz_release_mblock(dmz, mblk);
+
+		nr_blocks -= nr_bits;
+		chunk_block += nr_bits;
+
+	}
+
+	if (zone->weight >= n) {
+		zone->weight -= n;
+	} else {
+		dmz_dev_warn(dmz, "Zone %u: weight %u should be >= %u\n",
+			     dmz_id(dmz, zone), zone->weight, n);
+		zone->weight = 0;
+	}
+
+	return 0;
+}
+
+/*
+ * Get a block bit value.
+ */
+static int dmz_test_block(struct dmz_target *dmz, struct dm_zone *zone,
+			  sector_t chunk_block)
+{
+	struct dmz_mblock *mblk;
+	int ret;
+
+	WARN_ON(chunk_block >= dmz->zone_nr_blocks);
+
+	/* Get bitmap block */
+	mblk = dmz_get_bitmap(dmz, zone, chunk_block);
+	if (IS_ERR(mblk))
+		return PTR_ERR(mblk);
+
+	/* Get offset */
+	ret = test_bit(chunk_block & DMZ_BLOCK_MASK_BITS,
+		       (unsigned long *) mblk->data) != 0;
+
+	dmz_release_mblock(dmz, mblk);
+
+	return ret;
+}
+
+/*
+ * Return the number of blocks from chunk_block to the first block with a bit
+ * value specified by set. Search at most nr_blocks blocks from chunk_block.
+ */
+static int dmz_to_next_set_block(struct dmz_target *dmz, struct dm_zone *zone,
+				 sector_t chunk_block, unsigned int nr_blocks,
+				 int set)
+{
+	struct dmz_mblock *mblk;
+	unsigned int bit, set_bit, nr_bits;
+	unsigned long *bitmap;
+	int n = 0;
+
+	WARN_ON(chunk_block + nr_blocks > dmz->zone_nr_blocks);
+
+	while (nr_blocks) {
+
+		/* Get bitmap block */
+		mblk = dmz_get_bitmap(dmz, zone, chunk_block);
+		if (IS_ERR(mblk))
+			return PTR_ERR(mblk);
+
+		/* Get offset */
+		bitmap = (unsigned long *) mblk->data;
+		bit = chunk_block & DMZ_BLOCK_MASK_BITS;
+		nr_bits = min(nr_blocks, DMZ_BLOCK_SIZE_BITS - bit);
+		if (set)
+			set_bit = find_next_bit(bitmap,
+						DMZ_BLOCK_SIZE_BITS,
+						bit);
+		else
+			set_bit = find_next_zero_bit(bitmap,
+						     DMZ_BLOCK_SIZE_BITS,
+						     bit);
+		dmz_release_mblock(dmz, mblk);
+
+		n += set_bit - bit;
+		if (set_bit < DMZ_BLOCK_SIZE_BITS)
+			break;
+
+		nr_blocks -= nr_bits;
+		chunk_block += nr_bits;
+
+	}
+
+	return n;
+}
+
+/*
+ * Test if chunk_block is valid. If it is, the number of consecutive
+ * valid blocks from chunk_block will be returned.
+ */
+int dmz_block_valid(struct dmz_target *dmz, struct dm_zone *zone,
+		    sector_t chunk_block)
+{
+	int valid;
+
+	/* Test block */
+	valid = dmz_test_block(dmz, zone, chunk_block);
+	if (valid <= 0)
+		return valid;
+
+	/* The block is valid: get the number of valid blocks from block */
+	return dmz_to_next_set_block(dmz, zone, chunk_block,
+				     dmz->zone_nr_blocks - chunk_block,
+				     0);
+}
+
+/*
+ * Find the first valid block from @chunk_block in @zone.
+ * If such a block is found, its number is returned using
+ * @chunk_block and the total number of valid blocks from @chunk_block
+ * is returned.
+ */
+int dmz_first_valid_block(struct dmz_target *dmz, struct dm_zone *zone,
+			  sector_t *chunk_block)
+{
+	sector_t start_block = *chunk_block;
+	int ret;
+
+	ret = dmz_to_next_set_block(dmz, zone, start_block,
+				    dmz->zone_nr_blocks - start_block, 1);
+	if (ret < 0)
+		return ret;
+
+	start_block += ret;
+	*chunk_block = start_block;
+
+	return dmz_to_next_set_block(dmz, zone, start_block,
+				     dmz->zone_nr_blocks - start_block, 0);
+}
+
+/*
+ * Count the number of bits set starting from bit up to bit + nr_bits - 1.
+ */
+static int dmz_count_bits(void *bitmap, int bit, int nr_bits)
+{
+	unsigned long *addr;
+	int end = bit + nr_bits;
+	int n = 0;
+
+	while (bit < end) {
+
+		if (((bit & (BITS_PER_LONG - 1)) == 0) &&
+		    ((end - bit) >= BITS_PER_LONG)) {
+			addr = (unsigned long *)bitmap + BIT_WORD(bit);
+			if (*addr == ULONG_MAX) {
+				n += BITS_PER_LONG;
+				bit += BITS_PER_LONG;
+				continue;
+			}
+		}
+
+		if (test_bit(bit, bitmap))
+			n++;
+		bit++;
+
+	}
+
+	return n;
+
+}
+
+/*
+ * Get a zone weight.
+ */
+static void dmz_get_zone_weight(struct dmz_target *dmz, struct dm_zone *zone)
+{
+	struct dmz_mblock *mblk;
+	sector_t chunk_block = 0;
+	unsigned int bit, nr_bits;
+	unsigned int nr_blocks = dmz->zone_nr_blocks;
+	void *bitmap;
+	int n = 0;
+
+	while (nr_blocks) {
+
+		/* Get bitmap block */
+		mblk = dmz_get_bitmap(dmz, zone, chunk_block);
+		if (IS_ERR(mblk)) {
+			n = 0;
+			break;
+		}
+
+		/* Count bits in this block */
+		bitmap = mblk->data;
+		bit = chunk_block & DMZ_BLOCK_MASK_BITS;
+		nr_bits = min(nr_blocks, DMZ_BLOCK_SIZE_BITS - bit);
+		n += dmz_count_bits(bitmap, bit, nr_bits);
+
+		dmz_release_mblock(dmz, mblk);
+
+		nr_blocks -= nr_bits;
+		chunk_block += nr_bits;
+
+	}
+
+	zone->weight = n;
+}
+
+/*
+ * Initialize the target metadata.
+ */
+int dmz_init_meta(struct dmz_target *dmz)
+{
+	unsigned int i, zid;
+	struct dm_zone *zone;
+	int ret;
+
+	/* Initialize zone descriptors */
+	ret = dmz_init_zones(dmz);
+	if (ret)
+		return ret;
+
+	/* Get super block */
+	ret = dmz_load_sb(dmz);
+	if (ret)
+		goto out;
+
+	/* Set metadata zones starting from sb_zone */
+	zid = dmz_id(dmz, dmz->sb_zone);
+	for (i = 0; i < dmz->nr_meta_zones << 1; i++) {
+		zone = dmz_get(dmz, zid + i);
+		if (!dmz_is_rnd(zone))
+			return -ENXIO;
+		set_bit(DMZ_META, &zone->flags);
+	}
+
+	/* Load mapping table */
+	ret = dmz_load_mapping(dmz);
+	if (ret)
+		goto out;
+
+	/*
+	 * Cache size boundaries: allow at least 2 super blocks, the chunk map
+	 * blocks and enough blocks to be able to cache the bitmap blocks of
+	 * up to 16 zones when idle (min_nr_mblks). Otherwise, if busy, allow
+	 * the cache to add 512 more metadata blocks.
+	 */
+	dmz->min_nr_mblks = 2 + dmz->nr_map_blocks +
+		dmz->zone_nr_bitmap_blocks * 16;
+	dmz->max_nr_mblks = dmz->min_nr_mblks + 512;
+	dmz->mblk_shrinker.count_objects = dmz_mblock_shrinker_count;
+	dmz->mblk_shrinker.scan_objects = dmz_mblock_shrinker_scan;
+	dmz->mblk_shrinker.seeks = DEFAULT_SEEKS;
+
+	dmz_dev_info(dmz, "Host-%s zoned block device\n",
+		     bdev_zoned_model(dmz->zbd) == BLK_ZONED_HA ?
+		     "aware" : "managed");
+	dmz_dev_info(dmz, "  %llu 512-byte logical sectors\n",
+		     (u64)dmz->nr_zones
+		     << dmz->zone_nr_sectors_shift);
+	dmz_dev_info(dmz, "  %u zones of %llu 512-byte logical sectors\n",
+		     dmz->nr_zones,
+		     (u64)dmz->zone_nr_sectors);
+	dmz_dev_info(dmz, "  %u metadata zones\n",
+		     dmz->nr_meta_zones * 2);
+	dmz_dev_info(dmz, "  %u data zones for %u chunks\n",
+		     dmz->nr_data_zones,
+		     dmz->nr_chunks);
+	dmz_dev_info(dmz, "    %u random zones (%u unmapped)\n",
+		     dmz->dz_nr_rnd,
+		     atomic_read(&dmz->dz_unmap_nr_rnd));
+	dmz_dev_info(dmz, "    %u sequential zones (%u unmapped)\n",
+		     dmz->dz_nr_seq,
+		     atomic_read(&dmz->dz_unmap_nr_seq));
+	dmz_dev_info(dmz, "  %u reserved sequential data zones\n",
+		     dmz->nr_reserved_seq);
+
+	dmz_dev_debug(dmz, "Format:\n");
+	dmz_dev_debug(dmz, "%u metadata blocks per set (%u max cache)\n",
+		      dmz->nr_meta_blocks,
+		      dmz->max_nr_mblks);
+	dmz_dev_debug(dmz, "  %u data zone mapping blocks\n",
+		      dmz->nr_map_blocks);
+	dmz_dev_debug(dmz, "  %u bitmap blocks\n",
+		      dmz->nr_bitmap_blocks);
+
+out:
+	if (ret)
+		dmz_cleanup_meta(dmz);
+
+	return ret;
+}
+
+/*
+ * Cleanup the target metadata resources.
+ */
+void dmz_cleanup_meta(struct dmz_target *dmz)
+{
+	struct rb_root *root = &dmz->mblk_rbtree;
+	struct dmz_mblock *mblk, *next;
+	int i;
+
+	/* Release zone mapping resources */
+	if (dmz->dz_map_mblk) {
+		for (i = 0; i < dmz->nr_map_blocks; i++)
+			dmz_release_mblock(dmz, dmz->dz_map_mblk[i]);
+		kfree(dmz->dz_map_mblk);
+		dmz->dz_map_mblk = NULL;
+	}
+
+	/* Release super blocks */
+	for (i = 0; i < 2; i++) {
+		if (dmz->sb[i].mblk) {
+			dmz_free_mblock(dmz, dmz->sb[i].mblk);
+			dmz->sb[i].mblk = NULL;
+		}
+	}
+
+	/* Free cached blocks */
+	while (!list_empty(&dmz->mblk_dirty_list)) {
+		mblk = list_first_entry(&dmz->mblk_dirty_list,
+					struct dmz_mblock, link);
+		dmz_dev_warn(dmz, "mblock %llu still in dirty list (ref %u)\n",
+			     (u64)mblk->no,
+			     atomic_read(&mblk->ref));
+		list_del_init(&mblk->link);
+		rb_erase(&mblk->node, &dmz->mblk_rbtree);
+		dmz_free_mblock(dmz, mblk);
+	}
+
+	while (!list_empty(&dmz->mblk_lru_list)) {
+		mblk = list_first_entry(&dmz->mblk_lru_list,
+					struct dmz_mblock, link);
+		list_del_init(&mblk->link);
+		rb_erase(&mblk->node, &dmz->mblk_rbtree);
+		dmz_free_mblock(dmz, mblk);
+	}
+
+	/* Sanity checks: the mblock rbtree should now be empty */
+	rbtree_postorder_for_each_entry_safe(mblk, next, root, node) {
+		dmz_dev_warn(dmz, "mblock %llu ref %u still in rbtree\n",
+			     (u64)mblk->no,
+			     atomic_read(&mblk->ref));
+		atomic_set(&mblk->ref, 0);
+		dmz_free_mblock(dmz, mblk);
+	}
+
+	/* Free the zone descriptors */
+	dmz_drop_zones(dmz);
+}
+
+/*
+ * Check metadata on resume.
+ */
+int dmz_resume_meta(struct dmz_target *dmz)
+{
+	return dmz_check_zones(dmz);
+}
+
diff --git a/drivers/md/dm-zoned-reclaim.c b/drivers/md/dm-zoned-reclaim.c
new file mode 100644
index 0000000..aa76692
--- /dev/null
+++ b/drivers/md/dm-zoned-reclaim.c
@@ -0,0 +1,535 @@
+/*
+ * Drive-managed zoned block device target
+ * Copyright (C) 2017 Western Digital Corporation or its affiliates.
+ *
+ * Written by: Damien Le Moal <damien.lemoal@wdc.com>
+ *
+ * This software is distributed under the terms of the GNU General PUBLIC
+ * License version 2, or any later version, "as is," without technical
+ * support, and WITHOUT ANY WARRANTY, without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#include <linux/module.h>
+
+#include "dm-zoned.h"
+
+/*
+ * Align a sequential zone write pointer to chunk_block.
+ */
+static int dmz_reclaim_align_wp(struct dmz_target *dmz, struct dm_zone *zone,
+				sector_t chunk_block)
+{
+	sector_t wp_block = zone->wp_block;
+	unsigned int nr_blocks;
+	int ret;
+
+	if (wp_block > chunk_block)
+		return -EIO;
+
+	/*
+	 * Zeroout the space between the write
+	 * pointer and the requested position.
+	 */
+	nr_blocks = chunk_block - zone->wp_block;
+	if (!nr_blocks)
+		return 0;
+
+	ret = blkdev_issue_zeroout(dmz->zbd,
+			dmz_start_sect(dmz, zone) + dmz_blk2sect(wp_block),
+			dmz_blk2sect(nr_blocks),
+			GFP_NOFS, false);
+	if (ret) {
+		dmz_dev_err(dmz,
+			    "Align zone %u wp %llu to +%u blocks failed %d\n",
+			    dmz_id(dmz, zone),
+			    (unsigned long long)wp_block,
+			    nr_blocks,
+			    ret);
+		return ret;
+	}
+
+	zone->wp_block += nr_blocks;
+
+	return 0;
+}
+
+/*
+ * dm_kcopyd_copy notification.
+ */
+static void dmz_reclaim_copy_end(int read_err, unsigned long write_err,
+				 void *context)
+{
+	struct dmz_target *dmz = context;
+
+	if (read_err || write_err)
+		dmz->reclaim_err = -EIO;
+	else
+		dmz->reclaim_err = 0;
+
+	clear_bit_unlock(DMZ_RECLAIM_COPY, &dmz->flags);
+	smp_mb__after_atomic();
+	wake_up_bit(&dmz->flags, DMZ_RECLAIM_COPY);
+}
+
+/*
+ * Copy valid blocks of src_zone into dst_zone.
+ */
+static int dmz_reclaim_copy(struct dmz_target *dmz,
+			    struct dm_zone *src_zone, struct dm_zone *dst_zone)
+{
+	struct dm_io_region src, dst;
+	sector_t block = 0, end_block;
+	sector_t nr_blocks;
+	sector_t src_zone_block;
+	sector_t dst_zone_block;
+	unsigned long flags = 0;
+	int ret;
+
+	if (dmz_is_seq(src_zone))
+		end_block = src_zone->wp_block;
+	else
+		end_block = dmz->zone_nr_blocks;
+	src_zone_block = dmz_start_block(dmz, src_zone);
+	dst_zone_block = dmz_start_block(dmz, dst_zone);
+
+	if (dmz_is_seq(dst_zone))
+		set_bit(DM_KCOPYD_WRITE_SEQ, &flags);
+
+	while (block < end_block) {
+
+		/* Get a valid region from the source zone */
+		ret = dmz_first_valid_block(dmz, src_zone, &block);
+		if (ret < 0)
+			return ret;
+
+		/* Are we done ? */
+		nr_blocks = ret;
+		if (!nr_blocks)
+			return 0;
+
+		/*
+		 * If we are writing in a sequential zone, we must make sure
+		 * that writes are sequential. So Zero out any eventual hole
+		 * between writes.
+		 */
+		if (dmz_is_seq(dst_zone)) {
+			ret = dmz_reclaim_align_wp(dmz, dst_zone, block);
+			if (ret)
+				return ret;
+		}
+
+		src.bdev = dmz->zbd;
+		src.sector = dmz_blk2sect(src_zone_block + block);
+		src.count = dmz_blk2sect(nr_blocks);
+
+		dst.bdev = dmz->zbd;
+		dst.sector = dmz_blk2sect(dst_zone_block + block);
+		dst.count = src.count;
+
+		dmz_dev_debug(dmz,
+			      "Reclaim: Copy %s zone %u, block %llu+%llu to %s zone %u\n",
+			      dmz_is_rnd(src_zone) ? "RND" : "SEQ",
+			      dmz_id(dmz, src_zone),
+			      (unsigned long long)block,
+			      (unsigned long long)nr_blocks,
+			      dmz_is_rnd(dst_zone) ? "RND" : "SEQ",
+			      dmz_id(dmz, dst_zone));
+
+		/* Copy the valid region */
+		set_bit(DMZ_RECLAIM_COPY, &dmz->flags);
+		ret = dm_kcopyd_copy(dmz->reclaim_kc, &src, 1, &dst, flags,
+				     dmz_reclaim_copy_end, dmz);
+		if (ret != 0)
+			return ret;
+
+		/* Wait for copy to complete */
+		wait_on_bit_io(&dmz->flags, DMZ_RECLAIM_COPY,
+			       TASK_UNINTERRUPTIBLE);
+		if (dmz->reclaim_err)
+			return dmz->reclaim_err;
+
+		if (dmz_is_seq(dst_zone))
+			dst_zone->wp_block += nr_blocks;
+
+		block += nr_blocks;
+
+	}
+
+	return 0;
+}
+
+/*
+ * Clear a zone reclaim flag.
+ */
+static inline void dmz_reclaim_put_zone(struct dmz_target *dmz,
+					struct dm_zone *zone)
+{
+	WARN_ON(dmz_is_active(zone));
+	WARN_ON(!dmz_in_reclaim(zone));
+
+	clear_bit_unlock(DMZ_RECLAIM, &zone->flags);
+	smp_mb__after_atomic();
+	wake_up_bit(&zone->flags, DMZ_RECLAIM);
+}
+
+/*
+ * Move valid blocks of dzone buffer zone into dzone (after its write pointer)
+ * and free the buffer zone.
+ */
+static int dmz_reclaim_buf(struct dmz_target *dmz, struct dm_zone *dzone)
+{
+	struct dm_zone *bzone = dzone->bzone;
+	sector_t chunk_block = dzone->wp_block;
+	int ret;
+
+	dmz_dev_debug(dmz,
+		      "Chunk %u, move buf zone %u (weight %u) to data zone %u (weight %u)\n",
+		      dzone->chunk, dmz_id(dmz, bzone), dmz_weight(bzone),
+		      dmz_id(dmz, dzone), dmz_weight(dzone));
+
+	/* Flush data zone into the buffer zone */
+	ret = dmz_reclaim_copy(dmz, bzone, dzone);
+	if (ret < 0)
+		return ret;
+
+	down_read(&dmz->mblk_sem);
+
+	/* Validate copied blocks */
+	ret = dmz_valid_merge(dmz, bzone, dzone, chunk_block);
+	if (ret == 0) {
+		/* Free the buffer zone */
+		dmz_invalidate_zone(dmz, bzone);
+		dmz_lock_map(dmz);
+		dmz_unmap_zone(dmz, bzone);
+		dmz_reclaim_put_zone(dmz, dzone);
+		dmz_free_zone(dmz, bzone);
+		dmz_unlock_map(dmz);
+	}
+
+	up_read(&dmz->mblk_sem);
+
+	return 0;
+}
+
+/*
+ * Merge valid blocks of dzone into its buffer zone and free dzone.
+ */
+static int dmz_reclaim_seq_data(struct dmz_target *dmz, struct dm_zone *dzone)
+{
+	unsigned int chunk = dzone->chunk;
+	struct dm_zone *bzone = dzone->bzone;
+	int ret = 0;
+
+	dmz_dev_debug(dmz,
+		      "Chunk %u, move data zone %u (weight %u) to buf zone %u (weight %u)\n",
+		      chunk, dmz_id(dmz, dzone), dmz_weight(dzone),
+		      dmz_id(dmz, bzone), dmz_weight(bzone));
+
+	/* Flush data zone into the buffer zone */
+	ret = dmz_reclaim_copy(dmz, dzone, bzone);
+	if (ret < 0)
+		return ret;
+
+	down_read(&dmz->mblk_sem);
+
+	/* Validate copied blocks */
+	ret = dmz_valid_merge(dmz, dzone, bzone, 0);
+	if (ret == 0) {
+		/*
+		 * Free the data zone and remap the chunk to
+		 * the buffer zone.
+		 */
+		dmz_invalidate_zone(dmz, dzone);
+		dmz_lock_map(dmz);
+		dmz_unmap_zone(dmz, bzone);
+		dmz_unmap_zone(dmz, dzone);
+		dmz_reclaim_put_zone(dmz, dzone);
+		dmz_free_zone(dmz, dzone);
+		dmz_map_zone(dmz, bzone, chunk);
+		dmz_unlock_map(dmz);
+	}
+
+	up_read(&dmz->mblk_sem);
+
+	return 0;
+}
+
+/*
+ * Move valid blocks of the random data zone dzone into a free sequential zone.
+ * Once blocks are moved, remap the zone chunk to the sequential zone.
+ */
+static int dmz_reclaim_rnd_data(struct dmz_target *dmz, struct dm_zone *dzone)
+{
+	unsigned int chunk = dzone->chunk;
+	struct dm_zone *szone = NULL;
+	int ret;
+
+	/* Get a free sequential zone */
+	dmz_lock_map(dmz);
+	szone = dmz_alloc_zone(dmz, DMZ_ALLOC_RECLAIM);
+	dmz_unlock_map(dmz);
+	if (!szone)
+		return -ENOSPC;
+
+	dmz_dev_debug(dmz,
+		      "Chunk %u, move rnd zone %u (weight %u) to seq zone %u\n",
+		      chunk, dmz_id(dmz, dzone), dmz_weight(dzone),
+		      dmz_id(dmz, szone));
+
+	/* Flush the random data zone into the sequential zone */
+	ret = dmz_reclaim_copy(dmz, dzone, szone);
+
+	down_read(&dmz->mblk_sem);
+
+	if (ret == 0)
+		/* Validate copied blocks */
+		ret = dmz_valid_copy(dmz, dzone, szone);
+
+	if (ret) {
+		/* Free the sequential zone */
+		dmz_lock_map(dmz);
+		dmz_free_zone(dmz, szone);
+		dmz_unlock_map(dmz);
+	} else {
+		/* Free the data zone and remap the chunk */
+		dmz_invalidate_zone(dmz, dzone);
+		dmz_lock_map(dmz);
+		dmz_unmap_zone(dmz, dzone);
+		dmz_reclaim_put_zone(dmz, dzone);
+		dmz_free_zone(dmz, dzone);
+		dmz_map_zone(dmz, szone, chunk);
+		dmz_unlock_map(dmz);
+	}
+
+	up_read(&dmz->mblk_sem);
+
+	return 0;
+}
+
+/*
+ * Reclaim an empty zone.
+ */
+static void dmz_reclaim_empty(struct dmz_target *dmz, struct dm_zone *dzone)
+{
+	down_read(&dmz->mblk_sem);
+	dmz_lock_map(dmz);
+	dmz_unmap_zone(dmz, dzone);
+	dmz_reclaim_put_zone(dmz, dzone);
+	dmz_free_zone(dmz, dzone);
+	dmz_unlock_map(dmz);
+	up_read(&dmz->mblk_sem);
+}
+
+/*
+ * Lock a zone for reclaim. Returns 0 if the zone cannot be locked or if it is
+ * already locked and 1 otherwise.
+ */
+static inline int dmz_reclaim_lock_zone(struct dmz_target *dmz,
+					struct dm_zone *zone)
+{
+	/* Active zones cannot be reclaimed */
+	if (dmz_is_active(zone))
+		return 0;
+
+	return !test_and_set_bit(DMZ_RECLAIM, &zone->flags);
+}
+
+/*
+ * Select a random zone for reclaim.
+ */
+static struct dm_zone *dmz_reclaim_get_rnd_zone(struct dmz_target *dmz)
+{
+	struct dm_zone *dzone = NULL;
+	struct dm_zone *zone;
+
+	if (list_empty(&dmz->dz_map_rnd_list))
+		return NULL;
+
+	list_for_each_entry(zone, &dmz->dz_map_rnd_list, link) {
+		if (dmz_is_buf(zone))
+			dzone = zone->bzone;
+		else
+			dzone = zone;
+		if (dmz_reclaim_lock_zone(dmz, dzone))
+			return dzone;
+	}
+
+	return NULL;
+}
+
+/*
+ * Select a buffered sequential zone for reclaim.
+ */
+static struct dm_zone *dmz_reclaim_get_seq_zone(struct dmz_target *dmz)
+{
+	struct dm_zone *zone;
+
+	if (list_empty(&dmz->dz_map_seq_list))
+		return NULL;
+
+	list_for_each_entry(zone, &dmz->dz_map_seq_list, link) {
+		if (!zone->bzone)
+			continue;
+		if (dmz_reclaim_lock_zone(dmz, zone))
+			return zone;
+	}
+
+	return NULL;
+}
+
+/*
+ * Select a zone for reclaim.
+ */
+static struct dm_zone *dmz_reclaim_get_zone(struct dmz_target *dmz)
+{
+	struct dm_zone *zone = NULL;
+
+	/*
+	 * Search for a zone candidate to reclaim: 2 cases are possible.
+	 * (1) There is no free sequential zones. Then a random data zone
+	 *     cannot be reclaimed. So choose a sequential zone to reclaim so
+	 *     that afterward a random zone can be reclaimed.
+	 * (2) At least one free sequential zone is available, then choose
+	 *     the oldest random zone (data or buffer) that can be locked.
+	 */
+	dmz_lock_map(dmz);
+	if (list_empty(&dmz->reclaim_seq_zones_list))
+		zone = dmz_reclaim_get_seq_zone(dmz);
+	else
+		zone = dmz_reclaim_get_rnd_zone(dmz);
+	dmz_unlock_map(dmz);
+
+	return zone;
+}
+
+/*
+ * Find a reclaim candidate zone and reclaim it.
+ */
+static void dmz_reclaim(struct dmz_target *dmz)
+{
+	struct dm_zone *dzone;
+	struct dm_zone *rzone;
+	unsigned long start;
+	int ret;
+
+	/* Get a data zone */
+	dzone = dmz_reclaim_get_zone(dmz);
+	if (!dzone)
+		return;
+
+	start = jiffies;
+
+	if (dmz_is_rnd(dzone)) {
+
+		rzone = dzone;
+		if (!dmz_weight(dzone)) {
+			/* Empty zone */
+			dmz_reclaim_empty(dmz, dzone);
+			ret = 0;
+		} else {
+			/*
+			 * Reclaim the random data zone by moving its
+			 * valid data blocks to a free sequential zone.
+			 */
+			ret = dmz_reclaim_rnd_data(dmz, dzone);
+		}
+
+	} else {
+
+		struct dm_zone *bzone = dzone->bzone;
+		sector_t chunk_block = 0;
+
+		ret = dmz_first_valid_block(dmz, bzone, &chunk_block);
+		if (ret < 0)
+			goto out;
+
+		if (chunk_block >= dzone->wp_block) {
+			/*
+			 * Valid blocks in the buffer zone are after
+			 * the data zone write pointer: copy them there.
+			 */
+			ret = dmz_reclaim_buf(dmz, dzone);
+			rzone = bzone;
+		} else {
+			/*
+			 * Reclaim the data zone by merging it into the
+			 * buffer zone so that the buffer zone itself can
+			 * be later reclaimed.
+			 */
+			ret = dmz_reclaim_seq_data(dmz, dzone);
+			rzone = dzone;
+		}
+
+	}
+
+out:
+	if (ret) {
+		dmz_reclaim_put_zone(dmz, dzone);
+		return;
+	}
+
+	dmz_dev_debug(dmz, "Reclaimed zone %u in %u ms\n",
+		      dmz_id(dmz, rzone), jiffies_to_msecs(jiffies - start));
+
+	dmz_trigger_flush(dmz);
+}
+
+/*
+ * Zone reclaim work.
+ */
+void dmz_reclaim_work(struct work_struct *work)
+{
+	struct dmz_target *dmz =
+		container_of(work, struct dmz_target, reclaim_work.work);
+	unsigned long next_reclaim = DMZ_RECLAIM_PERIOD;
+	unsigned int unmap_nr_rnd = atomic_read(&dmz->dz_unmap_nr_rnd);
+	unsigned int throttle, unmap_perc;
+
+	/* If there are still plenty of random zones, do not reclaim */
+	unmap_perc = unmap_nr_rnd * 100 / dmz->dz_nr_rnd;
+	if (unmap_perc >= DMZ_RECLAIM_HIGH_FREE_RND)
+		goto out;
+
+	/*
+	 * If we are not idle and still have unmapped random zones,
+	 * do not reclaim.
+	 */
+	if (!dmz_idle(dmz) && unmap_perc > DMZ_RECLAIM_LOW_FREE_RND)
+		goto out;
+
+	/*
+	 * We need to start reclaiming random zones: set up zone copy
+	 * throttling to either go fast if we are very low on random zones
+	 * and slower if there are still some free random zones to avoid
+	 * as much as possible to negatively impact the user workload.
+	 */
+	if (dmz_idle(dmz) ||
+	    unmap_nr_rnd < atomic_read(&dmz->nr_active_chunks))
+		/* Idle or very low: go fast */
+		throttle = 100;
+	else
+		/* Busy but we still have some random zone: go slower */
+		throttle = min(75U, 100U - unmap_perc / 2);
+	dmz->reclaim_throttle.throttle = throttle;
+
+	dmz_dev_debug(dmz,
+		      "Reclaim (%u): %s (%u BIOs, %u active chunks), %u%% free rnd zones (%u/%u)\n",
+		      dmz->reclaim_throttle.throttle,
+		      (dmz_idle(dmz) ? "Idle" : "Busy"),
+		      atomic_read(&dmz->bio_count),
+		      atomic_read(&dmz->nr_active_chunks),
+		      unmap_perc,
+		      unmap_nr_rnd, dmz->dz_nr_rnd);
+
+	dmz_reclaim(dmz);
+
+	if ((dmz_should_reclaim(dmz)
+	     && atomic_read(&dmz->nr_reclaim_seq_zones)))
+		/* Run again immmediately */
+		next_reclaim = 0;
+
+out:
+	dmz_schedule_reclaim(dmz, next_reclaim);
+}
+
diff --git a/drivers/md/dm-zoned.h b/drivers/md/dm-zoned.h
new file mode 100644
index 0000000..bdcd538
--- /dev/null
+++ b/drivers/md/dm-zoned.h
@@ -0,0 +1,530 @@
+/*
+ * Drive-managed zoned block device target
+ * Copyright (C) 2017 Western Digital Corporation or its affiliates.
+ *
+ * This software is distributed under the terms of the GNU General Public
+ * License version 2, or any later version, "as is," without technical
+ * support, and WITHOUT ANY WARRANTY, without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+#include <linux/types.h>
+#include <linux/blkdev.h>
+#include <linux/device-mapper.h>
+#include <linux/dm-kcopyd.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/workqueue.h>
+#include <linux/rwsem.h>
+#include <linux/rbtree.h>
+#include <linux/radix-tree.h>
+#include <linux/shrinker.h>
+
+#ifndef __DM_ZONED_H__
+#define __DM_ZONED_H__
+
+/*
+ * Metadata version.
+ */
+#define DMZ_META_VER	1
+
+/*
+ * On-disk super block magic.
+ */
+#define DMZ_MAGIC	((((unsigned int)('D')) << 24) | \
+			 (((unsigned int)('Z')) << 16) | \
+			 (((unsigned int)('B')) <<  8) | \
+			 ((unsigned int)('D')))
+
+/*
+ * On disk super block.
+ * This uses only 512 B but uses on disk a full 4KB block. This block is
+ * followed on disk by the mapping table of chunks to zones and the bitmap
+ * blocks indicating zone block validity.
+ * The overall resulting metadata format is:
+ *    (1) Super block (1 block)
+ *    (2) Chunk mapping table (nr_map_blocks)
+ *    (3) Bitmap blocks (nr_bitmap_blocks)
+ * All metadata blocks are stored in conventional zones, starting from the
+ * the first conventional zone found on disk.
+ */
+struct dmz_super {
+
+	/* Magic number */
+	__le32		magic;			/*   4 */
+
+	/* Metadata version number */
+	__le32		version;		/*   8 */
+
+	/* Generation number */
+	__le64		gen;			/*  16 */
+
+	/* This block number */
+	__le64		sb_block;		/*  24 */
+
+	/* The number of metadata blocks, including this super block */
+	__le32		nr_meta_blocks;		/*  28 */
+
+	/* The number of sequential zones reserved for reclaim */
+	__le32		nr_reserved_seq;	/*  32 */
+
+	/* The number of entries in the mapping table */
+	__le32		nr_chunks;		/*  36 */
+
+	/* The number of blocks used for the chunk mapping table */
+	__le32		nr_map_blocks;		/*  40 */
+
+	/* The number of blocks used for the block bitmaps */
+	__le32		nr_bitmap_blocks;	/*  44 */
+
+	/* Checksum */
+	__le32		crc;			/*  48 */
+
+	/* Padding to full 512B sector */
+	u8		reserved[464];		/* 512 */
+
+};
+
+/*
+ * Chunk mapping entry: entries are indexed by chunk number
+ * and give the zone ID (dzone_id) mapping the chunk on disk.
+ * This zone may be sequential or random. If it is a sequential
+ * zone, a second zone (bzone_id) used as a write buffer may
+ * also be specified. This second zone will always be a randomly
+ * writeable zone.
+ */
+struct dmz_map {
+	__le32			dzone_id;
+	__le32			bzone_id;
+};
+
+/*
+ * dm-zoned creates block devices with 4KB blocks, always.
+ */
+#define DMZ_BLOCK_SHIFT		12
+#define DMZ_BLOCK_SIZE		(1 << DMZ_BLOCK_SHIFT)
+#define DMZ_BLOCK_MASK		(DMZ_BLOCK_SIZE - 1)
+
+#define DMZ_BLOCK_SHIFT_BITS	(DMZ_BLOCK_SHIFT + 3)
+#define DMZ_BLOCK_SIZE_BITS	(1 << DMZ_BLOCK_SHIFT_BITS)
+#define DMZ_BLOCK_MASK_BITS	(DMZ_BLOCK_SIZE_BITS - 1)
+
+#define DMZ_BLOCK_SECTORS_SHIFT	(DMZ_BLOCK_SHIFT - SECTOR_SHIFT)
+#define DMZ_BLOCK_SECTORS	(DMZ_BLOCK_SIZE >> SECTOR_SHIFT)
+#define DMZ_BLOCK_SECTORS_MASK	(DMZ_BLOCK_SECTORS - 1)
+
+/*
+ * Chunk mapping table metadata: 512 8-bytes entries per 4KB block.
+ */
+#define DMZ_MAP_ENTRIES		(DMZ_BLOCK_SIZE / sizeof(struct dmz_map))
+#define DMZ_MAP_ENTRIES_SHIFT	(ilog2(DMZ_MAP_ENTRIES))
+#define DMZ_MAP_ENTRIES_MASK	(DMZ_MAP_ENTRIES - 1)
+#define DMZ_MAP_UNMAPPED	UINT_MAX
+
+/*
+ * Block <-> 512B sector conversion.
+ */
+#define dmz_blk2sect(b)		((b) << DMZ_BLOCK_SECTORS_SHIFT)
+#define dmz_sect2blk(s)		((s) >> DMZ_BLOCK_SECTORS_SHIFT)
+
+#define DMZ_MIN_BIOS		8192
+
+/*
+ * The size of a zone report in number of zones.
+ * This results in 4096*64B=256KB report zones commands.
+ */
+#define DMZ_REPORT_NR_ZONES	4096
+
+/*
+ * Zone flags.
+ */
+enum {
+
+	/* Zone write type */
+	DMZ_RND,
+	DMZ_SEQ,
+
+	/* Zone critical condition */
+	DMZ_OFFLINE,
+	DMZ_READ_ONLY,
+
+	/* How the zone is being used */
+	DMZ_META,
+	DMZ_DATA,
+	DMZ_BUF,
+
+	/* Zone internal state */
+	DMZ_ACTIVE,
+	DMZ_RECLAIM,
+	DMZ_SEQ_WRITE_ERR,
+
+};
+
+/*
+ * Zone descriptor.
+ */
+struct dm_zone {
+
+	/* For listing the zone depending on its state */
+	struct list_head	link;
+
+	/* Zone type and state */
+	unsigned long		flags;
+
+	/* Zone activation reference count */
+	atomic_t		refcount;
+
+	/* Zone write pointer block (relative to the zone start block) */
+	unsigned int		wp_block;
+
+	/* Zone weight (number of valid blocks in the zone) */
+	unsigned int		weight;
+
+	/* The chunk that the zone maps */
+	unsigned int		chunk;
+
+	/*
+	 * For a sequential data zone, pointer to the random zone
+	 * used as a buffer for processing unaligned writes.
+	 * For a buffer zone, this points back to the data zone.
+	 */
+	struct dm_zone		*bzone;
+
+};
+
+/*
+ * Meta data block descriptor (for cached metadata blocks).
+ */
+struct dmz_mblock {
+
+	struct rb_node		node;
+	struct list_head	link;
+	sector_t		no;
+	atomic_t		ref;
+	unsigned long		state;
+	struct page		*page;
+	void			*data;
+
+};
+
+/*
+ * Super block information (one per metadata set).
+ */
+struct dmz_sb {
+	sector_t		block;
+	struct dmz_mblock	*mblk;
+	struct dmz_super	*sb;
+};
+
+/*
+ * Metadata block state flags.
+ */
+enum {
+	DMZ_META_DIRTY,
+	DMZ_META_READING,
+	DMZ_META_WRITING,
+	DMZ_META_ERROR,
+};
+
+/*
+ * Target flags.
+ */
+enum {
+	DMZ_RECLAIM_COPY,
+	DMZ_SUSPENDED,
+};
+
+/*
+ * Target descriptor.
+ */
+struct dmz_target {
+
+	struct dm_dev		*ddev;
+
+	/* Zoned block device information */
+	char			zbd_name[BDEVNAME_SIZE];
+	struct block_device	*zbd;
+	sector_t		zbd_capacity;
+	struct request_queue	*zbdq;
+	unsigned long		flags;
+
+	unsigned int		nr_zones;
+	unsigned int		nr_useable_zones;
+	unsigned int		nr_meta_blocks;
+	unsigned int		nr_meta_zones;
+	unsigned int		nr_data_zones;
+	unsigned int		nr_rnd_zones;
+	unsigned int		nr_reserved_seq;
+	unsigned int		nr_chunks;
+
+	sector_t		zone_nr_sectors;
+	unsigned int		zone_nr_sectors_shift;
+
+	sector_t		zone_nr_blocks;
+	sector_t		zone_nr_blocks_shift;
+
+	sector_t		zone_bitmap_size;
+	unsigned int		zone_nr_bitmap_blocks;
+
+	unsigned int		nr_bitmap_blocks;
+	unsigned int		nr_map_blocks;
+
+	/* Zone information array */
+	struct dm_zone		*zones;
+
+	/* For metadata handling */
+	struct dm_zone		*sb_zone;
+	struct dmz_sb		sb[2];
+	unsigned int		mblk_primary;
+	u64			sb_gen;
+	unsigned int		min_nr_mblks;
+	unsigned int		max_nr_mblks;
+	atomic_t		nr_mblks;
+	struct rw_semaphore	mblk_sem;
+	spinlock_t		mblk_lock;
+	struct rb_root		mblk_rbtree;
+	struct list_head	mblk_lru_list;
+	struct list_head	mblk_dirty_list;
+	struct shrinker		mblk_shrinker;
+
+	/* Zone allocation management */
+	struct mutex		map_lock;
+	struct dmz_mblock	**dz_map_mblk;
+	unsigned int		dz_nr_rnd;
+	atomic_t		dz_unmap_nr_rnd;
+	struct list_head	dz_unmap_rnd_list;
+	struct list_head	dz_map_rnd_list;
+
+	unsigned int		dz_nr_seq;
+	atomic_t		dz_unmap_nr_seq;
+	struct list_head	dz_unmap_seq_list;
+	struct list_head	dz_map_seq_list;
+
+	wait_queue_head_t	dz_free_wq;
+
+	/* For chunk work */
+	struct mutex		chunk_lock;
+	struct radix_tree_root	chunk_rxtree;
+	struct workqueue_struct *chunk_wq;
+	atomic_t		nr_active_chunks;
+
+	/* For chunk BIOs to zones */
+	struct bio_set		*bio_set;
+	atomic_t		bio_count;
+	unsigned long		atime;
+
+	/* For flush */
+	spinlock_t		flush_lock;
+	struct bio_list		flush_list;
+	struct delayed_work	flush_work;
+	struct workqueue_struct *flush_wq;
+
+	/* For reclaim */
+	struct delayed_work	reclaim_work;
+	struct workqueue_struct *reclaim_wq;
+	atomic_t		nr_reclaim_seq_zones;
+	struct list_head	reclaim_seq_zones_list;
+	struct dm_kcopyd_client	*reclaim_kc;
+	struct dm_kcopyd_throttle reclaim_throttle;
+	int			reclaim_err;
+
+};
+
+/*
+ * Chunk work descriptor.
+ */
+struct dm_chunk_work {
+	struct work_struct	work;
+	atomic_t		refcount;
+	struct dmz_target	*target;
+	unsigned int		chunk;
+	struct bio_list		bio_list;
+};
+
+#define dmz_id(dmz, z)		((unsigned int)((z) - (dmz)->zones))
+#define dmz_get(dmz, z)		(&(dmz)->zones[z])
+#define dmz_start_sect(dmz, z)	(dmz_id(dmz, z) << (dmz)->zone_nr_sectors_shift)
+#define dmz_start_block(dmz, z)	(dmz_id(dmz, z) << (dmz)->zone_nr_blocks_shift)
+#define dmz_is_rnd(z)		test_bit(DMZ_RND, &(z)->flags)
+#define dmz_is_seq(z)		test_bit(DMZ_SEQ, &(z)->flags)
+#define dmz_is_empty(z)		((z)->wp_block == 0)
+#define dmz_is_offline(z)	test_bit(DMZ_OFFLINE, &(z)->flags)
+#define dmz_is_readonly(z)	test_bit(DMZ_READ_ONLY, &(z)->flags)
+#define dmz_is_active(z)	test_bit(DMZ_ACTIVE, &(z)->flags)
+#define dmz_in_reclaim(z)	test_bit(DMZ_RECLAIM, &(z)->flags)
+#define dmz_seq_write_err(z)	test_bit(DMZ_SEQ_WRITE_ERR, &(z)->flags)
+
+#define dmz_is_meta(z)		test_bit(DMZ_META, &(z)->flags)
+#define dmz_is_buf(z)		test_bit(DMZ_BUF, &(z)->flags)
+#define dmz_is_data(z)		test_bit(DMZ_DATA, &(z)->flags)
+
+#define dmz_weight(z)		((z)->weight)
+
+#define dmz_chunk_sector(dmz, s) ((s) & ((dmz)->zone_nr_sectors - 1))
+#define dmz_chunk_block(dmz, b)	((b) & ((dmz)->zone_nr_blocks - 1))
+
+#define dmz_bio_block(bio)	dmz_sect2blk((bio)->bi_iter.bi_sector)
+#define dmz_bio_blocks(bio)	dmz_sect2blk(bio_sectors(bio))
+#define dmz_bio_chunk(dmz, bio)	((bio)->bi_iter.bi_sector >> \
+				 (dmz)->zone_nr_sectors_shift)
+
+#define dmz_lock_map(dmz)	mutex_lock(&(dmz)->map_lock)
+#define dmz_unlock_map(dmz)	mutex_unlock(&(dmz)->map_lock)
+
+/*
+ * Flush intervals (seconds).
+ */
+#define DMZ_FLUSH_PERIOD	(10 * HZ)
+
+/*
+ * Trigger flush.
+ */
+static inline void dmz_trigger_flush(struct dmz_target *dmz)
+{
+	mod_delayed_work(dmz->flush_wq, &dmz->flush_work, 0);
+}
+
+/*
+ * Number of seconds without BIO to consider the target device idle.
+ */
+#define DMZ_IDLE_PERIOD		(10UL * HZ)
+
+/*
+ * Zone reclaim check period.
+ */
+#define DMZ_RECLAIM_PERIOD	(HZ)
+
+/*
+ * Percentage of unmapped (free) random zones below which reclaim starts
+ * even if the device is not idle.
+ */
+#define DMZ_RECLAIM_LOW_FREE_RND	50
+
+/*
+ * Percentage of unmapped (free) random zones above which reclaim stops
+ * * even if the device is idle.
+ */
+#define DMZ_RECLAIM_HIGH_FREE_RND	75
+
+/*
+ * Test if the target device is idle.
+ */
+static inline int dmz_idle(struct dmz_target *dmz)
+{
+	return atomic_read(&(dmz)->bio_count) == 0 &&
+		time_is_before_jiffies(dmz->atime + DMZ_IDLE_PERIOD);
+}
+
+/*
+ * Test if triggerring reclaim is necessary.
+ */
+static inline bool dmz_should_reclaim(struct dmz_target *dmz)
+{
+	unsigned int unmap_rnd = atomic_read(&dmz->dz_unmap_nr_rnd);
+
+	if (dmz_idle(dmz) && unmap_rnd < dmz->dz_nr_rnd)
+		return true;
+
+	/* Percentage of unmappped random zones low ? */
+	return ((unmap_rnd * 100) / dmz->dz_nr_rnd) <= DMZ_RECLAIM_LOW_FREE_RND;
+}
+
+/*
+ * Schedule reclaim (delay in jiffies).
+ */
+static inline void dmz_schedule_reclaim(struct dmz_target *dmz,
+					unsigned long delay)
+{
+	mod_delayed_work(dmz->reclaim_wq, &dmz->reclaim_work, delay);
+}
+
+/*
+ * Trigger reclaim.
+ */
+static inline void dmz_trigger_reclaim(struct dmz_target *dmz)
+{
+	dmz_schedule_reclaim(dmz, 0);
+}
+
+extern void dmz_reclaim_work(struct work_struct *work);
+
+/*
+ * Zone BIO context.
+ */
+struct dmz_bioctx {
+	struct dmz_target	*target;
+	struct dm_zone		*zone;
+	struct bio		*bio;
+	atomic_t		ref;
+	int			error;
+};
+
+#define dmz_info(format, args...)		\
+	pr_info("dm-zoned: " format,		\
+	## args)
+
+#define dmz_dev_info(target, format, args...)	\
+	pr_info("dm-zoned (%s): " format,	\
+	       (dmz)->zbd_name, ## args)
+
+#define dmz_dev_err(dmz, format, args...)	\
+	pr_err("dm-zoned (%s): " format,	\
+	       (dmz)->zbd_name, ## args)
+
+#define dmz_dev_warn(dmz, format, args...)	\
+	pr_warn("dm-zoned (%s): " format,	\
+		(dmz)->zbd_name, ## args)
+
+#define dmz_dev_debug(dmz, format, args...)	\
+	pr_debug("dm-zoned (%s): " format,	\
+		 (dmz)->zbd_name, ## args)
+
+extern int dmz_init_meta(struct dmz_target *dmz);
+extern int dmz_resume_meta(struct dmz_target *dmz);
+extern void dmz_cleanup_meta(struct dmz_target *dmz);
+
+extern int dmz_flush_mblocks(struct dmz_target *dmz);
+
+#define DMZ_ALLOC_RND		0x01
+#define DMZ_ALLOC_RECLAIM	0x02
+
+struct dm_zone *dmz_alloc_zone(struct dmz_target *dmz, unsigned long flags);
+extern void dmz_free_zone(struct dmz_target *dmz, struct dm_zone *zone);
+
+extern void dmz_map_zone(struct dmz_target *dmz, struct dm_zone *zone,
+			 unsigned int chunk);
+extern void dmz_unmap_zone(struct dmz_target *dmz, struct dm_zone *zone);
+
+extern void dmz_activate_zone(struct dmz_target *dmz, struct dm_zone *zone);
+extern void dmz_deactivate_zone(struct dmz_target *dmz, struct dm_zone *zone);
+
+extern struct dm_zone *dmz_get_chunk_mapping(struct dmz_target *dmz,
+					     unsigned int chunk, int op);
+extern void dmz_put_chunk_mapping(struct dmz_target *dmz,
+				  struct dm_zone *zone);
+
+extern struct dm_zone *dmz_get_chunk_buffer(struct dmz_target *dmz,
+					    struct dm_zone *dzone);
+
+extern int dmz_valid_copy(struct dmz_target *dmz, struct dm_zone *from_zone,
+			  struct dm_zone *to_zone);
+extern int dmz_valid_merge(struct dmz_target *dmz, struct dm_zone *from_zone,
+			   struct dm_zone *to_zone, sector_t chunk_block);
+
+extern int dmz_validate_blocks(struct dmz_target *dmz, struct dm_zone *zone,
+			       sector_t chunk_block, unsigned int nr_blocks);
+extern int dmz_invalidate_blocks(struct dmz_target *dmz, struct dm_zone *zone,
+				 sector_t chunk_block, unsigned int nr_blocks);
+static inline int dmz_invalidate_zone(struct dmz_target *dmz,
+				      struct dm_zone *zone)
+{
+	return dmz_invalidate_blocks(dmz, zone, 0, dmz->zone_nr_blocks);
+}
+
+extern int dmz_block_valid(struct dmz_target *dmz, struct dm_zone *zone,
+			   sector_t chunk_block);
+
+extern int dmz_first_valid_block(struct dmz_target *dmz, struct dm_zone *zone,
+				 sector_t *chunk_block);
+
+#endif /* __DM_ZONED_H__ */
-- 
2.9.4

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] dm: zoned block device fixes
  2017-05-29 10:23 [PATCH 0/4] dm: zoned block device fixes Damien Le Moal
                   ` (3 preceding siblings ...)
  2017-05-29 10:23 ` [PATCH 4/4] dm-zoned: Drive-managed zoned block device target Damien Le Moal
@ 2017-05-30 20:20 ` Mike Snitzer
  2017-05-31  4:29   ` Damien Le Moal
  4 siblings, 1 reply; 13+ messages in thread
From: Mike Snitzer @ 2017-05-30 20:20 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Bart Van Assche, dm-devel, Alasdair Kergon

On Mon, May 29 2017 at  6:23P -0400,
Damien Le Moal <damien.lemoal@wdc.com> wrote:

> Mike,
> 
> The first 3 patches of this series are incremental fixes for the zoned block
> device support patches that you committed to the for-4.13/dm branch.
> 
> The first patch correct the zone alignement checks so that the check is
> performed for any device, regardless of the device LBA size (it is skipped for
> 512B LBA devices otherwise).

I folded this first patch into the original commit (baf844bf4ae3).

> The second patch is a fix for commit baf844bf4ae3 "dm table: add zoned block
> devices validation". In that commit, the stacked limits zoned model was not
> set to the zoned model of the table target devices, leading to the exposed
> device always being exposed as a regular block device. With this fix, dm-flaky
> and dm-linear work fine on top of host-managed zoned block devices.
> 
> The third patch fixes zoned model validation again to allow for target types
> emulating a different zoned model than the model of the table target devices,
> e.g. dm-zoned.

The 2nd and 3rd seem over-done to me.  After spending more time than
ideal, the following patch would seem to be the equivalent to what
you've done in patches 2 and 3 (sans the "cleanup" of passing limits to
validate_hardware_zoned_model):

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 6545150..a39bcd9 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1523,19 +1524,39 @@ int dm_calculate_queue_limits(struct dm_table *table,
 			       dm_device_name(table->md),
 			       (unsigned long long) ti->begin,
 			       (unsigned long long) ti->len);
+
+		/*
+		 * FIXME: this should likely be moved to blk_stack_limits(), would
+		 * also eliminate limits->zoned stacking hack in dm_set_device_limits()
+		 */
+		if (limits->zoned == BLK_ZONED_NONE && ti_limits.zoned != BLK_ZONED_NONE) {
+			/*
+			 * By default, the stacked limits zoned model is set to
+			 * BLK_ZONED_NONE in blk_set_stacking_limits(). Update
+			 * this model using the first target model reported
+			 * that is not BLK_ZONED_NONE. This will be either the
+			 * first target device zoned model or the model reported
+			 * by the target .io_hints.
+			 */
+			limits->zoned = ti_limits.zoned;
+		}
 	}
 
 	/*
 	 * Verify that the zoned model and zone sectors, as determined before
 	 * any .io_hints override, are the same across all devices in the table.
-	 * - but if limits->zoned is not BLK_ZONED_NONE validate match for it
-	 * - simillarly, check all devices conform to limits->chunk_sectors if
-	 *   .io_hints altered them
+	 * - this is especially relevant if .io_hints is emulating a disk-managed
+	 *   zoned model (aka BLK_ZONED_NONE) on host-managed zoned block devices.
+	 * BUT...
 	 */
-	if (limits->zoned != BLK_ZONED_NONE)
+	if (limits->zoned != BLK_ZONED_NONE) {
+		/*
+		 * ...IF the above limits stacking determined a zoned model
+		 * validate that all of the table's devices conform to it.
+		 */
 		zoned_model = limits->zoned;
-	if (limits->chunk_sectors != zone_sectors)
 		zone_sectors = limits->chunk_sectors;
+	}
 	if (validate_hardware_zoned_model(table, zoned_model, zone_sectors))
 		return -EINVAL;
 
Anyway, I've folded this into the original commit too.  If you could
verify it works with your scenarios I'd appreciate it.

FYI, any additional cosmetic cleanup can wait (I happen to think this
code is clearer than how you relied on the matches functions to
initialize a temporary value).

I also folded in an validate_hardware_zoned_model() optimization to
return early if zoned_model == BLK_ZONED_NONE, please see/test the
rolled-up end result here:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-4.13/dm&id=9a6b54360f147c2d25fba7debc31a3251b804cc2

Also, please note that I've forcibly rebased linux-dm.git's
'for-4.13/dm' and staged it in 'for-next'.

> The last patch is dm-zoned with various fixes (mainly crashes on setup error
> and handling of the metadata cache shrinker). For your review, please use this
> version.

Will do, thanks.

Mike

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] dm: zoned block device fixes
  2017-05-30 20:20 ` [PATCH 0/4] dm: zoned block device fixes Mike Snitzer
@ 2017-05-31  4:29   ` Damien Le Moal
  2017-05-31 14:39     ` Mike Snitzer
  0 siblings, 1 reply; 13+ messages in thread
From: Damien Le Moal @ 2017-05-31  4:29 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Bart Van Assche, dm-devel, Alasdair Kergon

Mike,

On 5/31/17 05:20, Mike Snitzer wrote:
> On Mon, May 29 2017 at  6:23P -0400,
> Damien Le Moal <damien.lemoal@wdc.com> wrote:
> 
>> Mike,
>>
>> The first 3 patches of this series are incremental fixes for the zoned block
>> device support patches that you committed to the for-4.13/dm branch.
>>
>> The first patch correct the zone alignement checks so that the check is
>> performed for any device, regardless of the device LBA size (it is skipped for
>> 512B LBA devices otherwise).
> 
> I folded this first patch into the original commit (baf844bf4ae3).

Great. Thanks.

>> The second patch is a fix for commit baf844bf4ae3 "dm table: add zoned block
>> devices validation". In that commit, the stacked limits zoned model was not
>> set to the zoned model of the table target devices, leading to the exposed
>> device always being exposed as a regular block device. With this fix, dm-flaky
>> and dm-linear work fine on top of host-managed zoned block devices.
>>
>> The third patch fixes zoned model validation again to allow for target types
>> emulating a different zoned model than the model of the table target devices,
>> e.g. dm-zoned.
> 
> The 2nd and 3rd seem over-done to me.  After spending more time than
> ideal, the following patch would seem to be the equivalent to what
> you've done in patches 2 and 3 (sans the "cleanup" of passing limits to
> validate_hardware_zoned_model):
> 
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 6545150..a39bcd9 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -1523,19 +1524,39 @@ int dm_calculate_queue_limits(struct dm_table *table,
>  			       dm_device_name(table->md),
>  			       (unsigned long long) ti->begin,
>  			       (unsigned long long) ti->len);
> +
> +		/*
> +		 * FIXME: this should likely be moved to blk_stack_limits(), would
> +		 * also eliminate limits->zoned stacking hack in dm_set_device_limits()
> +		 */
> +		if (limits->zoned == BLK_ZONED_NONE && ti_limits.zoned != BLK_ZONED_NONE) {
> +			/*
> +			 * By default, the stacked limits zoned model is set to
> +			 * BLK_ZONED_NONE in blk_set_stacking_limits(). Update
> +			 * this model using the first target model reported
> +			 * that is not BLK_ZONED_NONE. This will be either the
> +			 * first target device zoned model or the model reported
> +			 * by the target .io_hints.
> +			 */
> +			limits->zoned = ti_limits.zoned;
> +		}
>  	}
>  
>  	/*
>  	 * Verify that the zoned model and zone sectors, as determined before
>  	 * any .io_hints override, are the same across all devices in the table.
> -	 * - but if limits->zoned is not BLK_ZONED_NONE validate match for it
> -	 * - simillarly, check all devices conform to limits->chunk_sectors if
> -	 *   .io_hints altered them
> +	 * - this is especially relevant if .io_hints is emulating a disk-managed
> +	 *   zoned model (aka BLK_ZONED_NONE) on host-managed zoned block devices.
> +	 * BUT...
>  	 */
> -	if (limits->zoned != BLK_ZONED_NONE)
> +	if (limits->zoned != BLK_ZONED_NONE) {
> +		/*
> +		 * ...IF the above limits stacking determined a zoned model
> +		 * validate that all of the table's devices conform to it.
> +		 */
>  		zoned_model = limits->zoned;
> -	if (limits->chunk_sectors != zone_sectors)
>  		zone_sectors = limits->chunk_sectors;
> +	}
>  	if (validate_hardware_zoned_model(table, zoned_model, zone_sectors))
>  		return -EINVAL;
>  
> Anyway, I've folded this into the original commit too.  If you could
> verify it works with your scenarios I'd appreciate it.

I tested with dm-linear, dm-flakey and dm-zoned. No problems detected,
the end device zone model and zone size was always correct. I also tried
all invalid setup I could generate and all were properly caught.

There is however one case that will not work: an HM (or HA) emulating
target on top of a regular (NONE) block device. In that case, we will
end up checking that the underlying devices are compatible HM/HA, whih
will fail. But since none of the existing targets currently do this, I
guess the code is OK as is. What do you think ?

> FYI, any additional cosmetic cleanup can wait (I happen to think this
> code is clearer than how you relied on the matches functions to
> initialize a temporary value).

OK. No problem.

> I also folded in an validate_hardware_zoned_model() optimization to
> return early if zoned_model == BLK_ZONED_NONE, please see/test the
> rolled-up end result here:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-4.13/dm&id=9a6b54360f147c2d25fba7debc31a3251b804cc2
> 
> Also, please note that I've forcibly rebased linux-dm.git's
> 'for-4.13/dm' and staged it in 'for-next'.

I tested this tree unmodified. No problem detected.
Thank you.

Best regards.

-- 
Damien Le Moal,
Western Digital Research

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] dm: zoned block device fixes
  2017-05-31  4:29   ` Damien Le Moal
@ 2017-05-31 14:39     ` Mike Snitzer
  2017-06-02  0:36       ` Mike Snitzer
  0 siblings, 1 reply; 13+ messages in thread
From: Mike Snitzer @ 2017-05-31 14:39 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Bart Van Assche, dm-devel, Alasdair Kergon

On Wed, May 31 2017 at 12:29am -0400,
Damien Le Moal <damien.lemoal@wdc.com> wrote:

> Mike,
> 
> On 5/31/17 05:20, Mike Snitzer wrote:
> > On Mon, May 29 2017 at  6:23P -0400,
> > Damien Le Moal <damien.lemoal@wdc.com> wrote:
> > 
> >> Mike,
> >>
> >> The first 3 patches of this series are incremental fixes for the zoned block
> >> device support patches that you committed to the for-4.13/dm branch.
> >>
> >> The first patch correct the zone alignement checks so that the check is
> >> performed for any device, regardless of the device LBA size (it is skipped for
> >> 512B LBA devices otherwise).
> > 
> > I folded this first patch into the original commit (baf844bf4ae3).
> 
> Great. Thanks.
> 
> >> The second patch is a fix for commit baf844bf4ae3 "dm table: add zoned block
> >> devices validation". In that commit, the stacked limits zoned model was not
> >> set to the zoned model of the table target devices, leading to the exposed
> >> device always being exposed as a regular block device. With this fix, dm-flaky
> >> and dm-linear work fine on top of host-managed zoned block devices.
> >>
> >> The third patch fixes zoned model validation again to allow for target types
> >> emulating a different zoned model than the model of the table target devices,
> >> e.g. dm-zoned.
> > 
> > The 2nd and 3rd seem over-done to me.  After spending more time than
> > ideal, the following patch would seem to be the equivalent to what
> > you've done in patches 2 and 3 (sans the "cleanup" of passing limits to
> > validate_hardware_zoned_model):
> > 
> > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> > index 6545150..a39bcd9 100644
> > --- a/drivers/md/dm-table.c
> > +++ b/drivers/md/dm-table.c
> > @@ -1523,19 +1524,39 @@ int dm_calculate_queue_limits(struct dm_table *table,
> >  			       dm_device_name(table->md),
> >  			       (unsigned long long) ti->begin,
> >  			       (unsigned long long) ti->len);
> > +
> > +		/*
> > +		 * FIXME: this should likely be moved to blk_stack_limits(), would
> > +		 * also eliminate limits->zoned stacking hack in dm_set_device_limits()
> > +		 */
> > +		if (limits->zoned == BLK_ZONED_NONE && ti_limits.zoned != BLK_ZONED_NONE) {
> > +			/*
> > +			 * By default, the stacked limits zoned model is set to
> > +			 * BLK_ZONED_NONE in blk_set_stacking_limits(). Update
> > +			 * this model using the first target model reported
> > +			 * that is not BLK_ZONED_NONE. This will be either the
> > +			 * first target device zoned model or the model reported
> > +			 * by the target .io_hints.
> > +			 */
> > +			limits->zoned = ti_limits.zoned;
> > +		}
> >  	}
> >  
> >  	/*
> >  	 * Verify that the zoned model and zone sectors, as determined before
> >  	 * any .io_hints override, are the same across all devices in the table.
> > -	 * - but if limits->zoned is not BLK_ZONED_NONE validate match for it
> > -	 * - simillarly, check all devices conform to limits->chunk_sectors if
> > -	 *   .io_hints altered them
> > +	 * - this is especially relevant if .io_hints is emulating a disk-managed
> > +	 *   zoned model (aka BLK_ZONED_NONE) on host-managed zoned block devices.
> > +	 * BUT...
> >  	 */
> > -	if (limits->zoned != BLK_ZONED_NONE)
> > +	if (limits->zoned != BLK_ZONED_NONE) {
> > +		/*
> > +		 * ...IF the above limits stacking determined a zoned model
> > +		 * validate that all of the table's devices conform to it.
> > +		 */
> >  		zoned_model = limits->zoned;
> > -	if (limits->chunk_sectors != zone_sectors)
> >  		zone_sectors = limits->chunk_sectors;
> > +	}
> >  	if (validate_hardware_zoned_model(table, zoned_model, zone_sectors))
> >  		return -EINVAL;
> >  
> > Anyway, I've folded this into the original commit too.  If you could
> > verify it works with your scenarios I'd appreciate it.
> 
> I tested with dm-linear, dm-flakey and dm-zoned. No problems detected,
> the end device zone model and zone size was always correct. I also tried
> all invalid setup I could generate and all were properly caught.
> 
> There is however one case that will not work: an HM (or HA) emulating
> target on top of a regular (NONE) block device. In that case, we will
> end up checking that the underlying devices are compatible HM/HA, whih
> will fail. But since none of the existing targets currently do this, I
> guess the code is OK as is. What do you think ?

Yeah, I think it best to fix that if/when there is a need.

> > FYI, any additional cosmetic cleanup can wait (I happen to think this
> > code is clearer than how you relied on the matches functions to
> > initialize a temporary value).
> 
> OK. No problem.
> 
> > I also folded in an validate_hardware_zoned_model() optimization to
> > return early if zoned_model == BLK_ZONED_NONE, please see/test the
> > rolled-up end result here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-4.13/dm&id=9a6b54360f147c2d25fba7debc31a3251b804cc2
> > 
> > Also, please note that I've forcibly rebased linux-dm.git's
> > 'for-4.13/dm' and staged it in 'for-next'.
> 
> I tested this tree unmodified. No problem detected.
> Thank you.

Good news, thanks for the review/testing.

FYI: my review of dm-zoned will be focused on DM target correctness
(suspend/resume quirks, no allocations in the IO path that aren't backed
by a mempool, coding style nits, etc).  I don't know enough about zoned
block devices to weigh-in on those details.  Ultimately I'll be
deferring to you, others on your team, and others in the community that
are more invested in zoned block devices to steer and stabilize this
target.

Anyway, hopefully my review will be fairly quick and I can get dm-zoned
staged for 4.13 by end of day tomorrow.

Mike

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] dm: zoned block device fixes
  2017-05-31 14:39     ` Mike Snitzer
@ 2017-06-02  0:36       ` Mike Snitzer
  2017-06-05 10:48         ` Damien Le Moal
  0 siblings, 1 reply; 13+ messages in thread
From: Mike Snitzer @ 2017-06-02  0:36 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Bart Van Assche, dm-devel, Alasdair Kergon

On Wed, May 31 2017 at 10:39am -0400,
Mike Snitzer <snitzer@redhat.com> wrote:
 
> FYI: my review of dm-zoned will be focused on DM target correctness
> (suspend/resume quirks, no allocations in the IO path that aren't backed
> by a mempool, coding style nits, etc).  I don't know enough about zoned
> block devices to weigh-in on those details.  Ultimately I'll be
> deferring to you, others on your team, and others in the community that
> are more invested in zoned block devices to steer and stabilize this
> target.
> 
> Anyway, hopefully my review will be fairly quick and I can get dm-zoned
> staged for 4.13 by end of day tomorrow.

I made a go of it but I'm getting hung up on quite a lot of code that
doesn't conform to, what I'd like to think is, the cleaner nature of how
DM targets that are split across multiple files should be.

You basically slammed everything into 'struct dmz_target' and passed dmz
everywhere.  I tried to split out a 'struct dmz_metadata' (and got quite
far!) but finally gave up because affecting that churn was killing me
slowly.  Anyway, here is where I left off:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=dm-zoned

In hindsight, maybe I should've just responded with the laundry list of
things I saw so that you could fix them.  But if you see changes that
you like in that branch feel free to pull them in to a new version of
dm-zoned that you resubmit.

As for splitting out a 'struct dmz_metadata'.. I'd really prefer _some_
separation but there is little point with doing so if we're going to
just half-ass it and add in a back-pointer to the 'struct dmz_target' to
access certain members.  I was left unhappy with my attempt.. again, was
a shit-show of churn.

I think this target needs a more critical eye on the various places IO
is being submitted and where allocations are occuring.  I allowed myself
to get hung up on code movement when I should've focused on more
constructive design choices you made.

Mike

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] dm: zoned block device fixes
  2017-06-02  0:36       ` Mike Snitzer
@ 2017-06-05 10:48         ` Damien Le Moal
  2017-06-06 14:18           ` Mike Snitzer
  2017-06-08 20:21           ` Mike Snitzer
  0 siblings, 2 replies; 13+ messages in thread
From: Damien Le Moal @ 2017-06-05 10:48 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Bart Van Assche, dm-devel, Alasdair Kergon

Mike,

Thank you very much for the very detailed review and all the cleanup.

On 6/2/17 09:36, Mike Snitzer wrote:
> On Wed, May 31 2017 at 10:39am -0400,
> Mike Snitzer <snitzer@redhat.com> wrote:
>  
>> FYI: my review of dm-zoned will be focused on DM target correctness
>> (suspend/resume quirks, no allocations in the IO path that aren't backed
>> by a mempool, coding style nits, etc).  I don't know enough about zoned
>> block devices to weigh-in on those details.  Ultimately I'll be
>> deferring to you, others on your team, and others in the community that
>> are more invested in zoned block devices to steer and stabilize this
>> target.
>>
>> Anyway, hopefully my review will be fairly quick and I can get dm-zoned
>> staged for 4.13 by end of day tomorrow.
> 
> I made a go of it but I'm getting hung up on quite a lot of code that
> doesn't conform to, what I'd like to think is, the cleaner nature of how
> DM targets that are split across multiple files should be.
> 
> You basically slammed everything into 'struct dmz_target' and passed dmz
> everywhere.  I tried to split out a 'struct dmz_metadata' (and got quite
> far!) but finally gave up because affecting that churn was killing me
> slowly.  Anyway, here is where I left off:
> https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=dm-zoned
> 
> In hindsight, maybe I should've just responded with the laundry list of
> things I saw so that you could fix them.  But if you see changes that
> you like in that branch feel free to pull them in to a new version of
> dm-zoned that you resubmit.
> 
> As for splitting out a 'struct dmz_metadata'.. I'd really prefer _some_
> separation but there is little point with doing so if we're going to
> just half-ass it and add in a back-pointer to the 'struct dmz_target' to
> access certain members.  I was left unhappy with my attempt.. again, was
> a shit-show of churn.

I continued and finished the separation. There is no back-pointer
anymore. All functions in dm-zoned-metadata.c take the metadata struct
pointer as argument. Other functions in dm-zoned-target.c and
dm-zoned-reclaim.c take the dmz_target pointer as argument. To do this,
I had to add a "struct dmz_dev" to store the static information about
the device being used (bdev, name, capacity, number of zones, zone
size). This does cleanup further the initialization and tear-down path.
I also moved around and renamed some functions to further cleanup the
code and make it easier to read. At this point, I think it would be easy
to also separate all fields needed for reclaim from the dmz_target
structure. But I have not pushed changes this far as the amount of data
needed for reclaim is rather small.

> I think this target needs a more critical eye on the various places IO
> is being submitted and where allocations are occuring.  I allowed myself
> to get hung up on code movement when I should've focused on more
> constructive design choices you made.

I did address some of the "FIXME" notes you added. The main one is the
BIO cloning in the I/O path. I removed most of that and added a .end_io
method for completion processing. The only place were I do not see how
to remove the call to bio_clone() is during read BIO processing: since a
read BIO may end up being split between buffer zone, sequential zone and
simple buffer zero-out, fragmentation of the read BIO is sometimes
necessary and so need a clone.

There is one "FIXME" that I did not address: the allocation of metadata
blocks on cache miss. This is in the I/O path, but called only from the
context of the chunk workers, so a different context than the BIO submit
one. I do not see a problem with this. Please let me know if you would
prefer another solution.

I am running tests against the new version created with all these
changes. If everything goes well, I will send it out tomorrow.

Best regards.

-- 
Damien Le Moal,
Western Digital

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] dm: zoned block device fixes
  2017-06-05 10:48         ` Damien Le Moal
@ 2017-06-06 14:18           ` Mike Snitzer
  2017-06-08 20:21           ` Mike Snitzer
  1 sibling, 0 replies; 13+ messages in thread
From: Mike Snitzer @ 2017-06-06 14:18 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Bart Van Assche, dm-devel, Alasdair Kergon

On Mon, Jun 05 2017 at  6:48am -0400,
Damien Le Moal <damien.lemoal@wdc.com> wrote:

> Mike,
> 
> Thank you very much for the very detailed review and all the cleanup.
> 
> On 6/2/17 09:36, Mike Snitzer wrote:
> > On Wed, May 31 2017 at 10:39am -0400,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> >  
> >> FYI: my review of dm-zoned will be focused on DM target correctness
> >> (suspend/resume quirks, no allocations in the IO path that aren't backed
> >> by a mempool, coding style nits, etc).  I don't know enough about zoned
> >> block devices to weigh-in on those details.  Ultimately I'll be
> >> deferring to you, others on your team, and others in the community that
> >> are more invested in zoned block devices to steer and stabilize this
> >> target.
> >>
> >> Anyway, hopefully my review will be fairly quick and I can get dm-zoned
> >> staged for 4.13 by end of day tomorrow.
> > 
> > I made a go of it but I'm getting hung up on quite a lot of code that
> > doesn't conform to, what I'd like to think is, the cleaner nature of how
> > DM targets that are split across multiple files should be.
> > 
> > You basically slammed everything into 'struct dmz_target' and passed dmz
> > everywhere.  I tried to split out a 'struct dmz_metadata' (and got quite
> > far!) but finally gave up because affecting that churn was killing me
> > slowly.  Anyway, here is where I left off:
> > https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=dm-zoned
> > 
> > In hindsight, maybe I should've just responded with the laundry list of
> > things I saw so that you could fix them.  But if you see changes that
> > you like in that branch feel free to pull them in to a new version of
> > dm-zoned that you resubmit.
> > 
> > As for splitting out a 'struct dmz_metadata'.. I'd really prefer _some_
> > separation but there is little point with doing so if we're going to
> > just half-ass it and add in a back-pointer to the 'struct dmz_target' to
> > access certain members.  I was left unhappy with my attempt.. again, was
> > a shit-show of churn.
> 
> I continued and finished the separation. There is no back-pointer
> anymore. All functions in dm-zoned-metadata.c take the metadata struct
> pointer as argument. Other functions in dm-zoned-target.c and
> dm-zoned-reclaim.c take the dmz_target pointer as argument. To do this,
> I had to add a "struct dmz_dev" to store the static information about
> the device being used (bdev, name, capacity, number of zones, zone
> size). This does cleanup further the initialization and tear-down path.
> I also moved around and renamed some functions to further cleanup the
> code and make it easier to read. At this point, I think it would be easy
> to also separate all fields needed for reclaim from the dmz_target
> structure. But I have not pushed changes this far as the amount of data
> needed for reclaim is rather small.

Sounds good.

> > I think this target needs a more critical eye on the various places IO
> > is being submitted and where allocations are occuring.  I allowed myself
> > to get hung up on code movement when I should've focused on more
> > constructive design choices you made.
> 
> I did address some of the "FIXME" notes you added. The main one is the
> BIO cloning in the I/O path. I removed most of that and added a .end_io
> method for completion processing. The only place were I do not see how
> to remove the call to bio_clone() is during read BIO processing: since a
> read BIO may end up being split between buffer zone, sequential zone and
> simple buffer zero-out, fragmentation of the read BIO is sometimes
> necessary and so need a clone.
> 
> There is one "FIXME" that I did not address: the allocation of metadata
> blocks on cache miss. This is in the I/O path, but called only from the
> context of the chunk workers, so a different context than the BIO submit
> one. I do not see a problem with this. Please let me know if you would
> prefer another solution.
> 
> I am running tests against the new version created with all these
> changes. If everything goes well, I will send it out tomorrow.

Great, thanks for carrying it forward.

Mike

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] dm: zoned block device fixes
  2017-06-05 10:48         ` Damien Le Moal
  2017-06-06 14:18           ` Mike Snitzer
@ 2017-06-08 20:21           ` Mike Snitzer
  2017-06-09  4:25             ` Damien Le Moal
  1 sibling, 1 reply; 13+ messages in thread
From: Mike Snitzer @ 2017-06-08 20:21 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Bart Van Assche, dm-devel, Alasdair Kergon

On Mon, Jun 05 2017 at  6:48am -0400,
Damien Le Moal <damien.lemoal@wdc.com> wrote:
 
> I did address some of the "FIXME" notes you added. The main one is the
> BIO cloning in the I/O path. I removed most of that and added a .end_io
> method for completion processing. The only place were I do not see how
> to remove the call to bio_clone() is during read BIO processing: since a
> read BIO may end up being split between buffer zone, sequential zone and
> simple buffer zero-out, fragmentation of the read BIO is sometimes
> necessary and so need a clone.

So shouldn't it be possible to not allow a given bio to cross zone
boundaries by using dm_accept_partial_bio()?

Like you're already doing in dmz_map() actually... so why do you need to
account for crossing zone boundaries on read later on in
dmz_submit_read_bio()?

Is it that these zones aren't easily known up front (in dmz_map)?

Mike

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] dm: zoned block device fixes
  2017-06-08 20:21           ` Mike Snitzer
@ 2017-06-09  4:25             ` Damien Le Moal
  0 siblings, 0 replies; 13+ messages in thread
From: Damien Le Moal @ 2017-06-09  4:25 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Bart Van Assche, dm-devel, Alasdair Kergon

Mike,

On 6/9/17 05:21, Mike Snitzer wrote:
> On Mon, Jun 05 2017 at  6:48am -0400,
> Damien Le Moal <damien.lemoal@wdc.com> wrote:
>  
>> I did address some of the "FIXME" notes you added. The main one is the
>> BIO cloning in the I/O path. I removed most of that and added a .end_io
>> method for completion processing. The only place were I do not see how
>> to remove the call to bio_clone() is during read BIO processing: since a
>> read BIO may end up being split between buffer zone, sequential zone and
>> simple buffer zero-out, fragmentation of the read BIO is sometimes
>> necessary and so need a clone.
> 
> So shouldn't it be possible to not allow a given bio to cross zone
> boundaries by using dm_accept_partial_bio()?
> 
> Like you're already doing in dmz_map() actually... so why do you need to
> account for crossing zone boundaries on read later on in
> dmz_submit_read_bio()?

It is not crossing of zone or chunk boundaries that is being delt with
here. When the read BIO is being processed, we already know that it does
not cross chunk boundaries thanks to dmz_map(). Since chunks are mapped
to entire zones, the BIO does not cross a zone boundary either.

But the blocks to read within the chunk may be (1) invalid if they were
never written, (2) valid in the chunk data zone or (3) valid in the
chunk write buffer zone (this case exists only if the chunk is mapped to
a sequential zone.
So we need to examine the zones block bitmaps to discover this and
eventually split the BIO in several fragments if needed. Hence the need
for bio_clone().

If this processing was done within the context of dmz_map(), we could of
course use dm_accept_partial_bio() for splitting the BIO. But I avoided
this method as access to the zone bitmap may trigger metadata I/Os,
which is not very nice in the context of the user bio_submit(). There is
also the fact that dm-zoned does not have per zone locks to deal with
concurrent accesses to the same chunk. All BIO processing for a chunk
get serialized in the chunk work. This is mandatory to ensure sequential
write to sequential zones whenever possible.

> Is it that these zones aren't easily known up front (in dmz_map)?

dmz_map() does the mapping discovery using the mapping table that is
pinned-down in memory. So no metadata I/O trigger. But doing more than
that could trigger metadata I/Os, which I wanted to avoid.

On another note, I just posted an additional small patch for fixing 2
problems: an overflow in the target length calculation and a bad mistake
in the patch with the suspend method setup that caused compilation
error. My apologies for these mistakes. They were not in my local tree
when I tested. I think I messed up with the merging of all my local patches.

I checked your tree again and can confirm now that I am in sync. I will
keep testing anyway to make sure I did not do anything else wrong.

Thank you.

Best regards.

-- 
Damien Le Moal, Ph.D.
Sr Manager, System Software Group,
Western Digital Research
Damien.LeMoal@wdc.com
Tel: (+81) 0466-98-3593 (Ext. 51-3593)
1 kirihara-cho, Fujisawa, Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-06-09  4:25 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-29 10:23 [PATCH 0/4] dm: zoned block device fixes Damien Le Moal
2017-05-29 10:23 ` [PATCH 1/4] dm: Fix mapping zone alignment check Damien Le Moal
2017-05-29 10:23 ` [PATCH 2/4] dm: Fix staking limits for zoned block device Damien Le Moal
2017-05-29 10:23 ` [PATCH 3/4] dm: Fix zoned block device model validation Damien Le Moal
2017-05-29 10:23 ` [PATCH 4/4] dm-zoned: Drive-managed zoned block device target Damien Le Moal
2017-05-30 20:20 ` [PATCH 0/4] dm: zoned block device fixes Mike Snitzer
2017-05-31  4:29   ` Damien Le Moal
2017-05-31 14:39     ` Mike Snitzer
2017-06-02  0:36       ` Mike Snitzer
2017-06-05 10:48         ` Damien Le Moal
2017-06-06 14:18           ` Mike Snitzer
2017-06-08 20:21           ` Mike Snitzer
2017-06-09  4:25             ` Damien Le Moal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.