linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] Support DAX for device-mapper dm-linear devices
@ 2016-06-13 22:21 Toshi Kani
  2016-06-13 22:21 ` [PATCH 1/6] genhd: Add GENHD_FL_DAX to gendisk flags Toshi Kani
                   ` (7 more replies)
  0 siblings, 8 replies; 40+ messages in thread
From: Toshi Kani @ 2016-06-13 22:21 UTC (permalink / raw)
  To: agk, snitzer, dan.j.williams
  Cc: ross.zwisler, viro, axboe, toshi.kani, linux-nvdimm, dm-devel,
	linux-raid, linux-kernel

This patch-set adds DAX support to device-mapper dm-linear devices
used by LVM.  It works with LVM commands as follows:
 - Creation of a logical volume with all DAX capable devices (such
   as pmem) sets the logical volume DAX capable as well.
 - Once a logical volume is set to DAX capable, the volume may not
   be extended with non-DAX capable devices.

The direct_access interface is added to dm and dm-linear to map
a request to a target device.

 - Patch 1-2 introduce GENHD_FL_DAX flag to indicate DAX capability.
 - Patch 3-4 add direct_access functions to dm and dm-linear.
 - Patch 5-6 set GENHD_FL_DAX to dm when all targets are DAX capable.

---
Toshi Kani (6):
 1/6 genhd: Add GENHD_FL_DAX to gendisk flags
 2/6 block: Check GENHD_FL_DAX for DAX capability
 3/6 dm: Add dm_blk_direct_access() for mapped device
 4/6 dm-linear: Add linear_direct_access()
 5/6 dm, dm-linear: Add dax_supported to dm_target
 6/6 dm: Enable DAX support for mapper device

---
 drivers/block/brd.c           |  2 +-
 drivers/md/dm-linear.c        | 19 +++++++++++++++++++
 drivers/md/dm-table.c         | 12 +++++++++---
 drivers/md/dm.c               | 38 ++++++++++++++++++++++++++++++++++++--
 drivers/md/dm.h               |  7 +++++++
 drivers/nvdimm/pmem.c         |  2 +-
 drivers/s390/block/dcssblk.c  |  1 +
 fs/block_dev.c                |  5 +++--
 include/linux/device-mapper.h | 15 +++++++++++++++
 include/linux/genhd.h         |  2 +-
 10 files changed, 93 insertions(+), 10 deletions(-)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/6] genhd: Add GENHD_FL_DAX to gendisk flags
  2016-06-13 22:21 [PATCH 0/6] Support DAX for device-mapper dm-linear devices Toshi Kani
@ 2016-06-13 22:21 ` Toshi Kani
  2016-06-13 22:21 ` [PATCH 2/6] block: Check GENHD_FL_DAX for DAX capability Toshi Kani
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 40+ messages in thread
From: Toshi Kani @ 2016-06-13 22:21 UTC (permalink / raw)
  To: agk, snitzer, dan.j.williams
  Cc: ross.zwisler, viro, axboe, toshi.kani, linux-nvdimm, dm-devel,
	linux-raid, linux-kernel, Martin Schwidefsky, Heiko Carstens,
	linux-s390

Currently, presence of direct_access() in block_device_operations
indicates support of DAX on its block device.  Because
block_device_operations is instantiated with 'const', this DAX
capablity may not be enabled conditinally.

In preparation for supporting DAX to device-mapper devices, add
GENHD_FL_DAX to gendisk flags to indicate that this block device
supports DAX.  This will allow to set the DAX capability based on
how mapped device is composed.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <linux-s390@vger.kernel.org>
---
 drivers/block/brd.c          |    2 +-
 drivers/nvdimm/pmem.c        |    2 +-
 drivers/s390/block/dcssblk.c |    1 +
 include/linux/genhd.h        |    2 +-
 4 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index c04bd9b..43a6765 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -518,7 +518,7 @@ static struct brd_device *brd_alloc(int i)
 	disk->fops		= &brd_fops;
 	disk->private_data	= brd;
 	disk->queue		= brd->brd_queue;
-	disk->flags		= GENHD_FL_EXT_DEVT;
+	disk->flags		= GENHD_FL_EXT_DEVT | GENHD_FL_DAX;
 	sprintf(disk->disk_name, "ram%d", i);
 	set_capacity(disk, rd_size * 2);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 608fc44..3847449 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -295,7 +295,7 @@ static int pmem_attach_disk(struct device *dev,
 
 	disk->fops		= &pmem_fops;
 	disk->queue		= q;
-	disk->flags		= GENHD_FL_EXT_DEVT;
+	disk->flags		= GENHD_FL_EXT_DEVT | GENHD_FL_DAX;
 	nvdimm_namespace_disk_name(ndns, disk->disk_name);
 	disk->driverfs_dev = dev;
 	set_capacity(disk, (pmem->size - pmem->pfn_pad - pmem->data_offset)
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index bed53c4..60b1857 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -616,6 +616,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
 	dev_info->gd->queue = dev_info->dcssblk_queue;
 	dev_info->gd->private_data = dev_info;
 	dev_info->gd->driverfs_dev = &dev_info->dev;
+	dev_info->gd->flags = GENHD_FL_DAX;
 	blk_queue_make_request(dev_info->dcssblk_queue, dcssblk_make_request);
 	blk_queue_logical_block_size(dev_info->dcssblk_queue, 4096);
 
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 359a8e4..9dc3a3d 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -131,7 +131,7 @@ struct hd_struct {
 };
 
 #define GENHD_FL_REMOVABLE			1
-/* 2 is unused */
+#define GENHD_FL_DAX				2
 #define GENHD_FL_MEDIA_CHANGE_NOTIFY		4
 #define GENHD_FL_CD				8
 #define GENHD_FL_UP				16

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 2/6] block: Check GENHD_FL_DAX for DAX capability
  2016-06-13 22:21 [PATCH 0/6] Support DAX for device-mapper dm-linear devices Toshi Kani
  2016-06-13 22:21 ` [PATCH 1/6] genhd: Add GENHD_FL_DAX to gendisk flags Toshi Kani
@ 2016-06-13 22:21 ` Toshi Kani
  2016-06-13 22:21 ` [PATCH 3/6] dm: Add dm_blk_direct_access() for mapped device Toshi Kani
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 40+ messages in thread
From: Toshi Kani @ 2016-06-13 22:21 UTC (permalink / raw)
  To: agk, snitzer, dan.j.williams
  Cc: ross.zwisler, viro, axboe, toshi.kani, linux-nvdimm, dm-devel,
	linux-raid, linux-kernel, linux-fsdevel

Now that GENHD_FL_DAX is set to all drivers supporting DAX,
change bdev_direct_access() and __blkdev_get() to check this
GENHD_FL_DAX flag.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: <linux-fsdevel@vger.kernel.org>
---
 fs/block_dev.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 71ccab1..61935ee 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -493,7 +493,7 @@ long bdev_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax)
 
 	if (size < 0)
 		return size;
-	if (!ops->direct_access)
+	if (!(bdev->bd_disk->flags & GENHD_FL_DAX) || !ops->direct_access)
 		return -EOPNOTSUPP;
 	if ((sector + DIV_ROUND_UP(size, 512)) >
 					part_nr_sects_read(bdev->bd_part))
@@ -1287,7 +1287,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 		bdev->bd_disk = disk;
 		bdev->bd_queue = disk->queue;
 		bdev->bd_contains = bdev;
-		if (IS_ENABLED(CONFIG_BLK_DEV_DAX) && disk->fops->direct_access)
+		if (IS_ENABLED(CONFIG_BLK_DEV_DAX) &&
+		    disk->flags & GENHD_FL_DAX)
 			bdev->bd_inode->i_flags = S_DAX;
 		else
 			bdev->bd_inode->i_flags = 0;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 3/6] dm: Add dm_blk_direct_access() for mapped device
  2016-06-13 22:21 [PATCH 0/6] Support DAX for device-mapper dm-linear devices Toshi Kani
  2016-06-13 22:21 ` [PATCH 1/6] genhd: Add GENHD_FL_DAX to gendisk flags Toshi Kani
  2016-06-13 22:21 ` [PATCH 2/6] block: Check GENHD_FL_DAX for DAX capability Toshi Kani
@ 2016-06-13 22:21 ` Toshi Kani
  2016-06-13 22:21 ` [PATCH 4/6] dm-linear: Add linear_direct_access() Toshi Kani
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 40+ messages in thread
From: Toshi Kani @ 2016-06-13 22:21 UTC (permalink / raw)
  To: agk, snitzer, dan.j.williams
  Cc: ross.zwisler, viro, axboe, toshi.kani, linux-nvdimm, dm-devel,
	linux-raid, linux-kernel

Change mapped device to implement direct_access function,
dm_blk_direct_access(), which calls a target direct_access
function.  'struct target_type' is extended to have target
direct_access interface.  This function limits direct
accessible size to the dm_target's limit with max_io_len().

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/md/dm.c               |   29 +++++++++++++++++++++++++++++
 include/linux/device-mapper.h |   10 ++++++++++
 2 files changed, 39 insertions(+)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 1b2f962..6e9f958 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1473,6 +1473,34 @@ int dm_set_target_max_io_len(struct dm_target *ti, sector_t len)
 }
 EXPORT_SYMBOL_GPL(dm_set_target_max_io_len);
 
+static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
+	void __pmem **kaddr, pfn_t *pfn, long size)
+{
+	struct mapped_device *md = bdev->bd_disk->private_data;
+	struct dm_table *map;
+	struct dm_target *ti;
+	int srcu_idx;
+	long len, ret = -EIO;
+
+	map = dm_get_live_table(md, &srcu_idx);
+	if (!map)
+		return ret;
+
+	ti = dm_table_find_target(map, sector);
+	if (!dm_target_is_valid(ti))
+		goto out;
+
+	len = max_io_len(sector, ti) << SECTOR_SHIFT;
+	size = min(len, size);
+
+	if (ti->type->direct_access)
+		ret = ti->type->direct_access(ti, sector, kaddr, pfn, size);
+
+out:
+	dm_put_live_table(md, srcu_idx);
+	return min(ret, size);
+}
+
 /*
  * A target may call dm_accept_partial_bio only from the map routine.  It is
  * allowed for all bio types except REQ_FLUSH.
@@ -3721,6 +3749,7 @@ static const struct block_device_operations dm_blk_dops = {
 	.open = dm_blk_open,
 	.release = dm_blk_close,
 	.ioctl = dm_blk_ioctl,
+	.direct_access = dm_blk_direct_access,
 	.getgeo = dm_blk_getgeo,
 	.pr_ops = &dm_pr_ops,
 	.owner = THIS_MODULE
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 0830c9e..16e6c8c 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -116,6 +116,15 @@ typedef void (*dm_io_hints_fn) (struct dm_target *ti,
  */
 typedef int (*dm_busy_fn) (struct dm_target *ti);
 
+/*
+ * Returns:
+ *  < 0 : error
+ * >= 0 : the number of bytes accessible at the address
+ */
+typedef long (*dm_direct_access_fn) (struct dm_target *ti, sector_t sector,
+				     void __pmem **kaddr, pfn_t *pfn,
+				     long size);
+
 void dm_error(const char *message);
 
 struct dm_dev {
@@ -162,6 +171,7 @@ struct target_type {
 	dm_busy_fn busy;
 	dm_iterate_devices_fn iterate_devices;
 	dm_io_hints_fn io_hints;
+	dm_direct_access_fn direct_access;
 
 	/* For internal device-mapper use. */
 	struct list_head list;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 4/6] dm-linear: Add linear_direct_access()
  2016-06-13 22:21 [PATCH 0/6] Support DAX for device-mapper dm-linear devices Toshi Kani
                   ` (2 preceding siblings ...)
  2016-06-13 22:21 ` [PATCH 3/6] dm: Add dm_blk_direct_access() for mapped device Toshi Kani
@ 2016-06-13 22:21 ` Toshi Kani
  2016-06-13 22:21 ` [PATCH 5/6] dm, dm-linear: Add dax_supported to dm_target Toshi Kani
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 40+ messages in thread
From: Toshi Kani @ 2016-06-13 22:21 UTC (permalink / raw)
  To: agk, snitzer, dan.j.williams
  Cc: ross.zwisler, viro, axboe, toshi.kani, linux-nvdimm, dm-devel,
	linux-raid, linux-kernel

Change dm-linear to implement direct_access function,
linear_direct_access(), which maps sector and calls
direct_access function of its target device.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/md/dm-linear.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 05c35aa..49bd7d2 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -141,6 +141,22 @@ static int linear_iterate_devices(struct dm_target *ti,
 	return fn(ti, lc->dev, lc->start, ti->len, data);
 }
 
+static long linear_direct_access(struct dm_target *ti, sector_t sector,
+		void __pmem **kaddr, pfn_t *pfn, long size)
+{
+	struct linear_c *lc;
+	struct block_device *tbdev;
+	const struct block_device_operations *tops;
+	sector_t tsector;
+
+	lc = ti->private;
+	tbdev = lc->dev->bdev;
+	tops = tbdev->bd_disk->fops;
+	tsector = linear_map_sector(ti, sector);
+
+	return tops->direct_access(tbdev, tsector, kaddr, pfn, size);
+}
+
 static struct target_type linear_target = {
 	.name   = "linear",
 	.version = {1, 2, 1},
@@ -151,6 +167,7 @@ static struct target_type linear_target = {
 	.status = linear_status,
 	.prepare_ioctl = linear_prepare_ioctl,
 	.iterate_devices = linear_iterate_devices,
+	.direct_access = linear_direct_access,
 };
 
 int __init dm_linear_init(void)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 5/6] dm, dm-linear: Add dax_supported to dm_target
  2016-06-13 22:21 [PATCH 0/6] Support DAX for device-mapper dm-linear devices Toshi Kani
                   ` (3 preceding siblings ...)
  2016-06-13 22:21 ` [PATCH 4/6] dm-linear: Add linear_direct_access() Toshi Kani
@ 2016-06-13 22:21 ` Toshi Kani
  2016-06-13 22:21 ` [PATCH 6/6] dm: Enable DAX support for mapper device Toshi Kani
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 40+ messages in thread
From: Toshi Kani @ 2016-06-13 22:21 UTC (permalink / raw)
  To: agk, snitzer, dan.j.williams
  Cc: ross.zwisler, viro, axboe, toshi.kani, linux-nvdimm, dm-devel,
	linux-raid, linux-kernel

Extend 'struct dm_target' to have dax_supported bit, which allows
dm-table to check if a dm-target supports DAX.

Change dm-linear to set this bit when its target physical device
supports DAX.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/md/dm-linear.c        |    2 ++
 include/linux/device-mapper.h |    5 +++++
 2 files changed, 7 insertions(+)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 49bd7d2..6fdbbc8 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -59,6 +59,8 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	ti->num_flush_bios = 1;
 	ti->num_discard_bios = 1;
 	ti->num_write_same_bios = 1;
+	if (lc->dev->bdev->bd_disk->flags & GENHD_FL_DAX)
+		ti->dax_supported = 1;
 	ti->private = lc;
 	return 0;
 
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 16e6c8c..9b72989 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -290,6 +290,11 @@ struct dm_target {
 	 * Set if this target does not return zeroes on discarded blocks.
 	 */
 	bool discard_zeroes_data_unsupported:1;
+
+	/*
+	 * Set if the target supports DAX (direct access).
+	 */
+	bool dax_supported:1;
 };
 
 /* Each target can link one of these into the table */

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 6/6] dm: Enable DAX support for mapper device
  2016-06-13 22:21 [PATCH 0/6] Support DAX for device-mapper dm-linear devices Toshi Kani
                   ` (4 preceding siblings ...)
  2016-06-13 22:21 ` [PATCH 5/6] dm, dm-linear: Add dax_supported to dm_target Toshi Kani
@ 2016-06-13 22:21 ` Toshi Kani
  2016-06-13 22:57 ` [PATCH 0/6] Support DAX for device-mapper dm-linear devices Mike Snitzer
  2016-06-13 23:18 ` Dan Williams
  7 siblings, 0 replies; 40+ messages in thread
From: Toshi Kani @ 2016-06-13 22:21 UTC (permalink / raw)
  To: agk, snitzer, dan.j.williams
  Cc: ross.zwisler, viro, axboe, toshi.kani, linux-nvdimm, dm-devel,
	linux-raid, linux-kernel

Add a new dm type, DM_TYPE_DAX_BIO_BASED, which indicates
that mapped device supports DAX and is bio based.  This new
type is used to assure that all target devices have DAX support
and remain that way after GENHD_FL_DAX is set to mapped device.

At initial table load, GENHD_FL_DAX is set to mapped device
when setting DM_TYPE_DAX_BIO_BASED to the type.  Any subsequent
table load to the mapped device must have the same type, or else
it fails per the check in table_load().

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/md/dm-table.c |   12 +++++++++---
 drivers/md/dm.c       |    9 +++++++--
 drivers/md/dm.h       |    7 +++++++
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 626a5ec..81138e7 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -833,7 +833,7 @@ static bool __table_type_request_based(unsigned table_type)
 static int dm_table_set_type(struct dm_table *t)
 {
 	unsigned i;
-	unsigned bio_based = 0, request_based = 0, hybrid = 0;
+	unsigned bio_based = 0, request_based = 0, hybrid = 0, num_dax = 0;
 	bool use_blk_mq = false;
 	struct dm_target *tgt;
 	struct dm_dev_internal *dd;
@@ -849,6 +849,9 @@ static int dm_table_set_type(struct dm_table *t)
 		else
 			bio_based = 1;
 
+		if (tgt->dax_supported)
+			num_dax++;
+
 		if (bio_based && request_based) {
 			DMWARN("Inconsistent table: different target types"
 			       " can't be mixed up");
@@ -870,7 +873,10 @@ static int dm_table_set_type(struct dm_table *t)
 
 	if (bio_based) {
 		/* We must use this table as bio-based */
-		t->type = DM_TYPE_BIO_BASED;
+		if (num_dax && num_dax == t->num_targets)
+			t->type = DM_TYPE_DAX_BIO_BASED;
+		else
+			t->type = DM_TYPE_BIO_BASED;
 		return 0;
 	}
 
@@ -978,7 +984,7 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device *
 		return -EINVAL;
 	}
 
-	if (type == DM_TYPE_BIO_BASED)
+	if (dm_type_bio_based(type))
 		for (i = 0; i < t->num_targets; i++) {
 			tgt = t->targets + i;
 			per_io_data_size = max(per_io_data_size, tgt->per_io_data_size);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 6e9f958..41b8912 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2492,7 +2492,7 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
 
 	if (md->bs) {
 		/* The md already has necessary mempools. */
-		if (dm_table_get_type(t) == DM_TYPE_BIO_BASED) {
+		if (dm_type_bio_based(dm_table_get_type(t))) {
 			/*
 			 * Reload bioset because front_pad may have changed
 			 * because a different table was loaded.
@@ -2659,6 +2659,9 @@ void dm_set_md_type(struct mapped_device *md, unsigned type)
 {
 	BUG_ON(!mutex_is_locked(&md->type_lock));
 	md->type = type;
+
+	if (type == DM_TYPE_DAX_BIO_BASED)
+		dm_disk(md)->flags |= GENHD_FL_DAX;
 }
 
 unsigned dm_get_md_type(struct mapped_device *md)
@@ -2837,7 +2840,7 @@ out_kfree_tag_set:
 
 static unsigned filter_md_type(unsigned type, struct mapped_device *md)
 {
-	if (type == DM_TYPE_BIO_BASED)
+	if (dm_type_bio_based(type))
 		return type;
 
 	return !md->use_blk_mq ? DM_TYPE_REQUEST_BASED : DM_TYPE_MQ_REQUEST_BASED;
@@ -2867,6 +2870,7 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t)
 		}
 		break;
 	case DM_TYPE_BIO_BASED:
+	case DM_TYPE_DAX_BIO_BASED:
 		dm_init_normal_md_queue(md);
 		blk_queue_make_request(md->queue, dm_make_request);
 		/*
@@ -3573,6 +3577,7 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, unsigned t
 
 	switch (type) {
 	case DM_TYPE_BIO_BASED:
+	case DM_TYPE_DAX_BIO_BASED:
 		cachep = _io_cache;
 		pool_size = dm_get_reserved_bio_based_ios();
 		front_pad = roundup(per_io_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone);
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 13a758e..6d22667 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -39,6 +39,13 @@
 #define DM_TYPE_BIO_BASED		1
 #define DM_TYPE_REQUEST_BASED		2
 #define DM_TYPE_MQ_REQUEST_BASED	3
+#define DM_TYPE_DAX_BIO_BASED		4
+
+/*
+ * To check whether the DM type is bio-based or not.
+ */
+#define dm_type_bio_based(type) (((type) == DM_TYPE_BIO_BASED) || \
+				 ((type) == DM_TYPE_DAX_BIO_BASED))
 
 /*
  * List of devices that a metadevice uses and should open/close.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-13 22:21 [PATCH 0/6] Support DAX for device-mapper dm-linear devices Toshi Kani
                   ` (5 preceding siblings ...)
  2016-06-13 22:21 ` [PATCH 6/6] dm: Enable DAX support for mapper device Toshi Kani
@ 2016-06-13 22:57 ` Mike Snitzer
  2016-06-20 18:00   ` Mike Snitzer
  2016-06-13 23:18 ` Dan Williams
  7 siblings, 1 reply; 40+ messages in thread
From: Mike Snitzer @ 2016-06-13 22:57 UTC (permalink / raw)
  To: Toshi Kani
  Cc: agk, dan.j.williams, ross.zwisler, viro, axboe, linux-nvdimm,
	dm-devel, linux-raid, linux-kernel

On Mon, Jun 13 2016 at  6:21pm -0400,
Toshi Kani <toshi.kani@hpe.com> wrote:

> This patch-set adds DAX support to device-mapper dm-linear devices
> used by LVM.  It works with LVM commands as follows:
>  - Creation of a logical volume with all DAX capable devices (such
>    as pmem) sets the logical volume DAX capable as well.
>  - Once a logical volume is set to DAX capable, the volume may not
>    be extended with non-DAX capable devices.
> 
> The direct_access interface is added to dm and dm-linear to map
> a request to a target device.
> 
>  - Patch 1-2 introduce GENHD_FL_DAX flag to indicate DAX capability.
>  - Patch 3-4 add direct_access functions to dm and dm-linear.
>  - Patch 5-6 set GENHD_FL_DAX to dm when all targets are DAX capable.
> 
> ---
> Toshi Kani (6):
>  1/6 genhd: Add GENHD_FL_DAX to gendisk flags
>  2/6 block: Check GENHD_FL_DAX for DAX capability
>  3/6 dm: Add dm_blk_direct_access() for mapped device
>  4/6 dm-linear: Add linear_direct_access()
>  5/6 dm, dm-linear: Add dax_supported to dm_target
>  6/6 dm: Enable DAX support for mapper device

Thanks a lot for doing this.  I recently added it to my TODO so your
patches come at a great time.

I'll try to get to reviewing/testing your work by the end of this week.

Mike

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-13 22:21 [PATCH 0/6] Support DAX for device-mapper dm-linear devices Toshi Kani
                   ` (6 preceding siblings ...)
  2016-06-13 22:57 ` [PATCH 0/6] Support DAX for device-mapper dm-linear devices Mike Snitzer
@ 2016-06-13 23:18 ` Dan Williams
  2016-06-13 23:59   ` Kani, Toshimitsu
  7 siblings, 1 reply; 40+ messages in thread
From: Dan Williams @ 2016-06-13 23:18 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Alasdair Kergon, Mike Snitzer, Ross Zwisler, Al Viro, Jens Axboe,
	linux-nvdimm@lists.01.org, dm-devel, linux-raid, linux-kernel

Thanks Toshi!

On Mon, Jun 13, 2016 at 3:21 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> This patch-set adds DAX support to device-mapper dm-linear devices
> used by LVM.  It works with LVM commands as follows:
>  - Creation of a logical volume with all DAX capable devices (such
>    as pmem) sets the logical volume DAX capable as well.
>  - Once a logical volume is set to DAX capable, the volume may not
>    be extended with non-DAX capable devices.

I don't mind this, but it seems a policy decision that the kernel does
not need to make.  A sufficiently sophisticated user could cope with
DAX being available at varying LBAs.  Would it be sufficient to move
this policy decision to userspace tooling?

> The direct_access interface is added to dm and dm-linear to map
> a request to a target device.

I had dm-linear and md-raid0 support on my list of things to look at,
did you have raid0 in your plans?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-13 23:18 ` Dan Williams
@ 2016-06-13 23:59   ` Kani, Toshimitsu
  2016-06-14  0:02     ` Dan Williams
  2016-06-14 13:50     ` Jeff Moyer
  0 siblings, 2 replies; 40+ messages in thread
From: Kani, Toshimitsu @ 2016-06-13 23:59 UTC (permalink / raw)
  To: dan.j.williams
  Cc: linux-kernel, agk, linux-raid, snitzer, viro, axboe,
	linux-nvdimm@lists.01.org, ross.zwisler, dm-devel

On Mon, 2016-06-13 at 16:18 -0700, Dan Williams wrote:
> Thanks Toshi!
> 
> On Mon, Jun 13, 2016 at 3:21 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > 
> > This patch-set adds DAX support to device-mapper dm-linear devices
> > used by LVM.  It works with LVM commands as follows:
> >  - Creation of a logical volume with all DAX capable devices (such
> >    as pmem) sets the logical volume DAX capable as well.
> >  - Once a logical volume is set to DAX capable, the volume may not
> >    be extended with non-DAX capable devices.
>
> I don't mind this, but it seems a policy decision that the kernel does
> not need to make.  A sufficiently sophisticated user could cope with
> DAX being available at varying LBAs.  Would it be sufficient to move
> this policy decision to userspace tooling?

I think this is a kernel restriction.  When a block device is declared as
DAX capable, it should mean that the whole device is DAX capable.  So, I
think we need to assure the same to a mapped device.

In LVM level, a volume group may contain both DAX and non-DAX capable
devices.  There is no restriction for creating/extending a volume group.

> > The direct_access interface is added to dm and dm-linear to map
> > a request to a target device.
>
> I had dm-linear and md-raid0 support on my list of things to look at,
> did you have raid0 in your plans?

Yes, I hope to extend further and raid0 is a good candidate.   

Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-13 23:59   ` Kani, Toshimitsu
@ 2016-06-14  0:02     ` Dan Williams
  2016-06-14  7:30       ` Dan Williams
  2016-06-14 13:50     ` Jeff Moyer
  1 sibling, 1 reply; 40+ messages in thread
From: Dan Williams @ 2016-06-14  0:02 UTC (permalink / raw)
  To: Kani, Toshimitsu
  Cc: linux-kernel, agk, linux-raid, snitzer, viro, axboe,
	linux-nvdimm@lists.01.org, ross.zwisler, dm-devel

On Mon, Jun 13, 2016 at 4:59 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> On Mon, 2016-06-13 at 16:18 -0700, Dan Williams wrote:
>> Thanks Toshi!
>>
>> On Mon, Jun 13, 2016 at 3:21 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
>> >
>> > This patch-set adds DAX support to device-mapper dm-linear devices
>> > used by LVM.  It works with LVM commands as follows:
>> >  - Creation of a logical volume with all DAX capable devices (such
>> >    as pmem) sets the logical volume DAX capable as well.
>> >  - Once a logical volume is set to DAX capable, the volume may not
>> >    be extended with non-DAX capable devices.
>>
>> I don't mind this, but it seems a policy decision that the kernel does
>> not need to make.  A sufficiently sophisticated user could cope with
>> DAX being available at varying LBAs.  Would it be sufficient to move
>> this policy decision to userspace tooling?
>
> I think this is a kernel restriction.  When a block device is declared as
> DAX capable, it should mean that the whole device is DAX capable.  So, I
> think we need to assure the same to a mapped device.

Hmm, but we already violate this with badblocks.  The device is DAX
capable, but certain LBAs will return an error if direct_access is
attempted.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-14  0:02     ` Dan Williams
@ 2016-06-14  7:30       ` Dan Williams
  0 siblings, 0 replies; 40+ messages in thread
From: Dan Williams @ 2016-06-14  7:30 UTC (permalink / raw)
  To: Kani, Toshimitsu
  Cc: linux-kernel, agk, linux-raid, snitzer, viro, axboe,
	linux-nvdimm@lists.01.org, ross.zwisler, dm-devel

On Mon, Jun 13, 2016 at 5:02 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Mon, Jun 13, 2016 at 4:59 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
>> On Mon, 2016-06-13 at 16:18 -0700, Dan Williams wrote:
>>> Thanks Toshi!
>>>
>>> On Mon, Jun 13, 2016 at 3:21 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
>>> >
>>> > This patch-set adds DAX support to device-mapper dm-linear devices
>>> > used by LVM.  It works with LVM commands as follows:
>>> >  - Creation of a logical volume with all DAX capable devices (such
>>> >    as pmem) sets the logical volume DAX capable as well.
>>> >  - Once a logical volume is set to DAX capable, the volume may not
>>> >    be extended with non-DAX capable devices.
>>>
>>> I don't mind this, but it seems a policy decision that the kernel does
>>> not need to make.  A sufficiently sophisticated user could cope with
>>> DAX being available at varying LBAs.  Would it be sufficient to move
>>> this policy decision to userspace tooling?
>>
>> I think this is a kernel restriction.  When a block device is declared as
>> DAX capable, it should mean that the whole device is DAX capable.  So, I
>> think we need to assure the same to a mapped device.
>
> Hmm, but we already violate this with badblocks.  The device is DAX
> capable, but certain LBAs will return an error if direct_access is
> attempted.

Nevermind, for this to be useful we would need to fallback to regular
mmap for a portion of the linear span.  That's different than the
badblocks case.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-13 23:59   ` Kani, Toshimitsu
  2016-06-14  0:02     ` Dan Williams
@ 2016-06-14 13:50     ` Jeff Moyer
  2016-06-14 15:41       ` Mike Snitzer
  2016-06-14 15:53       ` Kani, Toshimitsu
  1 sibling, 2 replies; 40+ messages in thread
From: Jeff Moyer @ 2016-06-14 13:50 UTC (permalink / raw)
  To: Kani, Toshimitsu
  Cc: dan.j.williams, linux-kernel, agk, linux-raid, snitzer, viro,
	axboe, linux-nvdimm@lists.01.org, ross.zwisler, dm-devel

"Kani, Toshimitsu" <toshi.kani@hpe.com> writes:

>> I had dm-linear and md-raid0 support on my list of things to look at,
>> did you have raid0 in your plans?
>
> Yes, I hope to extend further and raid0 is a good candidate.   

dm-flakey would allow more xfstests test cases to run.  I'd say that's
more important than linear or raid0.  ;-)

Also, the next step in this work is to then decide how to determine on
what numa node an LBA resides.  We had discussed this at a prior
plumbers conference, and I think the consensus was to use xattrs.
Toshi, do you also plan to do that work?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-14 13:50     ` Jeff Moyer
@ 2016-06-14 15:41       ` Mike Snitzer
  2016-06-14 18:00         ` Kani, Toshimitsu
  2016-06-14 20:19         ` Jeff Moyer
  2016-06-14 15:53       ` Kani, Toshimitsu
  1 sibling, 2 replies; 40+ messages in thread
From: Mike Snitzer @ 2016-06-14 15:41 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Kani, Toshimitsu, axboe, linux-nvdimm@lists.01.org, linux-kernel,
	linux-raid, dm-devel, viro, dan.j.williams, ross.zwisler, agk

On Tue, Jun 14 2016 at  9:50am -0400,
Jeff Moyer <jmoyer@redhat.com> wrote:

> "Kani, Toshimitsu" <toshi.kani@hpe.com> writes:
> 
> >> I had dm-linear and md-raid0 support on my list of things to look at,
> >> did you have raid0 in your plans?
> >
> > Yes, I hope to extend further and raid0 is a good candidate.   
> 
> dm-flakey would allow more xfstests test cases to run.  I'd say that's
> more important than linear or raid0.  ;-)

Regardless of which target(s) grow DAX support the most pressing initial
concern is getting the DM device stacking correct.  And verifying that
IO that cross pmem device boundaries are being properly split by DM
core (via drivers/md/dm.c:__split_and_process_non_flush()'s call to
max_io_len).

My hope is to nail down the DM core and its dependencies in block etc.
Doing so in terms of dm-linear doesn't seem like wasted effort
considering you told me it'd be useful to have for pmem devices.
 
> Also, the next step in this work is to then decide how to determine on
> what numa node an LBA resides.  We had discussed this at a prior
> plumbers conference, and I think the consensus was to use xattrs.
> Toshi, do you also plan to do that work?

How does the associated NUMA node relate to this?  Does the
DM requests_queue need to be setup to only allocate from the NUMA node
the pmem device is attached to?  I recently added support for this to
DM.  But there will likely be some code need to propagate the NUMA node
id accordingly.

Mike

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-14 13:50     ` Jeff Moyer
  2016-06-14 15:41       ` Mike Snitzer
@ 2016-06-14 15:53       ` Kani, Toshimitsu
  1 sibling, 0 replies; 40+ messages in thread
From: Kani, Toshimitsu @ 2016-06-14 15:53 UTC (permalink / raw)
  To: jmoyer
  Cc: linux-kernel, linux-nvdimm, agk, linux-raid, snitzer, viro,
	dan.j.williams, axboe, ross.zwisler, dm-devel

On Tue, 2016-06-14 at 09:50 -0400, Jeff Moyer wrote:
> "Kani, Toshimitsu" <toshi.kani@hpe.com> writes:
> > > I had dm-linear and md-raid0 support on my list of things to look at,
> > > did you have raid0 in your plans?
> >
> > Yes, I hope to extend further and raid0 is a good candidate. 
>
> dm-flakey would allow more xfstests test cases to run.  I'd say that's
> more important than linear or raid0.  ;-)

That's an interesting one.  We can emulate badblocks by failing
direct_access with -EIO, but I do not think we can emulate "drop_writes" and
"corrupt_bio_byte" features with DAX...

> Also, the next step in this work is to then decide how to determine on
> what numa node an LBA resides.  We had discussed this at a prior
> plumbers conference, and I think the consensus was to use xattrs.
> Toshi, do you also plan to do that work?

No, it's not my plan at this point.

Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-14 15:41       ` Mike Snitzer
@ 2016-06-14 18:00         ` Kani, Toshimitsu
  2016-06-14 20:19         ` Jeff Moyer
  1 sibling, 0 replies; 40+ messages in thread
From: Kani, Toshimitsu @ 2016-06-14 18:00 UTC (permalink / raw)
  To: jmoyer, snitzer
  Cc: linux-kernel, linux-nvdimm, agk, linux-raid, viro,
	dan.j.williams, axboe, ross.zwisler, dm-devel

On Tue, 2016-06-14 at 11:41 -0400, Mike Snitzer wrote:
> On Tue, Jun 14 2016 at  9:50am -0400,
> Jeff Moyer <jmoyer@redhat.com> wrote:
> > "Kani, Toshimitsu" <toshi.kani@hpe.com> writes:
> > > > I had dm-linear and md-raid0 support on my list of things to look
> > > > at, did you have raid0 in your plans?
> > >
> > > Yes, I hope to extend further and raid0 is a good candidate. 
> >   
> > dm-flakey would allow more xfstests test cases to run.  I'd say that's
> > more important than linear or raid0.  ;-)
>
> Regardless of which target(s) grow DAX support the most pressing initial
> concern is getting the DM device stacking correct.  And verifying that
> IO that cross pmem device boundaries are being properly split by DM
> core (via drivers/md/dm.c:__split_and_process_non_flush()'s call to
> max_io_len).

Agreed. I've briefly tested stacking and it seems working fine.  As for IO
crossing pmem device boundaries, __split_and_process_non_flush() is used
when the device is mounted without DAX option.  With DAX, this case is
handled by dm_blk_direct_access() that limits return size.  This leads the
caller to iterate (read/write) or fallback to a smaller size (mmap pfault).

> My hope is to nail down the DM core and its dependencies in block etc.
> Doing so in terms of dm-linear doesn't seem like wasted effort
> considering you told me it'd be useful to have for pmem devices.

Yes, I think dm-linear is useful as it gives more flexibility, ex. it allows
creating a large device with multiple pmem devices.

> > Also, the next step in this work is to then decide how to determine on
> > what numa node an LBA resides.  We had discussed this at a prior
> > plumbers conference, and I think the consensus was to use xattrs.
> > Toshi, do you also plan to do that work?
>
> How does the associated NUMA node relate to this?  Does the
> DM requests_queue need to be setup to only allocate from the NUMA node
> the pmem device is attached to?  I recently added support for this to
> DM.  But there will likely be some code need to propagate the NUMA node
> id accordingly.

Each pmem device has sysfs "numa_node" so that tools like numactl can be
used to bind application to run on the same locality as pmem device (since
CPU directly accesses to pmem).  This won't work well with mapped device
since it can be composed with multiple localities.  Locality info would need
to be managed file-basis as Jeff mentioned.

Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-14 15:41       ` Mike Snitzer
  2016-06-14 18:00         ` Kani, Toshimitsu
@ 2016-06-14 20:19         ` Jeff Moyer
  2016-06-15  1:46           ` Mike Snitzer
  1 sibling, 1 reply; 40+ messages in thread
From: Jeff Moyer @ 2016-06-14 20:19 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Kani, Toshimitsu, axboe, linux-nvdimm@lists.01.org, linux-kernel,
	linux-raid, dm-devel, viro, dan.j.williams, ross.zwisler, agk

Mike Snitzer <snitzer@redhat.com> writes:

> On Tue, Jun 14 2016 at  9:50am -0400,
> Jeff Moyer <jmoyer@redhat.com> wrote:
>
>> "Kani, Toshimitsu" <toshi.kani@hpe.com> writes:
>> 
>> >> I had dm-linear and md-raid0 support on my list of things to look at,
>> >> did you have raid0 in your plans?
>> >
>> > Yes, I hope to extend further and raid0 is a good candidate.   
>> 
>> dm-flakey would allow more xfstests test cases to run.  I'd say that's
>> more important than linear or raid0.  ;-)
>
> Regardless of which target(s) grow DAX support the most pressing initial
> concern is getting the DM device stacking correct.  And verifying that
> IO that cross pmem device boundaries are being properly split by DM
> core (via drivers/md/dm.c:__split_and_process_non_flush()'s call to
> max_io_len).

That was a tongue-in-cheek comment.  You're reading way too much into
it.

>> Also, the next step in this work is to then decide how to determine on
>> what numa node an LBA resides.  We had discussed this at a prior
>> plumbers conference, and I think the consensus was to use xattrs.
>> Toshi, do you also plan to do that work?
>
> How does the associated NUMA node relate to this?  Does the
> DM requests_queue need to be setup to only allocate from the NUMA node
> the pmem device is attached to?  I recently added support for this to
> DM.  But there will likely be some code need to propagate the NUMA node
> id accordingly.

I assume you mean allocate memory (the volatile kind).  That should work
the same between pmem and regular block devices, no?

What I was getting at was that applications may want to know on which
node their data resides.  Right now, it's easy to tell because a single
device cannot span numa nodes, or, if it does, it does so via an
interleave, so numa information isn't interesting.  However, once data
on a single file system can be placed on multiple different numa nodes,
applications may want to query and/or control that placement.

Here's a snippet from a blog post I never finished:

There are two essential questions that need to be answered regarding
persistent memory and NUMA: first, would an application benefit from
being able to query the NUMA locality of its data, and second, would
an application benefit from being able to specify a placement policy
for its data?  This article is an attempt to summarize the current
state of hardware and software in order to consider the above two
questions.  We begin with a short list of use cases for these
interfaces, which will frame the discussion.

First, let's consider an interface that allows an application to query
the NUMA placement of existing data.  With such information, an
application may want to perform the following actions:

- relocate application processes to the same NUMA node as their data.
  (Interfaces for moving a process are readily available.)
- specify a memory (RAM) allocation policy so that memory allocations
  come from the same NUMA node as the data.

Second, we consider an interface that allows an application to specify
a placement policy for new data.  Using this interface, an application
may:

- ensure data is stored on the same NUMA node as the one on which the
  application is running
- ensure data is stored on the same NUMA node as an I/O adapter such
  as a network card, that is a producer of data stored to NVM.
- ensure data is stored on a different NUMA node:
  - so that the data is stored on the same NUMA node as related data
  - because the data does not need the faster access afforded by local
    NUMA placement.  Presumably this is a trade-off, and other data
    will require local placement to meet the performance goals of the
    application.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-14 20:19         ` Jeff Moyer
@ 2016-06-15  1:46           ` Mike Snitzer
  2016-06-15  2:07             ` Dan Williams
  0 siblings, 1 reply; 40+ messages in thread
From: Mike Snitzer @ 2016-06-15  1:46 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Kani, Toshimitsu, axboe, linux-nvdimm@lists.01.org, linux-kernel,
	linux-raid, dm-devel, viro, dan.j.williams, ross.zwisler, agk

On Tue, Jun 14 2016 at  4:19pm -0400,
Jeff Moyer <jmoyer@redhat.com> wrote:

> Mike Snitzer <snitzer@redhat.com> writes:
> 
> > On Tue, Jun 14 2016 at  9:50am -0400,
> > Jeff Moyer <jmoyer@redhat.com> wrote:
> >
> >> "Kani, Toshimitsu" <toshi.kani@hpe.com> writes:
> >> 
> >> >> I had dm-linear and md-raid0 support on my list of things to look at,
> >> >> did you have raid0 in your plans?
> >> >
> >> > Yes, I hope to extend further and raid0 is a good candidate.   
> >> 
> >> dm-flakey would allow more xfstests test cases to run.  I'd say that's
> >> more important than linear or raid0.  ;-)
> >
> > Regardless of which target(s) grow DAX support the most pressing initial
> > concern is getting the DM device stacking correct.  And verifying that
> > IO that cross pmem device boundaries are being properly split by DM
> > core (via drivers/md/dm.c:__split_and_process_non_flush()'s call to
> > max_io_len).
> 
> That was a tongue-in-cheek comment.  You're reading way too much into
> it.
> 
> >> Also, the next step in this work is to then decide how to determine on
> >> what numa node an LBA resides.  We had discussed this at a prior
> >> plumbers conference, and I think the consensus was to use xattrs.
> >> Toshi, do you also plan to do that work?
> >
> > How does the associated NUMA node relate to this?  Does the
> > DM requests_queue need to be setup to only allocate from the NUMA node
> > the pmem device is attached to?  I recently added support for this to
> > DM.  But there will likely be some code need to propagate the NUMA node
> > id accordingly.
> 
> I assume you mean allocate memory (the volatile kind).  That should work
> the same between pmem and regular block devices, no?

This is the commit I made to train DM to be numa node aware:
115485e83f497fdf9b4 ("dm: add 'dm_numa_node' module parameter")

As is the DM code is focused on memory allocations.  But I think blk-mq
may use the NUMA node for via tag_set->numa_node.  But that is moot
given pmem is bio-based right?

Steps could be taken to make all threads DM creates for a a given device
get pinned to the specified NUMA node too.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-15  1:46           ` Mike Snitzer
@ 2016-06-15  2:07             ` Dan Williams
  2016-06-15  2:35               ` Mike Snitzer
  0 siblings, 1 reply; 40+ messages in thread
From: Dan Williams @ 2016-06-15  2:07 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jeff Moyer, Kani, Toshimitsu, axboe, linux-nvdimm@lists.01.org,
	linux-kernel, linux-raid, dm-devel, viro, ross.zwisler, agk

On Tue, Jun 14, 2016 at 6:46 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Tue, Jun 14 2016 at  4:19pm -0400,
> Jeff Moyer <jmoyer@redhat.com> wrote:
>
>> Mike Snitzer <snitzer@redhat.com> writes:
>>
>> > On Tue, Jun 14 2016 at  9:50am -0400,
>> > Jeff Moyer <jmoyer@redhat.com> wrote:
>> >
>> >> "Kani, Toshimitsu" <toshi.kani@hpe.com> writes:
>> >>
>> >> >> I had dm-linear and md-raid0 support on my list of things to look at,
>> >> >> did you have raid0 in your plans?
>> >> >
>> >> > Yes, I hope to extend further and raid0 is a good candidate.
>> >>
>> >> dm-flakey would allow more xfstests test cases to run.  I'd say that's
>> >> more important than linear or raid0.  ;-)
>> >
>> > Regardless of which target(s) grow DAX support the most pressing initial
>> > concern is getting the DM device stacking correct.  And verifying that
>> > IO that cross pmem device boundaries are being properly split by DM
>> > core (via drivers/md/dm.c:__split_and_process_non_flush()'s call to
>> > max_io_len).
>>
>> That was a tongue-in-cheek comment.  You're reading way too much into
>> it.
>>
>> >> Also, the next step in this work is to then decide how to determine on
>> >> what numa node an LBA resides.  We had discussed this at a prior
>> >> plumbers conference, and I think the consensus was to use xattrs.
>> >> Toshi, do you also plan to do that work?
>> >
>> > How does the associated NUMA node relate to this?  Does the
>> > DM requests_queue need to be setup to only allocate from the NUMA node
>> > the pmem device is attached to?  I recently added support for this to
>> > DM.  But there will likely be some code need to propagate the NUMA node
>> > id accordingly.
>>
>> I assume you mean allocate memory (the volatile kind).  That should work
>> the same between pmem and regular block devices, no?
>
> This is the commit I made to train DM to be numa node aware:
> 115485e83f497fdf9b4 ("dm: add 'dm_numa_node' module parameter")

Hmm, but this is global for all DM device instances.

> As is the DM code is focused on memory allocations.  But I think blk-mq
> may use the NUMA node for via tag_set->numa_node.  But that is moot
> given pmem is bio-based right?

Right.

>
> Steps could be taken to make all threads DM creates for a a given device
> get pinned to the specified NUMA node too.

I think it would be useful if a DM instance inherited the numa node
from the component devices by default (assuming they're all from the
same node).  A "dev_to_node(disk_to_dev(disk))" conversion works for
pmem devices.

As far as I understand, Jeff wants to go further and have a linear
span across component devices from different nodes with an interface
to do an LBA-to-numa-node conversion.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-15  2:07             ` Dan Williams
@ 2016-06-15  2:35               ` Mike Snitzer
  0 siblings, 0 replies; 40+ messages in thread
From: Mike Snitzer @ 2016-06-15  2:35 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jeff Moyer, Kani, Toshimitsu, axboe, linux-nvdimm@lists.01.org,
	linux-kernel, linux-raid, dm-devel, viro, ross.zwisler, agk

On Tue, Jun 14 2016 at 10:07pm -0400,
Dan Williams <dan.j.williams@intel.com> wrote:

> On Tue, Jun 14, 2016 at 6:46 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> > On Tue, Jun 14 2016 at  4:19pm -0400,
> > Jeff Moyer <jmoyer@redhat.com> wrote:
> >
> >> Mike Snitzer <snitzer@redhat.com> writes:
> >>
> >> > On Tue, Jun 14 2016 at  9:50am -0400,
> >> > Jeff Moyer <jmoyer@redhat.com> wrote:
> >> >
> >> >> "Kani, Toshimitsu" <toshi.kani@hpe.com> writes:
> >> >>
> >> >> >> I had dm-linear and md-raid0 support on my list of things to look at,
> >> >> >> did you have raid0 in your plans?
> >> >> >
> >> >> > Yes, I hope to extend further and raid0 is a good candidate.
> >> >>
> >> >> dm-flakey would allow more xfstests test cases to run.  I'd say that's
> >> >> more important than linear or raid0.  ;-)
> >> >
> >> > Regardless of which target(s) grow DAX support the most pressing initial
> >> > concern is getting the DM device stacking correct.  And verifying that
> >> > IO that cross pmem device boundaries are being properly split by DM
> >> > core (via drivers/md/dm.c:__split_and_process_non_flush()'s call to
> >> > max_io_len).
> >>
> >> That was a tongue-in-cheek comment.  You're reading way too much into
> >> it.
> >>
> >> >> Also, the next step in this work is to then decide how to determine on
> >> >> what numa node an LBA resides.  We had discussed this at a prior
> >> >> plumbers conference, and I think the consensus was to use xattrs.
> >> >> Toshi, do you also plan to do that work?
> >> >
> >> > How does the associated NUMA node relate to this?  Does the
> >> > DM requests_queue need to be setup to only allocate from the NUMA node
> >> > the pmem device is attached to?  I recently added support for this to
> >> > DM.  But there will likely be some code need to propagate the NUMA node
> >> > id accordingly.
> >>
> >> I assume you mean allocate memory (the volatile kind).  That should work
> >> the same between pmem and regular block devices, no?
> >
> > This is the commit I made to train DM to be numa node aware:
> > 115485e83f497fdf9b4 ("dm: add 'dm_numa_node' module parameter")
> 
> Hmm, but this is global for all DM device instances.

Right, only because I didn't have a convenient way to allow the user to
specify it on a per-device level.  But I'll defer skinning that cat for
now since in this pmem case we'd inherit from the underlying device(s)

> > As is the DM code is focused on memory allocations.  But I think blk-mq
> > may use the NUMA node for via tag_set->numa_node.  But that is moot
> > given pmem is bio-based right?
> 
> Right.
> 
> >
> > Steps could be taken to make all threads DM creates for a a given device
> > get pinned to the specified NUMA node too.
> 
> I think it would be useful if a DM instance inherited the numa node
> from the component devices by default (assuming they're all from the
> same node).  A "dev_to_node(disk_to_dev(disk))" conversion works for
> pmem devices.

OK, I can look to make that happen.
 
> As far as I understand, Jeff wants to go further and have a linear
> span across component devices from different nodes with an interface
> to do an LBA-to-numa-node conversion.

All that variability makes DM's ability to do anything sane with it
close to impossible considering memory pools, threads, etc are all
pinned during the first activation of the DM device.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-13 22:57 ` [PATCH 0/6] Support DAX for device-mapper dm-linear devices Mike Snitzer
@ 2016-06-20 18:00   ` Mike Snitzer
  2016-06-20 18:31     ` Kani, Toshimitsu
  0 siblings, 1 reply; 40+ messages in thread
From: Mike Snitzer @ 2016-06-20 18:00 UTC (permalink / raw)
  To: Toshi Kani
  Cc: axboe, linux-nvdimm, linux-kernel, linux-raid, dm-devel, viro,
	dan.j.williams, ross.zwisler, agk

On Mon, Jun 13 2016 at  6:57pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Mon, Jun 13 2016 at  6:21pm -0400,
> Toshi Kani <toshi.kani@hpe.com> wrote:
> 
> > This patch-set adds DAX support to device-mapper dm-linear devices
> > used by LVM.  It works with LVM commands as follows:
> >  - Creation of a logical volume with all DAX capable devices (such
> >    as pmem) sets the logical volume DAX capable as well.
> >  - Once a logical volume is set to DAX capable, the volume may not
> >    be extended with non-DAX capable devices.
> > 
> > The direct_access interface is added to dm and dm-linear to map
> > a request to a target device.
> > 
> >  - Patch 1-2 introduce GENHD_FL_DAX flag to indicate DAX capability.
> >  - Patch 3-4 add direct_access functions to dm and dm-linear.
> >  - Patch 5-6 set GENHD_FL_DAX to dm when all targets are DAX capable.
> > 
> > ---
> > Toshi Kani (6):
> >  1/6 genhd: Add GENHD_FL_DAX to gendisk flags
> >  2/6 block: Check GENHD_FL_DAX for DAX capability
> >  3/6 dm: Add dm_blk_direct_access() for mapped device
> >  4/6 dm-linear: Add linear_direct_access()
> >  5/6 dm, dm-linear: Add dax_supported to dm_target
> >  6/6 dm: Enable DAX support for mapper device
> 
> Thanks a lot for doing this.  I recently added it to my TODO so your
> patches come at a great time.
> 
> I'll try to get to reviewing/testing your work by the end of this week.

I rebased your patches on linux-dm.git's 'for-next' (which includes what
I've already staged for the 4.8 merge window).  And I folded/changed
some of the DM patches so that there are only 2 now (1 for DM core and 1
for dm-linear).  Please see the 4 topmost commits in my 'wip' here:

http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

Feel free to pick these patches up to use as the basis for continued
work or re-posting of this set.. either that or I could post them as v2
on your behalf.

As for testing, I've verified that basic IO works to a pmem-based DM
linear device and that mixed table types are rejected as expected.

Mike

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-20 18:00   ` Mike Snitzer
@ 2016-06-20 18:31     ` Kani, Toshimitsu
  2016-06-20 19:40       ` Mike Snitzer
  0 siblings, 1 reply; 40+ messages in thread
From: Kani, Toshimitsu @ 2016-06-20 18:31 UTC (permalink / raw)
  To: snitzer
  Cc: linux-kernel, linux-nvdimm, agk, linux-raid, viro,
	dan.j.williams, axboe, ross.zwisler, dm-devel

On Mon, 2016-06-20 at 14:00 -0400, Mike Snitzer wrote:
> On Mon, Jun 13 2016 at  6:57pm -0400,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > 
> > On Mon, Jun 13 2016 at  6:21pm -0400,
> > Toshi Kani <toshi.kani@hpe.com> wrote:
> > 
> > > 
> > > This patch-set adds DAX support to device-mapper dm-linear devices
> > > used by LVM.  It works with LVM commands as follows:
> > >  - Creation of a logical volume with all DAX capable devices (such
> > >    as pmem) sets the logical volume DAX capable as well.
> > >  - Once a logical volume is set to DAX capable, the volume may not
> > >    be extended with non-DAX capable devices.
> > > 
> > > The direct_access interface is added to dm and dm-linear to map
> > > a request to a target device.
> > > 
> > >  - Patch 1-2 introduce GENHD_FL_DAX flag to indicate DAX capability.
> > >  - Patch 3-4 add direct_access functions to dm and dm-linear.
> > >  - Patch 5-6 set GENHD_FL_DAX to dm when all targets are DAX capable.
> > > 
> > > ---
> > > Toshi Kani (6):
> > >  1/6 genhd: Add GENHD_FL_DAX to gendisk flags
> > >  2/6 block: Check GENHD_FL_DAX for DAX capability
> > >  3/6 dm: Add dm_blk_direct_access() for mapped device
> > >  4/6 dm-linear: Add linear_direct_access()
> > >  5/6 dm, dm-linear: Add dax_supported to dm_target
> > >  6/6 dm: Enable DAX support for mapper device
> > Thanks a lot for doing this.  I recently added it to my TODO so your
> > patches come at a great time.
> > 
> > I'll try to get to reviewing/testing your work by the end of this week.
>
> I rebased your patches on linux-dm.git's 'for-next' (which includes what
> I've already staged for the 4.8 merge window).  And I folded/changed
> some of the DM patches so that there are only 2 now (1 for DM core and 1
> for dm-linear).  Please see the 4 topmost commits in my 'wip' here:
> 
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip
> 
> Feel free to pick these patches up to use as the basis for continued
> work or re-posting of this set.. either that or I could post them as v2
> on your behalf.
> 
> As for testing, I've verified that basic IO works to a pmem-based DM
> linear device and that mixed table types are rejected as expected.

Great! I will send additional patch, add DAX support to dm-stripe, on top of
these once I finish my testing.

Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-20 18:31     ` Kani, Toshimitsu
@ 2016-06-20 19:40       ` Mike Snitzer
  2016-06-20 19:52         ` Mike Snitzer
  0 siblings, 1 reply; 40+ messages in thread
From: Mike Snitzer @ 2016-06-20 19:40 UTC (permalink / raw)
  To: Kani, Toshimitsu
  Cc: linux-kernel, linux-nvdimm, agk, linux-raid, viro,
	dan.j.williams, axboe, ross.zwisler, dm-devel, sandeen

On Mon, Jun 20 2016 at  2:31pm -0400,
Kani, Toshimitsu <toshi.kani@hpe.com> wrote:

> On Mon, 2016-06-20 at 14:00 -0400, Mike Snitzer wrote:
> >
> > I rebased your patches on linux-dm.git's 'for-next' (which includes what
> > I've already staged for the 4.8 merge window).  And I folded/changed
> > some of the DM patches so that there are only 2 now (1 for DM core and 1
> > for dm-linear).  Please see the 4 topmost commits in my 'wip' here:
> > 
> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip
> > 
> > Feel free to pick these patches up to use as the basis for continued
> > work or re-posting of this set.. either that or I could post them as v2
> > on your behalf.
> > 
> > As for testing, I've verified that basic IO works to a pmem-based DM
> > linear device and that mixed table types are rejected as expected.
> 
> Great! I will send additional patch, add DAX support to dm-stripe, on top of
> these once I finish my testing.

I did some further testing and am seeing some XFS corruption when
testing a DM linear device that spans multiple pmem devices.  I created
2 partitions ontop of /dev/pmem0 (which I created using the howto from
https://nvdimm.wiki.kernel.org).  Then I did:

# pvcreate /dev/pmem0p1
# pvcreate /dev/pmem0p2
# vgcreate pmem /dev/pmem0p1 /dev/pmem0p2
# lvcreate -L 2.9G -n lv pmem

# lsblk /dev/pmem0
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
pmem0       259:0    0    6G  0 disk
├─pmem0p1   259:1    0    1G  0 part
│ └─pmem-lv 253:4    0  2.9G  0 lvm
└─pmem0p2   259:2    0    2G  0 part
  └─pmem-lv 253:4    0  2.9G  0 lvm

# dmsetup table pmem-lv
0 4186112 linear 259:2 2048
4186112 1900544 linear 259:1 2048

# mkfs.xfs /dev/pmem/lv
# mount -o dax -t xfs /dev/pmem/lv /mnt/dax
[11452.212034] XFS (dm-4): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
[11452.220323] XFS (dm-4): Mounting V4 Filesystem
[11452.226526] XFS (dm-4): Ending clean mount

# dd if=/dev/zero of=/mnt/dax/meh bs=1024K oflag=direct
[11729.754671] XFS (dm-4): Metadata corruption detected at xfs_agf_read_verify+0x70/0x120 [xfs], xfs_agf block 0x45a808
[11729.766423] XFS (dm-4): Unmount and run xfs_repair
[11729.771774] XFS (dm-4): First 64 bytes of corrupted metadata buffer:
[11729.778869] ffff8800b8038000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[11729.788582] ffff8800b8038010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[11729.798293] ffff8800b8038020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[11729.808002] ffff8800b8038030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[11729.817715] XFS (dm-4): metadata I/O error: block 0x45a808 ("xfs_trans_read_buf_map") error 117 numblks 8

When this XFS corruption occurs corruption then also manifests in lvm2's
metadata:

# vgremove pmem
Do you really want to remove volume group "pmem" containing 1 logical volumes? [y/n]: y
Do you really want to remove active logical volume lv? [y/n]: y
  Incorrect metadata area header checksum on /dev/pmem0p1 at offset 4096
  WARNING: Failed to write an MDA of VG pmem.
  Incorrect metadata area header checksum on /dev/pmem0p2 at offset 4096
  WARNING: Failed to write an MDA of VG pmem.
  Failed to write VG pmem.
  Incorrect metadata area header checksum on /dev/pmem0p2 at offset 4096
  Incorrect metadata area header checksum on /dev/pmem0p1 at offset 4096

If I don't use XFS, and only issue IO directly to the /dev/pmem/lv, I
don't see this corruption.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-20 19:40       ` Mike Snitzer
@ 2016-06-20 19:52         ` Mike Snitzer
  2016-06-20 20:11           ` Kani, Toshimitsu
  0 siblings, 1 reply; 40+ messages in thread
From: Mike Snitzer @ 2016-06-20 19:52 UTC (permalink / raw)
  To: Kani, Toshimitsu
  Cc: linux-kernel, linux-nvdimm, agk, linux-raid, viro,
	dan.j.williams, axboe, ross.zwisler, dm-devel, sandeen

On Mon, Jun 20 2016 at  3:40pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:
 
> # dd if=/dev/zero of=/mnt/dax/meh bs=1024K oflag=direct
> [11729.754671] XFS (dm-4): Metadata corruption detected at xfs_agf_read_verify+0x70/0x120 [xfs], xfs_agf block 0x45a808
> [11729.766423] XFS (dm-4): Unmount and run xfs_repair
> [11729.771774] XFS (dm-4): First 64 bytes of corrupted metadata buffer:
> [11729.778869] ffff8800b8038000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> [11729.788582] ffff8800b8038010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> [11729.798293] ffff8800b8038020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> [11729.808002] ffff8800b8038030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> [11729.817715] XFS (dm-4): metadata I/O error: block 0x45a808 ("xfs_trans_read_buf_map") error 117 numblks 8
> 
> When this XFS corruption occurs corruption then also manifests in lvm2's
> metadata:
> 
> # vgremove pmem
> Do you really want to remove volume group "pmem" containing 1 logical volumes? [y/n]: y
> Do you really want to remove active logical volume lv? [y/n]: y
>   Incorrect metadata area header checksum on /dev/pmem0p1 at offset 4096
>   WARNING: Failed to write an MDA of VG pmem.
>   Incorrect metadata area header checksum on /dev/pmem0p2 at offset 4096
>   WARNING: Failed to write an MDA of VG pmem.
>   Failed to write VG pmem.
>   Incorrect metadata area header checksum on /dev/pmem0p2 at offset 4096
>   Incorrect metadata area header checksum on /dev/pmem0p1 at offset 4096
> 
> If I don't use XFS, and only issue IO directly to the /dev/pmem/lv, I
> don't see this corruption.

I did the same test with ext4 instead of xfs and it resulted in the same
type of systemic corruption (lvm2 metadata corrupted too):

[12816.407147] EXT4-fs (dm-4): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
[12816.416123] EXT4-fs (dm-4): mounted filesystem with ordered data mode. Opts: dax
[12816.766855] EXT4-fs error (device dm-4): ext4_mb_generate_buddy:758: group 9, block bitmap and bg descriptor inconsistent: 32768 vs 32395 free clusters
[12816.782016] EXT4-fs error (device dm-4): ext4_mb_generate_buddy:758: group 10, block bitmap and bg descriptor inconsistent: 32768 vs 16384 free clusters
[12816.797491] JBD2: Spotted dirty metadata buffer (dev = dm-4, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

# vgremove pmem
Do you really want to remove volume group "pmem" containing 1 logical volumes? [y/n]: y
Do you really want to remove active logical volume lv? [y/n]: y
  Incorrect metadata area header checksum on /dev/pmem0p1 at offset 4096
  WARNING: Failed to write an MDA of VG pmem.
  Incorrect metadata area header checksum on /dev/pmem0p2 at offset 4096
  WARNING: Failed to write an MDA of VG pmem.
  Failed to write VG pmem.
  Incorrect metadata area header checksum on /dev/pmem0p2 at offset 4096
  Incorrect metadata area header checksum on /dev/pmem0p1 at offset 4096

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-20 19:52         ` Mike Snitzer
@ 2016-06-20 20:11           ` Kani, Toshimitsu
  2016-06-20 21:28             ` Kani, Toshimitsu
  0 siblings, 1 reply; 40+ messages in thread
From: Kani, Toshimitsu @ 2016-06-20 20:11 UTC (permalink / raw)
  To: snitzer
  Cc: linux-kernel, sandeen, linux-nvdimm, agk, linux-raid, viro,
	axboe, dan.j.williams, ross.zwisler, dm-devel

On Mon, 2016-06-20 at 15:52 -0400, Mike Snitzer wrote:
> On Mon, Jun 20 2016 at  3:40pm -0400,
> Mike Snitzer <snitzer@redhat.com> wrote:
>  
> > 
> > # dd if=/dev/zero of=/mnt/dax/meh bs=1024K oflag=direct
> > [11729.754671] XFS (dm-4): Metadata corruption detected at
> > xfs_agf_read_verify+0x70/0x120 [xfs], xfs_agf block 0x45a808
> > [11729.766423] XFS (dm-4): Unmount and run xfs_repair
> > [11729.771774] XFS (dm-4): First 64 bytes of corrupted metadata buffer:
> > [11729.778869] ffff8800b8038000: 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 00 00 00  ................
> > [11729.788582] ffff8800b8038010: 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 00 00 00  ................
> > [11729.798293] ffff8800b8038020: 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 00 00 00  ................
> > [11729.808002] ffff8800b8038030: 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 00 00 00  ................
> > [11729.817715] XFS (dm-4): metadata I/O error: block 0x45a808
> > ("xfs_trans_read_buf_map") error 117 numblks 8
> > 
> > When this XFS corruption occurs corruption then also manifests in lvm2's
> > metadata:
> > 
> > # vgremove pmem
> > Do you really want to remove volume group "pmem" containing 1 logical
> > volumes? [y/n]: y
> > Do you really want to remove active logical volume lv? [y/n]: y
> >   Incorrect metadata area header checksum on /dev/pmem0p1 at offset 4096
> >   WARNING: Failed to write an MDA of VG pmem.
> >   Incorrect metadata area header checksum on /dev/pmem0p2 at offset 4096
> >   WARNING: Failed to write an MDA of VG pmem.
> >   Failed to write VG pmem.
> >   Incorrect metadata area header checksum on /dev/pmem0p2 at offset 4096
> >   Incorrect metadata area header checksum on /dev/pmem0p1 at offset 4096
> > 
> > If I don't use XFS, and only issue IO directly to the /dev/pmem/lv, I
> > don't see this corruption.
> I did the same test with ext4 instead of xfs and it resulted in the same
> type of systemic corruption (lvm2 metadata corrupted too):
> 
> [12816.407147] EXT4-fs (dm-4): DAX enabled. Warning: EXPERIMENTAL, use at
> your own risk
> [12816.416123] EXT4-fs (dm-4): mounted filesystem with ordered data mode.
> Opts: dax
> [12816.766855] EXT4-fs error (device dm-4): ext4_mb_generate_buddy:758:
> group 9, block bitmap and bg descriptor inconsistent: 32768 vs 32395 free
> clusters
> [12816.782016] EXT4-fs error (device dm-4): ext4_mb_generate_buddy:758:
> group 10, block bitmap and bg descriptor inconsistent: 32768 vs 16384 free
> clusters
> [12816.797491] JBD2: Spotted dirty metadata buffer (dev = dm-4, blocknr =
> 0). There's a risk of filesystem corruption in case of system crash.
> 
> # vgremove pmem
> Do you really want to remove volume group "pmem" containing 1 logical
> volumes? [y/n]: y
> Do you really want to remove active logical volume lv? [y/n]: y
>   Incorrect metadata area header checksum on /dev/pmem0p1 at offset 4096
>   WARNING: Failed to write an MDA of VG pmem.
>   Incorrect metadata area header checksum on /dev/pmem0p2 at offset 4096
>   WARNING: Failed to write an MDA of VG pmem.
>   Failed to write VG pmem.
>   Incorrect metadata area header checksum on /dev/pmem0p2 at offset 4096
>   Incorrect metadata area header checksum on /dev/pmem0p1 at offset 4096

I will look into the issue.

Thanks for the testing!
-Toshi

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-20 20:11           ` Kani, Toshimitsu
@ 2016-06-20 21:28             ` Kani, Toshimitsu
  2016-06-20 22:22               ` Mike Snitzer
  0 siblings, 1 reply; 40+ messages in thread
From: Kani, Toshimitsu @ 2016-06-20 21:28 UTC (permalink / raw)
  To: snitzer
  Cc: linux-kernel, sandeen, linux-nvdimm, agk, linux-raid, viro,
	axboe, dan.j.williams, ross.zwisler, dm-devel

On Mon, 2016-06-20 at 14:01 -0600, Kani, Toshimitsu wrote:
> On Mon, 2016-06-20 at 15:52 -0400, Mike Snitzer wrote:
> > 
> > On Mon, Jun 20 2016 at  3:40pm -0400,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> >  
 :
> > > If I don't use XFS, and only issue IO directly to the /dev/pmem/lv, I
> > > don't see this corruption.
> >
> > I did the same test with ext4 instead of xfs and it resulted in the same
> > type of systemic corruption (lvm2 metadata corrupted too):
> > 
 :
> I will look into the issue.

Hi Mike,

Can you fold the following patch to the dm-linear patch?

Thanks,
-Tsohi

------
Subject: [PATCH] dm-linear: Fix partition handling for DAX

Partition handling was missing in linear_direct_access().
Call bdev_direct_access(), instead of directly calling
target direct_access function.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
---
 drivers/md/dm-linear.c |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 325aa06..38323e4 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -148,10 +148,16 @@ static long linear_direct_access(struct dm_target *ti,
sector_t sector,
 {
 	struct linear_c *lc = ti->private;
 	struct block_device *bdev = lc->dev->bdev;
-	const struct block_device_operations *bd_ops = bdev->bd_disk->fops;
-
-	return bd_ops->direct_access(bdev, linear_map_sector(ti, sector),
-				     kaddr, pfn, size);
+	struct blk_dax_ctl dax = {
+		.sector = linear_map_sector(ti, sector),
+		.size = size,
+	};
+	long ret;
+
+	ret = bdev_direct_access(bdev, &dax);
+	*kaddr = dax.addr;
+	*pfn = dax.pfn;
+	return ret;
 }
 
 static struct target_type linear_target = {

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-20 21:28             ` Kani, Toshimitsu
@ 2016-06-20 22:22               ` Mike Snitzer
  2016-06-21 13:41                 ` Mike Snitzer
  0 siblings, 1 reply; 40+ messages in thread
From: Mike Snitzer @ 2016-06-20 22:22 UTC (permalink / raw)
  To: Kani, Toshimitsu
  Cc: axboe, sandeen, linux-nvdimm, linux-kernel, linux-raid, dm-devel,
	viro, dan.j.williams, ross.zwisler, agk

On Mon, Jun 20 2016 at  5:28pm -0400,
Kani, Toshimitsu <toshi.kani@hpe.com> wrote:

> On Mon, 2016-06-20 at 14:01 -0600, Kani, Toshimitsu wrote:
> > On Mon, 2016-06-20 at 15:52 -0400, Mike Snitzer wrote:
> > > 
> > > On Mon, Jun 20 2016 at  3:40pm -0400,
> > > Mike Snitzer <snitzer@redhat.com> wrote:
> > >  
>  :
> > > > If I don't use XFS, and only issue IO directly to the /dev/pmem/lv, I
> > > > don't see this corruption.
> > >
> > > I did the same test with ext4 instead of xfs and it resulted in the same
> > > type of systemic corruption (lvm2 metadata corrupted too):
> > > 
>  :
> > I will look into the issue.
> 
> Hi Mike,
> 
> Can you fold the following patch to the dm-linear patch?
> 
> Thanks,
> -Tsohi
> 
> ------
> Subject: [PATCH] dm-linear: Fix partition handling for DAX
> 
> Partition handling was missing in linear_direct_access().
> Call bdev_direct_access(), instead of directly calling
> target direct_access function.
> 
> Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
> ---
>  drivers/md/dm-linear.c |   14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> index 325aa06..38323e4 100644
> --- a/drivers/md/dm-linear.c
> +++ b/drivers/md/dm-linear.c
> @@ -148,10 +148,16 @@ static long linear_direct_access(struct dm_target *ti,
> sector_t sector,
>  {
>  	struct linear_c *lc = ti->private;
>  	struct block_device *bdev = lc->dev->bdev;
> -	const struct block_device_operations *bd_ops = bdev->bd_disk->fops;
> -
> -	return bd_ops->direct_access(bdev, linear_map_sector(ti, sector),
> -				     kaddr, pfn, size);
> +	struct blk_dax_ctl dax = {
> +		.sector = linear_map_sector(ti, sector),
> +		.size = size,
> +	};
> +	long ret;
> +
> +	ret = bdev_direct_access(bdev, &dax);
> +	*kaddr = dax.addr;
> +	*pfn = dax.pfn;
> +	return ret;
>  }
>  
>  static struct target_type linear_target = {

Looks good, I folded it in and tested it to work.  Pushed to my 'wip'
branch.

No longer seeing any corruption in my test that was using partitions to
span pmem devices with a dm-linear device.

Jens, any chance you'd be open to picking up the first 2 patches in this
series?  Or would you like to see them folded or something different?

Mike

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-20 22:22               ` Mike Snitzer
@ 2016-06-21 13:41                 ` Mike Snitzer
  2016-06-21 15:44                   ` Kani, Toshimitsu
  0 siblings, 1 reply; 40+ messages in thread
From: Mike Snitzer @ 2016-06-21 13:41 UTC (permalink / raw)
  To: Kani, Toshimitsu, axboe
  Cc: axboe, sandeen, linux-nvdimm, linux-kernel, linux-raid, dm-devel,
	viro, dan.j.williams, ross.zwisler, agk

On Mon, Jun 20 2016 at  6:22pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Mon, Jun 20 2016 at  5:28pm -0400,
> Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> 
> > 
> > Hi Mike,
> > 
> > Can you fold the following patch to the dm-linear patch?
> > 
> > Thanks,
> > -Tsohi
> > 
> > ------
> > Subject: [PATCH] dm-linear: Fix partition handling for DAX
> > 
> > Partition handling was missing in linear_direct_access().
> > Call bdev_direct_access(), instead of directly calling
> > target direct_access function.
> > 
> > Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
> > ---
> >  drivers/md/dm-linear.c |   14 ++++++++++----
> >  1 file changed, 10 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> > index 325aa06..38323e4 100644
> > --- a/drivers/md/dm-linear.c
> > +++ b/drivers/md/dm-linear.c
> > @@ -148,10 +148,16 @@ static long linear_direct_access(struct dm_target *ti,
> > sector_t sector,
> >  {
> >  	struct linear_c *lc = ti->private;
> >  	struct block_device *bdev = lc->dev->bdev;
> > -	const struct block_device_operations *bd_ops = bdev->bd_disk->fops;
> > -
> > -	return bd_ops->direct_access(bdev, linear_map_sector(ti, sector),
> > -				     kaddr, pfn, size);
> > +	struct blk_dax_ctl dax = {
> > +		.sector = linear_map_sector(ti, sector),
> > +		.size = size,
> > +	};
> > +	long ret;
> > +
> > +	ret = bdev_direct_access(bdev, &dax);
> > +	*kaddr = dax.addr;
> > +	*pfn = dax.pfn;
> > +	return ret;
> >  }
> >  
> >  static struct target_type linear_target = {
> 
> Looks good, I folded it in and tested it to work.  Pushed to my 'wip'
> branch.
> 
> No longer seeing any corruption in my test that was using partitions to
> span pmem devices with a dm-linear device.
> 
> Jens, any chance you'd be open to picking up the first 2 patches in this
> series?  Or would you like to see them folded or something different?

I'm now wondering if we'd be better off setting a new QUEUE_FLAG_DAX
rather than establish GENHD_FL_DAX on the genhd?

It'd be quite a bit easier to allow upper layers (e.g. XFS and ext4) to
check for a queue flag.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-21 13:41                 ` Mike Snitzer
@ 2016-06-21 15:44                   ` Kani, Toshimitsu
  2016-06-21 15:50                     ` Kani, Toshimitsu
                                       ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Kani, Toshimitsu @ 2016-06-21 15:44 UTC (permalink / raw)
  To: axboe, snitzer
  Cc: linux-kernel, sandeen, linux-nvdimm, agk, linux-raid, viro,
	dan.j.williams, axboe, ross.zwisler, dm-devel

On Tue, 2016-06-21 at 09:41 -0400, Mike Snitzer wrote:
> On Mon, Jun 20 2016 at  6:22pm -0400,
> Mike Snitzer <snitzer@redhat.com> wrote:
> > 
> > On Mon, Jun 20 2016 at  5:28pm -0400,
> > Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> > 
 :
> > Looks good, I folded it in and tested it to work.  Pushed to my 'wip'
> > branch.
> > 
> > No longer seeing any corruption in my test that was using partitions to
> > span pmem devices with a dm-linear device.
> > 
> > Jens, any chance you'd be open to picking up the first 2 patches in this
> > series?  Or would you like to see them folded or something different?
>
> I'm now wondering if we'd be better off setting a new QUEUE_FLAG_DAX
> rather than establish GENHD_FL_DAX on the genhd?
> 
> It'd be quite a bit easier to allow upper layers (e.g. XFS and ext4) to
> check for a queue flag.

I think GENHD_FL_DAX is more appropriate since DAX does not use a request
queue, except for protecting the underlining device being disabled while
direct_access() is called (b2e0d1625e19).  

About protecting direct_access, this patch assumes that the underlining
device cannot be disabled until dtr() is called.  Is this correct?  If not,
I will need to call dax_map_atomic().

Thanks,
-Toshi 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-21 15:44                   ` Kani, Toshimitsu
@ 2016-06-21 15:50                     ` Kani, Toshimitsu
  2016-06-21 16:25                     ` Dan Williams
  2016-06-21 18:17                     ` Mike Snitzer
  2 siblings, 0 replies; 40+ messages in thread
From: Kani, Toshimitsu @ 2016-06-21 15:50 UTC (permalink / raw)
  To: axboe, snitzer
  Cc: linux-kernel, sandeen, linux-nvdimm, agk, linux-raid, viro,
	dan.j.williams, axboe, ross.zwisler, dm-devel

On Tue, 2016-06-21 at 09:34 -0600, Kani, Toshimitsu wrote:
> On Tue, 2016-06-21 at 09:41 -0400, Mike Snitzer wrote:
> > 
> > On Mon, Jun 20 2016 at  6:22pm -0400,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> > > 
> > > On Mon, Jun 20 2016 at  5:28pm -0400,
> > > Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> > > 
>  :
> > > 
> > > Looks good, I folded it in and tested it to work.  Pushed to my 'wip'
> > > branch.
> > > 
> > > No longer seeing any corruption in my test that was using partitions
> > > to span pmem devices with a dm-linear device.
> > > 
> > > Jens, any chance you'd be open to picking up the first 2 patches in
> > > this series?  Or would you like to see them folded or something
> > > different?
> >
> > I'm now wondering if we'd be better off setting a new QUEUE_FLAG_DAX
> > rather than establish GENHD_FL_DAX on the genhd?
> > 
> > It'd be quite a bit easier to allow upper layers (e.g. XFS and ext4) to
> > check for a queue flag.
>
> I think GENHD_FL_DAX is more appropriate since DAX does not use a request
> queue, except for protecting the underlining device being disabled while
> direct_access() is called (b2e0d1625e19).  

Forgot to mention that there are bdev_dax_supported() and bdev_dax_capable()
interfaces that can be called from upper layers.  They both call
bdev_direct_access() which checks GENHD_FL_DAX.

Thanks,
-Toshi

> About protecting direct_access, this patch assumes that the underlining
> device cannot be disabled until dtr() is called.  Is this correct?  If
> not, I will need to call dax_map_atomic().
> 
> Thanks,
> -Toshi 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-21 15:44                   ` Kani, Toshimitsu
  2016-06-21 15:50                     ` Kani, Toshimitsu
@ 2016-06-21 16:25                     ` Dan Williams
  2016-06-21 16:35                       ` Kani, Toshimitsu
  2016-06-21 18:17                     ` Mike Snitzer
  2 siblings, 1 reply; 40+ messages in thread
From: Dan Williams @ 2016-06-21 16:25 UTC (permalink / raw)
  To: Kani, Toshimitsu
  Cc: axboe, snitzer, linux-kernel, sandeen, linux-nvdimm, agk,
	linux-raid, viro, axboe, ross.zwisler, dm-devel

On Tue, Jun 21, 2016 at 8:44 AM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> On Tue, 2016-06-21 at 09:41 -0400, Mike Snitzer wrote:
>> On Mon, Jun 20 2016 at  6:22pm -0400,
>> Mike Snitzer <snitzer@redhat.com> wrote:
>> >
>> > On Mon, Jun 20 2016 at  5:28pm -0400,
>> > Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
>> >
>  :
>> > Looks good, I folded it in and tested it to work.  Pushed to my 'wip'
>> > branch.
>> >
>> > No longer seeing any corruption in my test that was using partitions to
>> > span pmem devices with a dm-linear device.
>> >
>> > Jens, any chance you'd be open to picking up the first 2 patches in this
>> > series?  Or would you like to see them folded or something different?
>>
>> I'm now wondering if we'd be better off setting a new QUEUE_FLAG_DAX
>> rather than establish GENHD_FL_DAX on the genhd?
>>
>> It'd be quite a bit easier to allow upper layers (e.g. XFS and ext4) to
>> check for a queue flag.
>
> I think GENHD_FL_DAX is more appropriate since DAX does not use a request
> queue, except for protecting the underlining device being disabled while
> direct_access() is called (b2e0d1625e19).
>
> About protecting direct_access, this patch assumes that the underlining
> device cannot be disabled until dtr() is called.  Is this correct?  If not,
> I will need to call dax_map_atomic().

Kernel internal usages of dax should be using dax_map_atomic() to
safely resolve device removal races.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-21 16:25                     ` Dan Williams
@ 2016-06-21 16:35                       ` Kani, Toshimitsu
  2016-06-21 16:45                         ` Dan Williams
  0 siblings, 1 reply; 40+ messages in thread
From: Kani, Toshimitsu @ 2016-06-21 16:35 UTC (permalink / raw)
  To: dan.j.williams
  Cc: linux-kernel, sandeen, linux-nvdimm, agk, linux-raid, snitzer,
	viro, axboe, axboe, ross.zwisler, dm-devel

On Tue, 2016-06-21 at 09:25 -0700, Dan Williams wrote:
> On Tue, Jun 21, 2016 at 8:44 AM, Kani, Toshimitsu <toshi.kani@hpe.com>
> wrote:
> > 
> > On Tue, 2016-06-21 at 09:41 -0400, Mike Snitzer wrote:
> > > 
> > > On Mon, Jun 20 2016 at  6:22pm -0400,
> > > Mike Snitzer <snitzer@redhat.com> wrote:
> > > > 
> > > > On Mon, Jun 20 2016 at  5:28pm -0400,
> > > > Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> >  :
> > > > Looks good, I folded it in and tested it to work.  Pushed to my 'wip'
> > > > branch.
> > > > 
> > > > No longer seeing any corruption in my test that was using partitions
> > > > to span pmem devices with a dm-linear device.
> > > > 
> > > > Jens, any chance you'd be open to picking up the first 2 patches in
> > > > this series?  Or would you like to see them folded or something
> > > > different?
> > >
> > > I'm now wondering if we'd be better off setting a new QUEUE_FLAG_DAX
> > > rather than establish GENHD_FL_DAX on the genhd?
> > > 
> > > It'd be quite a bit easier to allow upper layers (e.g. XFS and ext4) to
> > > check for a queue flag.
> >
> > I think GENHD_FL_DAX is more appropriate since DAX does not use a request
> > queue, except for protecting the underlining device being disabled while
> > direct_access() is called (b2e0d1625e19).
> > 
> > About protecting direct_access, this patch assumes that the underlining
> > device cannot be disabled until dtr() is called.  Is this correct?  If
> > not, I will need to call dax_map_atomic().
>
> Kernel internal usages of dax should be using dax_map_atomic() to
> safely resolve device removal races.

Will do.  In such case, shall I move dax_[un]map_atomic() to block_dev.c and
rename them to bdev_dax_[un]map_atomic()?

Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-21 16:35                       ` Kani, Toshimitsu
@ 2016-06-21 16:45                         ` Dan Williams
  2016-06-21 16:56                           ` Kani, Toshimitsu
  0 siblings, 1 reply; 40+ messages in thread
From: Dan Williams @ 2016-06-21 16:45 UTC (permalink / raw)
  To: Kani, Toshimitsu
  Cc: linux-kernel, sandeen, linux-nvdimm, agk, linux-raid, snitzer,
	viro, axboe, axboe, ross.zwisler, dm-devel

On Tue, Jun 21, 2016 at 9:35 AM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> On Tue, 2016-06-21 at 09:25 -0700, Dan Williams wrote:
>> On Tue, Jun 21, 2016 at 8:44 AM, Kani, Toshimitsu <toshi.kani@hpe.com>
>> wrote:
>> >
>> > On Tue, 2016-06-21 at 09:41 -0400, Mike Snitzer wrote:
>> > >
>> > > On Mon, Jun 20 2016 at  6:22pm -0400,
>> > > Mike Snitzer <snitzer@redhat.com> wrote:
>> > > >
>> > > > On Mon, Jun 20 2016 at  5:28pm -0400,
>> > > > Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
>> >  :
>> > > > Looks good, I folded it in and tested it to work.  Pushed to my 'wip'
>> > > > branch.
>> > > >
>> > > > No longer seeing any corruption in my test that was using partitions
>> > > > to span pmem devices with a dm-linear device.
>> > > >
>> > > > Jens, any chance you'd be open to picking up the first 2 patches in
>> > > > this series?  Or would you like to see them folded or something
>> > > > different?
>> > >
>> > > I'm now wondering if we'd be better off setting a new QUEUE_FLAG_DAX
>> > > rather than establish GENHD_FL_DAX on the genhd?
>> > >
>> > > It'd be quite a bit easier to allow upper layers (e.g. XFS and ext4) to
>> > > check for a queue flag.
>> >
>> > I think GENHD_FL_DAX is more appropriate since DAX does not use a request
>> > queue, except for protecting the underlining device being disabled while
>> > direct_access() is called (b2e0d1625e19).
>> >
>> > About protecting direct_access, this patch assumes that the underlining
>> > device cannot be disabled until dtr() is called.  Is this correct?  If
>> > not, I will need to call dax_map_atomic().
>>
>> Kernel internal usages of dax should be using dax_map_atomic() to
>> safely resolve device removal races.
>
> Will do.  In such case, shall I move dax_[un]map_atomic() to block_dev.c and
> rename them to bdev_dax_[un]map_atomic()?

Sounds good to me.  I know Jeff and Christoph don't like the current
calling convention of passing in a structure.  Just note that they
might ask you to change it back to a list of parameters if it moves to
bdev_dax_map_atomic().

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-21 16:45                         ` Dan Williams
@ 2016-06-21 16:56                           ` Kani, Toshimitsu
  0 siblings, 0 replies; 40+ messages in thread
From: Kani, Toshimitsu @ 2016-06-21 16:56 UTC (permalink / raw)
  To: dan.j.williams
  Cc: linux-kernel, sandeen, linux-nvdimm, agk, linux-raid, snitzer,
	viro, axboe, axboe, ross.zwisler, dm-devel

On Tue, 2016-06-21 at 09:45 -0700, Dan Williams wrote:
> On Tue, Jun 21, 2016 at 9:35 AM, Kani, Toshimitsu <toshi.kani@hpe.com>
> wrote:
> > On Tue, 2016-06-21 at 09:25 -0700, Dan Williams wrote:
> > > On Tue, Jun 21, 2016 at 8:44 AM, Kani, Toshimitsu <toshi.kani@hpe.com>
> > > wrote:
> > > > 
 :
> > > > I think GENHD_FL_DAX is more appropriate since DAX does not use a
> > > > request queue, except for protecting the underlining device being
> > > > disabled while direct_access() is called (b2e0d1625e19).
> > > > 
> > > > About protecting direct_access, this patch assumes that the
> > > > underlining device cannot be disabled until dtr() is called.  Is this
> > > > correct?  If not, I will need to call dax_map_atomic().
> > >
> > > Kernel internal usages of dax should be using dax_map_atomic() to
> > > safely resolve device removal races.
> >
> > Will do.  In such case, shall I move dax_[un]map_atomic() to block_dev.c
> > and rename them to bdev_dax_[un]map_atomic()?
>
> Sounds good to me.  I know Jeff and Christoph don't like the current
> calling convention of passing in a structure.  Just note that they
> might ask you to change it back to a list of parameters if it moves to
> bdev_dax_map_atomic().

OK, I will change it back to a list of parameters as well.

Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-21 15:44                   ` Kani, Toshimitsu
  2016-06-21 15:50                     ` Kani, Toshimitsu
  2016-06-21 16:25                     ` Dan Williams
@ 2016-06-21 18:17                     ` Mike Snitzer
  2016-06-22 17:44                       ` Kani, Toshimitsu
  2 siblings, 1 reply; 40+ messages in thread
From: Mike Snitzer @ 2016-06-21 18:17 UTC (permalink / raw)
  To: Kani, Toshimitsu
  Cc: axboe, linux-kernel, sandeen, linux-nvdimm, agk, linux-raid,
	viro, dan.j.williams, axboe, ross.zwisler, dm-devel

On Tue, Jun 21 2016 at 11:44am -0400,
Kani, Toshimitsu <toshi.kani@hpe.com> wrote:

> On Tue, 2016-06-21 at 09:41 -0400, Mike Snitzer wrote:
> > On Mon, Jun 20 2016 at  6:22pm -0400,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> > > 
> > > On Mon, Jun 20 2016 at  5:28pm -0400,
> > > Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> > > 
>  :
> > > Looks good, I folded it in and tested it to work.  Pushed to my 'wip'
> > > branch.
> > > 
> > > No longer seeing any corruption in my test that was using partitions to
> > > span pmem devices with a dm-linear device.
> > > 
> > > Jens, any chance you'd be open to picking up the first 2 patches in this
> > > series?  Or would you like to see them folded or something different?
> >
> > I'm now wondering if we'd be better off setting a new QUEUE_FLAG_DAX
> > rather than establish GENHD_FL_DAX on the genhd?
> > 
> > It'd be quite a bit easier to allow upper layers (e.g. XFS and ext4) to
> > check for a queue flag.
> 
> I think GENHD_FL_DAX is more appropriate since DAX does not use a request
> queue, except for protecting the underlining device being disabled while
> direct_access() is called (b2e0d1625e19).  

The devices in question have a request_queue.  All bio-based device have
a request_queue.

I don't have a big problem with GENHD_FL_DAX.  Just wanted to point out
that such block device capabilities are generally advertised in terms of
a QUEUE_FLAG.
 
> About protecting direct_access, this patch assumes that the underlining
> device cannot be disabled until dtr() is called.  Is this correct?  If not,
> I will need to call dax_map_atomic().

One of the big design considerations for DM that a DM device can be
suspended (with or without flush) and any new IO will be blocked until
the DM device is resumed.

So ideally DM should be able to have the same capability even if using
DAX.

But that is different than what commit b2e0d1625e19 is addressing.  For
DM, I wouldn't think you'd need the extra protections that
dax_map_atomic() is providing given that the underlying block device
lifetime is managed via DM core's dm_get_device/dm_put_device (see also:
dm.c:open_table_device/close_table_device).

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-21 18:17                     ` Mike Snitzer
@ 2016-06-22 17:44                       ` Kani, Toshimitsu
  2016-06-22 19:15                         ` Dan Williams
  0 siblings, 1 reply; 40+ messages in thread
From: Kani, Toshimitsu @ 2016-06-22 17:44 UTC (permalink / raw)
  To: snitzer
  Cc: linux-kernel, sandeen, linux-nvdimm, agk, linux-raid, viro,
	dan.j.williams, axboe, axboe, ross.zwisler, dm-devel

On Tue, 2016-06-21 at 14:17 -0400, Mike Snitzer wrote:
> On Tue, Jun 21 2016 at 11:44am -0400,
> Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> > 
> > On Tue, 2016-06-21 at 09:41 -0400, Mike Snitzer wrote:
> > > 
> > > On Mon, Jun 20 2016 at  6:22pm -0400,
> > > Mike Snitzer <snitzer@redhat.com> wrote:
 :
> > > I'm now wondering if we'd be better off setting a new QUEUE_FLAG_DAX
> > > rather than establish GENHD_FL_DAX on the genhd?
> > > 
> > > It'd be quite a bit easier to allow upper layers (e.g. XFS and ext4) to
> > > check for a queue flag.
> >
> > I think GENHD_FL_DAX is more appropriate since DAX does not use a request
> > queue, except for protecting the underlining device being disabled while
> > direct_access() is called (b2e0d1625e19).  
>
> The devices in question have a request_queue.  All bio-based device have
> a request_queue.

DAX-capable devices have two operation modes, bio-based and DAX.  I agree that
bio-based operation is associated with a request queue, and its capabilities
should be set to it.  DAX, on the other hand, is rather independent from a
request queue.

> I don't have a big problem with GENHD_FL_DAX.  Just wanted to point out
> that such block device capabilities are generally advertised in terms of
> a QUEUE_FLAG.

I do not have a strong opinion, but feel a bit odd to associate DAX to a
request queue. 
 
> > About protecting direct_access, this patch assumes that the underlining
> > device cannot be disabled until dtr() is called.  Is this correct?  If
> > not, I will need to call dax_map_atomic().
>
> One of the big design considerations for DM that a DM device can be
> suspended (with or without flush) and any new IO will be blocked until
> the DM device is resumed.
> 
> So ideally DM should be able to have the same capability even if using
> DAX.

Supporting suspend for DAX is challenging since it allows user applications to
access a device directly.  Once a device range is mmap'd, there is no kernel
intervention to access the range, unless we invalidate user mappings.  This
isn't done today even after a driver is unbind'd from a device.

> But that is different than what commit b2e0d1625e19 is addressing.  For
> DM, I wouldn't think you'd need the extra protections that
> dax_map_atomic() is providing given that the underlying block device
> lifetime is managed via DM core's dm_get_device/dm_put_device (see also:
> dm.c:open_table_device/close_table_device).

I thought so as well.  But I realized that there is (almost) nothing that can
prevent the unbind operation.  It cannot fail, either.  This unbind proceeds
even when a device is in-use.  In case of a pmem device, it is only protected
by pmem_release_queue(), which is called when a pmem device is being deleted
and calls blk_cleanup_queue() to serialize a critical section between
blk_queue_enter() and blk_queue_exit() per b2e0d1625e19.  This prevents from a
kernel DTLB fault, but does not prevent a device disappeared while in-use.

Protecting DM's underlining device with blk_queue_enter() (or something
similar) requires more thoughts...  blk_queue_enter() to a DM device cannot be
redirected to its underlining device.  So, this is TBD for now.  But I do not
think this is a blocker issue since doing unbind to a underlining device is
quite harmful no matter what we do - even if it is protected with
blk_queue_enter().

Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-22 17:44                       ` Kani, Toshimitsu
@ 2016-06-22 19:15                         ` Dan Williams
  2016-06-22 20:16                           ` Kani, Toshimitsu
  0 siblings, 1 reply; 40+ messages in thread
From: Dan Williams @ 2016-06-22 19:15 UTC (permalink / raw)
  To: Kani, Toshimitsu
  Cc: snitzer, linux-kernel, sandeen, linux-nvdimm, agk, linux-raid,
	viro, axboe, axboe, ross.zwisler, dm-devel

On Wed, Jun 22, 2016 at 10:44 AM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> On Tue, 2016-06-21 at 14:17 -0400, Mike Snitzer wrote:
>> On Tue, Jun 21 2016 at 11:44am -0400,
>> Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
>> >
>> > On Tue, 2016-06-21 at 09:41 -0400, Mike Snitzer wrote:
>> > >
>> > > On Mon, Jun 20 2016 at  6:22pm -0400,
>> > > Mike Snitzer <snitzer@redhat.com> wrote:
>  :
>> > > I'm now wondering if we'd be better off setting a new QUEUE_FLAG_DAX
>> > > rather than establish GENHD_FL_DAX on the genhd?
>> > >
>> > > It'd be quite a bit easier to allow upper layers (e.g. XFS and ext4) to
>> > > check for a queue flag.
>> >
>> > I think GENHD_FL_DAX is more appropriate since DAX does not use a request
>> > queue, except for protecting the underlining device being disabled while
>> > direct_access() is called (b2e0d1625e19).
>>
>> The devices in question have a request_queue.  All bio-based device have
>> a request_queue.
>
> DAX-capable devices have two operation modes, bio-based and DAX.  I agree that
> bio-based operation is associated with a request queue, and its capabilities
> should be set to it.  DAX, on the other hand, is rather independent from a
> request queue.
>
>> I don't have a big problem with GENHD_FL_DAX.  Just wanted to point out
>> that such block device capabilities are generally advertised in terms of
>> a QUEUE_FLAG.
>
> I do not have a strong opinion, but feel a bit odd to associate DAX to a
> request queue.

Given that we do not support dax to a raw block device [1] it seems a
gendisk flag is more misleading than request_queue flag that specifies
what requests can be made of the device.

[1]: acc93d30d7d4 Revert "block: enable dax for raw block devices"


>> > About protecting direct_access, this patch assumes that the underlining
>> > device cannot be disabled until dtr() is called.  Is this correct?  If
>> > not, I will need to call dax_map_atomic().
>>
>> One of the big design considerations for DM that a DM device can be
>> suspended (with or without flush) and any new IO will be blocked until
>> the DM device is resumed.
>>
>> So ideally DM should be able to have the same capability even if using
>> DAX.
>
> Supporting suspend for DAX is challenging since it allows user applications to
> access a device directly.  Once a device range is mmap'd, there is no kernel
> intervention to access the range, unless we invalidate user mappings.  This
> isn't done today even after a driver is unbind'd from a device.
>
>> But that is different than what commit b2e0d1625e19 is addressing.  For
>> DM, I wouldn't think you'd need the extra protections that
>> dax_map_atomic() is providing given that the underlying block device
>> lifetime is managed via DM core's dm_get_device/dm_put_device (see also:
>> dm.c:open_table_device/close_table_device).
>
> I thought so as well.  But I realized that there is (almost) nothing that can
> prevent the unbind operation.  It cannot fail, either.  This unbind proceeds
> even when a device is in-use.  In case of a pmem device, it is only protected
> by pmem_release_queue(), which is called when a pmem device is being deleted
> and calls blk_cleanup_queue() to serialize a critical section between
> blk_queue_enter() and blk_queue_exit() per b2e0d1625e19.  This prevents from a
> kernel DTLB fault, but does not prevent a device disappeared while in-use.
>
> Protecting DM's underlining device with blk_queue_enter() (or something
> similar) requires more thoughts...  blk_queue_enter() to a DM device cannot be
> redirected to its underlining device.  So, this is TBD for now.  But I do not
> think this is a blocker issue since doing unbind to a underlining device is
> quite harmful no matter what we do - even if it is protected with
> blk_queue_enter().

I still have the "block device removed" notification patches on my
todo list.  It's not a blocker, but there are scenarios where we can
keep accessing memory via dax of a disabled device leading to memory
corruption.  I'll bump that up in my queue now that we are looking at
additional scenarios where letting DAX mappings leak past the
reconfiguration of a block device could lead to trouble.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-22 19:15                         ` Dan Williams
@ 2016-06-22 20:16                           ` Kani, Toshimitsu
  2016-06-22 22:38                             ` Mike Snitzer
  0 siblings, 1 reply; 40+ messages in thread
From: Kani, Toshimitsu @ 2016-06-22 20:16 UTC (permalink / raw)
  To: dan.j.williams
  Cc: linux-kernel, sandeen, linux-nvdimm, agk, linux-raid, snitzer,
	viro, axboe, axboe, ross.zwisler, dm-devel

On Wed, 2016-06-22 at 12:15 -0700, Dan Williams wrote:
> On Wed, Jun 22, 2016 at 10:44 AM, Kani, Toshimitsu <toshi.kani@hpe.com>
> wrote:
> > On Tue, 2016-06-21 at 14:17 -0400, Mike Snitzer wrote:
> > > 
> > > On Tue, Jun 21 2016 at 11:44am -0400,
> > > Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> > > > 
> > > > On Tue, 2016-06-21 at 09:41 -0400, Mike Snitzer wrote:
> > > > > On Mon, Jun 20 2016 at  6:22pm -0400,
> > > > > Mike Snitzer <snitzer@redhat.com> wrote:
> > > > > I'm now wondering if we'd be better off setting a new QUEUE_FLAG_DAX
> > > > > rather than establish GENHD_FL_DAX on the genhd?
> > > > > 
> > > > > It'd be quite a bit easier to allow upper layers (e.g. XFS and ext4)
> > > > > to check for a queue flag.
> > > > 
> > > > I think GENHD_FL_DAX is more appropriate since DAX does not use a
> > > > request queue, except for protecting the underlining device being
> > > > disabled while direct_access() is called (b2e0d1625e19).
> > > 
> > > The devices in question have a request_queue.  All bio-based device have
> > > a request_queue.
> >
> > DAX-capable devices have two operation modes, bio-based and DAX.  I agree
> > that bio-based operation is associated with a request queue, and its
> > capabilities should be set to it.  DAX, on the other hand, is rather
> > independent from a request queue.
> > 
> > > I don't have a big problem with GENHD_FL_DAX.  Just wanted to point out
> > > that such block device capabilities are generally advertised in terms of
> > > a QUEUE_FLAG.
> >
> > I do not have a strong opinion, but feel a bit odd to associate DAX to a
> > request queue.
>
> Given that we do not support dax to a raw block device [1] it seems a
> gendisk flag is more misleading than request_queue flag that specifies
> what requests can be made of the device.
> 
> [1]: acc93d30d7d4 Revert "block: enable dax for raw block devices"

Oh, I see.  I will change to use request_queue flag.


> > > > About protecting direct_access, this patch assumes that the
> > > > underlining device cannot be disabled until dtr() is called.  Is this
> > > > correct?  If not, I will need to call dax_map_atomic().
> > >
> > > One of the big design considerations for DM that a DM device can be
> > > suspended (with or without flush) and any new IO will be blocked until
> > > the DM device is resumed.
> > > 
> > > So ideally DM should be able to have the same capability even if using
> > > DAX.
> >
> > Supporting suspend for DAX is challenging since it allows user
> > applications to access a device directly.  Once a device range is mmap'd,
> > there is no kernel intervention to access the range, unless we invalidate
> > user mappings.  This isn't done today even after a driver is unbind'd from
> > a device.
> > 
> > > But that is different than what commit b2e0d1625e19 is addressing.  For
> > > DM, I wouldn't think you'd need the extra protections that
> > > dax_map_atomic() is providing given that the underlying block device
> > > lifetime is managed via DM core's dm_get_device/dm_put_device (see also:
> > > dm.c:open_table_device/close_table_device).
> >
> > I thought so as well.  But I realized that there is (almost) nothing that
> > can prevent the unbind operation.  It cannot fail, either.  This unbind
> > proceeds even when a device is in-use.  In case of a pmem device, it is
> > only protected by pmem_release_queue(), which is called when a pmem device
> > is being deleted and calls blk_cleanup_queue() to serialize a critical
> > section between
> > blk_queue_enter() and blk_queue_exit() per b2e0d1625e19.  This prevents
> > from a kernel DTLB fault, but does not prevent a device disappeared while
> > in-use.
> > 
> > Protecting DM's underlining device with blk_queue_enter() (or something
> > similar) requires more thoughts...  blk_queue_enter() to a DM device
> > cannot be redirected to its underlining device.  So, this is TBD for
> > now.  But I do not think this is a blocker issue since doing unbind to a
> > underlining device is quite harmful no matter what we do - even if it is
> > protected with blk_queue_enter().
>
> I still have the "block device removed" notification patches on my
> todo list.  It's not a blocker, but there are scenarios where we can
> keep accessing memory via dax of a disabled device leading to memory
> corruption.  

Right, I noticed that user applications can access mmap'd ranges on a disabled
device.

> I'll bump that up in my queue now that we are looking at
> additional scenarios where letting DAX mappings leak past the
> reconfiguration of a block device could lead to trouble.

Great.  With DM, removing a underlining device while in-use can lead to
trouble, esp. with RAID0.  Users need to remove a device from DM first...

Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-22 20:16                           ` Kani, Toshimitsu
@ 2016-06-22 22:38                             ` Mike Snitzer
  2016-06-22 22:59                               ` Kani, Toshimitsu
  0 siblings, 1 reply; 40+ messages in thread
From: Mike Snitzer @ 2016-06-22 22:38 UTC (permalink / raw)
  To: Kani, Toshimitsu
  Cc: dan.j.williams, linux-kernel, sandeen, linux-nvdimm, agk,
	linux-raid, viro, axboe, axboe, ross.zwisler, dm-devel

On Wed, Jun 22 2016 at  4:16P -0400,
Kani, Toshimitsu <toshi.kani@hpe.com> wrote:

> On Wed, 2016-06-22 at 12:15 -0700, Dan Williams wrote:
> > On Wed, Jun 22, 2016 at 10:44 AM, Kani, Toshimitsu <toshi.kani@hpe.com>
> > wrote:
> > > On Tue, 2016-06-21 at 14:17 -0400, Mike Snitzer wrote:
> > > > 
> > > > On Tue, Jun 21 2016 at 11:44am -0400,
> > > > Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> > > > > 
> > > > > On Tue, 2016-06-21 at 09:41 -0400, Mike Snitzer wrote:
> > > > > > On Mon, Jun 20 2016 at  6:22pm -0400,
> > > > > > Mike Snitzer <snitzer@redhat.com> wrote:
> > > > > > I'm now wondering if we'd be better off setting a new QUEUE_FLAG_DAX
> > > > > > rather than establish GENHD_FL_DAX on the genhd?
> > > > > > 
> > > > > > It'd be quite a bit easier to allow upper layers (e.g. XFS and ext4)
> > > > > > to check for a queue flag.
> > > > > 
> > > > > I think GENHD_FL_DAX is more appropriate since DAX does not use a
> > > > > request queue, except for protecting the underlining device being
> > > > > disabled while direct_access() is called (b2e0d1625e19).
> > > > 
> > > > The devices in question have a request_queue.  All bio-based device have
> > > > a request_queue.
> > >
> > > DAX-capable devices have two operation modes, bio-based and DAX.  I agree
> > > that bio-based operation is associated with a request queue, and its
> > > capabilities should be set to it.  DAX, on the other hand, is rather
> > > independent from a request queue.
> > > 
> > > > I don't have a big problem with GENHD_FL_DAX.  Just wanted to point out
> > > > that such block device capabilities are generally advertised in terms of
> > > > a QUEUE_FLAG.
> > >
> > > I do not have a strong opinion, but feel a bit odd to associate DAX to a
> > > request queue.
> >
> > Given that we do not support dax to a raw block device [1] it seems a
> > gendisk flag is more misleading than request_queue flag that specifies
> > what requests can be made of the device.
> > 
> > [1]: acc93d30d7d4 Revert "block: enable dax for raw block devices"
> 
> Oh, I see.  I will change to use request_queue flag.

I implemented the block patch for this yesterday but based on your
feedback I stopped there (didn't carry the change through to the DM core
and DM linear -- can easily do so tomorrow though).

Feel free to use this as a starting point and fix/extend and add a
proper header:

>From e88736ce322f248157da6c7d402e940adafffa1e Mon Sep 17 00:00:00 2001
From: Mike Snitzer <snitzer@redhat.com>
Date: Tue, 21 Jun 2016 12:23:29 -0400
Subject: [PATCH] block: add QUEUE_FLAG_DAX for devices to advertise their DAX support

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/block/brd.c          | 3 +++
 drivers/nvdimm/pmem.c        | 1 +
 drivers/s390/block/dcssblk.c | 1 +
 fs/block_dev.c               | 5 +++--
 include/linux/blkdev.h       | 2 ++
 5 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index f5b0d6f..13eee12 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -508,6 +508,9 @@ static struct brd_device *brd_alloc(int i)
 	brd->brd_queue->limits.discard_granularity = PAGE_SIZE;
 	blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX);
 	brd->brd_queue->limits.discard_zeroes_data = 1;
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+	queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
+#endif
 	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue);
 
 	disk = brd->brd_disk = alloc_disk(max_part);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 608fc44..53b701b 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -283,6 +283,7 @@ static int pmem_attach_disk(struct device *dev,
 	blk_queue_max_hw_sectors(q, UINT_MAX);
 	blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
+	queue_flag_set_unlocked(QUEUE_FLAG_DAX, q);
 	q->queuedata = pmem;
 
 	disk = alloc_disk_node(0, nid);
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index bed53c4..093e9e1 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -618,6 +618,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
 	dev_info->gd->driverfs_dev = &dev_info->dev;
 	blk_queue_make_request(dev_info->dcssblk_queue, dcssblk_make_request);
 	blk_queue_logical_block_size(dev_info->dcssblk_queue, 4096);
+	queue_flag_set_unlocked(QUEUE_FLAG_DAX, dev_info->dcssblk_queue);
 
 	seg_byte_size = (dev_info->end - dev_info->start + 1);
 	set_capacity(dev_info->gd, seg_byte_size >> 9); // size in sectors
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 71ccab1..9bcb3a9 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -484,6 +484,7 @@ long bdev_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax)
 	sector_t sector = dax->sector;
 	long avail, size = dax->size;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
+	struct request_queue *q = bdev_get_queue(bdev);
 
 	/*
 	 * The device driver is allowed to sleep, in order to make the
@@ -493,7 +494,7 @@ long bdev_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax)
 
 	if (size < 0)
 		return size;
-	if (!ops->direct_access)
+	if (!blk_queue_dax(q) || !ops->direct_access)
 		return -EOPNOTSUPP;
 	if ((sector + DIV_ROUND_UP(size, 512)) >
 					part_nr_sects_read(bdev->bd_part))
@@ -1287,7 +1288,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 		bdev->bd_disk = disk;
 		bdev->bd_queue = disk->queue;
 		bdev->bd_contains = bdev;
-		if (IS_ENABLED(CONFIG_BLK_DEV_DAX) && disk->fops->direct_access)
+		if (IS_ENABLED(CONFIG_BLK_DEV_DAX) && blk_queue_dax(disk->queue))
 			bdev->bd_inode->i_flags = S_DAX;
 		else
 			bdev->bd_inode->i_flags = 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9746d22..d5cb326 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -505,6 +505,7 @@ struct request_queue {
 #define QUEUE_FLAG_WC	       23	/* Write back caching */
 #define QUEUE_FLAG_FUA	       24	/* device supports FUA writes */
 #define QUEUE_FLAG_FLUSH_NQ    25	/* flush not queueuable */
+#define QUEUE_FLAG_DAX	       26	/* device supports DAX */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\
@@ -594,6 +595,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 #define blk_queue_discard(q)	test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
 #define blk_queue_secdiscard(q)	(blk_queue_discard(q) && \
 	test_bit(QUEUE_FLAG_SECDISCARD, &(q)->queue_flags))
+#define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
2.7.4 (Apple Git-66)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices
  2016-06-22 22:38                             ` Mike Snitzer
@ 2016-06-22 22:59                               ` Kani, Toshimitsu
  0 siblings, 0 replies; 40+ messages in thread
From: Kani, Toshimitsu @ 2016-06-22 22:59 UTC (permalink / raw)
  To: snitzer
  Cc: linux-kernel, sandeen, linux-nvdimm, agk, linux-raid, viro,
	dan.j.williams, axboe, axboe, ross.zwisler, dm-devel

On Wed, 2016-06-22 at 18:38 -0400, Mike Snitzer wrote:
> On Wed, Jun 22 2016 at  4:16P -0400,
> Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> > 
> > On Wed, 2016-06-22 at 12:15 -0700, Dan Williams wrote:
> > > 
> > > On Wed, Jun 22, 2016 at 10:44 AM, Kani, Toshimitsu <toshi.kani@hpe.com>
> > > wrote:
> > > > 
> > > > On Tue, 2016-06-21 at 14:17 -0400, Mike Snitzer wrote:
> > > > > 
> > > > > On Tue, Jun 21 2016 at 11:44am -0400,
> > > > > Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> > > > > > 
> > > > > > On Tue, 2016-06-21 at 09:41 -0400, Mike Snitzer wrote:
> > > > > > > 
> > > > > The devices in question have a request_queue.  All bio-based device
> > > > > have a request_queue.
> > > > 
> > > > DAX-capable devices have two operation modes, bio-based and DAX.  I
> > > > agree that bio-based operation is associated with a request queue, and
> > > > its capabilities should be set to it.  DAX, on the other hand, is
> > > > rather independent from a request queue.
> > > > 
> > > > > 
> > > > > I don't have a big problem with GENHD_FL_DAX.  Just wanted to point
> > > > > out that such block device capabilities are generally advertised in
> > > > > terms of a QUEUE_FLAG.
> > > > 
> > > > I do not have a strong opinion, but feel a bit odd to associate DAX to
> > > > a request queue.
> > >
> > > Given that we do not support dax to a raw block device [1] it seems a
> > > gendisk flag is more misleading than request_queue flag that specifies
> > > what requests can be made of the device.
> > > 
> > > [1]: acc93d30d7d4 Revert "block: enable dax for raw block devices"
> >
> > Oh, I see.  I will change to use request_queue flag.
>
> I implemented the block patch for this yesterday but based on your
> feedback I stopped there (didn't carry the change through to the DM core
> and DM linear -- can easily do so tomorrow though).
> 
> Feel free to use this as a starting point and fix/extend and add a
> proper header:

Thanks Mike! I made similar changes as well. I will take yours and finish up
the rest. :)
-Toshi

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2016-06-22 23:15 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-13 22:21 [PATCH 0/6] Support DAX for device-mapper dm-linear devices Toshi Kani
2016-06-13 22:21 ` [PATCH 1/6] genhd: Add GENHD_FL_DAX to gendisk flags Toshi Kani
2016-06-13 22:21 ` [PATCH 2/6] block: Check GENHD_FL_DAX for DAX capability Toshi Kani
2016-06-13 22:21 ` [PATCH 3/6] dm: Add dm_blk_direct_access() for mapped device Toshi Kani
2016-06-13 22:21 ` [PATCH 4/6] dm-linear: Add linear_direct_access() Toshi Kani
2016-06-13 22:21 ` [PATCH 5/6] dm, dm-linear: Add dax_supported to dm_target Toshi Kani
2016-06-13 22:21 ` [PATCH 6/6] dm: Enable DAX support for mapper device Toshi Kani
2016-06-13 22:57 ` [PATCH 0/6] Support DAX for device-mapper dm-linear devices Mike Snitzer
2016-06-20 18:00   ` Mike Snitzer
2016-06-20 18:31     ` Kani, Toshimitsu
2016-06-20 19:40       ` Mike Snitzer
2016-06-20 19:52         ` Mike Snitzer
2016-06-20 20:11           ` Kani, Toshimitsu
2016-06-20 21:28             ` Kani, Toshimitsu
2016-06-20 22:22               ` Mike Snitzer
2016-06-21 13:41                 ` Mike Snitzer
2016-06-21 15:44                   ` Kani, Toshimitsu
2016-06-21 15:50                     ` Kani, Toshimitsu
2016-06-21 16:25                     ` Dan Williams
2016-06-21 16:35                       ` Kani, Toshimitsu
2016-06-21 16:45                         ` Dan Williams
2016-06-21 16:56                           ` Kani, Toshimitsu
2016-06-21 18:17                     ` Mike Snitzer
2016-06-22 17:44                       ` Kani, Toshimitsu
2016-06-22 19:15                         ` Dan Williams
2016-06-22 20:16                           ` Kani, Toshimitsu
2016-06-22 22:38                             ` Mike Snitzer
2016-06-22 22:59                               ` Kani, Toshimitsu
2016-06-13 23:18 ` Dan Williams
2016-06-13 23:59   ` Kani, Toshimitsu
2016-06-14  0:02     ` Dan Williams
2016-06-14  7:30       ` Dan Williams
2016-06-14 13:50     ` Jeff Moyer
2016-06-14 15:41       ` Mike Snitzer
2016-06-14 18:00         ` Kani, Toshimitsu
2016-06-14 20:19         ` Jeff Moyer
2016-06-15  1:46           ` Mike Snitzer
2016-06-15  2:07             ` Dan Williams
2016-06-15  2:35               ` Mike Snitzer
2016-06-14 15:53       ` Kani, Toshimitsu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).