All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/15] libnvdimm: ->rw_bytes(), BLK-mode, unit tests, and misc features
@ 2015-06-17 23:54 ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:54 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Martin K. Petersen, Vishal Verma, Neil Brown,
	Greg KH, Rafael J. Wysocki, Dave Chinner, Robert Moore,
	Andy Lutomirski, Jens Axboe, linux-acpi, Jeff Moyer,
	Matthew Wilcox, H. Peter Anvin, linux-fsdevel, Ross Zwisler, hch,
	mingo, linux-kernel, Lv Zheng

This patchset takes the position that a new block_device_operations op
is needed for nvdimm devices.  Jens, see "[PATCH 01/15] block: introduce
an ->rw_bytes() block device operation", it gates the rest of the series
moving forward.

Aside from adding a compile-time check to tools/testing/nvdimm/Kbuild
for validating all libnvdimm objects are built as modules, patches 2 to
6 are otherwise unchanged from the v6 libnvdimm posting [1].  The
remaining patches are feature additions and other cleanups that were
being held back while the base patchset was polished.

Patch 5 has an updated changelog speaking to the potential maintenance
burden of carrying tools/testing/nvdimm/ in-tree.  The benefits still
outweigh the risks in my opinion.

It should be noted that "[PATCH 14/15] libnvdimm: support read-only btt
backing devices" was developed in direct repsonse to working through the
implementation of unit tests for "[PATCH 15/15] libnvdimm, nfit: handle
acpi_nfit_memory_map flags" and its new "read-only by default" policy.
See the updates to the libndctl unit tests posted on the
linux-nvdimm@01.org mailing list.

[PATCH 01/15] block: introduce an ->rw_bytes() block device operation
[PATCH 02/15] libnvdimm: infrastructure for btt devices
[PATCH 03/15] nd_btt: atomic sector updates
[PATCH 04/15] libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
[PATCH 05/15] tools/testing/nvdimm: libnvdimm unit test infrastructure
[PATCH 06/15] libnvdimm: Non-Volatile Devices
[PATCH 07/15] fs/block_dev.c: skip rw_page if bdev has integrity
[PATCH 08/15] libnvdimm, btt: add support for blk integrity
[PATCH 09/15] libnvdimm, blk: add support for blk integrity
[PATCH 10/15] libnvdimm: fix up max_hw_sectors
[PATCH 11/15] libnvdimm: pmem, blk, and btt make_request cleanups
[PATCH 12/15] libnvdimm: enable iostat
[PATCH 13/15] libnvdimm: flag libnvdimm block devices as non-rotational
[PATCH 14/15] libnvdimm: support read-only btt backing devices
[PATCH 15/15] libnvdimm, nfit: handle acpi_nfit_memory_map flags

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-June/001166.html

---

Dan Williams (10):
      block: introduce an ->rw_bytes() block device operation
      libnvdimm: infrastructure for btt devices
      tools/testing/nvdimm: libnvdimm unit test infrastructure
      libnvdimm: Non-Volatile Devices
      libnvdimm: fix up max_hw_sectors
      libnvdimm: pmem, blk, and btt make_request cleanups
      libnvdimm: enable iostat
      libnvdimm: flag libnvdimm block devices as non-rotational
      libnvdimm: support read-only btt backing devices
      libnvdimm, nfit: handle acpi_nfit_memory_map flags

Ross Zwisler (1):
      libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory

Vishal Verma (4):
      nd_btt: atomic sector updates
      fs/block_dev.c: skip rw_page if bdev has integrity
      libnvdimm, btt: add support for blk integrity
      libnvdimm, blk: add support for blk integrity


 Documentation/nvdimm/btt.txt          |  273 ++++++
 Documentation/nvdimm/nvdimm.txt       |  805 +++++++++++++++++
 MAINTAINERS                           |   39 +
 drivers/acpi/nfit.c                   |  491 ++++++++++
 drivers/acpi/nfit.h                   |   58 +
 drivers/nvdimm/Kconfig                |   54 +
 drivers/nvdimm/Makefile               |    7 
 drivers/nvdimm/blk.c                  |  368 ++++++++
 drivers/nvdimm/btt.c                  | 1569 +++++++++++++++++++++++++++++++++
 drivers/nvdimm/btt.h                  |  185 ++++
 drivers/nvdimm/btt_devs.c             |  473 ++++++++++
 drivers/nvdimm/bus.c                  |  176 ++++
 drivers/nvdimm/core.c                 |   99 ++
 drivers/nvdimm/dimm_devs.c            |    9 
 drivers/nvdimm/namespace_devs.c       |   63 +
 drivers/nvdimm/nd-core.h              |   48 +
 drivers/nvdimm/nd.h                   |   61 +
 drivers/nvdimm/pmem.c                 |   58 +
 drivers/nvdimm/region.c               |   97 ++
 drivers/nvdimm/region_devs.c          |  106 ++
 fs/block_dev.c                        |    4 
 include/linux/blkdev.h                |   44 +
 include/linux/libnvdimm.h             |   30 +
 include/uapi/linux/ndctl.h            |    2 
 tools/testing/nvdimm/Kbuild           |   40 +
 tools/testing/nvdimm/Makefile         |    7 
 tools/testing/nvdimm/config_check.c   |   15 
 tools/testing/nvdimm/test/Kbuild      |    8 
 tools/testing/nvdimm/test/iomap.c     |  151 +++
 tools/testing/nvdimm/test/nfit.c      | 1115 +++++++++++++++++++++++
 tools/testing/nvdimm/test/nfit_test.h |   29 +
 31 files changed, 6422 insertions(+), 62 deletions(-)
 create mode 100644 Documentation/nvdimm/btt.txt
 create mode 100644 Documentation/nvdimm/nvdimm.txt
 create mode 100644 drivers/nvdimm/blk.c
 create mode 100644 drivers/nvdimm/btt.c
 create mode 100644 drivers/nvdimm/btt.h
 create mode 100644 drivers/nvdimm/btt_devs.c
 create mode 100644 tools/testing/nvdimm/Kbuild
 create mode 100644 tools/testing/nvdimm/Makefile
 create mode 100644 tools/testing/nvdimm/config_check.c
 create mode 100644 tools/testing/nvdimm/test/Kbuild
 create mode 100644 tools/testing/nvdimm/test/iomap.c
 create mode 100644 tools/testing/nvdimm/test/nfit.c
 create mode 100644 tools/testing/nvdimm/test/nfit_test.h

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 00/15] libnvdimm: ->rw_bytes(), BLK-mode, unit tests, and misc features
@ 2015-06-17 23:54 ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:54 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Martin K. Petersen, Vishal Verma, Neil Brown,
	Greg KH, Rafael J. Wysocki, Dave Chinner, Robert Moore,
	Andy Lutomirski, Jens Axboe, linux-acpi, Jeff Moyer,
	Matthew Wilcox, H. Peter Anvin, linux-fsdevel, Ross Zwisler, hch,
	mingo, linux-kernel, Lv Zheng

This patchset takes the position that a new block_device_operations op
is needed for nvdimm devices.  Jens, see "[PATCH 01/15] block: introduce
an ->rw_bytes() block device operation", it gates the rest of the series
moving forward.

Aside from adding a compile-time check to tools/testing/nvdimm/Kbuild
for validating all libnvdimm objects are built as modules, patches 2 to
6 are otherwise unchanged from the v6 libnvdimm posting [1].  The
remaining patches are feature additions and other cleanups that were
being held back while the base patchset was polished.

Patch 5 has an updated changelog speaking to the potential maintenance
burden of carrying tools/testing/nvdimm/ in-tree.  The benefits still
outweigh the risks in my opinion.

It should be noted that "[PATCH 14/15] libnvdimm: support read-only btt
backing devices" was developed in direct repsonse to working through the
implementation of unit tests for "[PATCH 15/15] libnvdimm, nfit: handle
acpi_nfit_memory_map flags" and its new "read-only by default" policy.
See the updates to the libndctl unit tests posted on the
linux-nvdimm@01.org mailing list.

[PATCH 01/15] block: introduce an ->rw_bytes() block device operation
[PATCH 02/15] libnvdimm: infrastructure for btt devices
[PATCH 03/15] nd_btt: atomic sector updates
[PATCH 04/15] libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
[PATCH 05/15] tools/testing/nvdimm: libnvdimm unit test infrastructure
[PATCH 06/15] libnvdimm: Non-Volatile Devices
[PATCH 07/15] fs/block_dev.c: skip rw_page if bdev has integrity
[PATCH 08/15] libnvdimm, btt: add support for blk integrity
[PATCH 09/15] libnvdimm, blk: add support for blk integrity
[PATCH 10/15] libnvdimm: fix up max_hw_sectors
[PATCH 11/15] libnvdimm: pmem, blk, and btt make_request cleanups
[PATCH 12/15] libnvdimm: enable iostat
[PATCH 13/15] libnvdimm: flag libnvdimm block devices as non-rotational
[PATCH 14/15] libnvdimm: support read-only btt backing devices
[PATCH 15/15] libnvdimm, nfit: handle acpi_nfit_memory_map flags

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-June/001166.html

---

Dan Williams (10):
      block: introduce an ->rw_bytes() block device operation
      libnvdimm: infrastructure for btt devices
      tools/testing/nvdimm: libnvdimm unit test infrastructure
      libnvdimm: Non-Volatile Devices
      libnvdimm: fix up max_hw_sectors
      libnvdimm: pmem, blk, and btt make_request cleanups
      libnvdimm: enable iostat
      libnvdimm: flag libnvdimm block devices as non-rotational
      libnvdimm: support read-only btt backing devices
      libnvdimm, nfit: handle acpi_nfit_memory_map flags

Ross Zwisler (1):
      libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory

Vishal Verma (4):
      nd_btt: atomic sector updates
      fs/block_dev.c: skip rw_page if bdev has integrity
      libnvdimm, btt: add support for blk integrity
      libnvdimm, blk: add support for blk integrity


 Documentation/nvdimm/btt.txt          |  273 ++++++
 Documentation/nvdimm/nvdimm.txt       |  805 +++++++++++++++++
 MAINTAINERS                           |   39 +
 drivers/acpi/nfit.c                   |  491 ++++++++++
 drivers/acpi/nfit.h                   |   58 +
 drivers/nvdimm/Kconfig                |   54 +
 drivers/nvdimm/Makefile               |    7 
 drivers/nvdimm/blk.c                  |  368 ++++++++
 drivers/nvdimm/btt.c                  | 1569 +++++++++++++++++++++++++++++++++
 drivers/nvdimm/btt.h                  |  185 ++++
 drivers/nvdimm/btt_devs.c             |  473 ++++++++++
 drivers/nvdimm/bus.c                  |  176 ++++
 drivers/nvdimm/core.c                 |   99 ++
 drivers/nvdimm/dimm_devs.c            |    9 
 drivers/nvdimm/namespace_devs.c       |   63 +
 drivers/nvdimm/nd-core.h              |   48 +
 drivers/nvdimm/nd.h                   |   61 +
 drivers/nvdimm/pmem.c                 |   58 +
 drivers/nvdimm/region.c               |   97 ++
 drivers/nvdimm/region_devs.c          |  106 ++
 fs/block_dev.c                        |    4 
 include/linux/blkdev.h                |   44 +
 include/linux/libnvdimm.h             |   30 +
 include/uapi/linux/ndctl.h            |    2 
 tools/testing/nvdimm/Kbuild           |   40 +
 tools/testing/nvdimm/Makefile         |    7 
 tools/testing/nvdimm/config_check.c   |   15 
 tools/testing/nvdimm/test/Kbuild      |    8 
 tools/testing/nvdimm/test/iomap.c     |  151 +++
 tools/testing/nvdimm/test/nfit.c      | 1115 +++++++++++++++++++++++
 tools/testing/nvdimm/test/nfit_test.h |   29 +
 31 files changed, 6422 insertions(+), 62 deletions(-)
 create mode 100644 Documentation/nvdimm/btt.txt
 create mode 100644 Documentation/nvdimm/nvdimm.txt
 create mode 100644 drivers/nvdimm/blk.c
 create mode 100644 drivers/nvdimm/btt.c
 create mode 100644 drivers/nvdimm/btt.h
 create mode 100644 drivers/nvdimm/btt_devs.c
 create mode 100644 tools/testing/nvdimm/Kbuild
 create mode 100644 tools/testing/nvdimm/Makefile
 create mode 100644 tools/testing/nvdimm/config_check.c
 create mode 100644 tools/testing/nvdimm/test/Kbuild
 create mode 100644 tools/testing/nvdimm/test/iomap.c
 create mode 100644 tools/testing/nvdimm/test/nfit.c
 create mode 100644 tools/testing/nvdimm/test/nfit_test.h

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 01/15] block: introduce an ->rw_bytes() block device operation
  2015-06-17 23:54 ` Dan Williams
@ 2015-06-17 23:54   ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:54 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, linux-kernel, Andy Lutomirski, Jens Axboe,
	linux-acpi, H. Peter Anvin, linux-fsdevel, Ross Zwisler, hch,
	mingo

Why do we need a new method in block_device_operations?

The capacities of persistent memory make it too large to map as RAM (no
struct page coverage by default), so Linux arranges for it to be managed
as a block device. The bio interface to a block device enforces sector
at a time i/o and has infrastructure for asynchronous completion of i/o.
The ->rw_page() interface is closely tied to the page cache and also
carries asynchronous i/o completion assumptions.  NVDIMM devices are
fast enough to complete i/o's synchronously (memcpy) and some in kernel
applications can take advantage of the byte-aligned (as opposed to
sector-aligned) nature of the media.  The ->rw_bytes() operation is
added to fill this role that does not fit into any existing access
method.

It could be argued that a ->rw_bytes() method makes a struct
block_device not a *block* device.  However, the applications for
persistent memory as storage devices makes them more "block" devices
than "character" devices.

The first consumer of the ->rw_bytes() capability is a stacked
block_device driver (BTT - block translation table) that adds atomic
sector update semantics on top of an nvdimm storage device.

Why enable drivers like BTT on top of a new globally visibly
block_device_operations op rather than an internal detail of nvdimm
drivers?

1/ We want ->rw_bytes() consumers to be enabled on either a per-disk or
per-partition basis.  Consider the case of enabling DAX+XFS on a single
persistent memory disk whereby the metadata needs atomic sector update
guarantees, but the data would like to be DAX capable.  Solution is to
create two partitions and enable BTT on the "metadata/XFS-logdev"
partition.

2/ We want this configuration topology to be visible to the sysfs device
model, and not an internal detail of nvdimm drivers requiring special
tooling.  For example if you ever wanted to "fsck" BTT metadata that
could be carried out on the raw nvdimm device directly rather than
require custom tooling / mechanisms to access the raw media.

3/ It becomes trivial to add new BTT like drivers without touching the
nvdimm drivers to add is_btt_mode(), is_foo_mode(), etc... checks in the
fast path.

Cc: Jens Axboe <axboe@fb.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c  |   19 +++++++++++++++++++
 include/linux/blkdev.h |   44 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 63 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 90902a142e35..efa2cde7f6b6 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -96,6 +96,24 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 	return 0;
 }
 
+static int pmem_rw_bytes(struct gendisk *disk, resource_size_t offset,
+			void *buf, size_t size, int rw)
+{
+	struct pmem_device *pmem = disk->private_data;
+
+	if (unlikely(offset + size > pmem->size)) {
+		dev_WARN_ONCE(disk_to_dev(disk), 1, "request out of range\n");
+		return -EFAULT;
+	}
+
+	if (rw == READ)
+		memcpy(buf, pmem->virt_addr + offset, size);
+	else
+		memcpy(pmem->virt_addr + offset, buf, size);
+
+	return 0;
+}
+
 static long pmem_direct_access(struct block_device *bdev, sector_t sector,
 			      void **kaddr, unsigned long *pfn, long size)
 {
@@ -114,6 +132,7 @@ static long pmem_direct_access(struct block_device *bdev, sector_t sector,
 static const struct block_device_operations pmem_fops = {
 	.owner =		THIS_MODULE,
 	.rw_page =		pmem_rw_page,
+	.rw_bytes =		pmem_rw_bytes,
 	.direct_access =	pmem_direct_access,
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f9a516f24de..25d6034a2e62 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1602,6 +1602,8 @@ struct block_device_operations {
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
 	int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
+	int (*rw_bytes)(struct gendisk *, resource_size_t offset,
+			void *buf, size_t size, int rw);
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	long (*direct_access)(struct block_device *, sector_t,
@@ -1625,6 +1627,48 @@ extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 						struct writeback_control *);
 extern long bdev_direct_access(struct block_device *, sector_t, void **addr,
 						unsigned long *pfn, long size);
+
+/**
+ * bdev_read_bytes() - synchronously read bytes from a memory-backed block dev
+ * @bdev: device to read
+ * @offset: bdev-relative starting offset
+ * @buf: buffer to fill
+ * @size: transfer length
+ *
+ * RAM and PMEM disks do not implement sectors internally.  @buf is
+ * up-to-date upon return from this routine.
+ */
+static inline int bdev_read_bytes(struct block_device *bdev,
+		resource_size_t offset, void *buf, size_t size)
+{
+	struct gendisk *disk = bdev->bd_disk;
+	const struct block_device_operations *ops = disk->fops;
+
+	offset += get_start_sect(bdev) << 9;
+	return ops->rw_bytes(disk, offset, buf, size, READ);
+}
+
+/**
+ * bdev_write_bytes() - synchronously write bytes to a memory-backed block dev
+ * @bdev: device to read
+ * @offset: bdev-relative starting offset
+ * @buf: buffer to drain
+ * @size: transfer length
+ *
+ * RAM and PMEM disks do not implement sectors internally.  Depending on
+ * the @bdev, the contents of @buf may be in cpu cache, platform buffers,
+ * or on backing memory media upon return from this routine.  Flushing
+ * to media is handled internal to the @bdev driver, if at all.
+ */
+static inline int bdev_write_bytes(struct block_device *bdev,
+		resource_size_t offset, void *buf, size_t size)
+{
+	struct gendisk *disk = bdev->bd_disk;
+	const struct block_device_operations *ops = disk->fops;
+
+	offset += get_start_sect(bdev) << 9;
+	return ops->rw_bytes(disk, offset, buf, size, WRITE);
+}
 #else /* CONFIG_BLOCK */
 
 struct block_device;


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 01/15] block: introduce an ->rw_bytes() block device operation
@ 2015-06-17 23:54   ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:54 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, linux-kernel, Andy Lutomirski, Jens Axboe,
	linux-acpi, H. Peter Anvin, linux-fsdevel, Ross Zwisler, hch,
	mingo

Why do we need a new method in block_device_operations?

The capacities of persistent memory make it too large to map as RAM (no
struct page coverage by default), so Linux arranges for it to be managed
as a block device. The bio interface to a block device enforces sector
at a time i/o and has infrastructure for asynchronous completion of i/o.
The ->rw_page() interface is closely tied to the page cache and also
carries asynchronous i/o completion assumptions.  NVDIMM devices are
fast enough to complete i/o's synchronously (memcpy) and some in kernel
applications can take advantage of the byte-aligned (as opposed to
sector-aligned) nature of the media.  The ->rw_bytes() operation is
added to fill this role that does not fit into any existing access
method.

It could be argued that a ->rw_bytes() method makes a struct
block_device not a *block* device.  However, the applications for
persistent memory as storage devices makes them more "block" devices
than "character" devices.

The first consumer of the ->rw_bytes() capability is a stacked
block_device driver (BTT - block translation table) that adds atomic
sector update semantics on top of an nvdimm storage device.

Why enable drivers like BTT on top of a new globally visibly
block_device_operations op rather than an internal detail of nvdimm
drivers?

1/ We want ->rw_bytes() consumers to be enabled on either a per-disk or
per-partition basis.  Consider the case of enabling DAX+XFS on a single
persistent memory disk whereby the metadata needs atomic sector update
guarantees, but the data would like to be DAX capable.  Solution is to
create two partitions and enable BTT on the "metadata/XFS-logdev"
partition.

2/ We want this configuration topology to be visible to the sysfs device
model, and not an internal detail of nvdimm drivers requiring special
tooling.  For example if you ever wanted to "fsck" BTT metadata that
could be carried out on the raw nvdimm device directly rather than
require custom tooling / mechanisms to access the raw media.

3/ It becomes trivial to add new BTT like drivers without touching the
nvdimm drivers to add is_btt_mode(), is_foo_mode(), etc... checks in the
fast path.

Cc: Jens Axboe <axboe@fb.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c  |   19 +++++++++++++++++++
 include/linux/blkdev.h |   44 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 63 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 90902a142e35..efa2cde7f6b6 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -96,6 +96,24 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 	return 0;
 }
 
+static int pmem_rw_bytes(struct gendisk *disk, resource_size_t offset,
+			void *buf, size_t size, int rw)
+{
+	struct pmem_device *pmem = disk->private_data;
+
+	if (unlikely(offset + size > pmem->size)) {
+		dev_WARN_ONCE(disk_to_dev(disk), 1, "request out of range\n");
+		return -EFAULT;
+	}
+
+	if (rw == READ)
+		memcpy(buf, pmem->virt_addr + offset, size);
+	else
+		memcpy(pmem->virt_addr + offset, buf, size);
+
+	return 0;
+}
+
 static long pmem_direct_access(struct block_device *bdev, sector_t sector,
 			      void **kaddr, unsigned long *pfn, long size)
 {
@@ -114,6 +132,7 @@ static long pmem_direct_access(struct block_device *bdev, sector_t sector,
 static const struct block_device_operations pmem_fops = {
 	.owner =		THIS_MODULE,
 	.rw_page =		pmem_rw_page,
+	.rw_bytes =		pmem_rw_bytes,
 	.direct_access =	pmem_direct_access,
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f9a516f24de..25d6034a2e62 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1602,6 +1602,8 @@ struct block_device_operations {
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
 	int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
+	int (*rw_bytes)(struct gendisk *, resource_size_t offset,
+			void *buf, size_t size, int rw);
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	long (*direct_access)(struct block_device *, sector_t,
@@ -1625,6 +1627,48 @@ extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 						struct writeback_control *);
 extern long bdev_direct_access(struct block_device *, sector_t, void **addr,
 						unsigned long *pfn, long size);
+
+/**
+ * bdev_read_bytes() - synchronously read bytes from a memory-backed block dev
+ * @bdev: device to read
+ * @offset: bdev-relative starting offset
+ * @buf: buffer to fill
+ * @size: transfer length
+ *
+ * RAM and PMEM disks do not implement sectors internally.  @buf is
+ * up-to-date upon return from this routine.
+ */
+static inline int bdev_read_bytes(struct block_device *bdev,
+		resource_size_t offset, void *buf, size_t size)
+{
+	struct gendisk *disk = bdev->bd_disk;
+	const struct block_device_operations *ops = disk->fops;
+
+	offset += get_start_sect(bdev) << 9;
+	return ops->rw_bytes(disk, offset, buf, size, READ);
+}
+
+/**
+ * bdev_write_bytes() - synchronously write bytes to a memory-backed block dev
+ * @bdev: device to read
+ * @offset: bdev-relative starting offset
+ * @buf: buffer to drain
+ * @size: transfer length
+ *
+ * RAM and PMEM disks do not implement sectors internally.  Depending on
+ * the @bdev, the contents of @buf may be in cpu cache, platform buffers,
+ * or on backing memory media upon return from this routine.  Flushing
+ * to media is handled internal to the @bdev driver, if at all.
+ */
+static inline int bdev_write_bytes(struct block_device *bdev,
+		resource_size_t offset, void *buf, size_t size)
+{
+	struct gendisk *disk = bdev->bd_disk;
+	const struct block_device_operations *ops = disk->fops;
+
+	offset += get_start_sect(bdev) << 9;
+	return ops->rw_bytes(disk, offset, buf, size, WRITE);
+}
 #else /* CONFIG_BLOCK */
 
 struct block_device;


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 02/15] libnvdimm: infrastructure for btt devices
  2015-06-17 23:54 ` Dan Williams
@ 2015-06-17 23:54   ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:54 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Neil Brown, Greg KH, linux-kernel, mingo,
	linux-acpi, linux-fsdevel, hch

Block devices from an nd bus, in addition to accepting "struct bio"
based requests, also have the capability to perform byte-aligned
accesses.  By default only the bio/block interface is used.  However, if
another driver can make effective use of the byte-aligned capability it
can claim the block interface and use the byte-aligned ->rw_bytes()
interface.

The BTT driver is the initial first consumer of this mechanism to allow
layering atomic sector update guarantees on top of ->rw_bytes() capable
libnvdimm-block-devices, or their partitions.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/Kconfig     |    3 
 drivers/nvdimm/Makefile    |    1 
 drivers/nvdimm/btt.h       |   45 +++++
 drivers/nvdimm/btt_devs.c  |  431 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvdimm/bus.c       |   82 ++++++++
 drivers/nvdimm/core.c      |   20 ++
 drivers/nvdimm/nd-core.h   |   34 +++
 drivers/nvdimm/nd.h        |   19 ++
 drivers/nvdimm/pmem.c      |    6 -
 include/uapi/linux/ndctl.h |    2 
 10 files changed, 637 insertions(+), 6 deletions(-)
 create mode 100644 drivers/nvdimm/btt.h
 create mode 100644 drivers/nvdimm/btt_devs.c

diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 07a29113b870..f16ba9d14740 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -33,4 +33,7 @@ config BLK_DEV_PMEM
 
 	  Say Y if you want to use an NVDIMM
 
+config ND_BTT_DEVS
+	def_bool y
+
 endif
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index abce98f87f16..eb1bbce86592 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -11,3 +11,4 @@ libnvdimm-y += region_devs.o
 libnvdimm-y += region.o
 libnvdimm-y += namespace_devs.o
 libnvdimm-y += label.o
+libnvdimm-$(CONFIG_ND_BTT_DEVS) += btt_devs.o
diff --git a/drivers/nvdimm/btt.h b/drivers/nvdimm/btt.h
new file mode 100644
index 000000000000..e8f6d8e0ddd3
--- /dev/null
+++ b/drivers/nvdimm/btt.h
@@ -0,0 +1,45 @@
+/*
+ * Block Translation Table library
+ * Copyright (c) 2014-2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_BTT_H
+#define _LINUX_BTT_H
+
+#include <linux/types.h>
+
+#define BTT_SIG_LEN 16
+#define BTT_SIG "BTT_ARENA_INFO\0"
+
+struct btt_sb {
+	u8 signature[BTT_SIG_LEN];
+	u8 uuid[16];
+	u8 parent_uuid[16];
+	__le32 flags;
+	__le16 version_major;
+	__le16 version_minor;
+	__le32 external_lbasize;
+	__le32 external_nlba;
+	__le32 internal_lbasize;
+	__le32 internal_nlba;
+	__le32 nfree;
+	__le32 infosize;
+	__le64 nextoff;
+	__le64 dataoff;
+	__le64 mapoff;
+	__le64 logoff;
+	__le64 info2off;
+	u8 padding[3968];
+	__le64 checksum;
+};
+
+#endif
diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
new file mode 100644
index 000000000000..2148fd8f535b
--- /dev/null
+++ b/drivers/nvdimm/btt_devs.c
@@ -0,0 +1,431 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/blkdev.h>
+#include <linux/device.h>
+#include <linux/genhd.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include "nd-core.h"
+#include "btt.h"
+#include "nd.h"
+
+static DEFINE_IDA(btt_ida);
+
+static void nd_btt_release(struct device *dev)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+
+	dev_dbg(dev, "%s\n", __func__);
+	WARN_ON(nd_btt->backing_dev);
+	ida_simple_remove(&btt_ida, nd_btt->id);
+	kfree(nd_btt->uuid);
+	kfree(nd_btt);
+}
+
+static struct device_type nd_btt_device_type = {
+	.name = "nd_btt",
+	.release = nd_btt_release,
+};
+
+bool is_nd_btt(struct device *dev)
+{
+	return dev->type == &nd_btt_device_type;
+}
+
+struct nd_btt *to_nd_btt(struct device *dev)
+{
+	struct nd_btt *nd_btt = container_of(dev, struct nd_btt, dev);
+
+	WARN_ON(!is_nd_btt(dev));
+	return nd_btt;
+}
+EXPORT_SYMBOL(to_nd_btt);
+
+static const unsigned long btt_lbasize_supported[] = { 512, 4096, 0 };
+
+static ssize_t sector_size_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+
+	return nd_sector_size_show(nd_btt->lbasize, btt_lbasize_supported, buf);
+}
+
+static ssize_t sector_size_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	ssize_t rc;
+
+	device_lock(dev);
+	nvdimm_bus_lock(dev);
+	rc = nd_sector_size_store(dev, buf, &nd_btt->lbasize,
+			btt_lbasize_supported);
+	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	nvdimm_bus_unlock(dev);
+	device_unlock(dev);
+
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(sector_size);
+
+static ssize_t uuid_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+
+	if (nd_btt->uuid)
+		return sprintf(buf, "%pUb\n", nd_btt->uuid);
+	return sprintf(buf, "\n");
+}
+
+static ssize_t uuid_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	ssize_t rc;
+
+	device_lock(dev);
+	rc = nd_uuid_store(dev, &nd_btt->uuid, buf, len);
+	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	device_unlock(dev);
+
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(uuid);
+
+static ssize_t backing_dev_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	char name[BDEVNAME_SIZE];
+
+	if (nd_btt->backing_dev)
+		return sprintf(buf, "/dev/%s\n",
+				bdevname(nd_btt->backing_dev, name));
+	else
+		return sprintf(buf, "\n");
+}
+
+static const fmode_t nd_btt_devs_mode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
+
+static void nd_btt_remove_bdev(struct nd_btt *nd_btt, const char *caller)
+{
+	struct block_device *bdev = nd_btt->backing_dev;
+	char bdev_name[BDEVNAME_SIZE];
+
+	if (!nd_btt->backing_dev)
+		return;
+
+	WARN_ON_ONCE(!is_nvdimm_bus_locked(&nd_btt->dev));
+	dev_dbg(&nd_btt->dev, "%s: %s: release %s\n", caller, __func__,
+			bdevname(bdev, bdev_name));
+	blkdev_put(bdev, nd_btt_devs_mode);
+	nd_btt->backing_dev = NULL;
+
+	/*
+	 * Once we've had our backing device removed we need to be fully
+	 * reconfigured.  The bus will have already created a new seed
+	 * for this purpose, so now is a good time to clean up this
+	 * stale nd_btt instance.
+	 */
+	if (nd_btt->dev.driver)
+		nd_device_unregister(&nd_btt->dev, ND_ASYNC);
+}
+
+static int __nd_btt_remove_disk(struct device *dev, void *data)
+{
+	struct gendisk *disk = data;
+	struct block_device *bdev;
+	struct nd_btt *nd_btt;
+
+	if (!is_nd_btt(dev))
+		return 0;
+
+	nd_btt = to_nd_btt(dev);
+	bdev = nd_btt->backing_dev;
+	if (bdev && bdev->bd_disk == disk)
+		nd_btt_remove_bdev(nd_btt, __func__);
+	return 0;
+}
+
+void nd_btt_remove_disk(struct nvdimm_bus *nvdimm_bus, struct gendisk *disk)
+{
+	device_for_each_child(&nvdimm_bus->dev, disk, __nd_btt_remove_disk);
+}
+
+static ssize_t __backing_dev_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
+	const struct block_device_operations *ops;
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	struct block_device *bdev;
+	char *path;
+
+	if (dev->driver) {
+		dev_dbg(dev, "%s: -EBUSY\n", __func__);
+		return -EBUSY;
+	}
+
+	path = kstrndup(buf, len, GFP_KERNEL);
+	if (!path)
+		return -ENOMEM;
+	strim(path);
+
+	/* detach the backing device */
+	if (strcmp(path, "") == 0) {
+		nd_btt_remove_bdev(nd_btt, __func__);
+		goto out;
+	} else if (nd_btt->backing_dev) {
+		dev_dbg(dev, "backing_dev already set\n");
+		len = -EBUSY;
+		goto out;
+	}
+
+	bdev = blkdev_get_by_path(path, nd_btt_devs_mode, nd_btt);
+	if (IS_ERR(bdev)) {
+		dev_dbg(dev, "open '%s' failed: %ld\n", path, PTR_ERR(bdev));
+		len = PTR_ERR(bdev);
+		goto out;
+	}
+
+	if (nvdimm_bus != walk_to_nvdimm_bus(disk_to_dev(bdev->bd_disk))) {
+		dev_dbg(dev, "%s not a descendant of %s\n", path,
+				dev_name(&nvdimm_bus->dev));
+		blkdev_put(bdev, nd_btt_devs_mode);
+		len = -EINVAL;
+		goto out;
+	}
+
+	if (get_capacity(bdev->bd_disk) < SZ_16M / 512) {
+		dev_dbg(dev, "%s too small to host btt\n", path);
+		blkdev_put(bdev, nd_btt_devs_mode);
+		len = -ENXIO;
+		goto out;
+	}
+
+	ops = bdev->bd_disk->fops;
+	if (!ops->rw_bytes) {
+		dev_dbg(dev, "%s does not implement ->rw_bytes()\n", path);
+		blkdev_put(bdev, nd_btt_devs_mode);
+		len = -EINVAL;
+		goto out;
+	}
+
+	WARN_ON_ONCE(!is_nvdimm_bus_locked(&nd_btt->dev));
+	nd_btt->backing_dev = bdev;
+
+ out:
+	kfree(path);
+	return len;
+}
+
+static ssize_t backing_dev_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	ssize_t rc;
+
+	nvdimm_bus_lock(dev);
+	device_lock(dev);
+	rc = __backing_dev_store(dev, attr, buf, len);
+	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	device_unlock(dev);
+	nvdimm_bus_unlock(dev);
+
+	return rc;
+}
+static DEVICE_ATTR_RW(backing_dev);
+
+static bool is_nd_btt_idle(struct device *dev)
+{
+	struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+
+	if (nvdimm_bus->nd_btt == nd_btt || dev->driver || nd_btt->backing_dev)
+		return false;
+	return true;
+}
+
+static ssize_t delete_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	/* return 1 if can be deleted */
+	return sprintf(buf, "%d\n", is_nd_btt_idle(dev));
+}
+
+static ssize_t delete_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	unsigned long val;
+
+	/* write 1 to delete */
+	if (kstrtoul(buf, 0, &val) != 0 || val != 1)
+		return -EINVAL;
+
+	/* prevent deletion while this btt is active, or is the current seed */
+	if (!is_nd_btt_idle(dev))
+		return -EBUSY;
+
+	/*
+	 * userspace raced itself if device goes active here and it gets
+	 * to keep the pieces
+	 */
+	nd_device_unregister(dev, ND_ASYNC);
+
+	return len;
+}
+static DEVICE_ATTR_RW(delete);
+
+static struct attribute *nd_btt_attributes[] = {
+	&dev_attr_sector_size.attr,
+	&dev_attr_backing_dev.attr,
+	&dev_attr_delete.attr,
+	&dev_attr_uuid.attr,
+	NULL,
+};
+
+static struct attribute_group nd_btt_attribute_group = {
+	.attrs = nd_btt_attributes,
+};
+
+static const struct attribute_group *nd_btt_attribute_groups[] = {
+	&nd_btt_attribute_group,
+	&nd_device_attribute_group,
+	NULL,
+};
+
+static struct nd_btt *__nd_btt_create(struct nvdimm_bus *nvdimm_bus,
+		unsigned long lbasize, u8 *uuid, struct block_device *bdev)
+{
+	struct nd_btt *nd_btt = kzalloc(sizeof(*nd_btt), GFP_KERNEL);
+	struct device *dev;
+
+	if (!nd_btt)
+		return NULL;
+	nd_btt->id = ida_simple_get(&btt_ida, 0, 0, GFP_KERNEL);
+	if (nd_btt->id < 0) {
+		kfree(nd_btt);
+		return NULL;
+	}
+
+	nd_btt->lbasize = lbasize;
+	if (uuid)
+		uuid = kmemdup(uuid, 16, GFP_KERNEL);
+	nd_btt->uuid = uuid;
+	nd_btt->backing_dev = bdev;
+	dev = &nd_btt->dev;
+	dev_set_name(dev, "btt%d", nd_btt->id);
+	dev->parent = &nvdimm_bus->dev;
+	dev->type = &nd_btt_device_type;
+	dev->groups = nd_btt_attribute_groups;
+	nd_device_register(&nd_btt->dev);
+	return nd_btt;
+}
+
+struct nd_btt *nd_btt_create(struct nvdimm_bus *nvdimm_bus)
+{
+	return __nd_btt_create(nvdimm_bus, 0, NULL, NULL);
+}
+
+/*
+ * nd_btt_sb_checksum: compute checksum for btt info block
+ *
+ * Returns a fletcher64 checksum of everything in the given info block
+ * except the last field (since that's where the checksum lives).
+ */
+u64 nd_btt_sb_checksum(struct btt_sb *btt_sb)
+{
+	u64 sum, sum_save;
+
+	sum_save = btt_sb->checksum;
+	btt_sb->checksum = 0;
+	sum = nd_fletcher64(btt_sb, sizeof(*btt_sb), 1);
+	btt_sb->checksum = sum_save;
+	return sum;
+}
+EXPORT_SYMBOL(nd_btt_sb_checksum);
+
+static struct nd_btt *__nd_btt_autodetect(struct nvdimm_bus *nvdimm_bus,
+		struct block_device *bdev, struct btt_sb *btt_sb)
+{
+	u64 checksum;
+	u32 lbasize;
+
+	if (!btt_sb || !bdev || !nvdimm_bus)
+		return NULL;
+
+	if (bdev_read_bytes(bdev, SZ_4K, btt_sb, sizeof(*btt_sb)))
+		return NULL;
+
+	if (get_capacity(bdev->bd_disk) < SZ_16M / 512)
+		return NULL;
+
+	if (memcmp(btt_sb->signature, BTT_SIG, BTT_SIG_LEN) != 0)
+		return NULL;
+
+	checksum = le64_to_cpu(btt_sb->checksum);
+	btt_sb->checksum = 0;
+	if (checksum != nd_btt_sb_checksum(btt_sb))
+		return NULL;
+	btt_sb->checksum = cpu_to_le64(checksum);
+
+	lbasize = le32_to_cpu(btt_sb->external_lbasize);
+	return __nd_btt_create(nvdimm_bus, lbasize, btt_sb->uuid, bdev);
+}
+
+static int nd_btt_autodetect(struct nvdimm_bus *nvdimm_bus,
+		struct block_device *bdev)
+{
+	char name[BDEVNAME_SIZE];
+	struct nd_btt *nd_btt;
+	struct btt_sb *btt_sb;
+
+	btt_sb = kzalloc(sizeof(*btt_sb), GFP_KERNEL);
+	nd_btt = __nd_btt_autodetect(nvdimm_bus, bdev, btt_sb);
+	kfree(btt_sb);
+	dev_dbg(&nvdimm_bus->dev, "%s: %s btt: %s\n", __func__,
+			bdevname(bdev, name), nd_btt
+			? dev_name(&nd_btt->dev) : "<none>");
+	return nd_btt ? 0 : -ENODEV;
+}
+
+void nd_btt_add_disk(struct nvdimm_bus *nvdimm_bus, struct gendisk *disk)
+{
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+
+	disk_part_iter_init(&piter, disk, DISK_PITER_INCL_PART0);
+	while ((part = disk_part_iter_next(&piter))) {
+		struct block_device *bdev;
+		int rc;
+
+		bdev = bdget_disk(disk, part->partno);
+		if (!bdev)
+			continue;
+		if (blkdev_get(bdev, nd_btt_devs_mode, nvdimm_bus) != 0)
+			continue;
+		rc = nd_btt_autodetect(nvdimm_bus, bdev);
+		if (rc)
+			blkdev_put(bdev, nd_btt_devs_mode);
+		/* no need to scan further in the case of whole disk btt */
+		if (rc == 0 && part->partno == 0)
+			break;
+	}
+	disk_part_iter_exit(&piter);
+}
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index ca802702440e..3c14fee5aff4 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -14,8 +14,10 @@
 #include <linux/vmalloc.h>
 #include <linux/uaccess.h>
 #include <linux/module.h>
+#include <linux/blkdev.h>
 #include <linux/fcntl.h>
 #include <linux/async.h>
+#include <linux/genhd.h>
 #include <linux/ndctl.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
@@ -40,6 +42,8 @@ static int to_nd_device_type(struct device *dev)
 		return ND_DEVICE_REGION_BLK;
 	else if (is_nd_pmem(dev->parent) || is_nd_blk(dev->parent))
 		return nd_region_to_nstype(to_nd_region(dev->parent));
+	else if (is_nd_btt(dev))
+		return ND_DEVICE_BTT;
 
 	return 0;
 }
@@ -103,6 +107,21 @@ static int nvdimm_bus_probe(struct device *dev)
 
 	dev_dbg(&nvdimm_bus->dev, "%s.probe(%s) = %d\n", dev->driver->name,
 			dev_name(dev), rc);
+
+	/* check if our btt-seed has sprouted, and plant another */
+	if (rc == 0 && is_nd_btt(dev) && dev == &nvdimm_bus->nd_btt->dev) {
+		const char *sep = "", *name = "", *status = "failed";
+
+		nvdimm_bus->nd_btt = nd_btt_create(nvdimm_bus);
+		if (nvdimm_bus->nd_btt) {
+			status = "succeeded";
+			sep = ": ";
+			name = dev_name(&nvdimm_bus->nd_btt->dev);
+		}
+		dev_dbg(&nvdimm_bus->dev, "btt seed creation %s%s%s\n",
+				status, sep, name);
+	}
+
 	if (rc != 0)
 		module_put(provider);
 	return rc;
@@ -163,14 +182,19 @@ static void nd_async_device_unregister(void *d, async_cookie_t cookie)
 	put_device(dev);
 }
 
-void nd_device_register(struct device *dev)
+void __nd_device_register(struct device *dev)
 {
 	dev->bus = &nvdimm_bus_type;
-	device_initialize(dev);
 	get_device(dev);
 	async_schedule_domain(nd_async_device_register, dev,
 			&nd_async_domain);
 }
+
+void nd_device_register(struct device *dev)
+{
+	device_initialize(dev);
+	__nd_device_register(dev);
+}
 EXPORT_SYMBOL(nd_device_register);
 
 void nd_device_unregister(struct device *dev, enum nd_async_mode mode)
@@ -219,6 +243,60 @@ int __nd_driver_register(struct nd_device_driver *nd_drv, struct module *owner,
 }
 EXPORT_SYMBOL(__nd_driver_register);
 
+/**
+ * nvdimm_bus_add_disk() - attach and run actions on an nvdimm block device
+ * @disk: disk device being registered
+ *
+ * Note, that @disk must be a descendant of an nvdimm_bus
+ */
+int nvdimm_bus_add_disk(struct gendisk *disk)
+{
+	const struct block_device_operations *ops = disk->fops;
+	struct device *dev = disk->driverfs_dev;
+	struct nvdimm_bus *nvdimm_bus;
+
+	nvdimm_bus = walk_to_nvdimm_bus(dev);
+	if (!nvdimm_bus || !ops->rw_bytes)
+		return -EINVAL;
+
+	/*
+	 * Take the bus lock here to prevent userspace racing to
+	 * initiate actions on the newly availble block device while
+	 * autodetect scanning is still in flight.
+	 */
+	nvdimm_bus_lock(&nvdimm_bus->dev);
+	add_disk(disk);
+	nd_btt_add_disk(nvdimm_bus, disk);
+	nvdimm_bus_unlock(&nvdimm_bus->dev);
+
+	return 0;
+}
+EXPORT_SYMBOL(nvdimm_bus_add_disk);
+
+void nvdimm_bus_remove_disk(struct gendisk *disk)
+{
+	struct device *dev = disk_to_dev(disk);
+	struct nvdimm_bus *nvdimm_bus;
+
+	nvdimm_bus = walk_to_nvdimm_bus(dev);
+	if (!nvdimm_bus)
+		return;
+
+	nvdimm_bus_lock(&nvdimm_bus->dev);
+	nd_btt_remove_disk(nvdimm_bus, disk);
+	nvdimm_bus_unlock(&nvdimm_bus->dev);
+
+	/*
+	 * Flush in case *_notify_remove() kicked off asynchronous
+	 * device unregistration
+	 */
+	nd_synchronize();
+
+	del_gendisk(disk);
+	put_disk(disk);
+}
+EXPORT_SYMBOL(nvdimm_bus_remove_disk);
+
 static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
 		char *buf)
 {
diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index dd824d7c2669..0fa9b6225450 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -273,10 +273,28 @@ static ssize_t wait_probe_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(wait_probe);
 
+static ssize_t btt_seed_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nvdimm_bus *nvdimm_bus = to_nvdimm_bus(dev);
+	ssize_t rc;
+
+	nvdimm_bus_lock(dev);
+	if (nvdimm_bus->nd_btt)
+		rc = sprintf(buf, "%s\n", dev_name(&nvdimm_bus->nd_btt->dev));
+	else
+		rc = sprintf(buf, "\n");
+	nvdimm_bus_unlock(dev);
+
+	return rc;
+}
+static DEVICE_ATTR_RO(btt_seed);
+
 static struct attribute *nvdimm_bus_attributes[] = {
 	&dev_attr_commands.attr,
 	&dev_attr_wait_probe.attr,
 	&dev_attr_provider.attr,
+	&dev_attr_btt_seed.attr,
 	NULL,
 };
 
@@ -322,6 +340,8 @@ struct nvdimm_bus *__nvdimm_bus_register(struct device *parent,
 	list_add_tail(&nvdimm_bus->list, &nvdimm_bus_list);
 	mutex_unlock(&nvdimm_bus_list_mutex);
 
+	nvdimm_bus->nd_btt = nd_btt_create(nvdimm_bus);
+
 	return nvdimm_bus;
  err:
 	put_device(&nvdimm_bus->dev);
diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h
index 78d6c51f4bac..1375d30b3da5 100644
--- a/drivers/nvdimm/nd-core.h
+++ b/drivers/nvdimm/nd-core.h
@@ -23,6 +23,11 @@ extern struct list_head nvdimm_bus_list;
 extern struct mutex nvdimm_bus_list_mutex;
 extern int nvdimm_major;
 
+struct block_device;
+struct nd_io_claim;
+struct nd_btt;
+struct nd_io;
+
 struct nvdimm_bus {
 	struct nvdimm_bus_descriptor *nd_desc;
 	wait_queue_head_t probe_wait;
@@ -31,6 +36,7 @@ struct nvdimm_bus {
 	struct device dev;
 	int id, probe_active;
 	struct mutex reconfig_mutex;
+	struct nd_btt *nd_btt;
 };
 
 struct nvdimm {
@@ -45,6 +51,33 @@ struct nvdimm {
 bool is_nvdimm(struct device *dev);
 bool is_nd_blk(struct device *dev);
 bool is_nd_pmem(struct device *dev);
+struct gendisk;
+#if IS_ENABLED(CONFIG_ND_BTT_DEVS)
+bool is_nd_btt(struct device *dev);
+struct nd_btt *nd_btt_create(struct nvdimm_bus *nvdimm_bus);
+void nd_btt_add_disk(struct nvdimm_bus *nvdimm_bus, struct gendisk *disk);
+void nd_btt_remove_disk(struct nvdimm_bus *nvdimm_bus, struct gendisk *disk);
+#else
+static inline bool is_nd_btt(struct device *dev)
+{
+	return false;
+}
+
+static inline struct nd_btt *nd_btt_create(struct nvdimm_bus *nvdimm_bus)
+{
+	return NULL;
+}
+
+static inline void nd_btt_add_disk(struct nvdimm_bus *nvdimm_bus,
+		struct gendisk *disk)
+{
+}
+
+static inline void nd_btt_remove_disk(struct nvdimm_bus *nvdimm_bus,
+		struct gendisk *disk)
+{
+}
+#endif
 struct nvdimm_bus *walk_to_nvdimm_bus(struct device *nd_dev);
 int __init nvdimm_bus_init(void);
 void nvdimm_bus_exit(void);
@@ -58,6 +91,7 @@ void nd_synchronize(void);
 int nvdimm_bus_register_dimms(struct nvdimm_bus *nvdimm_bus);
 int nvdimm_bus_register_regions(struct nvdimm_bus *nvdimm_bus);
 int nvdimm_bus_init_interleave_sets(struct nvdimm_bus *nvdimm_bus);
+void __nd_device_register(struct device *dev);
 int nd_match_dimm(struct device *dev, void *data);
 struct nd_label_id;
 char *nd_label_gen_id(struct nd_label_id *label_id, u8 *uuid, u32 flags);
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index bfa849617358..3bd8d650340e 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -14,11 +14,17 @@
 #define __ND_H__
 #include <linux/libnvdimm.h>
 #include <linux/device.h>
+#include <linux/genhd.h>
 #include <linux/mutex.h>
 #include <linux/ndctl.h>
 #include <linux/types.h>
+#include <linux/fs.h>
 #include "label.h"
 
+enum {
+	SECTOR_SHIFT = 9,
+};
+
 struct nvdimm_drvdata {
 	struct device *dev;
 	int nsindex_size;
@@ -94,6 +100,14 @@ static inline unsigned nd_inc_seq(unsigned seq)
 	return next[seq & 3];
 }
 
+struct nd_btt {
+	struct device dev;
+	struct block_device *backing_dev;
+	unsigned long lbasize;
+	u8 *uuid;
+	int id;
+};
+
 enum nd_async_mode {
 	ND_SYNC,
 	ND_ASYNC,
@@ -118,6 +132,9 @@ int nvdimm_init_nsarea(struct nvdimm_drvdata *ndd);
 int nvdimm_init_config_data(struct nvdimm_drvdata *ndd);
 int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, size_t offset,
 		void *buf, size_t len);
+struct nd_btt *to_nd_btt(struct device *dev);
+struct btt_sb;
+u64 nd_btt_sb_checksum(struct btt_sb *btt_sb);
 struct nd_region *to_nd_region(struct device *dev);
 int nd_region_to_nstype(struct nd_region *nd_region);
 int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
@@ -125,6 +142,8 @@ u64 nd_region_interleave_set_cookie(struct nd_region *nd_region);
 void nvdimm_bus_lock(struct device *dev);
 void nvdimm_bus_unlock(struct device *dev);
 bool is_nvdimm_bus_locked(struct device *dev);
+int nvdimm_bus_add_disk(struct gendisk *disk);
+void nvdimm_bus_remove_disk(struct gendisk *disk);
 void nvdimm_drvdata_release(struct kref *kref);
 void put_ndd(struct nvdimm_drvdata *ndd);
 int nd_label_reserve_dpa(struct nvdimm_drvdata *ndd);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index efa2cde7f6b6..1f4767150975 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -190,8 +190,6 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 	set_capacity(disk, pmem->size >> 9);
 	pmem->pmem_disk = disk;
 
-	add_disk(disk);
-
 	return pmem;
 
 out_free_queue:
@@ -208,8 +206,7 @@ out:
 
 static void pmem_free(struct pmem_device *pmem)
 {
-	del_gendisk(pmem->pmem_disk);
-	put_disk(pmem->pmem_disk);
+	nvdimm_bus_remove_disk(pmem->pmem_disk);
 	blk_cleanup_queue(pmem->pmem_queue);
 	iounmap(pmem->virt_addr);
 	release_mem_region(pmem->phys_addr, pmem->size);
@@ -244,6 +241,7 @@ static int nd_pmem_probe(struct device *dev)
 		return PTR_ERR(pmem);
 
 	dev_set_drvdata(dev, pmem);
+	nvdimm_bus_add_disk(pmem->pmem_disk);
 
 	return 0;
 }
diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
index 2b94ea2287bb..4c2e3ff374b2 100644
--- a/include/uapi/linux/ndctl.h
+++ b/include/uapi/linux/ndctl.h
@@ -181,6 +181,7 @@ static inline const char *nvdimm_cmd_name(unsigned cmd)
 #define ND_DEVICE_NAMESPACE_IO 4    /* legacy persistent memory */
 #define ND_DEVICE_NAMESPACE_PMEM 5  /* PMEM namespace (may alias with BLK) */
 #define ND_DEVICE_NAMESPACE_BLK 6   /* BLK namespace (may alias with PMEM) */
+#define ND_DEVICE_BTT 7		    /* block-translation table device */
 
 enum nd_driver_flags {
 	ND_DRIVER_DIMM            = 1 << ND_DEVICE_DIMM,
@@ -189,6 +190,7 @@ enum nd_driver_flags {
 	ND_DRIVER_NAMESPACE_IO    = 1 << ND_DEVICE_NAMESPACE_IO,
 	ND_DRIVER_NAMESPACE_PMEM  = 1 << ND_DEVICE_NAMESPACE_PMEM,
 	ND_DRIVER_NAMESPACE_BLK   = 1 << ND_DEVICE_NAMESPACE_BLK,
+	ND_DRIVER_BTT		  = 1 << ND_DEVICE_BTT,
 };
 
 enum {


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 02/15] libnvdimm: infrastructure for btt devices
@ 2015-06-17 23:54   ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:54 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Neil Brown, Greg KH, linux-kernel, mingo,
	linux-acpi, linux-fsdevel, hch

Block devices from an nd bus, in addition to accepting "struct bio"
based requests, also have the capability to perform byte-aligned
accesses.  By default only the bio/block interface is used.  However, if
another driver can make effective use of the byte-aligned capability it
can claim the block interface and use the byte-aligned ->rw_bytes()
interface.

The BTT driver is the initial first consumer of this mechanism to allow
layering atomic sector update guarantees on top of ->rw_bytes() capable
libnvdimm-block-devices, or their partitions.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/Kconfig     |    3 
 drivers/nvdimm/Makefile    |    1 
 drivers/nvdimm/btt.h       |   45 +++++
 drivers/nvdimm/btt_devs.c  |  431 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvdimm/bus.c       |   82 ++++++++
 drivers/nvdimm/core.c      |   20 ++
 drivers/nvdimm/nd-core.h   |   34 +++
 drivers/nvdimm/nd.h        |   19 ++
 drivers/nvdimm/pmem.c      |    6 -
 include/uapi/linux/ndctl.h |    2 
 10 files changed, 637 insertions(+), 6 deletions(-)
 create mode 100644 drivers/nvdimm/btt.h
 create mode 100644 drivers/nvdimm/btt_devs.c

diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 07a29113b870..f16ba9d14740 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -33,4 +33,7 @@ config BLK_DEV_PMEM
 
 	  Say Y if you want to use an NVDIMM
 
+config ND_BTT_DEVS
+	def_bool y
+
 endif
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index abce98f87f16..eb1bbce86592 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -11,3 +11,4 @@ libnvdimm-y += region_devs.o
 libnvdimm-y += region.o
 libnvdimm-y += namespace_devs.o
 libnvdimm-y += label.o
+libnvdimm-$(CONFIG_ND_BTT_DEVS) += btt_devs.o
diff --git a/drivers/nvdimm/btt.h b/drivers/nvdimm/btt.h
new file mode 100644
index 000000000000..e8f6d8e0ddd3
--- /dev/null
+++ b/drivers/nvdimm/btt.h
@@ -0,0 +1,45 @@
+/*
+ * Block Translation Table library
+ * Copyright (c) 2014-2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_BTT_H
+#define _LINUX_BTT_H
+
+#include <linux/types.h>
+
+#define BTT_SIG_LEN 16
+#define BTT_SIG "BTT_ARENA_INFO\0"
+
+struct btt_sb {
+	u8 signature[BTT_SIG_LEN];
+	u8 uuid[16];
+	u8 parent_uuid[16];
+	__le32 flags;
+	__le16 version_major;
+	__le16 version_minor;
+	__le32 external_lbasize;
+	__le32 external_nlba;
+	__le32 internal_lbasize;
+	__le32 internal_nlba;
+	__le32 nfree;
+	__le32 infosize;
+	__le64 nextoff;
+	__le64 dataoff;
+	__le64 mapoff;
+	__le64 logoff;
+	__le64 info2off;
+	u8 padding[3968];
+	__le64 checksum;
+};
+
+#endif
diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
new file mode 100644
index 000000000000..2148fd8f535b
--- /dev/null
+++ b/drivers/nvdimm/btt_devs.c
@@ -0,0 +1,431 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/blkdev.h>
+#include <linux/device.h>
+#include <linux/genhd.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include "nd-core.h"
+#include "btt.h"
+#include "nd.h"
+
+static DEFINE_IDA(btt_ida);
+
+static void nd_btt_release(struct device *dev)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+
+	dev_dbg(dev, "%s\n", __func__);
+	WARN_ON(nd_btt->backing_dev);
+	ida_simple_remove(&btt_ida, nd_btt->id);
+	kfree(nd_btt->uuid);
+	kfree(nd_btt);
+}
+
+static struct device_type nd_btt_device_type = {
+	.name = "nd_btt",
+	.release = nd_btt_release,
+};
+
+bool is_nd_btt(struct device *dev)
+{
+	return dev->type == &nd_btt_device_type;
+}
+
+struct nd_btt *to_nd_btt(struct device *dev)
+{
+	struct nd_btt *nd_btt = container_of(dev, struct nd_btt, dev);
+
+	WARN_ON(!is_nd_btt(dev));
+	return nd_btt;
+}
+EXPORT_SYMBOL(to_nd_btt);
+
+static const unsigned long btt_lbasize_supported[] = { 512, 4096, 0 };
+
+static ssize_t sector_size_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+
+	return nd_sector_size_show(nd_btt->lbasize, btt_lbasize_supported, buf);
+}
+
+static ssize_t sector_size_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	ssize_t rc;
+
+	device_lock(dev);
+	nvdimm_bus_lock(dev);
+	rc = nd_sector_size_store(dev, buf, &nd_btt->lbasize,
+			btt_lbasize_supported);
+	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	nvdimm_bus_unlock(dev);
+	device_unlock(dev);
+
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(sector_size);
+
+static ssize_t uuid_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+
+	if (nd_btt->uuid)
+		return sprintf(buf, "%pUb\n", nd_btt->uuid);
+	return sprintf(buf, "\n");
+}
+
+static ssize_t uuid_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	ssize_t rc;
+
+	device_lock(dev);
+	rc = nd_uuid_store(dev, &nd_btt->uuid, buf, len);
+	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	device_unlock(dev);
+
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(uuid);
+
+static ssize_t backing_dev_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	char name[BDEVNAME_SIZE];
+
+	if (nd_btt->backing_dev)
+		return sprintf(buf, "/dev/%s\n",
+				bdevname(nd_btt->backing_dev, name));
+	else
+		return sprintf(buf, "\n");
+}
+
+static const fmode_t nd_btt_devs_mode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
+
+static void nd_btt_remove_bdev(struct nd_btt *nd_btt, const char *caller)
+{
+	struct block_device *bdev = nd_btt->backing_dev;
+	char bdev_name[BDEVNAME_SIZE];
+
+	if (!nd_btt->backing_dev)
+		return;
+
+	WARN_ON_ONCE(!is_nvdimm_bus_locked(&nd_btt->dev));
+	dev_dbg(&nd_btt->dev, "%s: %s: release %s\n", caller, __func__,
+			bdevname(bdev, bdev_name));
+	blkdev_put(bdev, nd_btt_devs_mode);
+	nd_btt->backing_dev = NULL;
+
+	/*
+	 * Once we've had our backing device removed we need to be fully
+	 * reconfigured.  The bus will have already created a new seed
+	 * for this purpose, so now is a good time to clean up this
+	 * stale nd_btt instance.
+	 */
+	if (nd_btt->dev.driver)
+		nd_device_unregister(&nd_btt->dev, ND_ASYNC);
+}
+
+static int __nd_btt_remove_disk(struct device *dev, void *data)
+{
+	struct gendisk *disk = data;
+	struct block_device *bdev;
+	struct nd_btt *nd_btt;
+
+	if (!is_nd_btt(dev))
+		return 0;
+
+	nd_btt = to_nd_btt(dev);
+	bdev = nd_btt->backing_dev;
+	if (bdev && bdev->bd_disk == disk)
+		nd_btt_remove_bdev(nd_btt, __func__);
+	return 0;
+}
+
+void nd_btt_remove_disk(struct nvdimm_bus *nvdimm_bus, struct gendisk *disk)
+{
+	device_for_each_child(&nvdimm_bus->dev, disk, __nd_btt_remove_disk);
+}
+
+static ssize_t __backing_dev_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
+	const struct block_device_operations *ops;
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	struct block_device *bdev;
+	char *path;
+
+	if (dev->driver) {
+		dev_dbg(dev, "%s: -EBUSY\n", __func__);
+		return -EBUSY;
+	}
+
+	path = kstrndup(buf, len, GFP_KERNEL);
+	if (!path)
+		return -ENOMEM;
+	strim(path);
+
+	/* detach the backing device */
+	if (strcmp(path, "") == 0) {
+		nd_btt_remove_bdev(nd_btt, __func__);
+		goto out;
+	} else if (nd_btt->backing_dev) {
+		dev_dbg(dev, "backing_dev already set\n");
+		len = -EBUSY;
+		goto out;
+	}
+
+	bdev = blkdev_get_by_path(path, nd_btt_devs_mode, nd_btt);
+	if (IS_ERR(bdev)) {
+		dev_dbg(dev, "open '%s' failed: %ld\n", path, PTR_ERR(bdev));
+		len = PTR_ERR(bdev);
+		goto out;
+	}
+
+	if (nvdimm_bus != walk_to_nvdimm_bus(disk_to_dev(bdev->bd_disk))) {
+		dev_dbg(dev, "%s not a descendant of %s\n", path,
+				dev_name(&nvdimm_bus->dev));
+		blkdev_put(bdev, nd_btt_devs_mode);
+		len = -EINVAL;
+		goto out;
+	}
+
+	if (get_capacity(bdev->bd_disk) < SZ_16M / 512) {
+		dev_dbg(dev, "%s too small to host btt\n", path);
+		blkdev_put(bdev, nd_btt_devs_mode);
+		len = -ENXIO;
+		goto out;
+	}
+
+	ops = bdev->bd_disk->fops;
+	if (!ops->rw_bytes) {
+		dev_dbg(dev, "%s does not implement ->rw_bytes()\n", path);
+		blkdev_put(bdev, nd_btt_devs_mode);
+		len = -EINVAL;
+		goto out;
+	}
+
+	WARN_ON_ONCE(!is_nvdimm_bus_locked(&nd_btt->dev));
+	nd_btt->backing_dev = bdev;
+
+ out:
+	kfree(path);
+	return len;
+}
+
+static ssize_t backing_dev_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	ssize_t rc;
+
+	nvdimm_bus_lock(dev);
+	device_lock(dev);
+	rc = __backing_dev_store(dev, attr, buf, len);
+	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	device_unlock(dev);
+	nvdimm_bus_unlock(dev);
+
+	return rc;
+}
+static DEVICE_ATTR_RW(backing_dev);
+
+static bool is_nd_btt_idle(struct device *dev)
+{
+	struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+
+	if (nvdimm_bus->nd_btt == nd_btt || dev->driver || nd_btt->backing_dev)
+		return false;
+	return true;
+}
+
+static ssize_t delete_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	/* return 1 if can be deleted */
+	return sprintf(buf, "%d\n", is_nd_btt_idle(dev));
+}
+
+static ssize_t delete_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	unsigned long val;
+
+	/* write 1 to delete */
+	if (kstrtoul(buf, 0, &val) != 0 || val != 1)
+		return -EINVAL;
+
+	/* prevent deletion while this btt is active, or is the current seed */
+	if (!is_nd_btt_idle(dev))
+		return -EBUSY;
+
+	/*
+	 * userspace raced itself if device goes active here and it gets
+	 * to keep the pieces
+	 */
+	nd_device_unregister(dev, ND_ASYNC);
+
+	return len;
+}
+static DEVICE_ATTR_RW(delete);
+
+static struct attribute *nd_btt_attributes[] = {
+	&dev_attr_sector_size.attr,
+	&dev_attr_backing_dev.attr,
+	&dev_attr_delete.attr,
+	&dev_attr_uuid.attr,
+	NULL,
+};
+
+static struct attribute_group nd_btt_attribute_group = {
+	.attrs = nd_btt_attributes,
+};
+
+static const struct attribute_group *nd_btt_attribute_groups[] = {
+	&nd_btt_attribute_group,
+	&nd_device_attribute_group,
+	NULL,
+};
+
+static struct nd_btt *__nd_btt_create(struct nvdimm_bus *nvdimm_bus,
+		unsigned long lbasize, u8 *uuid, struct block_device *bdev)
+{
+	struct nd_btt *nd_btt = kzalloc(sizeof(*nd_btt), GFP_KERNEL);
+	struct device *dev;
+
+	if (!nd_btt)
+		return NULL;
+	nd_btt->id = ida_simple_get(&btt_ida, 0, 0, GFP_KERNEL);
+	if (nd_btt->id < 0) {
+		kfree(nd_btt);
+		return NULL;
+	}
+
+	nd_btt->lbasize = lbasize;
+	if (uuid)
+		uuid = kmemdup(uuid, 16, GFP_KERNEL);
+	nd_btt->uuid = uuid;
+	nd_btt->backing_dev = bdev;
+	dev = &nd_btt->dev;
+	dev_set_name(dev, "btt%d", nd_btt->id);
+	dev->parent = &nvdimm_bus->dev;
+	dev->type = &nd_btt_device_type;
+	dev->groups = nd_btt_attribute_groups;
+	nd_device_register(&nd_btt->dev);
+	return nd_btt;
+}
+
+struct nd_btt *nd_btt_create(struct nvdimm_bus *nvdimm_bus)
+{
+	return __nd_btt_create(nvdimm_bus, 0, NULL, NULL);
+}
+
+/*
+ * nd_btt_sb_checksum: compute checksum for btt info block
+ *
+ * Returns a fletcher64 checksum of everything in the given info block
+ * except the last field (since that's where the checksum lives).
+ */
+u64 nd_btt_sb_checksum(struct btt_sb *btt_sb)
+{
+	u64 sum, sum_save;
+
+	sum_save = btt_sb->checksum;
+	btt_sb->checksum = 0;
+	sum = nd_fletcher64(btt_sb, sizeof(*btt_sb), 1);
+	btt_sb->checksum = sum_save;
+	return sum;
+}
+EXPORT_SYMBOL(nd_btt_sb_checksum);
+
+static struct nd_btt *__nd_btt_autodetect(struct nvdimm_bus *nvdimm_bus,
+		struct block_device *bdev, struct btt_sb *btt_sb)
+{
+	u64 checksum;
+	u32 lbasize;
+
+	if (!btt_sb || !bdev || !nvdimm_bus)
+		return NULL;
+
+	if (bdev_read_bytes(bdev, SZ_4K, btt_sb, sizeof(*btt_sb)))
+		return NULL;
+
+	if (get_capacity(bdev->bd_disk) < SZ_16M / 512)
+		return NULL;
+
+	if (memcmp(btt_sb->signature, BTT_SIG, BTT_SIG_LEN) != 0)
+		return NULL;
+
+	checksum = le64_to_cpu(btt_sb->checksum);
+	btt_sb->checksum = 0;
+	if (checksum != nd_btt_sb_checksum(btt_sb))
+		return NULL;
+	btt_sb->checksum = cpu_to_le64(checksum);
+
+	lbasize = le32_to_cpu(btt_sb->external_lbasize);
+	return __nd_btt_create(nvdimm_bus, lbasize, btt_sb->uuid, bdev);
+}
+
+static int nd_btt_autodetect(struct nvdimm_bus *nvdimm_bus,
+		struct block_device *bdev)
+{
+	char name[BDEVNAME_SIZE];
+	struct nd_btt *nd_btt;
+	struct btt_sb *btt_sb;
+
+	btt_sb = kzalloc(sizeof(*btt_sb), GFP_KERNEL);
+	nd_btt = __nd_btt_autodetect(nvdimm_bus, bdev, btt_sb);
+	kfree(btt_sb);
+	dev_dbg(&nvdimm_bus->dev, "%s: %s btt: %s\n", __func__,
+			bdevname(bdev, name), nd_btt
+			? dev_name(&nd_btt->dev) : "<none>");
+	return nd_btt ? 0 : -ENODEV;
+}
+
+void nd_btt_add_disk(struct nvdimm_bus *nvdimm_bus, struct gendisk *disk)
+{
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+
+	disk_part_iter_init(&piter, disk, DISK_PITER_INCL_PART0);
+	while ((part = disk_part_iter_next(&piter))) {
+		struct block_device *bdev;
+		int rc;
+
+		bdev = bdget_disk(disk, part->partno);
+		if (!bdev)
+			continue;
+		if (blkdev_get(bdev, nd_btt_devs_mode, nvdimm_bus) != 0)
+			continue;
+		rc = nd_btt_autodetect(nvdimm_bus, bdev);
+		if (rc)
+			blkdev_put(bdev, nd_btt_devs_mode);
+		/* no need to scan further in the case of whole disk btt */
+		if (rc == 0 && part->partno == 0)
+			break;
+	}
+	disk_part_iter_exit(&piter);
+}
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index ca802702440e..3c14fee5aff4 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -14,8 +14,10 @@
 #include <linux/vmalloc.h>
 #include <linux/uaccess.h>
 #include <linux/module.h>
+#include <linux/blkdev.h>
 #include <linux/fcntl.h>
 #include <linux/async.h>
+#include <linux/genhd.h>
 #include <linux/ndctl.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
@@ -40,6 +42,8 @@ static int to_nd_device_type(struct device *dev)
 		return ND_DEVICE_REGION_BLK;
 	else if (is_nd_pmem(dev->parent) || is_nd_blk(dev->parent))
 		return nd_region_to_nstype(to_nd_region(dev->parent));
+	else if (is_nd_btt(dev))
+		return ND_DEVICE_BTT;
 
 	return 0;
 }
@@ -103,6 +107,21 @@ static int nvdimm_bus_probe(struct device *dev)
 
 	dev_dbg(&nvdimm_bus->dev, "%s.probe(%s) = %d\n", dev->driver->name,
 			dev_name(dev), rc);
+
+	/* check if our btt-seed has sprouted, and plant another */
+	if (rc == 0 && is_nd_btt(dev) && dev == &nvdimm_bus->nd_btt->dev) {
+		const char *sep = "", *name = "", *status = "failed";
+
+		nvdimm_bus->nd_btt = nd_btt_create(nvdimm_bus);
+		if (nvdimm_bus->nd_btt) {
+			status = "succeeded";
+			sep = ": ";
+			name = dev_name(&nvdimm_bus->nd_btt->dev);
+		}
+		dev_dbg(&nvdimm_bus->dev, "btt seed creation %s%s%s\n",
+				status, sep, name);
+	}
+
 	if (rc != 0)
 		module_put(provider);
 	return rc;
@@ -163,14 +182,19 @@ static void nd_async_device_unregister(void *d, async_cookie_t cookie)
 	put_device(dev);
 }
 
-void nd_device_register(struct device *dev)
+void __nd_device_register(struct device *dev)
 {
 	dev->bus = &nvdimm_bus_type;
-	device_initialize(dev);
 	get_device(dev);
 	async_schedule_domain(nd_async_device_register, dev,
 			&nd_async_domain);
 }
+
+void nd_device_register(struct device *dev)
+{
+	device_initialize(dev);
+	__nd_device_register(dev);
+}
 EXPORT_SYMBOL(nd_device_register);
 
 void nd_device_unregister(struct device *dev, enum nd_async_mode mode)
@@ -219,6 +243,60 @@ int __nd_driver_register(struct nd_device_driver *nd_drv, struct module *owner,
 }
 EXPORT_SYMBOL(__nd_driver_register);
 
+/**
+ * nvdimm_bus_add_disk() - attach and run actions on an nvdimm block device
+ * @disk: disk device being registered
+ *
+ * Note, that @disk must be a descendant of an nvdimm_bus
+ */
+int nvdimm_bus_add_disk(struct gendisk *disk)
+{
+	const struct block_device_operations *ops = disk->fops;
+	struct device *dev = disk->driverfs_dev;
+	struct nvdimm_bus *nvdimm_bus;
+
+	nvdimm_bus = walk_to_nvdimm_bus(dev);
+	if (!nvdimm_bus || !ops->rw_bytes)
+		return -EINVAL;
+
+	/*
+	 * Take the bus lock here to prevent userspace racing to
+	 * initiate actions on the newly availble block device while
+	 * autodetect scanning is still in flight.
+	 */
+	nvdimm_bus_lock(&nvdimm_bus->dev);
+	add_disk(disk);
+	nd_btt_add_disk(nvdimm_bus, disk);
+	nvdimm_bus_unlock(&nvdimm_bus->dev);
+
+	return 0;
+}
+EXPORT_SYMBOL(nvdimm_bus_add_disk);
+
+void nvdimm_bus_remove_disk(struct gendisk *disk)
+{
+	struct device *dev = disk_to_dev(disk);
+	struct nvdimm_bus *nvdimm_bus;
+
+	nvdimm_bus = walk_to_nvdimm_bus(dev);
+	if (!nvdimm_bus)
+		return;
+
+	nvdimm_bus_lock(&nvdimm_bus->dev);
+	nd_btt_remove_disk(nvdimm_bus, disk);
+	nvdimm_bus_unlock(&nvdimm_bus->dev);
+
+	/*
+	 * Flush in case *_notify_remove() kicked off asynchronous
+	 * device unregistration
+	 */
+	nd_synchronize();
+
+	del_gendisk(disk);
+	put_disk(disk);
+}
+EXPORT_SYMBOL(nvdimm_bus_remove_disk);
+
 static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
 		char *buf)
 {
diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index dd824d7c2669..0fa9b6225450 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -273,10 +273,28 @@ static ssize_t wait_probe_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(wait_probe);
 
+static ssize_t btt_seed_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nvdimm_bus *nvdimm_bus = to_nvdimm_bus(dev);
+	ssize_t rc;
+
+	nvdimm_bus_lock(dev);
+	if (nvdimm_bus->nd_btt)
+		rc = sprintf(buf, "%s\n", dev_name(&nvdimm_bus->nd_btt->dev));
+	else
+		rc = sprintf(buf, "\n");
+	nvdimm_bus_unlock(dev);
+
+	return rc;
+}
+static DEVICE_ATTR_RO(btt_seed);
+
 static struct attribute *nvdimm_bus_attributes[] = {
 	&dev_attr_commands.attr,
 	&dev_attr_wait_probe.attr,
 	&dev_attr_provider.attr,
+	&dev_attr_btt_seed.attr,
 	NULL,
 };
 
@@ -322,6 +340,8 @@ struct nvdimm_bus *__nvdimm_bus_register(struct device *parent,
 	list_add_tail(&nvdimm_bus->list, &nvdimm_bus_list);
 	mutex_unlock(&nvdimm_bus_list_mutex);
 
+	nvdimm_bus->nd_btt = nd_btt_create(nvdimm_bus);
+
 	return nvdimm_bus;
  err:
 	put_device(&nvdimm_bus->dev);
diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h
index 78d6c51f4bac..1375d30b3da5 100644
--- a/drivers/nvdimm/nd-core.h
+++ b/drivers/nvdimm/nd-core.h
@@ -23,6 +23,11 @@ extern struct list_head nvdimm_bus_list;
 extern struct mutex nvdimm_bus_list_mutex;
 extern int nvdimm_major;
 
+struct block_device;
+struct nd_io_claim;
+struct nd_btt;
+struct nd_io;
+
 struct nvdimm_bus {
 	struct nvdimm_bus_descriptor *nd_desc;
 	wait_queue_head_t probe_wait;
@@ -31,6 +36,7 @@ struct nvdimm_bus {
 	struct device dev;
 	int id, probe_active;
 	struct mutex reconfig_mutex;
+	struct nd_btt *nd_btt;
 };
 
 struct nvdimm {
@@ -45,6 +51,33 @@ struct nvdimm {
 bool is_nvdimm(struct device *dev);
 bool is_nd_blk(struct device *dev);
 bool is_nd_pmem(struct device *dev);
+struct gendisk;
+#if IS_ENABLED(CONFIG_ND_BTT_DEVS)
+bool is_nd_btt(struct device *dev);
+struct nd_btt *nd_btt_create(struct nvdimm_bus *nvdimm_bus);
+void nd_btt_add_disk(struct nvdimm_bus *nvdimm_bus, struct gendisk *disk);
+void nd_btt_remove_disk(struct nvdimm_bus *nvdimm_bus, struct gendisk *disk);
+#else
+static inline bool is_nd_btt(struct device *dev)
+{
+	return false;
+}
+
+static inline struct nd_btt *nd_btt_create(struct nvdimm_bus *nvdimm_bus)
+{
+	return NULL;
+}
+
+static inline void nd_btt_add_disk(struct nvdimm_bus *nvdimm_bus,
+		struct gendisk *disk)
+{
+}
+
+static inline void nd_btt_remove_disk(struct nvdimm_bus *nvdimm_bus,
+		struct gendisk *disk)
+{
+}
+#endif
 struct nvdimm_bus *walk_to_nvdimm_bus(struct device *nd_dev);
 int __init nvdimm_bus_init(void);
 void nvdimm_bus_exit(void);
@@ -58,6 +91,7 @@ void nd_synchronize(void);
 int nvdimm_bus_register_dimms(struct nvdimm_bus *nvdimm_bus);
 int nvdimm_bus_register_regions(struct nvdimm_bus *nvdimm_bus);
 int nvdimm_bus_init_interleave_sets(struct nvdimm_bus *nvdimm_bus);
+void __nd_device_register(struct device *dev);
 int nd_match_dimm(struct device *dev, void *data);
 struct nd_label_id;
 char *nd_label_gen_id(struct nd_label_id *label_id, u8 *uuid, u32 flags);
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index bfa849617358..3bd8d650340e 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -14,11 +14,17 @@
 #define __ND_H__
 #include <linux/libnvdimm.h>
 #include <linux/device.h>
+#include <linux/genhd.h>
 #include <linux/mutex.h>
 #include <linux/ndctl.h>
 #include <linux/types.h>
+#include <linux/fs.h>
 #include "label.h"
 
+enum {
+	SECTOR_SHIFT = 9,
+};
+
 struct nvdimm_drvdata {
 	struct device *dev;
 	int nsindex_size;
@@ -94,6 +100,14 @@ static inline unsigned nd_inc_seq(unsigned seq)
 	return next[seq & 3];
 }
 
+struct nd_btt {
+	struct device dev;
+	struct block_device *backing_dev;
+	unsigned long lbasize;
+	u8 *uuid;
+	int id;
+};
+
 enum nd_async_mode {
 	ND_SYNC,
 	ND_ASYNC,
@@ -118,6 +132,9 @@ int nvdimm_init_nsarea(struct nvdimm_drvdata *ndd);
 int nvdimm_init_config_data(struct nvdimm_drvdata *ndd);
 int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, size_t offset,
 		void *buf, size_t len);
+struct nd_btt *to_nd_btt(struct device *dev);
+struct btt_sb;
+u64 nd_btt_sb_checksum(struct btt_sb *btt_sb);
 struct nd_region *to_nd_region(struct device *dev);
 int nd_region_to_nstype(struct nd_region *nd_region);
 int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
@@ -125,6 +142,8 @@ u64 nd_region_interleave_set_cookie(struct nd_region *nd_region);
 void nvdimm_bus_lock(struct device *dev);
 void nvdimm_bus_unlock(struct device *dev);
 bool is_nvdimm_bus_locked(struct device *dev);
+int nvdimm_bus_add_disk(struct gendisk *disk);
+void nvdimm_bus_remove_disk(struct gendisk *disk);
 void nvdimm_drvdata_release(struct kref *kref);
 void put_ndd(struct nvdimm_drvdata *ndd);
 int nd_label_reserve_dpa(struct nvdimm_drvdata *ndd);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index efa2cde7f6b6..1f4767150975 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -190,8 +190,6 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 	set_capacity(disk, pmem->size >> 9);
 	pmem->pmem_disk = disk;
 
-	add_disk(disk);
-
 	return pmem;
 
 out_free_queue:
@@ -208,8 +206,7 @@ out:
 
 static void pmem_free(struct pmem_device *pmem)
 {
-	del_gendisk(pmem->pmem_disk);
-	put_disk(pmem->pmem_disk);
+	nvdimm_bus_remove_disk(pmem->pmem_disk);
 	blk_cleanup_queue(pmem->pmem_queue);
 	iounmap(pmem->virt_addr);
 	release_mem_region(pmem->phys_addr, pmem->size);
@@ -244,6 +241,7 @@ static int nd_pmem_probe(struct device *dev)
 		return PTR_ERR(pmem);
 
 	dev_set_drvdata(dev, pmem);
+	nvdimm_bus_add_disk(pmem->pmem_disk);
 
 	return 0;
 }
diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
index 2b94ea2287bb..4c2e3ff374b2 100644
--- a/include/uapi/linux/ndctl.h
+++ b/include/uapi/linux/ndctl.h
@@ -181,6 +181,7 @@ static inline const char *nvdimm_cmd_name(unsigned cmd)
 #define ND_DEVICE_NAMESPACE_IO 4    /* legacy persistent memory */
 #define ND_DEVICE_NAMESPACE_PMEM 5  /* PMEM namespace (may alias with BLK) */
 #define ND_DEVICE_NAMESPACE_BLK 6   /* BLK namespace (may alias with PMEM) */
+#define ND_DEVICE_BTT 7		    /* block-translation table device */
 
 enum nd_driver_flags {
 	ND_DRIVER_DIMM            = 1 << ND_DEVICE_DIMM,
@@ -189,6 +190,7 @@ enum nd_driver_flags {
 	ND_DRIVER_NAMESPACE_IO    = 1 << ND_DEVICE_NAMESPACE_IO,
 	ND_DRIVER_NAMESPACE_PMEM  = 1 << ND_DEVICE_NAMESPACE_PMEM,
 	ND_DRIVER_NAMESPACE_BLK   = 1 << ND_DEVICE_NAMESPACE_BLK,
+	ND_DRIVER_BTT		  = 1 << ND_DEVICE_BTT,
 };
 
 enum {


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 03/15] nd_btt: atomic sector updates
  2015-06-17 23:54 ` Dan Williams
@ 2015-06-17 23:55   ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Vishal Verma, Neil Brown, Greg KH,
	Dave Chinner, linux-kernel, Andy Lutomirski, Jens Axboe,
	linux-acpi, Jeff Moyer, H. Peter Anvin, linux-fsdevel, hch,
	mingo

From: Vishal Verma <vishal.l.verma@linux.intel.com>

BTT stands for Block Translation Table, and is a way to provide power
fail sector atomicity semantics for block devices that have the ability
to perform byte granularity IO. It relies on the capability of libnvdimm
namespace devices to do byte aligned IO.

The BTT works as a stacked blocked device, and reserves a chunk of space
from the backing device for its accounting metadata. It is a bio-based
driver because all IO is done synchronously, and there is no queuing or
asynchronous completions at either the device or the driver level.

The BTT uses 'lanes' to index into various 'on-disk' data structures,
and lanes also act as a synchronization mechanism in case there are more
CPUs than available lanes. We did a comparison between two lane lock
strategies - first where we kept an atomic counter around that tracked
which was the last lane that was used, and 'our' lane was determined by
atomically incrementing that. That way, for the nr_cpus > nr_lanes case,
theoretically, no CPU would be blocked waiting for a lane. The other
strategy was to use the cpu number we're scheduled on to and hash it to
a lane number. Theoretically, this could block an IO that could've
otherwise run using a different, free lane. But some fio workloads
showed that the direct cpu -> lane hash performed faster than tracking
'last lane' - my reasoning is the cache thrash caused by moving the
atomic variable made that approach slower than simply waiting out the
in-progress IO. This supports the conclusion that the driver can be a
very simple bio-based one that does synchronous IOs instead of queuing.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
[jmoyer: fix nmi watchdog timeout in btt_map_init]
[jmoyer: move btt initialization to module load path]
[jmoyer: fix memory leak in the btt initialization path]
[jmoyer: Don't overwrite corrupted arenas]
Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 Documentation/nvdimm/btt.txt |  273 ++++++++
 drivers/acpi/nfit.c          |    1 
 drivers/nvdimm/Kconfig       |   27 +
 drivers/nvdimm/Makefile      |    3 
 drivers/nvdimm/btt.c         | 1449 ++++++++++++++++++++++++++++++++++++++++++
 drivers/nvdimm/btt.h         |  141 ++++
 drivers/nvdimm/btt_devs.c    |    3 
 drivers/nvdimm/nd.h          |   10 
 drivers/nvdimm/region.c      |   89 +++
 drivers/nvdimm/region_devs.c |   10 
 include/linux/libnvdimm.h    |    1 
 11 files changed, 2004 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/nvdimm/btt.txt
 create mode 100644 drivers/nvdimm/btt.c

diff --git a/Documentation/nvdimm/btt.txt b/Documentation/nvdimm/btt.txt
new file mode 100644
index 000000000000..95134d5ec4a0
--- /dev/null
+++ b/Documentation/nvdimm/btt.txt
@@ -0,0 +1,273 @@
+BTT - Block Translation Table
+=============================
+
+
+1. Introduction
+---------------
+
+Persistent memory based storage is able to perform IO at byte (or more
+accurately, cache line) granularity. However, we often want to expose such
+storage as traditional block devices. The block drivers for persistent memory
+will do exactly this. However, they do not provide any atomicity guarantees.
+Traditional SSDs typically provide protection against torn sectors in hardware,
+using stored energy in capacitors to complete in-flight block writes, or perhaps
+in firmware. We don't have this luxury with persistent memory - if a write is in
+progress, and we experience a power failure, the block will contain a mix of old
+and new data. Applications may not be prepared to handle such a scenario.
+
+The Block Translation Table (BTT) provides atomic sector update semantics for
+persistent memory devices, so that applications that rely on sector writes not
+being torn can continue to do so. The BTT manifests itself as a stacked block
+device, and reserves a portion of the underlying storage for its metadata. At
+the heart of it, is an indirection table that re-maps all the blocks on the
+volume. It can be thought of as an extremely simple file system that only
+provides atomic sector updates.
+
+
+2. Static Layout
+----------------
+
+The underlying storage on which a BTT can be laid out is not limited in any way.
+The BTT, however, splits the available space into chunks of up to 512 GiB,
+called "Arenas".
+
+Each arena follows the same layout for its metadata, and all references in an
+arena are internal to it (with the exception of one field that points to the
+next arena). The following depicts the "On-disk" metadata layout:
+
+
+  Backing Store     +------->  Arena
++---------------+   |   +------------------+
+|               |   |   | Arena info block |
+|    Arena 0    +---+   |       4K         |
+|     512G      |       +------------------+
+|               |       |                  |
++---------------+       |                  |
+|               |       |                  |
+|    Arena 1    |       |   Data Blocks    |
+|     512G      |       |                  |
+|               |       |                  |
++---------------+       |                  |
+|       .       |       |                  |
+|       .       |       |                  |
+|       .       |       |                  |
+|               |       |                  |
+|               |       |                  |
++---------------+       +------------------+
+                        |                  |
+                        |     BTT Map      |
+                        |                  |
+                        |                  |
+                        +------------------+
+                        |                  |
+                        |     BTT Flog     |
+                        |                  |
+                        +------------------+
+                        | Info block copy  |
+                        |       4K         |
+                        +------------------+
+
+
+3. Theory of Operation
+----------------------
+
+
+a. The BTT Map
+--------------
+
+The map is a simple lookup/indirection table that maps an LBA to an internal
+block. Each map entry is 32 bits. The two most significant bits are special
+flags, and the remaining form the internal block number.
+
+Bit      Description
+31     : TRIM flag - marks if the block was trimmed or discarded
+30     : ERROR flag - marks an error block. Cleared on write.
+29 - 0 : Mappings to internal 'postmap' blocks
+
+
+Some of the terminology that will be subsequently used:
+
+External LBA  : LBA as made visible to upper layers.
+ABA           : Arena Block Address - Block offset/number within an arena
+Premap ABA    : The block offset into an arena, which was decided upon by range
+		checking the External LBA
+Postmap ABA   : The block number in the "Data Blocks" area obtained after
+		indirection from the map
+nfree	      : The number of free blocks that are maintained at any given time.
+		This is the number of concurrent writes that can happen to the
+		arena.
+
+
+For example, after adding a BTT, we surface a disk of 1024G. We get a read for
+the external LBA at 768G. This falls into the second arena, and of the 512G
+worth of blocks that this arena contributes, this block is at 256G. Thus, the
+premap ABA is 256G. We now refer to the map, and find out the mapping for block
+'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64.
+
+
+b. The BTT Flog
+---------------
+
+The BTT provides sector atomicity by making every write an "allocating write",
+i.e. Every write goes to a "free" block. A running list of free blocks is
+maintained in the form of the BTT flog. 'Flog' is a combination of the words
+"free list" and "log". The flog contains 'nfree' entries, and an entry contains:
+
+lba     : The premap ABA that is being written to
+old_map : The old postmap ABA - after 'this' write completes, this will be a
+	  free block.
+new_map : The new postmap ABA. The map will up updated to reflect this
+	  lba->postmap_aba mapping, but we log it here in case we have to
+	  recover.
+seq	: Sequence number to mark which of the 2 sections of this flog entry is
+	  valid/newest. It cycles between 01->10->11->01 (binary) under normal
+	  operation, with 00 indicating an uninitialized state.
+lba'	: alternate lba entry
+old_map': alternate old postmap entry
+new_map': alternate new postmap entry
+seq'	: alternate sequence number.
+
+Each of the above fields is 32-bit, making one entry 16 bytes. Flog updates are
+done such that for any entry being written, it:
+a. overwrites the 'old' section in the entry based on sequence numbers
+b. writes the new entry such that the sequence number is written last.
+
+
+c. The concept of lanes
+-----------------------
+
+While 'nfree' describes the number of concurrent IOs an arena can process
+concurrently, 'nlanes' is the number of IOs the BTT device as a whole can
+process.
+ nlanes = min(nfree, num_cpus)
+A lane number is obtained at the start of any IO, and is used for indexing into
+all the on-disk and in-memory data structures for the duration of the IO. It is
+protected by a spinlock.
+
+
+d. In-memory data structure: Read Tracking Table (RTT)
+------------------------------------------------------
+
+Consider a case where we have two threads, one doing reads and the other,
+writes. We can hit a condition where the writer thread grabs a free block to do
+a new IO, but the (slow) reader thread is still reading from it. In other words,
+the reader consulted a map entry, and started reading the corresponding block. A
+writer started writing to the same external LBA, and finished the write updating
+the map for that external LBA to point to its new postmap ABA. At this point the
+internal, postmap block that the reader is (still) reading has been inserted
+into the list of free blocks. If another write comes in for the same LBA, it can
+grab this free block, and start writing to it, causing the reader to read
+incorrect data. To prevent this, we introduce the RTT.
+
+The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts
+into rtt[lane_number], the postmap ABA it is reading, and clears it after the
+read is complete. Every writer thread, after grabbing a free block, checks the
+RTT for its presence. If the postmap free block is in the RTT, it waits till the
+reader clears the RTT entry, and only then starts writing to it.
+
+
+e. In-memory data structure: map locks
+--------------------------------------
+
+Consider a case where two writer threads are writing to the same LBA. There can
+be a race in the following sequence of steps:
+
+free[lane] = map[premap_aba]
+map[premap_aba] = postmap_aba
+
+Both threads can update their respective free[lane] with the same old, freed
+postmap_aba. This has made the layout inconsistent by losing a free entry, and
+at the same time, duplicating another free entry for two lanes.
+
+To solve this, we could have a single map lock (per arena) that has to be taken
+before performing the above sequence, but we feel that could be too contentious.
+Instead we use an array of (nfree) map_locks that is indexed by
+(premap_aba modulo nfree).
+
+
+f. Reconstruction from the Flog
+-------------------------------
+
+On startup, we analyze the BTT flog to create our list of free blocks. We walk
+through all the entries, and for each lane, of the set of two possible
+'sections', we always look at the most recent one only (based on the sequence
+number). The reconstruction rules/steps are simple:
+- Read map[log_entry.lba].
+- If log_entry.new matches the map entry, then log_entry.old is free.
+- If log_entry.new does not match the map entry, then log_entry.new is free.
+  (This case can only be caused by power-fails/unsafe shutdowns)
+
+
+g. Summarizing - Read and Write flows
+-------------------------------------
+
+Read:
+
+1.  Convert external LBA to arena number + pre-map ABA
+2.  Get a lane (and take lane_lock)
+3.  Read map to get the entry for this pre-map ABA
+4.  Enter post-map ABA into RTT[lane]
+5.  If TRIM flag set in map, return zeroes, and end IO (go to step 8)
+6.  If ERROR flag set in map, end IO with EIO (go to step 8)
+7.  Read data from this block
+8.  Remove post-map ABA entry from RTT[lane]
+9.  Release lane (and lane_lock)
+
+Write:
+
+1.  Convert external LBA to Arena number + pre-map ABA
+2.  Get a lane (and take lane_lock)
+3.  Use lane to index into in-memory free list and obtain a new block, next flog
+        index, next sequence number
+4.  Scan the RTT to check if free block is present, and spin/wait if it is.
+5.  Write data to this free block
+6.  Read map to get the existing post-map ABA entry for this pre-map ABA
+7.  Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num]
+8.  Write new post-map ABA into map.
+9.  Write old post-map entry into the free list
+10. Calculate next sequence number and write into the free list entry
+11. Release lane (and lane_lock)
+
+
+4. Error Handling
+=================
+
+An arena would be in an error state if any of the metadata is corrupted
+irrecoverably, either due to a bug or a media error. The following conditions
+indicate an error:
+- Info block checksum does not match (and recovering from the copy also fails)
+- All internal available blocks are not uniquely and entirely addressed by the
+  sum of mapped blocks and free blocks (from the BTT flog).
+- Rebuilding free list from the flog reveals missing/duplicate/impossible
+  entries
+- A map entry is out of bounds
+
+If any of these error conditions are encountered, the arena is put into a read
+only state using a flag in the info block.
+
+
+5. In-kernel usage
+==================
+
+Any block driver that supports byte granularity IO to the storage may register
+with the BTT. It will have to provide the rw_bytes interface in its
+block_device_operations struct:
+
+	int (*rw_bytes)(struct gendisk *, void *, size_t, off_t, int rw);
+
+It may register with the BTT after it adds its own gendisk, using btt_init:
+
+	struct btt *btt_init(struct gendisk *disk, unsigned long long rawsize,
+			u32 lbasize, u8 uuid[], int maxlane);
+
+note that maxlane is the maximum amount of concurrency the driver wishes to
+allow the BTT to use.
+
+The BTT 'disk' appears as a stacked block device that grabs the underlying block
+device in the O_EXCL mode.
+
+When the driver wishes to remove the backing disk, it should similarly call
+btt_fini using the same struct btt* handle that was provided to it by btt_init.
+
+	void btt_fini(struct btt *btt);
+
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 35af6f7f0abd..fc38b49eff7d 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -902,6 +902,7 @@ static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,
 		} else {
 			nd_mapping->size = nfit_mem->bdw->capacity;
 			nd_mapping->start = nfit_mem->bdw->start_address;
+			ndr_desc->num_lanes = nfit_mem->bdw->windows;
 			blk_valid = 1;
 		}
 
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index f16ba9d14740..a9def3839655 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -34,6 +34,31 @@ config BLK_DEV_PMEM
 	  Say Y if you want to use an NVDIMM
 
 config ND_BTT_DEVS
-	def_bool y
+	bool
+
+config ND_BTT
+	tristate "BTT: Block Translation Table (atomic sector updates)"
+	default LIBNVDIMM
+	select ND_BTT_DEVS
+	help
+	  The Block Translation Table (BTT) provides atomic sector
+	  update semantics for persistent memory devices, so that
+	  applications that rely on sector writes not being torn (a
+	  guarantee that typical disks provide) can continue to do so.
+	  The BTT manifests itself as a stacked block device, and
+	  reserves a portion of the underlying storage for its
+	  metadata.
+
+	  Select Y if unsure.
+
+config ND_MAX_REGIONS
+	int "Maximum number of regions supported by the sub-system"
+	default 64
+	---help---
+	  A 'region' corresponds to an individual DIMM or an interleave
+	  set of DIMMs.  A typical maximally configured system may have
+	  up to 32 DIMMs.
+
+	  Leave the default of 64 if you are unsure.
 
 endif
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index eb1bbce86592..aa5bb1acf831 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -1,8 +1,11 @@
 obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
 obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
+obj-$(CONFIG_ND_BTT) += nd_btt.o
 
 nd_pmem-y := pmem.o
 
+nd_btt-y := btt.o
+
 libnvdimm-y := core.o
 libnvdimm-y += bus.o
 libnvdimm-y += dimm_devs.o
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
new file mode 100644
index 000000000000..58becbd69ae1
--- /dev/null
+++ b/drivers/nvdimm/btt.c
@@ -0,0 +1,1449 @@
+/*
+ * Block Translation Table
+ * Copyright (c) 2014-2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <linux/highmem.h>
+#include <linux/debugfs.h>
+#include <linux/blkdev.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/hdreg.h>
+#include <linux/genhd.h>
+#include <linux/sizes.h>
+#include <linux/ndctl.h>
+#include <linux/fs.h>
+#include <linux/nd.h>
+#include "btt.h"
+#include "nd.h"
+
+enum log_ent_request {
+	LOG_NEW_ENT = 0,
+	LOG_OLD_ENT
+};
+
+static int btt_major;
+
+static int arena_read_bytes(struct arena_info *arena, resource_size_t offset,
+		void *buf, size_t n)
+{
+	struct nd_btt *nd_btt = arena->nd_btt;
+	struct block_device *bdev = nd_btt->backing_dev;
+
+	/* arena offsets are 4K from the base of the device */
+	offset += SZ_4K;
+	return bdev_read_bytes(bdev, offset, buf, n);
+}
+
+static int arena_write_bytes(struct arena_info *arena, resource_size_t offset,
+		void *buf, size_t n)
+{
+	struct nd_btt *nd_btt = arena->nd_btt;
+	struct block_device *bdev = nd_btt->backing_dev;
+
+	/* arena offsets are 4K from the base of the device */
+	offset += SZ_4K;
+	return bdev_write_bytes(bdev, offset, buf, n);
+}
+
+static int btt_info_write(struct arena_info *arena, struct btt_sb *super)
+{
+	int ret;
+
+	ret = arena_write_bytes(arena, arena->info2off, super,
+			sizeof(struct btt_sb));
+	if (ret)
+		return ret;
+
+	return arena_write_bytes(arena, arena->infooff, super,
+			sizeof(struct btt_sb));
+}
+
+static int btt_info_read(struct arena_info *arena, struct btt_sb *super)
+{
+	WARN_ON(!super);
+	return arena_read_bytes(arena, arena->infooff, super,
+			sizeof(struct btt_sb));
+}
+
+/*
+ * 'raw' version of btt_map write
+ * Assumptions:
+ *   mapping is in little-endian
+ *   mapping contains 'E' and 'Z' flags as desired
+ */
+static int __btt_map_write(struct arena_info *arena, u32 lba, __le32 mapping)
+{
+	u64 ns_off = arena->mapoff + (lba * MAP_ENT_SIZE);
+
+	WARN_ON(lba >= arena->external_nlba);
+	return arena_write_bytes(arena, ns_off, &mapping, MAP_ENT_SIZE);
+}
+
+static int btt_map_write(struct arena_info *arena, u32 lba, u32 mapping,
+			u32 z_flag, u32 e_flag)
+{
+	u32 ze;
+	__le32 mapping_le;
+
+	/*
+	 * This 'mapping' is supposed to be just the LBA mapping, without
+	 * any flags set, so strip the flag bits.
+	 */
+	mapping &= MAP_LBA_MASK;
+
+	ze = (z_flag << 1) + e_flag;
+	switch (ze) {
+	case 0:
+		/*
+		 * We want to set neither of the Z or E flags, and
+		 * in the actual layout, this means setting the bit
+		 * positions of both to '1' to indicate a 'normal'
+		 * map entry
+		 */
+		mapping |= MAP_ENT_NORMAL;
+		break;
+	case 1:
+		mapping |= (1 << MAP_ERR_SHIFT);
+		break;
+	case 2:
+		mapping |= (1 << MAP_TRIM_SHIFT);
+		break;
+	default:
+		/*
+		 * The case where Z and E are both sent in as '1' could be
+		 * construed as a valid 'normal' case, but we decide not to,
+		 * to avoid confusion
+		 */
+		WARN_ONCE(1, "Invalid use of Z and E flags\n");
+		return -EIO;
+	}
+
+	mapping_le = cpu_to_le32(mapping);
+	return __btt_map_write(arena, lba, mapping_le);
+}
+
+static int btt_map_read(struct arena_info *arena, u32 lba, u32 *mapping,
+			int *trim, int *error)
+{
+	int ret;
+	__le32 in;
+	u32 raw_mapping, postmap, ze, z_flag, e_flag;
+	u64 ns_off = arena->mapoff + (lba * MAP_ENT_SIZE);
+
+	WARN_ON(lba >= arena->external_nlba);
+
+	ret = arena_read_bytes(arena, ns_off, &in, MAP_ENT_SIZE);
+	if (ret)
+		return ret;
+
+	raw_mapping = le32_to_cpu(in);
+
+	z_flag = (raw_mapping & MAP_TRIM_MASK) >> MAP_TRIM_SHIFT;
+	e_flag = (raw_mapping & MAP_ERR_MASK) >> MAP_ERR_SHIFT;
+	ze = (z_flag << 1) + e_flag;
+	postmap = raw_mapping & MAP_LBA_MASK;
+
+	/* Reuse the {z,e}_flag variables for *trim and *error */
+	z_flag = 0;
+	e_flag = 0;
+
+	switch (ze) {
+	case 0:
+		/* Initial state. Return postmap = premap */
+		*mapping = lba;
+		break;
+	case 1:
+		*mapping = postmap;
+		e_flag = 1;
+		break;
+	case 2:
+		*mapping = postmap;
+		z_flag = 1;
+		break;
+	case 3:
+		*mapping = postmap;
+		break;
+	default:
+		return -EIO;
+	}
+
+	if (trim)
+		*trim = z_flag;
+	if (error)
+		*error = e_flag;
+
+	return ret;
+}
+
+static int btt_log_read_pair(struct arena_info *arena, u32 lane,
+			struct log_entry *ent)
+{
+	WARN_ON(!ent);
+	return arena_read_bytes(arena,
+			arena->logoff + (2 * lane * LOG_ENT_SIZE), ent,
+			2 * LOG_ENT_SIZE);
+}
+
+static struct dentry *debugfs_root;
+
+static void arena_debugfs_init(struct arena_info *a, struct dentry *parent,
+				int idx)
+{
+	char dirname[32];
+	struct dentry *d;
+
+	/* If for some reason, parent bttN was not created, exit */
+	if (!parent)
+		return;
+
+	snprintf(dirname, 32, "arena%d", idx);
+	d = debugfs_create_dir(dirname, parent);
+	if (IS_ERR_OR_NULL(d))
+		return;
+	a->debugfs_dir = d;
+
+	debugfs_create_x64("size", S_IRUGO, d, &a->size);
+	debugfs_create_x64("external_lba_start", S_IRUGO, d,
+				&a->external_lba_start);
+	debugfs_create_x32("internal_nlba", S_IRUGO, d, &a->internal_nlba);
+	debugfs_create_u32("internal_lbasize", S_IRUGO, d,
+				&a->internal_lbasize);
+	debugfs_create_x32("external_nlba", S_IRUGO, d, &a->external_nlba);
+	debugfs_create_u32("external_lbasize", S_IRUGO, d,
+				&a->external_lbasize);
+	debugfs_create_u32("nfree", S_IRUGO, d, &a->nfree);
+	debugfs_create_u16("version_major", S_IRUGO, d, &a->version_major);
+	debugfs_create_u16("version_minor", S_IRUGO, d, &a->version_minor);
+	debugfs_create_x64("nextoff", S_IRUGO, d, &a->nextoff);
+	debugfs_create_x64("infooff", S_IRUGO, d, &a->infooff);
+	debugfs_create_x64("dataoff", S_IRUGO, d, &a->dataoff);
+	debugfs_create_x64("mapoff", S_IRUGO, d, &a->mapoff);
+	debugfs_create_x64("logoff", S_IRUGO, d, &a->logoff);
+	debugfs_create_x64("info2off", S_IRUGO, d, &a->info2off);
+	debugfs_create_x32("flags", S_IRUGO, d, &a->flags);
+}
+
+static void btt_debugfs_init(struct btt *btt)
+{
+	int i = 0;
+	struct arena_info *arena;
+
+	btt->debugfs_dir = debugfs_create_dir(dev_name(&btt->nd_btt->dev),
+						debugfs_root);
+	if (IS_ERR_OR_NULL(btt->debugfs_dir))
+		return;
+
+	list_for_each_entry(arena, &btt->arena_list, list) {
+		arena_debugfs_init(arena, btt->debugfs_dir, i);
+		i++;
+	}
+}
+
+/*
+ * This function accepts two log entries, and uses the
+ * sequence number to find the 'older' entry.
+ * It also updates the sequence number in this old entry to
+ * make it the 'new' one if the mark_flag is set.
+ * Finally, it returns which of the entries was the older one.
+ *
+ * TODO The logic feels a bit kludge-y. make it better..
+ */
+static int btt_log_get_old(struct log_entry *ent)
+{
+	int old;
+
+	/*
+	 * the first ever time this is seen, the entry goes into [0]
+	 * the next time, the following logic works out to put this
+	 * (next) entry into [1]
+	 */
+	if (ent[0].seq == 0) {
+		ent[0].seq = cpu_to_le32(1);
+		return 0;
+	}
+
+	if (ent[0].seq == ent[1].seq)
+		return -EINVAL;
+	if (le32_to_cpu(ent[0].seq) + le32_to_cpu(ent[1].seq) > 5)
+		return -EINVAL;
+
+	if (le32_to_cpu(ent[0].seq) < le32_to_cpu(ent[1].seq)) {
+		if (le32_to_cpu(ent[1].seq) - le32_to_cpu(ent[0].seq) == 1)
+			old = 0;
+		else
+			old = 1;
+	} else {
+		if (le32_to_cpu(ent[0].seq) - le32_to_cpu(ent[1].seq) == 1)
+			old = 1;
+		else
+			old = 0;
+	}
+
+	return old;
+}
+
+static struct device *to_dev(struct arena_info *arena)
+{
+	return &arena->nd_btt->dev;
+}
+
+/*
+ * This function copies the desired (old/new) log entry into ent if
+ * it is not NULL. It returns the sub-slot number (0 or 1)
+ * where the desired log entry was found. Negative return values
+ * indicate errors.
+ */
+static int btt_log_read(struct arena_info *arena, u32 lane,
+			struct log_entry *ent, int old_flag)
+{
+	int ret;
+	int old_ent, ret_ent;
+	struct log_entry log[2];
+
+	ret = btt_log_read_pair(arena, lane, log);
+	if (ret)
+		return -EIO;
+
+	old_ent = btt_log_get_old(log);
+	if (old_ent < 0 || old_ent > 1) {
+		dev_info(to_dev(arena),
+				"log corruption (%d): lane %d seq [%d, %d]\n",
+			old_ent, lane, log[0].seq, log[1].seq);
+		/* TODO set error state? */
+		return -EIO;
+	}
+
+	ret_ent = (old_flag ? old_ent : (1 - old_ent));
+
+	if (ent != NULL)
+		memcpy(ent, &log[ret_ent], LOG_ENT_SIZE);
+
+	return ret_ent;
+}
+
+/*
+ * This function commits a log entry to media
+ * It does _not_ prepare the freelist entry for the next write
+ * btt_flog_write is the wrapper for updating the freelist elements
+ */
+static int __btt_log_write(struct arena_info *arena, u32 lane,
+			u32 sub, struct log_entry *ent)
+{
+	int ret;
+	/*
+	 * Ignore the padding in log_entry for calculating log_half.
+	 * The entry is 'committed' when we write the sequence number,
+	 * and we want to ensure that that is the last thing written.
+	 * We don't bother writing the padding as that would be extra
+	 * media wear and write amplification
+	 */
+	unsigned int log_half = (LOG_ENT_SIZE - 2 * sizeof(u64)) / 2;
+	u64 ns_off = arena->logoff + (((2 * lane) + sub) * LOG_ENT_SIZE);
+	void *src = ent;
+
+	/* split the 16B write into atomic, durable halves */
+	ret = arena_write_bytes(arena, ns_off, src, log_half);
+	if (ret)
+		return ret;
+
+	ns_off += log_half;
+	src += log_half;
+	return arena_write_bytes(arena, ns_off, src, log_half);
+}
+
+static int btt_flog_write(struct arena_info *arena, u32 lane, u32 sub,
+			struct log_entry *ent)
+{
+	int ret;
+
+	ret = __btt_log_write(arena, lane, sub, ent);
+	if (ret)
+		return ret;
+
+	/* prepare the next free entry */
+	arena->freelist[lane].sub = 1 - arena->freelist[lane].sub;
+	if (++(arena->freelist[lane].seq) == 4)
+		arena->freelist[lane].seq = 1;
+	arena->freelist[lane].block = le32_to_cpu(ent->old_map);
+
+	return ret;
+}
+
+/*
+ * This function initializes the BTT map to the initial state, which is
+ * all-zeroes, and indicates an identity mapping
+ */
+static int btt_map_init(struct arena_info *arena)
+{
+	int ret = -EINVAL;
+	void *zerobuf;
+	size_t offset = 0;
+	size_t chunk_size = SZ_2M;
+	size_t mapsize = arena->logoff - arena->mapoff;
+
+	zerobuf = kzalloc(chunk_size, GFP_KERNEL);
+	if (!zerobuf)
+		return -ENOMEM;
+
+	while (mapsize) {
+		size_t size = min(mapsize, chunk_size);
+
+		ret = arena_write_bytes(arena, arena->mapoff + offset, zerobuf,
+				size);
+		if (ret)
+			goto free;
+
+		offset += size;
+		mapsize -= size;
+		cond_resched();
+	}
+
+ free:
+	kfree(zerobuf);
+	return ret;
+}
+
+/*
+ * This function initializes the BTT log with 'fake' entries pointing
+ * to the initial reserved set of blocks as being free
+ */
+static int btt_log_init(struct arena_info *arena)
+{
+	int ret;
+	u32 i;
+	struct log_entry log, zerolog;
+
+	memset(&zerolog, 0, sizeof(zerolog));
+
+	for (i = 0; i < arena->nfree; i++) {
+		log.lba = cpu_to_le32(i);
+		log.old_map = cpu_to_le32(arena->external_nlba + i);
+		log.new_map = cpu_to_le32(arena->external_nlba + i);
+		log.seq = cpu_to_le32(LOG_SEQ_INIT);
+		ret = __btt_log_write(arena, i, 0, &log);
+		if (ret)
+			return ret;
+		ret = __btt_log_write(arena, i, 1, &zerolog);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int btt_freelist_init(struct arena_info *arena)
+{
+	int old, new, ret;
+	u32 i, map_entry;
+	struct log_entry log_new, log_old;
+
+	arena->freelist = kcalloc(arena->nfree, sizeof(struct free_entry),
+					GFP_KERNEL);
+	if (!arena->freelist)
+		return -ENOMEM;
+
+	for (i = 0; i < arena->nfree; i++) {
+		old = btt_log_read(arena, i, &log_old, LOG_OLD_ENT);
+		if (old < 0)
+			return old;
+
+		new = btt_log_read(arena, i, &log_new, LOG_NEW_ENT);
+		if (new < 0)
+			return new;
+
+		/* sub points to the next one to be overwritten */
+		arena->freelist[i].sub = 1 - new;
+		arena->freelist[i].seq = nd_inc_seq(le32_to_cpu(log_new.seq));
+		arena->freelist[i].block = le32_to_cpu(log_new.old_map);
+
+		/* This implies a newly created or untouched flog entry */
+		if (log_new.old_map == log_new.new_map)
+			continue;
+
+		/* Check if map recovery is needed */
+		ret = btt_map_read(arena, le32_to_cpu(log_new.lba), &map_entry,
+				NULL, NULL);
+		if (ret)
+			return ret;
+		if ((le32_to_cpu(log_new.new_map) != map_entry) &&
+				(le32_to_cpu(log_new.old_map) == map_entry)) {
+			/*
+			 * Last transaction wrote the flog, but wasn't able
+			 * to complete the map write. So fix up the map.
+			 */
+			ret = btt_map_write(arena, le32_to_cpu(log_new.lba),
+					le32_to_cpu(log_new.new_map), 0, 0);
+			if (ret)
+				return ret;
+		}
+
+	}
+
+	return 0;
+}
+
+static int btt_rtt_init(struct arena_info *arena)
+{
+	arena->rtt = kcalloc(arena->nfree, sizeof(u32), GFP_KERNEL);
+	if (arena->rtt == NULL)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int btt_maplocks_init(struct arena_info *arena)
+{
+	u32 i;
+
+	arena->map_locks = kcalloc(arena->nfree, sizeof(struct aligned_lock),
+				GFP_KERNEL);
+	if (!arena->map_locks)
+		return -ENOMEM;
+
+	for (i = 0; i < arena->nfree; i++)
+		spin_lock_init(&arena->map_locks[i].lock);
+
+	return 0;
+}
+
+static struct arena_info *alloc_arena(struct btt *btt, size_t size,
+				size_t start, size_t arena_off)
+{
+	struct arena_info *arena;
+	u64 logsize, mapsize, datasize;
+	u64 available = size;
+
+	arena = kzalloc(sizeof(struct arena_info), GFP_KERNEL);
+	if (!arena)
+		return NULL;
+	arena->nd_btt = btt->nd_btt;
+
+	if (!size)
+		return arena;
+
+	arena->size = size;
+	arena->external_lba_start = start;
+	arena->external_lbasize = btt->lbasize;
+	arena->internal_lbasize = roundup(arena->external_lbasize,
+					INT_LBASIZE_ALIGNMENT);
+	arena->nfree = BTT_DEFAULT_NFREE;
+	arena->version_major = 1;
+	arena->version_minor = 1;
+
+	if (available % BTT_PG_SIZE)
+		available -= (available % BTT_PG_SIZE);
+
+	/* Two pages are reserved for the super block and its copy */
+	available -= 2 * BTT_PG_SIZE;
+
+	/* The log takes a fixed amount of space based on nfree */
+	logsize = roundup(2 * arena->nfree * sizeof(struct log_entry),
+				BTT_PG_SIZE);
+	available -= logsize;
+
+	/* Calculate optimal split between map and data area */
+	arena->internal_nlba = div_u64(available - BTT_PG_SIZE,
+			arena->internal_lbasize + MAP_ENT_SIZE);
+	arena->external_nlba = arena->internal_nlba - arena->nfree;
+
+	mapsize = roundup((arena->external_nlba * MAP_ENT_SIZE), BTT_PG_SIZE);
+	datasize = available - mapsize;
+
+	/* 'Absolute' values, relative to start of storage space */
+	arena->infooff = arena_off;
+	arena->dataoff = arena->infooff + BTT_PG_SIZE;
+	arena->mapoff = arena->dataoff + datasize;
+	arena->logoff = arena->mapoff + mapsize;
+	arena->info2off = arena->logoff + logsize;
+	return arena;
+}
+
+static void free_arenas(struct btt *btt)
+{
+	struct arena_info *arena, *next;
+
+	list_for_each_entry_safe(arena, next, &btt->arena_list, list) {
+		list_del(&arena->list);
+		kfree(arena->rtt);
+		kfree(arena->map_locks);
+		kfree(arena->freelist);
+		debugfs_remove_recursive(arena->debugfs_dir);
+		kfree(arena);
+	}
+}
+
+/*
+ * This function checks if the metadata layout is valid and error free
+ */
+static int arena_is_valid(struct arena_info *arena, struct btt_sb *super,
+				u8 *uuid, u32 lbasize)
+{
+	u64 checksum;
+
+	if (memcmp(super->uuid, uuid, 16))
+		return 0;
+
+	checksum = le64_to_cpu(super->checksum);
+	super->checksum = 0;
+	if (checksum != nd_btt_sb_checksum(super))
+		return 0;
+	super->checksum = cpu_to_le64(checksum);
+
+	if (lbasize != le32_to_cpu(super->external_lbasize))
+		return 0;
+
+	/* TODO: figure out action for this */
+	if ((le32_to_cpu(super->flags) & IB_FLAG_ERROR_MASK) != 0)
+		dev_info(to_dev(arena), "Found arena with an error flag\n");
+
+	return 1;
+}
+
+/*
+ * This function reads an existing valid btt superblock and
+ * populates the corresponding arena_info struct
+ */
+static void parse_arena_meta(struct arena_info *arena, struct btt_sb *super,
+				u64 arena_off)
+{
+	arena->internal_nlba = le32_to_cpu(super->internal_nlba);
+	arena->internal_lbasize = le32_to_cpu(super->internal_lbasize);
+	arena->external_nlba = le32_to_cpu(super->external_nlba);
+	arena->external_lbasize = le32_to_cpu(super->external_lbasize);
+	arena->nfree = le32_to_cpu(super->nfree);
+	arena->version_major = le16_to_cpu(super->version_major);
+	arena->version_minor = le16_to_cpu(super->version_minor);
+
+	arena->nextoff = (super->nextoff == 0) ? 0 : (arena_off +
+			le64_to_cpu(super->nextoff));
+	arena->infooff = arena_off;
+	arena->dataoff = arena_off + le64_to_cpu(super->dataoff);
+	arena->mapoff = arena_off + le64_to_cpu(super->mapoff);
+	arena->logoff = arena_off + le64_to_cpu(super->logoff);
+	arena->info2off = arena_off + le64_to_cpu(super->info2off);
+
+	arena->size = (super->nextoff > 0) ? (le64_to_cpu(super->nextoff)) :
+			(arena->info2off - arena->infooff + BTT_PG_SIZE);
+
+	arena->flags = le32_to_cpu(super->flags);
+}
+
+static int discover_arenas(struct btt *btt)
+{
+	int ret = 0;
+	struct arena_info *arena;
+	struct btt_sb *super;
+	size_t remaining = btt->rawsize;
+	u64 cur_nlba = 0;
+	size_t cur_off = 0;
+	int num_arenas = 0;
+
+	super = kzalloc(sizeof(*super), GFP_KERNEL);
+	if (!super)
+		return -ENOMEM;
+
+	while (remaining) {
+		/* Alloc memory for arena */
+		arena = alloc_arena(btt, 0, 0, 0);
+		if (!arena) {
+			ret = -ENOMEM;
+			goto out_super;
+		}
+
+		arena->infooff = cur_off;
+		ret = btt_info_read(arena, super);
+		if (ret)
+			goto out;
+
+		if (!arena_is_valid(arena, super, btt->nd_btt->uuid,
+				btt->lbasize)) {
+			if (remaining == btt->rawsize) {
+				btt->init_state = INIT_NOTFOUND;
+				dev_info(to_dev(arena), "No existing arenas\n");
+				goto out;
+			} else {
+				dev_info(to_dev(arena),
+						"Found corrupted metadata!\n");
+				ret = -ENODEV;
+				goto out;
+			}
+		}
+
+		arena->external_lba_start = cur_nlba;
+		parse_arena_meta(arena, super, cur_off);
+
+		ret = btt_freelist_init(arena);
+		if (ret)
+			goto out;
+
+		ret = btt_rtt_init(arena);
+		if (ret)
+			goto out;
+
+		ret = btt_maplocks_init(arena);
+		if (ret)
+			goto out;
+
+		list_add_tail(&arena->list, &btt->arena_list);
+
+		remaining -= arena->size;
+		cur_off += arena->size;
+		cur_nlba += arena->external_nlba;
+		num_arenas++;
+
+		if (arena->nextoff == 0)
+			break;
+	}
+	btt->num_arenas = num_arenas;
+	btt->nlba = cur_nlba;
+	btt->init_state = INIT_READY;
+
+	kfree(super);
+	return ret;
+
+ out:
+	kfree(arena);
+	free_arenas(btt);
+ out_super:
+	kfree(super);
+	return ret;
+}
+
+static int create_arenas(struct btt *btt)
+{
+	size_t remaining = btt->rawsize;
+	size_t cur_off = 0;
+
+	while (remaining) {
+		struct arena_info *arena;
+		size_t arena_size = min_t(u64, ARENA_MAX_SIZE, remaining);
+
+		remaining -= arena_size;
+		if (arena_size < ARENA_MIN_SIZE)
+			break;
+
+		arena = alloc_arena(btt, arena_size, btt->nlba, cur_off);
+		if (!arena) {
+			free_arenas(btt);
+			return -ENOMEM;
+		}
+		btt->nlba += arena->external_nlba;
+		if (remaining >= ARENA_MIN_SIZE)
+			arena->nextoff = arena->size;
+		else
+			arena->nextoff = 0;
+		cur_off += arena_size;
+		list_add_tail(&arena->list, &btt->arena_list);
+	}
+
+	return 0;
+}
+
+/*
+ * This function completes arena initialization by writing
+ * all the metadata.
+ * It is only called for an uninitialized arena when a write
+ * to that arena occurs for the first time.
+ */
+static int btt_arena_write_layout(struct arena_info *arena, u8 *uuid)
+{
+	int ret;
+	struct btt_sb *super;
+
+	ret = btt_map_init(arena);
+	if (ret)
+		return ret;
+
+	ret = btt_log_init(arena);
+	if (ret)
+		return ret;
+
+	super = kzalloc(sizeof(struct btt_sb), GFP_NOIO);
+	if (!super)
+		return -ENOMEM;
+
+	strncpy(super->signature, BTT_SIG, BTT_SIG_LEN);
+	memcpy(super->uuid, uuid, 16);
+	super->flags = cpu_to_le32(arena->flags);
+	super->version_major = cpu_to_le16(arena->version_major);
+	super->version_minor = cpu_to_le16(arena->version_minor);
+	super->external_lbasize = cpu_to_le32(arena->external_lbasize);
+	super->external_nlba = cpu_to_le32(arena->external_nlba);
+	super->internal_lbasize = cpu_to_le32(arena->internal_lbasize);
+	super->internal_nlba = cpu_to_le32(arena->internal_nlba);
+	super->nfree = cpu_to_le32(arena->nfree);
+	super->infosize = cpu_to_le32(sizeof(struct btt_sb));
+	super->nextoff = cpu_to_le64(arena->nextoff);
+	/*
+	 * Subtract arena->infooff (arena start) so numbers are relative
+	 * to 'this' arena
+	 */
+	super->dataoff = cpu_to_le64(arena->dataoff - arena->infooff);
+	super->mapoff = cpu_to_le64(arena->mapoff - arena->infooff);
+	super->logoff = cpu_to_le64(arena->logoff - arena->infooff);
+	super->info2off = cpu_to_le64(arena->info2off - arena->infooff);
+
+	super->flags = 0;
+	super->checksum = cpu_to_le64(nd_btt_sb_checksum(super));
+
+	ret = btt_info_write(arena, super);
+
+	kfree(super);
+	return ret;
+}
+
+/*
+ * This function completes the initialization for the BTT namespace
+ * such that it is ready to accept IOs
+ */
+static int btt_meta_init(struct btt *btt)
+{
+	int ret = 0;
+	struct arena_info *arena;
+
+	mutex_lock(&btt->init_lock);
+	list_for_each_entry(arena, &btt->arena_list, list) {
+		ret = btt_arena_write_layout(arena, btt->nd_btt->uuid);
+		if (ret)
+			goto unlock;
+
+		ret = btt_freelist_init(arena);
+		if (ret)
+			goto unlock;
+
+		ret = btt_rtt_init(arena);
+		if (ret)
+			goto unlock;
+
+		ret = btt_maplocks_init(arena);
+		if (ret)
+			goto unlock;
+	}
+
+	btt->init_state = INIT_READY;
+
+ unlock:
+	mutex_unlock(&btt->init_lock);
+	return ret;
+}
+
+/*
+ * This function calculates the arena in which the given LBA lies
+ * by doing a linear walk. This is acceptable since we expect only
+ * a few arenas. If we have backing devices that get much larger,
+ * we can construct a balanced binary tree of arenas at init time
+ * so that this range search becomes faster.
+ */
+static int lba_to_arena(struct btt *btt, sector_t sector, __u32 *premap,
+				struct arena_info **arena)
+{
+	struct arena_info *arena_list;
+	__u64 lba = div_u64(sector << SECTOR_SHIFT, btt->sector_size);
+
+	list_for_each_entry(arena_list, &btt->arena_list, list) {
+		if (lba < arena_list->external_nlba) {
+			*arena = arena_list;
+			*premap = lba;
+			return 0;
+		}
+		lba -= arena_list->external_nlba;
+	}
+
+	return -EIO;
+}
+
+/*
+ * The following (lock_map, unlock_map) are mostly just to improve
+ * readability, since they index into an array of locks
+ */
+static void lock_map(struct arena_info *arena, u32 premap)
+		__acquires(&arena->map_locks[idx].lock)
+{
+	u32 idx = (premap * MAP_ENT_SIZE / L1_CACHE_BYTES) % arena->nfree;
+
+	spin_lock(&arena->map_locks[idx].lock);
+}
+
+static void unlock_map(struct arena_info *arena, u32 premap)
+		__releases(&arena->map_locks[idx].lock)
+{
+	u32 idx = (premap * MAP_ENT_SIZE / L1_CACHE_BYTES) % arena->nfree;
+
+	spin_unlock(&arena->map_locks[idx].lock);
+}
+
+static u64 to_namespace_offset(struct arena_info *arena, u64 lba)
+{
+	return arena->dataoff + ((u64)lba * arena->internal_lbasize);
+}
+
+static int btt_data_read(struct arena_info *arena, struct page *page,
+			unsigned int off, u32 lba, u32 len)
+{
+	int ret;
+	u64 nsoff = to_namespace_offset(arena, lba);
+	void *mem = kmap_atomic(page);
+
+	ret = arena_read_bytes(arena, nsoff, mem + off, len);
+	kunmap_atomic(mem);
+
+	return ret;
+}
+
+static int btt_data_write(struct arena_info *arena, u32 lba,
+			struct page *page, unsigned int off, u32 len)
+{
+	int ret;
+	u64 nsoff = to_namespace_offset(arena, lba);
+	void *mem = kmap_atomic(page);
+
+	ret = arena_write_bytes(arena, nsoff, mem + off, len);
+	kunmap_atomic(mem);
+
+	return ret;
+}
+
+static void zero_fill_data(struct page *page, unsigned int off, u32 len)
+{
+	void *mem = kmap_atomic(page);
+
+	memset(mem + off, 0, len);
+	kunmap_atomic(mem);
+}
+
+static int btt_read_pg(struct btt *btt, struct page *page, unsigned int off,
+			sector_t sector, unsigned int len)
+{
+	int ret = 0;
+	int t_flag, e_flag;
+	struct arena_info *arena = NULL;
+	u32 lane = 0, premap, postmap;
+
+	while (len) {
+		u32 cur_len;
+
+		lane = nd_region_acquire_lane(btt->nd_region);
+
+		ret = lba_to_arena(btt, sector, &premap, &arena);
+		if (ret)
+			goto out_lane;
+
+		cur_len = min(btt->sector_size, len);
+
+		ret = btt_map_read(arena, premap, &postmap, &t_flag, &e_flag);
+		if (ret)
+			goto out_lane;
+
+		/*
+		 * We loop to make sure that the post map LBA didn't change
+		 * from under us between writing the RTT and doing the actual
+		 * read.
+		 */
+		while (1) {
+			u32 new_map;
+
+			if (t_flag) {
+				zero_fill_data(page, off, cur_len);
+				goto out_lane;
+			}
+
+			if (e_flag) {
+				ret = -EIO;
+				goto out_lane;
+			}
+
+			arena->rtt[lane] = RTT_VALID | postmap;
+			/*
+			 * Barrier to make sure this write is not reordered
+			 * to do the verification map_read before the RTT store
+			 */
+			barrier();
+
+			ret = btt_map_read(arena, premap, &new_map, &t_flag,
+						&e_flag);
+			if (ret)
+				goto out_rtt;
+
+			if (postmap == new_map)
+				break;
+
+			postmap = new_map;
+		}
+
+		ret = btt_data_read(arena, page, off, postmap, cur_len);
+		if (ret)
+			goto out_rtt;
+
+		arena->rtt[lane] = RTT_INVALID;
+		nd_region_release_lane(btt->nd_region, lane);
+
+		len -= cur_len;
+		off += cur_len;
+		sector += btt->sector_size >> SECTOR_SHIFT;
+	}
+
+	return 0;
+
+ out_rtt:
+	arena->rtt[lane] = RTT_INVALID;
+ out_lane:
+	nd_region_release_lane(btt->nd_region, lane);
+	return ret;
+}
+
+static int btt_write_pg(struct btt *btt, sector_t sector, struct page *page,
+		unsigned int off, unsigned int len)
+{
+	int ret = 0;
+	struct arena_info *arena = NULL;
+	u32 premap = 0, old_postmap, new_postmap, lane = 0, i;
+	struct log_entry log;
+	int sub;
+
+	while (len) {
+		u32 cur_len;
+
+		lane = nd_region_acquire_lane(btt->nd_region);
+
+		ret = lba_to_arena(btt, sector, &premap, &arena);
+		if (ret)
+			goto out_lane;
+		cur_len = min(btt->sector_size, len);
+
+		if ((arena->flags & IB_FLAG_ERROR_MASK) != 0) {
+			ret = -EIO;
+			goto out_lane;
+		}
+
+		new_postmap = arena->freelist[lane].block;
+
+		/* Wait if the new block is being read from */
+		for (i = 0; i < arena->nfree; i++)
+			while (arena->rtt[i] == (RTT_VALID | new_postmap))
+				cpu_relax();
+
+
+		if (new_postmap >= arena->internal_nlba) {
+			ret = -EIO;
+			goto out_lane;
+		} else
+			ret = btt_data_write(arena, new_postmap, page,
+						off, cur_len);
+		if (ret)
+			goto out_lane;
+
+		lock_map(arena, premap);
+		ret = btt_map_read(arena, premap, &old_postmap, NULL, NULL);
+		if (ret)
+			goto out_map;
+		if (old_postmap >= arena->internal_nlba) {
+			ret = -EIO;
+			goto out_map;
+		}
+
+		log.lba = cpu_to_le32(premap);
+		log.old_map = cpu_to_le32(old_postmap);
+		log.new_map = cpu_to_le32(new_postmap);
+		log.seq = cpu_to_le32(arena->freelist[lane].seq);
+		sub = arena->freelist[lane].sub;
+		ret = btt_flog_write(arena, lane, sub, &log);
+		if (ret)
+			goto out_map;
+
+		ret = btt_map_write(arena, premap, new_postmap, 0, 0);
+		if (ret)
+			goto out_map;
+
+		unlock_map(arena, premap);
+		nd_region_release_lane(btt->nd_region, lane);
+
+		len -= cur_len;
+		off += cur_len;
+		sector += btt->sector_size >> SECTOR_SHIFT;
+	}
+
+	return 0;
+
+ out_map:
+	unlock_map(arena, premap);
+ out_lane:
+	nd_region_release_lane(btt->nd_region, lane);
+	return ret;
+}
+
+static int btt_do_bvec(struct btt *btt, struct page *page,
+			unsigned int len, unsigned int off, int rw,
+			sector_t sector)
+{
+	int ret;
+
+	if (rw == READ) {
+		ret = btt_read_pg(btt, page, off, sector, len);
+		flush_dcache_page(page);
+	} else {
+		flush_dcache_page(page);
+		ret = btt_write_pg(btt, sector, page, off, len);
+	}
+
+	return ret;
+}
+
+static void btt_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct block_device *bdev = bio->bi_bdev;
+	struct btt *btt = q->queuedata;
+	int rw;
+	struct bio_vec bvec;
+	sector_t sector;
+	struct bvec_iter iter;
+	int err = 0;
+
+	sector = bio->bi_iter.bi_sector;
+	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
+		err = -EIO;
+		goto out;
+	}
+
+	BUG_ON(bio->bi_rw & REQ_DISCARD);
+
+	rw = bio_rw(bio);
+	if (rw == READA)
+		rw = READ;
+
+	bio_for_each_segment(bvec, bio, iter) {
+		unsigned int len = bvec.bv_len;
+
+		BUG_ON(len > PAGE_SIZE);
+		/* Make sure len is in multiples of sector size. */
+		/* XXX is this right? */
+		BUG_ON(len < btt->sector_size);
+		BUG_ON(len % btt->sector_size);
+
+		err = btt_do_bvec(btt, bvec.bv_page, len, bvec.bv_offset,
+				rw, sector);
+		if (err) {
+			dev_info(&btt->nd_btt->dev,
+					"io error in %s sector %lld, len %d,\n",
+					(rw == READ) ? "READ" : "WRITE",
+					(unsigned long long) sector, len);
+			goto out;
+		}
+		sector += len >> SECTOR_SHIFT;
+	}
+
+out:
+	bio_endio(bio, err);
+}
+
+static int btt_rw_page(struct block_device *bdev, sector_t sector,
+		struct page *page, int rw)
+{
+	struct btt *btt = bdev->bd_disk->private_data;
+
+	btt_do_bvec(btt, page, PAGE_CACHE_SIZE, 0, rw, sector);
+	page_endio(page, rw & WRITE, 0);
+	return 0;
+}
+
+
+static int btt_getgeo(struct block_device *bd, struct hd_geometry *geo)
+{
+	/* some standard values */
+	geo->heads = 1 << 6;
+	geo->sectors = 1 << 5;
+	geo->cylinders = get_capacity(bd->bd_disk) >> 11;
+	return 0;
+}
+
+static const struct block_device_operations btt_fops = {
+	.owner =		THIS_MODULE,
+	.rw_page =		btt_rw_page,
+	.getgeo =		btt_getgeo,
+};
+
+static int btt_blk_init(struct btt *btt)
+{
+	struct nd_btt *nd_btt = btt->nd_btt;
+	char name[BDEVNAME_SIZE];
+	int ret;
+
+	/* create a new disk and request queue for btt */
+	btt->btt_queue = blk_alloc_queue(GFP_KERNEL);
+	if (!btt->btt_queue)
+		return -ENOMEM;
+
+	btt->btt_disk = alloc_disk(0);
+	if (!btt->btt_disk) {
+		ret = -ENOMEM;
+		goto out_free_queue;
+	}
+
+	sprintf(btt->btt_disk->disk_name, "%ss",
+			bdevname(nd_btt->backing_dev, name));
+	btt->btt_disk->driverfs_dev = &btt->nd_btt->dev;
+	btt->btt_disk->major = btt_major;
+	btt->btt_disk->first_minor = 0;
+	btt->btt_disk->fops = &btt_fops;
+	btt->btt_disk->private_data = btt;
+	btt->btt_disk->queue = btt->btt_queue;
+	btt->btt_disk->flags = GENHD_FL_EXT_DEVT;
+
+	blk_queue_make_request(btt->btt_queue, btt_make_request);
+	blk_queue_max_hw_sectors(btt->btt_queue, 1024);
+	blk_queue_bounce_limit(btt->btt_queue, BLK_BOUNCE_ANY);
+	blk_queue_logical_block_size(btt->btt_queue, btt->sector_size);
+	btt->btt_queue->queuedata = btt;
+
+	set_capacity(btt->btt_disk,
+			btt->nlba * btt->sector_size >> SECTOR_SHIFT);
+	add_disk(btt->btt_disk);
+
+	return 0;
+
+out_free_queue:
+	blk_cleanup_queue(btt->btt_queue);
+	return ret;
+}
+
+static void btt_blk_cleanup(struct btt *btt)
+{
+	del_gendisk(btt->btt_disk);
+	put_disk(btt->btt_disk);
+	blk_cleanup_queue(btt->btt_queue);
+}
+
+/**
+ * btt_init - initialize a block translation table for the given device
+ * @nd_btt:	device with BTT geometry and backing device info
+ * @rawsize:	raw size in bytes of the backing device
+ * @lbasize:	lba size of the backing device
+ * @uuid:	A uuid for the backing device - this is stored on media
+ * @maxlane:	maximum number of parallel requests the device can handle
+ *
+ * Initialize a Block Translation Table on a backing device to provide
+ * single sector power fail atomicity.
+ *
+ * Context:
+ * Might sleep.
+ *
+ * Returns:
+ * Pointer to a new struct btt on success, NULL on failure.
+ */
+static struct btt *btt_init(struct nd_btt *nd_btt, unsigned long long rawsize,
+		u32 lbasize, u8 *uuid, struct nd_region *nd_region)
+{
+	int ret;
+	struct btt *btt;
+	struct device *dev = &nd_btt->dev;
+
+	btt = kzalloc(sizeof(struct btt), GFP_KERNEL);
+	if (!btt)
+		return NULL;
+
+	btt->nd_btt = nd_btt;
+	btt->rawsize = rawsize;
+	btt->lbasize = lbasize;
+	btt->sector_size = ((lbasize >= 4096) ? 4096 : 512);
+	INIT_LIST_HEAD(&btt->arena_list);
+	mutex_init(&btt->init_lock);
+	btt->nd_region = nd_region;
+
+	ret = discover_arenas(btt);
+	if (ret) {
+		dev_err(dev, "init: error in arena_discover: %d\n", ret);
+		goto out_free;
+	}
+
+	if (btt->init_state != INIT_READY) {
+		btt->num_arenas = (rawsize / ARENA_MAX_SIZE) +
+			((rawsize % ARENA_MAX_SIZE) ? 1 : 0);
+		dev_dbg(dev, "init: %d arenas for %llu rawsize\n",
+				btt->num_arenas, rawsize);
+
+		ret = create_arenas(btt);
+		if (ret) {
+			dev_info(dev, "init: create_arenas: %d\n", ret);
+			goto out_free;
+		}
+
+		ret = btt_meta_init(btt);
+		if (ret) {
+			dev_err(dev, "init: error in meta_init: %d\n", ret);
+			return NULL;
+		}
+	}
+
+	ret = btt_blk_init(btt);
+	if (ret) {
+		dev_err(dev, "init: error in blk_init: %d\n", ret);
+		goto out_free;
+	}
+
+	btt_debugfs_init(btt);
+
+	return btt;
+
+ out_free:
+	kfree(btt);
+	return NULL;
+}
+
+/**
+ * btt_fini - de-initialize a BTT
+ * @btt:	the BTT handle that was generated by btt_init
+ *
+ * De-initialize a Block Translation Table on device removal
+ *
+ * Context:
+ * Might sleep.
+ */
+static void btt_fini(struct btt *btt)
+{
+	if (btt) {
+		btt_blk_cleanup(btt);
+		free_arenas(btt);
+		debugfs_remove_recursive(btt->debugfs_dir);
+		kfree(btt);
+	}
+}
+
+static int link_btt(struct nd_btt *nd_btt)
+{
+	struct block_device *bdev = nd_btt->backing_dev;
+	struct kobject *dir = &part_to_dev(bdev->bd_part)->kobj;
+
+	return sysfs_create_link(dir, &nd_btt->dev.kobj, "nd_btt");
+}
+
+static void unlink_btt(struct nd_btt *nd_btt)
+{
+	struct block_device *bdev = nd_btt->backing_dev;
+	struct kobject *dir;
+
+	/* if backing_dev was deleted first we may have nothing to unlink */
+	if (!nd_btt->backing_dev)
+		return;
+
+	dir = &part_to_dev(bdev->bd_part)->kobj;
+	sysfs_remove_link(dir, "nd_btt");
+}
+
+static struct nd_region *nd_btt_to_region(struct nd_btt *nd_btt)
+{
+	struct block_device *bdev = nd_btt->backing_dev;
+	struct device *disk_dev = disk_to_dev(bdev->bd_disk);
+	struct device *namespace_dev = disk_dev->parent;
+
+	return to_nd_region(namespace_dev->parent);
+}
+
+static int nd_btt_probe(struct device *dev)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	struct nd_region *nd_region;
+	struct block_device *bdev;
+	struct btt *btt;
+	size_t rawsize;
+	int rc;
+
+	if (!nd_btt->uuid || !nd_btt->backing_dev || !nd_btt->lbasize)
+		return -ENODEV;
+
+	rc = link_btt(nd_btt);
+	if (rc)
+		return rc;
+
+	bdev = nd_btt->backing_dev;
+	sync_blockdev(bdev);
+	invalidate_bdev(bdev);
+	rawsize = (bdev->bd_part->nr_sects << SECTOR_SHIFT) - SZ_4K;
+	if (rawsize < ARENA_MIN_SIZE) {
+		rc = -ENXIO;
+		goto err_btt;
+	}
+	nd_region = nd_btt_to_region(nd_btt);
+	btt = btt_init(nd_btt, rawsize, nd_btt->lbasize, nd_btt->uuid,
+			nd_region);
+	if (!btt) {
+		rc = -ENOMEM;
+		goto err_btt;
+	}
+	dev_set_drvdata(dev, btt);
+
+	return 0;
+ err_btt:
+	unlink_btt(nd_btt);
+	return rc;
+}
+
+static int nd_btt_remove(struct device *dev)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	struct btt *btt = dev_get_drvdata(dev);
+
+	btt_fini(btt);
+	unlink_btt(nd_btt);
+
+	return 0;
+}
+
+static struct nd_device_driver nd_btt_driver = {
+	.probe = nd_btt_probe,
+	.remove = nd_btt_remove,
+	.drv = {
+		.name = "nd_btt",
+	},
+	.type = ND_DRIVER_BTT,
+};
+
+static int __init nd_btt_init(void)
+{
+	int rc;
+
+	BUILD_BUG_ON(sizeof(struct btt_sb) != SZ_4K);
+
+	btt_major = register_blkdev(0, "btt");
+	if (btt_major < 0)
+		return btt_major;
+
+	debugfs_root = debugfs_create_dir("btt", NULL);
+	if (IS_ERR_OR_NULL(debugfs_root)) {
+		rc = -ENXIO;
+		goto err_debugfs;
+	}
+
+	rc = nd_driver_register(&nd_btt_driver);
+	if (rc < 0)
+		goto err_driver;
+	return 0;
+
+ err_driver:
+	debugfs_remove_recursive(debugfs_root);
+ err_debugfs:
+	unregister_blkdev(btt_major, "btt");
+
+	return rc;
+}
+
+static void __exit nd_btt_exit(void)
+{
+	driver_unregister(&nd_btt_driver.drv);
+	debugfs_remove_recursive(debugfs_root);
+	unregister_blkdev(btt_major, "btt");
+}
+
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_BTT);
+MODULE_AUTHOR("Vishal Verma <vishal.l.verma@linux.intel.com>");
+MODULE_LICENSE("GPL v2");
+module_init(nd_btt_init);
+module_exit(nd_btt_exit);
diff --git a/drivers/nvdimm/btt.h b/drivers/nvdimm/btt.h
index e8f6d8e0ddd3..8c95a7792c3e 100644
--- a/drivers/nvdimm/btt.h
+++ b/drivers/nvdimm/btt.h
@@ -19,6 +19,39 @@
 
 #define BTT_SIG_LEN 16
 #define BTT_SIG "BTT_ARENA_INFO\0"
+#define MAP_ENT_SIZE 4
+#define MAP_TRIM_SHIFT 31
+#define MAP_TRIM_MASK (1 << MAP_TRIM_SHIFT)
+#define MAP_ERR_SHIFT 30
+#define MAP_ERR_MASK (1 << MAP_ERR_SHIFT)
+#define MAP_LBA_MASK (~((1 << MAP_TRIM_SHIFT) | (1 << MAP_ERR_SHIFT)))
+#define MAP_ENT_NORMAL 0xC0000000
+#define LOG_ENT_SIZE sizeof(struct log_entry)
+#define ARENA_MIN_SIZE (1UL << 24)	/* 16 MB */
+#define ARENA_MAX_SIZE (1ULL << 39)	/* 512 GB */
+#define RTT_VALID (1UL << 31)
+#define RTT_INVALID 0
+#define INT_LBASIZE_ALIGNMENT 256
+#define BTT_PG_SIZE 4096
+#define BTT_DEFAULT_NFREE ND_MAX_LANES
+#define LOG_SEQ_INIT 1
+
+#define IB_FLAG_ERROR 0x00000001
+#define IB_FLAG_ERROR_MASK 0x00000001
+
+enum btt_init_state {
+	INIT_UNCHECKED = 0,
+	INIT_NOTFOUND,
+	INIT_READY
+};
+
+struct log_entry {
+	__le32 lba;
+	__le32 old_map;
+	__le32 new_map;
+	__le32 seq;
+	__le64 padding[2];
+};
 
 struct btt_sb {
 	u8 signature[BTT_SIG_LEN];
@@ -42,4 +75,112 @@ struct btt_sb {
 	__le64 checksum;
 };
 
+struct free_entry {
+	u32 block;
+	u8 sub;
+	u8 seq;
+};
+
+struct aligned_lock {
+	union {
+		spinlock_t lock;
+		u8 cacheline_padding[L1_CACHE_BYTES];
+	};
+};
+
+/**
+ * struct arena_info - handle for an arena
+ * @size:		Size in bytes this arena occupies on the raw device.
+ *			This includes arena metadata.
+ * @external_lba_start:	The first external LBA in this arena.
+ * @internal_nlba:	Number of internal blocks available in the arena
+ *			including nfree reserved blocks
+ * @internal_lbasize:	Internal and external lba sizes may be different as
+ *			we can round up 'odd' external lbasizes such as 520B
+ *			to be aligned.
+ * @external_nlba:	Number of blocks contributed by the arena to the number
+ *			reported to upper layers. (internal_nlba - nfree)
+ * @external_lbasize:	LBA size as exposed to upper layers.
+ * @nfree:		A reserve number of 'free' blocks that is used to
+ *			handle incoming writes.
+ * @version_major:	Metadata layout version major.
+ * @version_minor:	Metadata layout version minor.
+ * @nextoff:		Offset in bytes to the start of the next arena.
+ * @infooff:		Offset in bytes to the info block of this arena.
+ * @dataoff:		Offset in bytes to the data area of this arena.
+ * @mapoff:		Offset in bytes to the map area of this arena.
+ * @logoff:		Offset in bytes to the log area of this arena.
+ * @info2off:		Offset in bytes to the backup info block of this arena.
+ * @freelist:		Pointer to in-memory list of free blocks
+ * @rtt:		Pointer to in-memory "Read Tracking Table"
+ * @map_locks:		Spinlocks protecting concurrent map writes
+ * @nd_btt:		Pointer to parent nd_btt structure.
+ * @list:		List head for list of arenas
+ * @debugfs_dir:	Debugfs dentry
+ * @flags:		Arena flags - may signify error states.
+ *
+ * arena_info is a per-arena handle. Once an arena is narrowed down for an
+ * IO, this struct is passed around for the duration of the IO.
+ */
+struct arena_info {
+	u64 size;			/* Total bytes for this arena */
+	u64 external_lba_start;
+	u32 internal_nlba;
+	u32 internal_lbasize;
+	u32 external_nlba;
+	u32 external_lbasize;
+	u32 nfree;
+	u16 version_major;
+	u16 version_minor;
+	/* Byte offsets to the different on-media structures */
+	u64 nextoff;
+	u64 infooff;
+	u64 dataoff;
+	u64 mapoff;
+	u64 logoff;
+	u64 info2off;
+	/* Pointers to other in-memory structures for this arena */
+	struct free_entry *freelist;
+	u32 *rtt;
+	struct aligned_lock *map_locks;
+	struct nd_btt *nd_btt;
+	struct list_head list;
+	struct dentry *debugfs_dir;
+	/* Arena flags */
+	u32 flags;
+};
+
+/**
+ * struct btt - handle for a BTT instance
+ * @btt_disk:		Pointer to the gendisk for BTT device
+ * @btt_queue:		Pointer to the request queue for the BTT device
+ * @arena_list:		Head of the list of arenas
+ * @debugfs_dir:	Debugfs dentry
+ * @nd_btt:		Parent nd_btt struct
+ * @nlba:		Number of logical blocks exposed to the	upper layers
+ *			after removing the amount of space needed by metadata
+ * @rawsize:		Total size in bytes of the available backing device
+ * @lbasize:		LBA size as requested and presented to upper layers.
+ *			This is sector_size + size of any metadata.
+ * @sector_size:	The Linux sector size - 512 or 4096
+ * @lanes:		Per-lane spinlocks
+ * @init_lock:		Mutex used for the BTT initialization
+ * @init_state:		Flag describing the initialization state for the BTT
+ * @num_arenas:		Number of arenas in the BTT instance
+ */
+struct btt {
+	struct gendisk *btt_disk;
+	struct request_queue *btt_queue;
+	struct list_head arena_list;
+	struct dentry *debugfs_dir;
+	struct nd_btt *nd_btt;
+	u64 nlba;
+	unsigned long long rawsize;
+	u32 lbasize;
+	u32 sector_size;
+	struct nd_region *nd_region;
+	struct mutex init_lock;
+	int init_state;
+	int num_arenas;
+};
 #endif
diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
index 2148fd8f535b..c03c854f892b 100644
--- a/drivers/nvdimm/btt_devs.c
+++ b/drivers/nvdimm/btt_devs.c
@@ -351,7 +351,8 @@ struct nd_btt *nd_btt_create(struct nvdimm_bus *nvdimm_bus)
  */
 u64 nd_btt_sb_checksum(struct btt_sb *btt_sb)
 {
-	u64 sum, sum_save;
+	u64 sum;
+	__le64 sum_save;
 
 	sum_save = btt_sb->checksum;
 	btt_sb->checksum = 0;
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 3bd8d650340e..b876b839f49a 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -22,6 +22,12 @@
 #include "label.h"
 
 enum {
+	/*
+	 * Limits the maximum number of block apertures a dimm can
+	 * support and is an input to the geometry/on-disk-format of a
+	 * BTT instance
+	 */
+	ND_MAX_LANES = 256,
 	SECTOR_SHIFT = 9,
 };
 
@@ -84,7 +90,7 @@ struct nd_region {
 	u16 ndr_mappings;
 	u64 ndr_size;
 	u64 ndr_start;
-	int id;
+	int id, num_lanes;
 	void *provider_data;
 	struct nd_interleave_set *nd_set;
 	struct nd_mapping mapping[0];
@@ -136,6 +142,8 @@ struct nd_btt *to_nd_btt(struct device *dev);
 struct btt_sb;
 u64 nd_btt_sb_checksum(struct btt_sb *btt_sb);
 struct nd_region *to_nd_region(struct device *dev);
+unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
+void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
 int nd_region_to_nstype(struct nd_region *nd_region);
 int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
 u64 nd_region_interleave_set_cookie(struct nd_region *nd_region);
diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c
index 9aba44e483e0..aa617bf86506 100644
--- a/drivers/nvdimm/region.c
+++ b/drivers/nvdimm/region.c
@@ -10,18 +10,106 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/cpumask.h>
 #include <linux/module.h>
 #include <linux/device.h>
 #include <linux/nd.h>
 #include "nd.h"
 
+struct nd_percpu_lane {
+	int count[CONFIG_ND_MAX_REGIONS];
+	spinlock_t lock[CONFIG_ND_MAX_REGIONS];
+};
+
+static DEFINE_PER_CPU(struct nd_percpu_lane, nd_percpu_lane);
+
+static void __init nd_region_init_locks(void)
+{
+	unsigned int i, j;
+
+	for (i = 0; i < nr_cpu_ids; i++)
+		for (j = 0; j < CONFIG_ND_MAX_REGIONS; j++) {
+			struct nd_percpu_lane *ndl;
+
+			ndl = per_cpu_ptr(&nd_percpu_lane, i);
+			spin_lock_init(&ndl->lock[j]);
+			ndl->count[j] = 0;
+		}
+}
+
+/**
+ * nd_region_acquire_lane - allocate and lock a lane
+ * @nd_region: region id and number of lanes possible
+ *
+ * A lane correlates to a BLK-data-window and/or a log slot in the BTT.
+ * We optimize for the common case where there are 256 lanes, one
+ * per-cpu.  For larger systems we need to lock to share lanes.  For now
+ * this implementation assumes the cost of maintaining an allocator for
+ * free lanes is on the order of the lock hold time, so it implements a
+ * static lane = cpu % num_lanes mapping.
+ *
+ * In the case of a BTT instance on top of a BLK namespace a lane may be
+ * acquired recursively.  We lock on the first instance.
+ *
+ * In the case of a BTT instance on top of PMEM, we only acquire a lane
+ * for the BTT metadata updates.
+ */
+unsigned int nd_region_acquire_lane(struct nd_region *nd_region)
+{
+	unsigned int cpu, lane;
+
+	cpu = get_cpu();
+	if (nd_region->num_lanes < nr_cpu_ids) {
+		struct nd_percpu_lane *ndl_lock, *ndl_count;
+		unsigned int id = nd_region->id;
+
+		lane = cpu % nd_region->num_lanes;
+		ndl_count = per_cpu_ptr(&nd_percpu_lane, cpu);
+		ndl_lock = per_cpu_ptr(&nd_percpu_lane, lane);
+		if (ndl_count->count[id]++ == 0)
+			spin_lock(&ndl_lock->lock[id]);
+	} else
+		lane = cpu;
+
+	return lane;
+}
+EXPORT_SYMBOL(nd_region_acquire_lane);
+
+void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane)
+{
+	if (nd_region->num_lanes < nr_cpu_ids) {
+		unsigned int cpu = get_cpu();
+		unsigned int id = nd_region->id;
+		struct nd_percpu_lane *ndl_lock, *ndl_count;
+
+		ndl_count = per_cpu_ptr(&nd_percpu_lane, cpu);
+		ndl_lock = per_cpu_ptr(&nd_percpu_lane, lane);
+		if (--ndl_count->count[id] == 0)
+			spin_unlock(&ndl_lock->lock[id]);
+		put_cpu();
+	}
+	put_cpu();
+}
+EXPORT_SYMBOL(nd_region_release_lane);
+
 static int nd_region_probe(struct device *dev)
 {
 	int err;
+	static unsigned long once;
 	struct nd_region_namespaces *num_ns;
 	struct nd_region *nd_region = to_nd_region(dev);
 	int rc = nd_region_register_namespaces(nd_region, &err);
 
+	if (nd_region->num_lanes > num_online_cpus()
+			&& nd_region->num_lanes < num_possible_cpus()
+			&& !test_and_set_bit(0, &once)) {
+		dev_info(dev, "online cpus (%d) < concurrent i/o lanes (%d) < possible cpus (%d)\n",
+				num_online_cpus(), nd_region->num_lanes,
+				num_possible_cpus());
+		dev_info(dev, "setting nr_cpus=%d may yield better libnvdimm device performance\n",
+				nd_region->num_lanes);
+	}
+
 	num_ns = devm_kzalloc(dev, sizeof(*num_ns), GFP_KERNEL);
 	if (!num_ns)
 		return -ENOMEM;
@@ -84,6 +172,7 @@ static struct nd_device_driver nd_region_driver = {
 
 int __init nd_region_init(void)
 {
+	nd_region_init_locks();
 	return nd_driver_register(&nd_region_driver);
 }
 
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index ac21ce419beb..cfa68d6590d6 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -532,6 +532,12 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
 	if (nd_region->id < 0) {
 		kfree(nd_region);
 		return NULL;
+	} else if (nd_region->id >= CONFIG_ND_MAX_REGIONS) {
+		dev_err(&nvdimm_bus->dev, "max region limit %d reached\n",
+				CONFIG_ND_MAX_REGIONS);
+		ida_simple_remove(&region_ida, nd_region->id);
+		kfree(nd_region);
+		return NULL;
 	}
 
 	memcpy(nd_region->mapping, ndr_desc->nd_mapping,
@@ -545,6 +551,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
 	nd_region->ndr_mappings = ndr_desc->num_mappings;
 	nd_region->provider_data = ndr_desc->provider_data;
 	nd_region->nd_set = ndr_desc->nd_set;
+	nd_region->num_lanes = ndr_desc->num_lanes;
 	ida_init(&nd_region->ns_ida);
 	dev = &nd_region->dev;
 	dev_set_name(dev, "region%d", nd_region->id);
@@ -561,6 +568,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
 struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus,
 		struct nd_region_desc *ndr_desc)
 {
+	ndr_desc->num_lanes = ND_MAX_LANES;
 	return nd_region_create(nvdimm_bus, ndr_desc, &nd_pmem_device_type,
 			__func__);
 }
@@ -571,6 +579,7 @@ struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus,
 {
 	if (ndr_desc->num_mappings > 1)
 		return NULL;
+	ndr_desc->num_lanes = min(ndr_desc->num_lanes, ND_MAX_LANES);
 	return nd_region_create(nvdimm_bus, ndr_desc, &nd_blk_device_type,
 			__func__);
 }
@@ -579,6 +588,7 @@ EXPORT_SYMBOL_GPL(nvdimm_blk_region_create);
 struct nd_region *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
 		struct nd_region_desc *ndr_desc)
 {
+	ndr_desc->num_lanes = ND_MAX_LANES;
 	return nd_region_create(nvdimm_bus, ndr_desc, &nd_volatile_device_type,
 			__func__);
 }
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index a59dca17b3aa..531d99dfac68 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -85,6 +85,7 @@ struct nd_region_desc {
 	const struct attribute_group **attr_groups;
 	struct nd_interleave_set *nd_set;
 	void *provider_data;
+	int num_lanes;
 };
 
 struct nvdimm_bus;


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 03/15] nd_btt: atomic sector updates
@ 2015-06-17 23:55   ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Vishal Verma, Neil Brown, Greg KH,
	Dave Chinner, linux-kernel, Andy Lutomirski, Jens Axboe,
	linux-acpi, Jeff Moyer, H. Peter Anvin, linux-fsdevel, hch,
	mingo

From: Vishal Verma <vishal.l.verma@linux.intel.com>

BTT stands for Block Translation Table, and is a way to provide power
fail sector atomicity semantics for block devices that have the ability
to perform byte granularity IO. It relies on the capability of libnvdimm
namespace devices to do byte aligned IO.

The BTT works as a stacked blocked device, and reserves a chunk of space
from the backing device for its accounting metadata. It is a bio-based
driver because all IO is done synchronously, and there is no queuing or
asynchronous completions at either the device or the driver level.

The BTT uses 'lanes' to index into various 'on-disk' data structures,
and lanes also act as a synchronization mechanism in case there are more
CPUs than available lanes. We did a comparison between two lane lock
strategies - first where we kept an atomic counter around that tracked
which was the last lane that was used, and 'our' lane was determined by
atomically incrementing that. That way, for the nr_cpus > nr_lanes case,
theoretically, no CPU would be blocked waiting for a lane. The other
strategy was to use the cpu number we're scheduled on to and hash it to
a lane number. Theoretically, this could block an IO that could've
otherwise run using a different, free lane. But some fio workloads
showed that the direct cpu -> lane hash performed faster than tracking
'last lane' - my reasoning is the cache thrash caused by moving the
atomic variable made that approach slower than simply waiting out the
in-progress IO. This supports the conclusion that the driver can be a
very simple bio-based one that does synchronous IOs instead of queuing.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
[jmoyer: fix nmi watchdog timeout in btt_map_init]
[jmoyer: move btt initialization to module load path]
[jmoyer: fix memory leak in the btt initialization path]
[jmoyer: Don't overwrite corrupted arenas]
Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 Documentation/nvdimm/btt.txt |  273 ++++++++
 drivers/acpi/nfit.c          |    1 
 drivers/nvdimm/Kconfig       |   27 +
 drivers/nvdimm/Makefile      |    3 
 drivers/nvdimm/btt.c         | 1449 ++++++++++++++++++++++++++++++++++++++++++
 drivers/nvdimm/btt.h         |  141 ++++
 drivers/nvdimm/btt_devs.c    |    3 
 drivers/nvdimm/nd.h          |   10 
 drivers/nvdimm/region.c      |   89 +++
 drivers/nvdimm/region_devs.c |   10 
 include/linux/libnvdimm.h    |    1 
 11 files changed, 2004 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/nvdimm/btt.txt
 create mode 100644 drivers/nvdimm/btt.c

diff --git a/Documentation/nvdimm/btt.txt b/Documentation/nvdimm/btt.txt
new file mode 100644
index 000000000000..95134d5ec4a0
--- /dev/null
+++ b/Documentation/nvdimm/btt.txt
@@ -0,0 +1,273 @@
+BTT - Block Translation Table
+=============================
+
+
+1. Introduction
+---------------
+
+Persistent memory based storage is able to perform IO at byte (or more
+accurately, cache line) granularity. However, we often want to expose such
+storage as traditional block devices. The block drivers for persistent memory
+will do exactly this. However, they do not provide any atomicity guarantees.
+Traditional SSDs typically provide protection against torn sectors in hardware,
+using stored energy in capacitors to complete in-flight block writes, or perhaps
+in firmware. We don't have this luxury with persistent memory - if a write is in
+progress, and we experience a power failure, the block will contain a mix of old
+and new data. Applications may not be prepared to handle such a scenario.
+
+The Block Translation Table (BTT) provides atomic sector update semantics for
+persistent memory devices, so that applications that rely on sector writes not
+being torn can continue to do so. The BTT manifests itself as a stacked block
+device, and reserves a portion of the underlying storage for its metadata. At
+the heart of it, is an indirection table that re-maps all the blocks on the
+volume. It can be thought of as an extremely simple file system that only
+provides atomic sector updates.
+
+
+2. Static Layout
+----------------
+
+The underlying storage on which a BTT can be laid out is not limited in any way.
+The BTT, however, splits the available space into chunks of up to 512 GiB,
+called "Arenas".
+
+Each arena follows the same layout for its metadata, and all references in an
+arena are internal to it (with the exception of one field that points to the
+next arena). The following depicts the "On-disk" metadata layout:
+
+
+  Backing Store     +------->  Arena
++---------------+   |   +------------------+
+|               |   |   | Arena info block |
+|    Arena 0    +---+   |       4K         |
+|     512G      |       +------------------+
+|               |       |                  |
++---------------+       |                  |
+|               |       |                  |
+|    Arena 1    |       |   Data Blocks    |
+|     512G      |       |                  |
+|               |       |                  |
++---------------+       |                  |
+|       .       |       |                  |
+|       .       |       |                  |
+|       .       |       |                  |
+|               |       |                  |
+|               |       |                  |
++---------------+       +------------------+
+                        |                  |
+                        |     BTT Map      |
+                        |                  |
+                        |                  |
+                        +------------------+
+                        |                  |
+                        |     BTT Flog     |
+                        |                  |
+                        +------------------+
+                        | Info block copy  |
+                        |       4K         |
+                        +------------------+
+
+
+3. Theory of Operation
+----------------------
+
+
+a. The BTT Map
+--------------
+
+The map is a simple lookup/indirection table that maps an LBA to an internal
+block. Each map entry is 32 bits. The two most significant bits are special
+flags, and the remaining form the internal block number.
+
+Bit      Description
+31     : TRIM flag - marks if the block was trimmed or discarded
+30     : ERROR flag - marks an error block. Cleared on write.
+29 - 0 : Mappings to internal 'postmap' blocks
+
+
+Some of the terminology that will be subsequently used:
+
+External LBA  : LBA as made visible to upper layers.
+ABA           : Arena Block Address - Block offset/number within an arena
+Premap ABA    : The block offset into an arena, which was decided upon by range
+		checking the External LBA
+Postmap ABA   : The block number in the "Data Blocks" area obtained after
+		indirection from the map
+nfree	      : The number of free blocks that are maintained at any given time.
+		This is the number of concurrent writes that can happen to the
+		arena.
+
+
+For example, after adding a BTT, we surface a disk of 1024G. We get a read for
+the external LBA at 768G. This falls into the second arena, and of the 512G
+worth of blocks that this arena contributes, this block is at 256G. Thus, the
+premap ABA is 256G. We now refer to the map, and find out the mapping for block
+'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64.
+
+
+b. The BTT Flog
+---------------
+
+The BTT provides sector atomicity by making every write an "allocating write",
+i.e. Every write goes to a "free" block. A running list of free blocks is
+maintained in the form of the BTT flog. 'Flog' is a combination of the words
+"free list" and "log". The flog contains 'nfree' entries, and an entry contains:
+
+lba     : The premap ABA that is being written to
+old_map : The old postmap ABA - after 'this' write completes, this will be a
+	  free block.
+new_map : The new postmap ABA. The map will up updated to reflect this
+	  lba->postmap_aba mapping, but we log it here in case we have to
+	  recover.
+seq	: Sequence number to mark which of the 2 sections of this flog entry is
+	  valid/newest. It cycles between 01->10->11->01 (binary) under normal
+	  operation, with 00 indicating an uninitialized state.
+lba'	: alternate lba entry
+old_map': alternate old postmap entry
+new_map': alternate new postmap entry
+seq'	: alternate sequence number.
+
+Each of the above fields is 32-bit, making one entry 16 bytes. Flog updates are
+done such that for any entry being written, it:
+a. overwrites the 'old' section in the entry based on sequence numbers
+b. writes the new entry such that the sequence number is written last.
+
+
+c. The concept of lanes
+-----------------------
+
+While 'nfree' describes the number of concurrent IOs an arena can process
+concurrently, 'nlanes' is the number of IOs the BTT device as a whole can
+process.
+ nlanes = min(nfree, num_cpus)
+A lane number is obtained at the start of any IO, and is used for indexing into
+all the on-disk and in-memory data structures for the duration of the IO. It is
+protected by a spinlock.
+
+
+d. In-memory data structure: Read Tracking Table (RTT)
+------------------------------------------------------
+
+Consider a case where we have two threads, one doing reads and the other,
+writes. We can hit a condition where the writer thread grabs a free block to do
+a new IO, but the (slow) reader thread is still reading from it. In other words,
+the reader consulted a map entry, and started reading the corresponding block. A
+writer started writing to the same external LBA, and finished the write updating
+the map for that external LBA to point to its new postmap ABA. At this point the
+internal, postmap block that the reader is (still) reading has been inserted
+into the list of free blocks. If another write comes in for the same LBA, it can
+grab this free block, and start writing to it, causing the reader to read
+incorrect data. To prevent this, we introduce the RTT.
+
+The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts
+into rtt[lane_number], the postmap ABA it is reading, and clears it after the
+read is complete. Every writer thread, after grabbing a free block, checks the
+RTT for its presence. If the postmap free block is in the RTT, it waits till the
+reader clears the RTT entry, and only then starts writing to it.
+
+
+e. In-memory data structure: map locks
+--------------------------------------
+
+Consider a case where two writer threads are writing to the same LBA. There can
+be a race in the following sequence of steps:
+
+free[lane] = map[premap_aba]
+map[premap_aba] = postmap_aba
+
+Both threads can update their respective free[lane] with the same old, freed
+postmap_aba. This has made the layout inconsistent by losing a free entry, and
+at the same time, duplicating another free entry for two lanes.
+
+To solve this, we could have a single map lock (per arena) that has to be taken
+before performing the above sequence, but we feel that could be too contentious.
+Instead we use an array of (nfree) map_locks that is indexed by
+(premap_aba modulo nfree).
+
+
+f. Reconstruction from the Flog
+-------------------------------
+
+On startup, we analyze the BTT flog to create our list of free blocks. We walk
+through all the entries, and for each lane, of the set of two possible
+'sections', we always look at the most recent one only (based on the sequence
+number). The reconstruction rules/steps are simple:
+- Read map[log_entry.lba].
+- If log_entry.new matches the map entry, then log_entry.old is free.
+- If log_entry.new does not match the map entry, then log_entry.new is free.
+  (This case can only be caused by power-fails/unsafe shutdowns)
+
+
+g. Summarizing - Read and Write flows
+-------------------------------------
+
+Read:
+
+1.  Convert external LBA to arena number + pre-map ABA
+2.  Get a lane (and take lane_lock)
+3.  Read map to get the entry for this pre-map ABA
+4.  Enter post-map ABA into RTT[lane]
+5.  If TRIM flag set in map, return zeroes, and end IO (go to step 8)
+6.  If ERROR flag set in map, end IO with EIO (go to step 8)
+7.  Read data from this block
+8.  Remove post-map ABA entry from RTT[lane]
+9.  Release lane (and lane_lock)
+
+Write:
+
+1.  Convert external LBA to Arena number + pre-map ABA
+2.  Get a lane (and take lane_lock)
+3.  Use lane to index into in-memory free list and obtain a new block, next flog
+        index, next sequence number
+4.  Scan the RTT to check if free block is present, and spin/wait if it is.
+5.  Write data to this free block
+6.  Read map to get the existing post-map ABA entry for this pre-map ABA
+7.  Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num]
+8.  Write new post-map ABA into map.
+9.  Write old post-map entry into the free list
+10. Calculate next sequence number and write into the free list entry
+11. Release lane (and lane_lock)
+
+
+4. Error Handling
+=================
+
+An arena would be in an error state if any of the metadata is corrupted
+irrecoverably, either due to a bug or a media error. The following conditions
+indicate an error:
+- Info block checksum does not match (and recovering from the copy also fails)
+- All internal available blocks are not uniquely and entirely addressed by the
+  sum of mapped blocks and free blocks (from the BTT flog).
+- Rebuilding free list from the flog reveals missing/duplicate/impossible
+  entries
+- A map entry is out of bounds
+
+If any of these error conditions are encountered, the arena is put into a read
+only state using a flag in the info block.
+
+
+5. In-kernel usage
+==================
+
+Any block driver that supports byte granularity IO to the storage may register
+with the BTT. It will have to provide the rw_bytes interface in its
+block_device_operations struct:
+
+	int (*rw_bytes)(struct gendisk *, void *, size_t, off_t, int rw);
+
+It may register with the BTT after it adds its own gendisk, using btt_init:
+
+	struct btt *btt_init(struct gendisk *disk, unsigned long long rawsize,
+			u32 lbasize, u8 uuid[], int maxlane);
+
+note that maxlane is the maximum amount of concurrency the driver wishes to
+allow the BTT to use.
+
+The BTT 'disk' appears as a stacked block device that grabs the underlying block
+device in the O_EXCL mode.
+
+When the driver wishes to remove the backing disk, it should similarly call
+btt_fini using the same struct btt* handle that was provided to it by btt_init.
+
+	void btt_fini(struct btt *btt);
+
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 35af6f7f0abd..fc38b49eff7d 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -902,6 +902,7 @@ static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,
 		} else {
 			nd_mapping->size = nfit_mem->bdw->capacity;
 			nd_mapping->start = nfit_mem->bdw->start_address;
+			ndr_desc->num_lanes = nfit_mem->bdw->windows;
 			blk_valid = 1;
 		}
 
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index f16ba9d14740..a9def3839655 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -34,6 +34,31 @@ config BLK_DEV_PMEM
 	  Say Y if you want to use an NVDIMM
 
 config ND_BTT_DEVS
-	def_bool y
+	bool
+
+config ND_BTT
+	tristate "BTT: Block Translation Table (atomic sector updates)"
+	default LIBNVDIMM
+	select ND_BTT_DEVS
+	help
+	  The Block Translation Table (BTT) provides atomic sector
+	  update semantics for persistent memory devices, so that
+	  applications that rely on sector writes not being torn (a
+	  guarantee that typical disks provide) can continue to do so.
+	  The BTT manifests itself as a stacked block device, and
+	  reserves a portion of the underlying storage for its
+	  metadata.
+
+	  Select Y if unsure.
+
+config ND_MAX_REGIONS
+	int "Maximum number of regions supported by the sub-system"
+	default 64
+	---help---
+	  A 'region' corresponds to an individual DIMM or an interleave
+	  set of DIMMs.  A typical maximally configured system may have
+	  up to 32 DIMMs.
+
+	  Leave the default of 64 if you are unsure.
 
 endif
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index eb1bbce86592..aa5bb1acf831 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -1,8 +1,11 @@
 obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
 obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
+obj-$(CONFIG_ND_BTT) += nd_btt.o
 
 nd_pmem-y := pmem.o
 
+nd_btt-y := btt.o
+
 libnvdimm-y := core.o
 libnvdimm-y += bus.o
 libnvdimm-y += dimm_devs.o
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
new file mode 100644
index 000000000000..58becbd69ae1
--- /dev/null
+++ b/drivers/nvdimm/btt.c
@@ -0,0 +1,1449 @@
+/*
+ * Block Translation Table
+ * Copyright (c) 2014-2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <linux/highmem.h>
+#include <linux/debugfs.h>
+#include <linux/blkdev.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/hdreg.h>
+#include <linux/genhd.h>
+#include <linux/sizes.h>
+#include <linux/ndctl.h>
+#include <linux/fs.h>
+#include <linux/nd.h>
+#include "btt.h"
+#include "nd.h"
+
+enum log_ent_request {
+	LOG_NEW_ENT = 0,
+	LOG_OLD_ENT
+};
+
+static int btt_major;
+
+static int arena_read_bytes(struct arena_info *arena, resource_size_t offset,
+		void *buf, size_t n)
+{
+	struct nd_btt *nd_btt = arena->nd_btt;
+	struct block_device *bdev = nd_btt->backing_dev;
+
+	/* arena offsets are 4K from the base of the device */
+	offset += SZ_4K;
+	return bdev_read_bytes(bdev, offset, buf, n);
+}
+
+static int arena_write_bytes(struct arena_info *arena, resource_size_t offset,
+		void *buf, size_t n)
+{
+	struct nd_btt *nd_btt = arena->nd_btt;
+	struct block_device *bdev = nd_btt->backing_dev;
+
+	/* arena offsets are 4K from the base of the device */
+	offset += SZ_4K;
+	return bdev_write_bytes(bdev, offset, buf, n);
+}
+
+static int btt_info_write(struct arena_info *arena, struct btt_sb *super)
+{
+	int ret;
+
+	ret = arena_write_bytes(arena, arena->info2off, super,
+			sizeof(struct btt_sb));
+	if (ret)
+		return ret;
+
+	return arena_write_bytes(arena, arena->infooff, super,
+			sizeof(struct btt_sb));
+}
+
+static int btt_info_read(struct arena_info *arena, struct btt_sb *super)
+{
+	WARN_ON(!super);
+	return arena_read_bytes(arena, arena->infooff, super,
+			sizeof(struct btt_sb));
+}
+
+/*
+ * 'raw' version of btt_map write
+ * Assumptions:
+ *   mapping is in little-endian
+ *   mapping contains 'E' and 'Z' flags as desired
+ */
+static int __btt_map_write(struct arena_info *arena, u32 lba, __le32 mapping)
+{
+	u64 ns_off = arena->mapoff + (lba * MAP_ENT_SIZE);
+
+	WARN_ON(lba >= arena->external_nlba);
+	return arena_write_bytes(arena, ns_off, &mapping, MAP_ENT_SIZE);
+}
+
+static int btt_map_write(struct arena_info *arena, u32 lba, u32 mapping,
+			u32 z_flag, u32 e_flag)
+{
+	u32 ze;
+	__le32 mapping_le;
+
+	/*
+	 * This 'mapping' is supposed to be just the LBA mapping, without
+	 * any flags set, so strip the flag bits.
+	 */
+	mapping &= MAP_LBA_MASK;
+
+	ze = (z_flag << 1) + e_flag;
+	switch (ze) {
+	case 0:
+		/*
+		 * We want to set neither of the Z or E flags, and
+		 * in the actual layout, this means setting the bit
+		 * positions of both to '1' to indicate a 'normal'
+		 * map entry
+		 */
+		mapping |= MAP_ENT_NORMAL;
+		break;
+	case 1:
+		mapping |= (1 << MAP_ERR_SHIFT);
+		break;
+	case 2:
+		mapping |= (1 << MAP_TRIM_SHIFT);
+		break;
+	default:
+		/*
+		 * The case where Z and E are both sent in as '1' could be
+		 * construed as a valid 'normal' case, but we decide not to,
+		 * to avoid confusion
+		 */
+		WARN_ONCE(1, "Invalid use of Z and E flags\n");
+		return -EIO;
+	}
+
+	mapping_le = cpu_to_le32(mapping);
+	return __btt_map_write(arena, lba, mapping_le);
+}
+
+static int btt_map_read(struct arena_info *arena, u32 lba, u32 *mapping,
+			int *trim, int *error)
+{
+	int ret;
+	__le32 in;
+	u32 raw_mapping, postmap, ze, z_flag, e_flag;
+	u64 ns_off = arena->mapoff + (lba * MAP_ENT_SIZE);
+
+	WARN_ON(lba >= arena->external_nlba);
+
+	ret = arena_read_bytes(arena, ns_off, &in, MAP_ENT_SIZE);
+	if (ret)
+		return ret;
+
+	raw_mapping = le32_to_cpu(in);
+
+	z_flag = (raw_mapping & MAP_TRIM_MASK) >> MAP_TRIM_SHIFT;
+	e_flag = (raw_mapping & MAP_ERR_MASK) >> MAP_ERR_SHIFT;
+	ze = (z_flag << 1) + e_flag;
+	postmap = raw_mapping & MAP_LBA_MASK;
+
+	/* Reuse the {z,e}_flag variables for *trim and *error */
+	z_flag = 0;
+	e_flag = 0;
+
+	switch (ze) {
+	case 0:
+		/* Initial state. Return postmap = premap */
+		*mapping = lba;
+		break;
+	case 1:
+		*mapping = postmap;
+		e_flag = 1;
+		break;
+	case 2:
+		*mapping = postmap;
+		z_flag = 1;
+		break;
+	case 3:
+		*mapping = postmap;
+		break;
+	default:
+		return -EIO;
+	}
+
+	if (trim)
+		*trim = z_flag;
+	if (error)
+		*error = e_flag;
+
+	return ret;
+}
+
+static int btt_log_read_pair(struct arena_info *arena, u32 lane,
+			struct log_entry *ent)
+{
+	WARN_ON(!ent);
+	return arena_read_bytes(arena,
+			arena->logoff + (2 * lane * LOG_ENT_SIZE), ent,
+			2 * LOG_ENT_SIZE);
+}
+
+static struct dentry *debugfs_root;
+
+static void arena_debugfs_init(struct arena_info *a, struct dentry *parent,
+				int idx)
+{
+	char dirname[32];
+	struct dentry *d;
+
+	/* If for some reason, parent bttN was not created, exit */
+	if (!parent)
+		return;
+
+	snprintf(dirname, 32, "arena%d", idx);
+	d = debugfs_create_dir(dirname, parent);
+	if (IS_ERR_OR_NULL(d))
+		return;
+	a->debugfs_dir = d;
+
+	debugfs_create_x64("size", S_IRUGO, d, &a->size);
+	debugfs_create_x64("external_lba_start", S_IRUGO, d,
+				&a->external_lba_start);
+	debugfs_create_x32("internal_nlba", S_IRUGO, d, &a->internal_nlba);
+	debugfs_create_u32("internal_lbasize", S_IRUGO, d,
+				&a->internal_lbasize);
+	debugfs_create_x32("external_nlba", S_IRUGO, d, &a->external_nlba);
+	debugfs_create_u32("external_lbasize", S_IRUGO, d,
+				&a->external_lbasize);
+	debugfs_create_u32("nfree", S_IRUGO, d, &a->nfree);
+	debugfs_create_u16("version_major", S_IRUGO, d, &a->version_major);
+	debugfs_create_u16("version_minor", S_IRUGO, d, &a->version_minor);
+	debugfs_create_x64("nextoff", S_IRUGO, d, &a->nextoff);
+	debugfs_create_x64("infooff", S_IRUGO, d, &a->infooff);
+	debugfs_create_x64("dataoff", S_IRUGO, d, &a->dataoff);
+	debugfs_create_x64("mapoff", S_IRUGO, d, &a->mapoff);
+	debugfs_create_x64("logoff", S_IRUGO, d, &a->logoff);
+	debugfs_create_x64("info2off", S_IRUGO, d, &a->info2off);
+	debugfs_create_x32("flags", S_IRUGO, d, &a->flags);
+}
+
+static void btt_debugfs_init(struct btt *btt)
+{
+	int i = 0;
+	struct arena_info *arena;
+
+	btt->debugfs_dir = debugfs_create_dir(dev_name(&btt->nd_btt->dev),
+						debugfs_root);
+	if (IS_ERR_OR_NULL(btt->debugfs_dir))
+		return;
+
+	list_for_each_entry(arena, &btt->arena_list, list) {
+		arena_debugfs_init(arena, btt->debugfs_dir, i);
+		i++;
+	}
+}
+
+/*
+ * This function accepts two log entries, and uses the
+ * sequence number to find the 'older' entry.
+ * It also updates the sequence number in this old entry to
+ * make it the 'new' one if the mark_flag is set.
+ * Finally, it returns which of the entries was the older one.
+ *
+ * TODO The logic feels a bit kludge-y. make it better..
+ */
+static int btt_log_get_old(struct log_entry *ent)
+{
+	int old;
+
+	/*
+	 * the first ever time this is seen, the entry goes into [0]
+	 * the next time, the following logic works out to put this
+	 * (next) entry into [1]
+	 */
+	if (ent[0].seq == 0) {
+		ent[0].seq = cpu_to_le32(1);
+		return 0;
+	}
+
+	if (ent[0].seq == ent[1].seq)
+		return -EINVAL;
+	if (le32_to_cpu(ent[0].seq) + le32_to_cpu(ent[1].seq) > 5)
+		return -EINVAL;
+
+	if (le32_to_cpu(ent[0].seq) < le32_to_cpu(ent[1].seq)) {
+		if (le32_to_cpu(ent[1].seq) - le32_to_cpu(ent[0].seq) == 1)
+			old = 0;
+		else
+			old = 1;
+	} else {
+		if (le32_to_cpu(ent[0].seq) - le32_to_cpu(ent[1].seq) == 1)
+			old = 1;
+		else
+			old = 0;
+	}
+
+	return old;
+}
+
+static struct device *to_dev(struct arena_info *arena)
+{
+	return &arena->nd_btt->dev;
+}
+
+/*
+ * This function copies the desired (old/new) log entry into ent if
+ * it is not NULL. It returns the sub-slot number (0 or 1)
+ * where the desired log entry was found. Negative return values
+ * indicate errors.
+ */
+static int btt_log_read(struct arena_info *arena, u32 lane,
+			struct log_entry *ent, int old_flag)
+{
+	int ret;
+	int old_ent, ret_ent;
+	struct log_entry log[2];
+
+	ret = btt_log_read_pair(arena, lane, log);
+	if (ret)
+		return -EIO;
+
+	old_ent = btt_log_get_old(log);
+	if (old_ent < 0 || old_ent > 1) {
+		dev_info(to_dev(arena),
+				"log corruption (%d): lane %d seq [%d, %d]\n",
+			old_ent, lane, log[0].seq, log[1].seq);
+		/* TODO set error state? */
+		return -EIO;
+	}
+
+	ret_ent = (old_flag ? old_ent : (1 - old_ent));
+
+	if (ent != NULL)
+		memcpy(ent, &log[ret_ent], LOG_ENT_SIZE);
+
+	return ret_ent;
+}
+
+/*
+ * This function commits a log entry to media
+ * It does _not_ prepare the freelist entry for the next write
+ * btt_flog_write is the wrapper for updating the freelist elements
+ */
+static int __btt_log_write(struct arena_info *arena, u32 lane,
+			u32 sub, struct log_entry *ent)
+{
+	int ret;
+	/*
+	 * Ignore the padding in log_entry for calculating log_half.
+	 * The entry is 'committed' when we write the sequence number,
+	 * and we want to ensure that that is the last thing written.
+	 * We don't bother writing the padding as that would be extra
+	 * media wear and write amplification
+	 */
+	unsigned int log_half = (LOG_ENT_SIZE - 2 * sizeof(u64)) / 2;
+	u64 ns_off = arena->logoff + (((2 * lane) + sub) * LOG_ENT_SIZE);
+	void *src = ent;
+
+	/* split the 16B write into atomic, durable halves */
+	ret = arena_write_bytes(arena, ns_off, src, log_half);
+	if (ret)
+		return ret;
+
+	ns_off += log_half;
+	src += log_half;
+	return arena_write_bytes(arena, ns_off, src, log_half);
+}
+
+static int btt_flog_write(struct arena_info *arena, u32 lane, u32 sub,
+			struct log_entry *ent)
+{
+	int ret;
+
+	ret = __btt_log_write(arena, lane, sub, ent);
+	if (ret)
+		return ret;
+
+	/* prepare the next free entry */
+	arena->freelist[lane].sub = 1 - arena->freelist[lane].sub;
+	if (++(arena->freelist[lane].seq) == 4)
+		arena->freelist[lane].seq = 1;
+	arena->freelist[lane].block = le32_to_cpu(ent->old_map);
+
+	return ret;
+}
+
+/*
+ * This function initializes the BTT map to the initial state, which is
+ * all-zeroes, and indicates an identity mapping
+ */
+static int btt_map_init(struct arena_info *arena)
+{
+	int ret = -EINVAL;
+	void *zerobuf;
+	size_t offset = 0;
+	size_t chunk_size = SZ_2M;
+	size_t mapsize = arena->logoff - arena->mapoff;
+
+	zerobuf = kzalloc(chunk_size, GFP_KERNEL);
+	if (!zerobuf)
+		return -ENOMEM;
+
+	while (mapsize) {
+		size_t size = min(mapsize, chunk_size);
+
+		ret = arena_write_bytes(arena, arena->mapoff + offset, zerobuf,
+				size);
+		if (ret)
+			goto free;
+
+		offset += size;
+		mapsize -= size;
+		cond_resched();
+	}
+
+ free:
+	kfree(zerobuf);
+	return ret;
+}
+
+/*
+ * This function initializes the BTT log with 'fake' entries pointing
+ * to the initial reserved set of blocks as being free
+ */
+static int btt_log_init(struct arena_info *arena)
+{
+	int ret;
+	u32 i;
+	struct log_entry log, zerolog;
+
+	memset(&zerolog, 0, sizeof(zerolog));
+
+	for (i = 0; i < arena->nfree; i++) {
+		log.lba = cpu_to_le32(i);
+		log.old_map = cpu_to_le32(arena->external_nlba + i);
+		log.new_map = cpu_to_le32(arena->external_nlba + i);
+		log.seq = cpu_to_le32(LOG_SEQ_INIT);
+		ret = __btt_log_write(arena, i, 0, &log);
+		if (ret)
+			return ret;
+		ret = __btt_log_write(arena, i, 1, &zerolog);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int btt_freelist_init(struct arena_info *arena)
+{
+	int old, new, ret;
+	u32 i, map_entry;
+	struct log_entry log_new, log_old;
+
+	arena->freelist = kcalloc(arena->nfree, sizeof(struct free_entry),
+					GFP_KERNEL);
+	if (!arena->freelist)
+		return -ENOMEM;
+
+	for (i = 0; i < arena->nfree; i++) {
+		old = btt_log_read(arena, i, &log_old, LOG_OLD_ENT);
+		if (old < 0)
+			return old;
+
+		new = btt_log_read(arena, i, &log_new, LOG_NEW_ENT);
+		if (new < 0)
+			return new;
+
+		/* sub points to the next one to be overwritten */
+		arena->freelist[i].sub = 1 - new;
+		arena->freelist[i].seq = nd_inc_seq(le32_to_cpu(log_new.seq));
+		arena->freelist[i].block = le32_to_cpu(log_new.old_map);
+
+		/* This implies a newly created or untouched flog entry */
+		if (log_new.old_map == log_new.new_map)
+			continue;
+
+		/* Check if map recovery is needed */
+		ret = btt_map_read(arena, le32_to_cpu(log_new.lba), &map_entry,
+				NULL, NULL);
+		if (ret)
+			return ret;
+		if ((le32_to_cpu(log_new.new_map) != map_entry) &&
+				(le32_to_cpu(log_new.old_map) == map_entry)) {
+			/*
+			 * Last transaction wrote the flog, but wasn't able
+			 * to complete the map write. So fix up the map.
+			 */
+			ret = btt_map_write(arena, le32_to_cpu(log_new.lba),
+					le32_to_cpu(log_new.new_map), 0, 0);
+			if (ret)
+				return ret;
+		}
+
+	}
+
+	return 0;
+}
+
+static int btt_rtt_init(struct arena_info *arena)
+{
+	arena->rtt = kcalloc(arena->nfree, sizeof(u32), GFP_KERNEL);
+	if (arena->rtt == NULL)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int btt_maplocks_init(struct arena_info *arena)
+{
+	u32 i;
+
+	arena->map_locks = kcalloc(arena->nfree, sizeof(struct aligned_lock),
+				GFP_KERNEL);
+	if (!arena->map_locks)
+		return -ENOMEM;
+
+	for (i = 0; i < arena->nfree; i++)
+		spin_lock_init(&arena->map_locks[i].lock);
+
+	return 0;
+}
+
+static struct arena_info *alloc_arena(struct btt *btt, size_t size,
+				size_t start, size_t arena_off)
+{
+	struct arena_info *arena;
+	u64 logsize, mapsize, datasize;
+	u64 available = size;
+
+	arena = kzalloc(sizeof(struct arena_info), GFP_KERNEL);
+	if (!arena)
+		return NULL;
+	arena->nd_btt = btt->nd_btt;
+
+	if (!size)
+		return arena;
+
+	arena->size = size;
+	arena->external_lba_start = start;
+	arena->external_lbasize = btt->lbasize;
+	arena->internal_lbasize = roundup(arena->external_lbasize,
+					INT_LBASIZE_ALIGNMENT);
+	arena->nfree = BTT_DEFAULT_NFREE;
+	arena->version_major = 1;
+	arena->version_minor = 1;
+
+	if (available % BTT_PG_SIZE)
+		available -= (available % BTT_PG_SIZE);
+
+	/* Two pages are reserved for the super block and its copy */
+	available -= 2 * BTT_PG_SIZE;
+
+	/* The log takes a fixed amount of space based on nfree */
+	logsize = roundup(2 * arena->nfree * sizeof(struct log_entry),
+				BTT_PG_SIZE);
+	available -= logsize;
+
+	/* Calculate optimal split between map and data area */
+	arena->internal_nlba = div_u64(available - BTT_PG_SIZE,
+			arena->internal_lbasize + MAP_ENT_SIZE);
+	arena->external_nlba = arena->internal_nlba - arena->nfree;
+
+	mapsize = roundup((arena->external_nlba * MAP_ENT_SIZE), BTT_PG_SIZE);
+	datasize = available - mapsize;
+
+	/* 'Absolute' values, relative to start of storage space */
+	arena->infooff = arena_off;
+	arena->dataoff = arena->infooff + BTT_PG_SIZE;
+	arena->mapoff = arena->dataoff + datasize;
+	arena->logoff = arena->mapoff + mapsize;
+	arena->info2off = arena->logoff + logsize;
+	return arena;
+}
+
+static void free_arenas(struct btt *btt)
+{
+	struct arena_info *arena, *next;
+
+	list_for_each_entry_safe(arena, next, &btt->arena_list, list) {
+		list_del(&arena->list);
+		kfree(arena->rtt);
+		kfree(arena->map_locks);
+		kfree(arena->freelist);
+		debugfs_remove_recursive(arena->debugfs_dir);
+		kfree(arena);
+	}
+}
+
+/*
+ * This function checks if the metadata layout is valid and error free
+ */
+static int arena_is_valid(struct arena_info *arena, struct btt_sb *super,
+				u8 *uuid, u32 lbasize)
+{
+	u64 checksum;
+
+	if (memcmp(super->uuid, uuid, 16))
+		return 0;
+
+	checksum = le64_to_cpu(super->checksum);
+	super->checksum = 0;
+	if (checksum != nd_btt_sb_checksum(super))
+		return 0;
+	super->checksum = cpu_to_le64(checksum);
+
+	if (lbasize != le32_to_cpu(super->external_lbasize))
+		return 0;
+
+	/* TODO: figure out action for this */
+	if ((le32_to_cpu(super->flags) & IB_FLAG_ERROR_MASK) != 0)
+		dev_info(to_dev(arena), "Found arena with an error flag\n");
+
+	return 1;
+}
+
+/*
+ * This function reads an existing valid btt superblock and
+ * populates the corresponding arena_info struct
+ */
+static void parse_arena_meta(struct arena_info *arena, struct btt_sb *super,
+				u64 arena_off)
+{
+	arena->internal_nlba = le32_to_cpu(super->internal_nlba);
+	arena->internal_lbasize = le32_to_cpu(super->internal_lbasize);
+	arena->external_nlba = le32_to_cpu(super->external_nlba);
+	arena->external_lbasize = le32_to_cpu(super->external_lbasize);
+	arena->nfree = le32_to_cpu(super->nfree);
+	arena->version_major = le16_to_cpu(super->version_major);
+	arena->version_minor = le16_to_cpu(super->version_minor);
+
+	arena->nextoff = (super->nextoff == 0) ? 0 : (arena_off +
+			le64_to_cpu(super->nextoff));
+	arena->infooff = arena_off;
+	arena->dataoff = arena_off + le64_to_cpu(super->dataoff);
+	arena->mapoff = arena_off + le64_to_cpu(super->mapoff);
+	arena->logoff = arena_off + le64_to_cpu(super->logoff);
+	arena->info2off = arena_off + le64_to_cpu(super->info2off);
+
+	arena->size = (super->nextoff > 0) ? (le64_to_cpu(super->nextoff)) :
+			(arena->info2off - arena->infooff + BTT_PG_SIZE);
+
+	arena->flags = le32_to_cpu(super->flags);
+}
+
+static int discover_arenas(struct btt *btt)
+{
+	int ret = 0;
+	struct arena_info *arena;
+	struct btt_sb *super;
+	size_t remaining = btt->rawsize;
+	u64 cur_nlba = 0;
+	size_t cur_off = 0;
+	int num_arenas = 0;
+
+	super = kzalloc(sizeof(*super), GFP_KERNEL);
+	if (!super)
+		return -ENOMEM;
+
+	while (remaining) {
+		/* Alloc memory for arena */
+		arena = alloc_arena(btt, 0, 0, 0);
+		if (!arena) {
+			ret = -ENOMEM;
+			goto out_super;
+		}
+
+		arena->infooff = cur_off;
+		ret = btt_info_read(arena, super);
+		if (ret)
+			goto out;
+
+		if (!arena_is_valid(arena, super, btt->nd_btt->uuid,
+				btt->lbasize)) {
+			if (remaining == btt->rawsize) {
+				btt->init_state = INIT_NOTFOUND;
+				dev_info(to_dev(arena), "No existing arenas\n");
+				goto out;
+			} else {
+				dev_info(to_dev(arena),
+						"Found corrupted metadata!\n");
+				ret = -ENODEV;
+				goto out;
+			}
+		}
+
+		arena->external_lba_start = cur_nlba;
+		parse_arena_meta(arena, super, cur_off);
+
+		ret = btt_freelist_init(arena);
+		if (ret)
+			goto out;
+
+		ret = btt_rtt_init(arena);
+		if (ret)
+			goto out;
+
+		ret = btt_maplocks_init(arena);
+		if (ret)
+			goto out;
+
+		list_add_tail(&arena->list, &btt->arena_list);
+
+		remaining -= arena->size;
+		cur_off += arena->size;
+		cur_nlba += arena->external_nlba;
+		num_arenas++;
+
+		if (arena->nextoff == 0)
+			break;
+	}
+	btt->num_arenas = num_arenas;
+	btt->nlba = cur_nlba;
+	btt->init_state = INIT_READY;
+
+	kfree(super);
+	return ret;
+
+ out:
+	kfree(arena);
+	free_arenas(btt);
+ out_super:
+	kfree(super);
+	return ret;
+}
+
+static int create_arenas(struct btt *btt)
+{
+	size_t remaining = btt->rawsize;
+	size_t cur_off = 0;
+
+	while (remaining) {
+		struct arena_info *arena;
+		size_t arena_size = min_t(u64, ARENA_MAX_SIZE, remaining);
+
+		remaining -= arena_size;
+		if (arena_size < ARENA_MIN_SIZE)
+			break;
+
+		arena = alloc_arena(btt, arena_size, btt->nlba, cur_off);
+		if (!arena) {
+			free_arenas(btt);
+			return -ENOMEM;
+		}
+		btt->nlba += arena->external_nlba;
+		if (remaining >= ARENA_MIN_SIZE)
+			arena->nextoff = arena->size;
+		else
+			arena->nextoff = 0;
+		cur_off += arena_size;
+		list_add_tail(&arena->list, &btt->arena_list);
+	}
+
+	return 0;
+}
+
+/*
+ * This function completes arena initialization by writing
+ * all the metadata.
+ * It is only called for an uninitialized arena when a write
+ * to that arena occurs for the first time.
+ */
+static int btt_arena_write_layout(struct arena_info *arena, u8 *uuid)
+{
+	int ret;
+	struct btt_sb *super;
+
+	ret = btt_map_init(arena);
+	if (ret)
+		return ret;
+
+	ret = btt_log_init(arena);
+	if (ret)
+		return ret;
+
+	super = kzalloc(sizeof(struct btt_sb), GFP_NOIO);
+	if (!super)
+		return -ENOMEM;
+
+	strncpy(super->signature, BTT_SIG, BTT_SIG_LEN);
+	memcpy(super->uuid, uuid, 16);
+	super->flags = cpu_to_le32(arena->flags);
+	super->version_major = cpu_to_le16(arena->version_major);
+	super->version_minor = cpu_to_le16(arena->version_minor);
+	super->external_lbasize = cpu_to_le32(arena->external_lbasize);
+	super->external_nlba = cpu_to_le32(arena->external_nlba);
+	super->internal_lbasize = cpu_to_le32(arena->internal_lbasize);
+	super->internal_nlba = cpu_to_le32(arena->internal_nlba);
+	super->nfree = cpu_to_le32(arena->nfree);
+	super->infosize = cpu_to_le32(sizeof(struct btt_sb));
+	super->nextoff = cpu_to_le64(arena->nextoff);
+	/*
+	 * Subtract arena->infooff (arena start) so numbers are relative
+	 * to 'this' arena
+	 */
+	super->dataoff = cpu_to_le64(arena->dataoff - arena->infooff);
+	super->mapoff = cpu_to_le64(arena->mapoff - arena->infooff);
+	super->logoff = cpu_to_le64(arena->logoff - arena->infooff);
+	super->info2off = cpu_to_le64(arena->info2off - arena->infooff);
+
+	super->flags = 0;
+	super->checksum = cpu_to_le64(nd_btt_sb_checksum(super));
+
+	ret = btt_info_write(arena, super);
+
+	kfree(super);
+	return ret;
+}
+
+/*
+ * This function completes the initialization for the BTT namespace
+ * such that it is ready to accept IOs
+ */
+static int btt_meta_init(struct btt *btt)
+{
+	int ret = 0;
+	struct arena_info *arena;
+
+	mutex_lock(&btt->init_lock);
+	list_for_each_entry(arena, &btt->arena_list, list) {
+		ret = btt_arena_write_layout(arena, btt->nd_btt->uuid);
+		if (ret)
+			goto unlock;
+
+		ret = btt_freelist_init(arena);
+		if (ret)
+			goto unlock;
+
+		ret = btt_rtt_init(arena);
+		if (ret)
+			goto unlock;
+
+		ret = btt_maplocks_init(arena);
+		if (ret)
+			goto unlock;
+	}
+
+	btt->init_state = INIT_READY;
+
+ unlock:
+	mutex_unlock(&btt->init_lock);
+	return ret;
+}
+
+/*
+ * This function calculates the arena in which the given LBA lies
+ * by doing a linear walk. This is acceptable since we expect only
+ * a few arenas. If we have backing devices that get much larger,
+ * we can construct a balanced binary tree of arenas at init time
+ * so that this range search becomes faster.
+ */
+static int lba_to_arena(struct btt *btt, sector_t sector, __u32 *premap,
+				struct arena_info **arena)
+{
+	struct arena_info *arena_list;
+	__u64 lba = div_u64(sector << SECTOR_SHIFT, btt->sector_size);
+
+	list_for_each_entry(arena_list, &btt->arena_list, list) {
+		if (lba < arena_list->external_nlba) {
+			*arena = arena_list;
+			*premap = lba;
+			return 0;
+		}
+		lba -= arena_list->external_nlba;
+	}
+
+	return -EIO;
+}
+
+/*
+ * The following (lock_map, unlock_map) are mostly just to improve
+ * readability, since they index into an array of locks
+ */
+static void lock_map(struct arena_info *arena, u32 premap)
+		__acquires(&arena->map_locks[idx].lock)
+{
+	u32 idx = (premap * MAP_ENT_SIZE / L1_CACHE_BYTES) % arena->nfree;
+
+	spin_lock(&arena->map_locks[idx].lock);
+}
+
+static void unlock_map(struct arena_info *arena, u32 premap)
+		__releases(&arena->map_locks[idx].lock)
+{
+	u32 idx = (premap * MAP_ENT_SIZE / L1_CACHE_BYTES) % arena->nfree;
+
+	spin_unlock(&arena->map_locks[idx].lock);
+}
+
+static u64 to_namespace_offset(struct arena_info *arena, u64 lba)
+{
+	return arena->dataoff + ((u64)lba * arena->internal_lbasize);
+}
+
+static int btt_data_read(struct arena_info *arena, struct page *page,
+			unsigned int off, u32 lba, u32 len)
+{
+	int ret;
+	u64 nsoff = to_namespace_offset(arena, lba);
+	void *mem = kmap_atomic(page);
+
+	ret = arena_read_bytes(arena, nsoff, mem + off, len);
+	kunmap_atomic(mem);
+
+	return ret;
+}
+
+static int btt_data_write(struct arena_info *arena, u32 lba,
+			struct page *page, unsigned int off, u32 len)
+{
+	int ret;
+	u64 nsoff = to_namespace_offset(arena, lba);
+	void *mem = kmap_atomic(page);
+
+	ret = arena_write_bytes(arena, nsoff, mem + off, len);
+	kunmap_atomic(mem);
+
+	return ret;
+}
+
+static void zero_fill_data(struct page *page, unsigned int off, u32 len)
+{
+	void *mem = kmap_atomic(page);
+
+	memset(mem + off, 0, len);
+	kunmap_atomic(mem);
+}
+
+static int btt_read_pg(struct btt *btt, struct page *page, unsigned int off,
+			sector_t sector, unsigned int len)
+{
+	int ret = 0;
+	int t_flag, e_flag;
+	struct arena_info *arena = NULL;
+	u32 lane = 0, premap, postmap;
+
+	while (len) {
+		u32 cur_len;
+
+		lane = nd_region_acquire_lane(btt->nd_region);
+
+		ret = lba_to_arena(btt, sector, &premap, &arena);
+		if (ret)
+			goto out_lane;
+
+		cur_len = min(btt->sector_size, len);
+
+		ret = btt_map_read(arena, premap, &postmap, &t_flag, &e_flag);
+		if (ret)
+			goto out_lane;
+
+		/*
+		 * We loop to make sure that the post map LBA didn't change
+		 * from under us between writing the RTT and doing the actual
+		 * read.
+		 */
+		while (1) {
+			u32 new_map;
+
+			if (t_flag) {
+				zero_fill_data(page, off, cur_len);
+				goto out_lane;
+			}
+
+			if (e_flag) {
+				ret = -EIO;
+				goto out_lane;
+			}
+
+			arena->rtt[lane] = RTT_VALID | postmap;
+			/*
+			 * Barrier to make sure this write is not reordered
+			 * to do the verification map_read before the RTT store
+			 */
+			barrier();
+
+			ret = btt_map_read(arena, premap, &new_map, &t_flag,
+						&e_flag);
+			if (ret)
+				goto out_rtt;
+
+			if (postmap == new_map)
+				break;
+
+			postmap = new_map;
+		}
+
+		ret = btt_data_read(arena, page, off, postmap, cur_len);
+		if (ret)
+			goto out_rtt;
+
+		arena->rtt[lane] = RTT_INVALID;
+		nd_region_release_lane(btt->nd_region, lane);
+
+		len -= cur_len;
+		off += cur_len;
+		sector += btt->sector_size >> SECTOR_SHIFT;
+	}
+
+	return 0;
+
+ out_rtt:
+	arena->rtt[lane] = RTT_INVALID;
+ out_lane:
+	nd_region_release_lane(btt->nd_region, lane);
+	return ret;
+}
+
+static int btt_write_pg(struct btt *btt, sector_t sector, struct page *page,
+		unsigned int off, unsigned int len)
+{
+	int ret = 0;
+	struct arena_info *arena = NULL;
+	u32 premap = 0, old_postmap, new_postmap, lane = 0, i;
+	struct log_entry log;
+	int sub;
+
+	while (len) {
+		u32 cur_len;
+
+		lane = nd_region_acquire_lane(btt->nd_region);
+
+		ret = lba_to_arena(btt, sector, &premap, &arena);
+		if (ret)
+			goto out_lane;
+		cur_len = min(btt->sector_size, len);
+
+		if ((arena->flags & IB_FLAG_ERROR_MASK) != 0) {
+			ret = -EIO;
+			goto out_lane;
+		}
+
+		new_postmap = arena->freelist[lane].block;
+
+		/* Wait if the new block is being read from */
+		for (i = 0; i < arena->nfree; i++)
+			while (arena->rtt[i] == (RTT_VALID | new_postmap))
+				cpu_relax();
+
+
+		if (new_postmap >= arena->internal_nlba) {
+			ret = -EIO;
+			goto out_lane;
+		} else
+			ret = btt_data_write(arena, new_postmap, page,
+						off, cur_len);
+		if (ret)
+			goto out_lane;
+
+		lock_map(arena, premap);
+		ret = btt_map_read(arena, premap, &old_postmap, NULL, NULL);
+		if (ret)
+			goto out_map;
+		if (old_postmap >= arena->internal_nlba) {
+			ret = -EIO;
+			goto out_map;
+		}
+
+		log.lba = cpu_to_le32(premap);
+		log.old_map = cpu_to_le32(old_postmap);
+		log.new_map = cpu_to_le32(new_postmap);
+		log.seq = cpu_to_le32(arena->freelist[lane].seq);
+		sub = arena->freelist[lane].sub;
+		ret = btt_flog_write(arena, lane, sub, &log);
+		if (ret)
+			goto out_map;
+
+		ret = btt_map_write(arena, premap, new_postmap, 0, 0);
+		if (ret)
+			goto out_map;
+
+		unlock_map(arena, premap);
+		nd_region_release_lane(btt->nd_region, lane);
+
+		len -= cur_len;
+		off += cur_len;
+		sector += btt->sector_size >> SECTOR_SHIFT;
+	}
+
+	return 0;
+
+ out_map:
+	unlock_map(arena, premap);
+ out_lane:
+	nd_region_release_lane(btt->nd_region, lane);
+	return ret;
+}
+
+static int btt_do_bvec(struct btt *btt, struct page *page,
+			unsigned int len, unsigned int off, int rw,
+			sector_t sector)
+{
+	int ret;
+
+	if (rw == READ) {
+		ret = btt_read_pg(btt, page, off, sector, len);
+		flush_dcache_page(page);
+	} else {
+		flush_dcache_page(page);
+		ret = btt_write_pg(btt, sector, page, off, len);
+	}
+
+	return ret;
+}
+
+static void btt_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct block_device *bdev = bio->bi_bdev;
+	struct btt *btt = q->queuedata;
+	int rw;
+	struct bio_vec bvec;
+	sector_t sector;
+	struct bvec_iter iter;
+	int err = 0;
+
+	sector = bio->bi_iter.bi_sector;
+	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
+		err = -EIO;
+		goto out;
+	}
+
+	BUG_ON(bio->bi_rw & REQ_DISCARD);
+
+	rw = bio_rw(bio);
+	if (rw == READA)
+		rw = READ;
+
+	bio_for_each_segment(bvec, bio, iter) {
+		unsigned int len = bvec.bv_len;
+
+		BUG_ON(len > PAGE_SIZE);
+		/* Make sure len is in multiples of sector size. */
+		/* XXX is this right? */
+		BUG_ON(len < btt->sector_size);
+		BUG_ON(len % btt->sector_size);
+
+		err = btt_do_bvec(btt, bvec.bv_page, len, bvec.bv_offset,
+				rw, sector);
+		if (err) {
+			dev_info(&btt->nd_btt->dev,
+					"io error in %s sector %lld, len %d,\n",
+					(rw == READ) ? "READ" : "WRITE",
+					(unsigned long long) sector, len);
+			goto out;
+		}
+		sector += len >> SECTOR_SHIFT;
+	}
+
+out:
+	bio_endio(bio, err);
+}
+
+static int btt_rw_page(struct block_device *bdev, sector_t sector,
+		struct page *page, int rw)
+{
+	struct btt *btt = bdev->bd_disk->private_data;
+
+	btt_do_bvec(btt, page, PAGE_CACHE_SIZE, 0, rw, sector);
+	page_endio(page, rw & WRITE, 0);
+	return 0;
+}
+
+
+static int btt_getgeo(struct block_device *bd, struct hd_geometry *geo)
+{
+	/* some standard values */
+	geo->heads = 1 << 6;
+	geo->sectors = 1 << 5;
+	geo->cylinders = get_capacity(bd->bd_disk) >> 11;
+	return 0;
+}
+
+static const struct block_device_operations btt_fops = {
+	.owner =		THIS_MODULE,
+	.rw_page =		btt_rw_page,
+	.getgeo =		btt_getgeo,
+};
+
+static int btt_blk_init(struct btt *btt)
+{
+	struct nd_btt *nd_btt = btt->nd_btt;
+	char name[BDEVNAME_SIZE];
+	int ret;
+
+	/* create a new disk and request queue for btt */
+	btt->btt_queue = blk_alloc_queue(GFP_KERNEL);
+	if (!btt->btt_queue)
+		return -ENOMEM;
+
+	btt->btt_disk = alloc_disk(0);
+	if (!btt->btt_disk) {
+		ret = -ENOMEM;
+		goto out_free_queue;
+	}
+
+	sprintf(btt->btt_disk->disk_name, "%ss",
+			bdevname(nd_btt->backing_dev, name));
+	btt->btt_disk->driverfs_dev = &btt->nd_btt->dev;
+	btt->btt_disk->major = btt_major;
+	btt->btt_disk->first_minor = 0;
+	btt->btt_disk->fops = &btt_fops;
+	btt->btt_disk->private_data = btt;
+	btt->btt_disk->queue = btt->btt_queue;
+	btt->btt_disk->flags = GENHD_FL_EXT_DEVT;
+
+	blk_queue_make_request(btt->btt_queue, btt_make_request);
+	blk_queue_max_hw_sectors(btt->btt_queue, 1024);
+	blk_queue_bounce_limit(btt->btt_queue, BLK_BOUNCE_ANY);
+	blk_queue_logical_block_size(btt->btt_queue, btt->sector_size);
+	btt->btt_queue->queuedata = btt;
+
+	set_capacity(btt->btt_disk,
+			btt->nlba * btt->sector_size >> SECTOR_SHIFT);
+	add_disk(btt->btt_disk);
+
+	return 0;
+
+out_free_queue:
+	blk_cleanup_queue(btt->btt_queue);
+	return ret;
+}
+
+static void btt_blk_cleanup(struct btt *btt)
+{
+	del_gendisk(btt->btt_disk);
+	put_disk(btt->btt_disk);
+	blk_cleanup_queue(btt->btt_queue);
+}
+
+/**
+ * btt_init - initialize a block translation table for the given device
+ * @nd_btt:	device with BTT geometry and backing device info
+ * @rawsize:	raw size in bytes of the backing device
+ * @lbasize:	lba size of the backing device
+ * @uuid:	A uuid for the backing device - this is stored on media
+ * @maxlane:	maximum number of parallel requests the device can handle
+ *
+ * Initialize a Block Translation Table on a backing device to provide
+ * single sector power fail atomicity.
+ *
+ * Context:
+ * Might sleep.
+ *
+ * Returns:
+ * Pointer to a new struct btt on success, NULL on failure.
+ */
+static struct btt *btt_init(struct nd_btt *nd_btt, unsigned long long rawsize,
+		u32 lbasize, u8 *uuid, struct nd_region *nd_region)
+{
+	int ret;
+	struct btt *btt;
+	struct device *dev = &nd_btt->dev;
+
+	btt = kzalloc(sizeof(struct btt), GFP_KERNEL);
+	if (!btt)
+		return NULL;
+
+	btt->nd_btt = nd_btt;
+	btt->rawsize = rawsize;
+	btt->lbasize = lbasize;
+	btt->sector_size = ((lbasize >= 4096) ? 4096 : 512);
+	INIT_LIST_HEAD(&btt->arena_list);
+	mutex_init(&btt->init_lock);
+	btt->nd_region = nd_region;
+
+	ret = discover_arenas(btt);
+	if (ret) {
+		dev_err(dev, "init: error in arena_discover: %d\n", ret);
+		goto out_free;
+	}
+
+	if (btt->init_state != INIT_READY) {
+		btt->num_arenas = (rawsize / ARENA_MAX_SIZE) +
+			((rawsize % ARENA_MAX_SIZE) ? 1 : 0);
+		dev_dbg(dev, "init: %d arenas for %llu rawsize\n",
+				btt->num_arenas, rawsize);
+
+		ret = create_arenas(btt);
+		if (ret) {
+			dev_info(dev, "init: create_arenas: %d\n", ret);
+			goto out_free;
+		}
+
+		ret = btt_meta_init(btt);
+		if (ret) {
+			dev_err(dev, "init: error in meta_init: %d\n", ret);
+			return NULL;
+		}
+	}
+
+	ret = btt_blk_init(btt);
+	if (ret) {
+		dev_err(dev, "init: error in blk_init: %d\n", ret);
+		goto out_free;
+	}
+
+	btt_debugfs_init(btt);
+
+	return btt;
+
+ out_free:
+	kfree(btt);
+	return NULL;
+}
+
+/**
+ * btt_fini - de-initialize a BTT
+ * @btt:	the BTT handle that was generated by btt_init
+ *
+ * De-initialize a Block Translation Table on device removal
+ *
+ * Context:
+ * Might sleep.
+ */
+static void btt_fini(struct btt *btt)
+{
+	if (btt) {
+		btt_blk_cleanup(btt);
+		free_arenas(btt);
+		debugfs_remove_recursive(btt->debugfs_dir);
+		kfree(btt);
+	}
+}
+
+static int link_btt(struct nd_btt *nd_btt)
+{
+	struct block_device *bdev = nd_btt->backing_dev;
+	struct kobject *dir = &part_to_dev(bdev->bd_part)->kobj;
+
+	return sysfs_create_link(dir, &nd_btt->dev.kobj, "nd_btt");
+}
+
+static void unlink_btt(struct nd_btt *nd_btt)
+{
+	struct block_device *bdev = nd_btt->backing_dev;
+	struct kobject *dir;
+
+	/* if backing_dev was deleted first we may have nothing to unlink */
+	if (!nd_btt->backing_dev)
+		return;
+
+	dir = &part_to_dev(bdev->bd_part)->kobj;
+	sysfs_remove_link(dir, "nd_btt");
+}
+
+static struct nd_region *nd_btt_to_region(struct nd_btt *nd_btt)
+{
+	struct block_device *bdev = nd_btt->backing_dev;
+	struct device *disk_dev = disk_to_dev(bdev->bd_disk);
+	struct device *namespace_dev = disk_dev->parent;
+
+	return to_nd_region(namespace_dev->parent);
+}
+
+static int nd_btt_probe(struct device *dev)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	struct nd_region *nd_region;
+	struct block_device *bdev;
+	struct btt *btt;
+	size_t rawsize;
+	int rc;
+
+	if (!nd_btt->uuid || !nd_btt->backing_dev || !nd_btt->lbasize)
+		return -ENODEV;
+
+	rc = link_btt(nd_btt);
+	if (rc)
+		return rc;
+
+	bdev = nd_btt->backing_dev;
+	sync_blockdev(bdev);
+	invalidate_bdev(bdev);
+	rawsize = (bdev->bd_part->nr_sects << SECTOR_SHIFT) - SZ_4K;
+	if (rawsize < ARENA_MIN_SIZE) {
+		rc = -ENXIO;
+		goto err_btt;
+	}
+	nd_region = nd_btt_to_region(nd_btt);
+	btt = btt_init(nd_btt, rawsize, nd_btt->lbasize, nd_btt->uuid,
+			nd_region);
+	if (!btt) {
+		rc = -ENOMEM;
+		goto err_btt;
+	}
+	dev_set_drvdata(dev, btt);
+
+	return 0;
+ err_btt:
+	unlink_btt(nd_btt);
+	return rc;
+}
+
+static int nd_btt_remove(struct device *dev)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	struct btt *btt = dev_get_drvdata(dev);
+
+	btt_fini(btt);
+	unlink_btt(nd_btt);
+
+	return 0;
+}
+
+static struct nd_device_driver nd_btt_driver = {
+	.probe = nd_btt_probe,
+	.remove = nd_btt_remove,
+	.drv = {
+		.name = "nd_btt",
+	},
+	.type = ND_DRIVER_BTT,
+};
+
+static int __init nd_btt_init(void)
+{
+	int rc;
+
+	BUILD_BUG_ON(sizeof(struct btt_sb) != SZ_4K);
+
+	btt_major = register_blkdev(0, "btt");
+	if (btt_major < 0)
+		return btt_major;
+
+	debugfs_root = debugfs_create_dir("btt", NULL);
+	if (IS_ERR_OR_NULL(debugfs_root)) {
+		rc = -ENXIO;
+		goto err_debugfs;
+	}
+
+	rc = nd_driver_register(&nd_btt_driver);
+	if (rc < 0)
+		goto err_driver;
+	return 0;
+
+ err_driver:
+	debugfs_remove_recursive(debugfs_root);
+ err_debugfs:
+	unregister_blkdev(btt_major, "btt");
+
+	return rc;
+}
+
+static void __exit nd_btt_exit(void)
+{
+	driver_unregister(&nd_btt_driver.drv);
+	debugfs_remove_recursive(debugfs_root);
+	unregister_blkdev(btt_major, "btt");
+}
+
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_BTT);
+MODULE_AUTHOR("Vishal Verma <vishal.l.verma@linux.intel.com>");
+MODULE_LICENSE("GPL v2");
+module_init(nd_btt_init);
+module_exit(nd_btt_exit);
diff --git a/drivers/nvdimm/btt.h b/drivers/nvdimm/btt.h
index e8f6d8e0ddd3..8c95a7792c3e 100644
--- a/drivers/nvdimm/btt.h
+++ b/drivers/nvdimm/btt.h
@@ -19,6 +19,39 @@
 
 #define BTT_SIG_LEN 16
 #define BTT_SIG "BTT_ARENA_INFO\0"
+#define MAP_ENT_SIZE 4
+#define MAP_TRIM_SHIFT 31
+#define MAP_TRIM_MASK (1 << MAP_TRIM_SHIFT)
+#define MAP_ERR_SHIFT 30
+#define MAP_ERR_MASK (1 << MAP_ERR_SHIFT)
+#define MAP_LBA_MASK (~((1 << MAP_TRIM_SHIFT) | (1 << MAP_ERR_SHIFT)))
+#define MAP_ENT_NORMAL 0xC0000000
+#define LOG_ENT_SIZE sizeof(struct log_entry)
+#define ARENA_MIN_SIZE (1UL << 24)	/* 16 MB */
+#define ARENA_MAX_SIZE (1ULL << 39)	/* 512 GB */
+#define RTT_VALID (1UL << 31)
+#define RTT_INVALID 0
+#define INT_LBASIZE_ALIGNMENT 256
+#define BTT_PG_SIZE 4096
+#define BTT_DEFAULT_NFREE ND_MAX_LANES
+#define LOG_SEQ_INIT 1
+
+#define IB_FLAG_ERROR 0x00000001
+#define IB_FLAG_ERROR_MASK 0x00000001
+
+enum btt_init_state {
+	INIT_UNCHECKED = 0,
+	INIT_NOTFOUND,
+	INIT_READY
+};
+
+struct log_entry {
+	__le32 lba;
+	__le32 old_map;
+	__le32 new_map;
+	__le32 seq;
+	__le64 padding[2];
+};
 
 struct btt_sb {
 	u8 signature[BTT_SIG_LEN];
@@ -42,4 +75,112 @@ struct btt_sb {
 	__le64 checksum;
 };
 
+struct free_entry {
+	u32 block;
+	u8 sub;
+	u8 seq;
+};
+
+struct aligned_lock {
+	union {
+		spinlock_t lock;
+		u8 cacheline_padding[L1_CACHE_BYTES];
+	};
+};
+
+/**
+ * struct arena_info - handle for an arena
+ * @size:		Size in bytes this arena occupies on the raw device.
+ *			This includes arena metadata.
+ * @external_lba_start:	The first external LBA in this arena.
+ * @internal_nlba:	Number of internal blocks available in the arena
+ *			including nfree reserved blocks
+ * @internal_lbasize:	Internal and external lba sizes may be different as
+ *			we can round up 'odd' external lbasizes such as 520B
+ *			to be aligned.
+ * @external_nlba:	Number of blocks contributed by the arena to the number
+ *			reported to upper layers. (internal_nlba - nfree)
+ * @external_lbasize:	LBA size as exposed to upper layers.
+ * @nfree:		A reserve number of 'free' blocks that is used to
+ *			handle incoming writes.
+ * @version_major:	Metadata layout version major.
+ * @version_minor:	Metadata layout version minor.
+ * @nextoff:		Offset in bytes to the start of the next arena.
+ * @infooff:		Offset in bytes to the info block of this arena.
+ * @dataoff:		Offset in bytes to the data area of this arena.
+ * @mapoff:		Offset in bytes to the map area of this arena.
+ * @logoff:		Offset in bytes to the log area of this arena.
+ * @info2off:		Offset in bytes to the backup info block of this arena.
+ * @freelist:		Pointer to in-memory list of free blocks
+ * @rtt:		Pointer to in-memory "Read Tracking Table"
+ * @map_locks:		Spinlocks protecting concurrent map writes
+ * @nd_btt:		Pointer to parent nd_btt structure.
+ * @list:		List head for list of arenas
+ * @debugfs_dir:	Debugfs dentry
+ * @flags:		Arena flags - may signify error states.
+ *
+ * arena_info is a per-arena handle. Once an arena is narrowed down for an
+ * IO, this struct is passed around for the duration of the IO.
+ */
+struct arena_info {
+	u64 size;			/* Total bytes for this arena */
+	u64 external_lba_start;
+	u32 internal_nlba;
+	u32 internal_lbasize;
+	u32 external_nlba;
+	u32 external_lbasize;
+	u32 nfree;
+	u16 version_major;
+	u16 version_minor;
+	/* Byte offsets to the different on-media structures */
+	u64 nextoff;
+	u64 infooff;
+	u64 dataoff;
+	u64 mapoff;
+	u64 logoff;
+	u64 info2off;
+	/* Pointers to other in-memory structures for this arena */
+	struct free_entry *freelist;
+	u32 *rtt;
+	struct aligned_lock *map_locks;
+	struct nd_btt *nd_btt;
+	struct list_head list;
+	struct dentry *debugfs_dir;
+	/* Arena flags */
+	u32 flags;
+};
+
+/**
+ * struct btt - handle for a BTT instance
+ * @btt_disk:		Pointer to the gendisk for BTT device
+ * @btt_queue:		Pointer to the request queue for the BTT device
+ * @arena_list:		Head of the list of arenas
+ * @debugfs_dir:	Debugfs dentry
+ * @nd_btt:		Parent nd_btt struct
+ * @nlba:		Number of logical blocks exposed to the	upper layers
+ *			after removing the amount of space needed by metadata
+ * @rawsize:		Total size in bytes of the available backing device
+ * @lbasize:		LBA size as requested and presented to upper layers.
+ *			This is sector_size + size of any metadata.
+ * @sector_size:	The Linux sector size - 512 or 4096
+ * @lanes:		Per-lane spinlocks
+ * @init_lock:		Mutex used for the BTT initialization
+ * @init_state:		Flag describing the initialization state for the BTT
+ * @num_arenas:		Number of arenas in the BTT instance
+ */
+struct btt {
+	struct gendisk *btt_disk;
+	struct request_queue *btt_queue;
+	struct list_head arena_list;
+	struct dentry *debugfs_dir;
+	struct nd_btt *nd_btt;
+	u64 nlba;
+	unsigned long long rawsize;
+	u32 lbasize;
+	u32 sector_size;
+	struct nd_region *nd_region;
+	struct mutex init_lock;
+	int init_state;
+	int num_arenas;
+};
 #endif
diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
index 2148fd8f535b..c03c854f892b 100644
--- a/drivers/nvdimm/btt_devs.c
+++ b/drivers/nvdimm/btt_devs.c
@@ -351,7 +351,8 @@ struct nd_btt *nd_btt_create(struct nvdimm_bus *nvdimm_bus)
  */
 u64 nd_btt_sb_checksum(struct btt_sb *btt_sb)
 {
-	u64 sum, sum_save;
+	u64 sum;
+	__le64 sum_save;
 
 	sum_save = btt_sb->checksum;
 	btt_sb->checksum = 0;
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 3bd8d650340e..b876b839f49a 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -22,6 +22,12 @@
 #include "label.h"
 
 enum {
+	/*
+	 * Limits the maximum number of block apertures a dimm can
+	 * support and is an input to the geometry/on-disk-format of a
+	 * BTT instance
+	 */
+	ND_MAX_LANES = 256,
 	SECTOR_SHIFT = 9,
 };
 
@@ -84,7 +90,7 @@ struct nd_region {
 	u16 ndr_mappings;
 	u64 ndr_size;
 	u64 ndr_start;
-	int id;
+	int id, num_lanes;
 	void *provider_data;
 	struct nd_interleave_set *nd_set;
 	struct nd_mapping mapping[0];
@@ -136,6 +142,8 @@ struct nd_btt *to_nd_btt(struct device *dev);
 struct btt_sb;
 u64 nd_btt_sb_checksum(struct btt_sb *btt_sb);
 struct nd_region *to_nd_region(struct device *dev);
+unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
+void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
 int nd_region_to_nstype(struct nd_region *nd_region);
 int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
 u64 nd_region_interleave_set_cookie(struct nd_region *nd_region);
diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c
index 9aba44e483e0..aa617bf86506 100644
--- a/drivers/nvdimm/region.c
+++ b/drivers/nvdimm/region.c
@@ -10,18 +10,106 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/cpumask.h>
 #include <linux/module.h>
 #include <linux/device.h>
 #include <linux/nd.h>
 #include "nd.h"
 
+struct nd_percpu_lane {
+	int count[CONFIG_ND_MAX_REGIONS];
+	spinlock_t lock[CONFIG_ND_MAX_REGIONS];
+};
+
+static DEFINE_PER_CPU(struct nd_percpu_lane, nd_percpu_lane);
+
+static void __init nd_region_init_locks(void)
+{
+	unsigned int i, j;
+
+	for (i = 0; i < nr_cpu_ids; i++)
+		for (j = 0; j < CONFIG_ND_MAX_REGIONS; j++) {
+			struct nd_percpu_lane *ndl;
+
+			ndl = per_cpu_ptr(&nd_percpu_lane, i);
+			spin_lock_init(&ndl->lock[j]);
+			ndl->count[j] = 0;
+		}
+}
+
+/**
+ * nd_region_acquire_lane - allocate and lock a lane
+ * @nd_region: region id and number of lanes possible
+ *
+ * A lane correlates to a BLK-data-window and/or a log slot in the BTT.
+ * We optimize for the common case where there are 256 lanes, one
+ * per-cpu.  For larger systems we need to lock to share lanes.  For now
+ * this implementation assumes the cost of maintaining an allocator for
+ * free lanes is on the order of the lock hold time, so it implements a
+ * static lane = cpu % num_lanes mapping.
+ *
+ * In the case of a BTT instance on top of a BLK namespace a lane may be
+ * acquired recursively.  We lock on the first instance.
+ *
+ * In the case of a BTT instance on top of PMEM, we only acquire a lane
+ * for the BTT metadata updates.
+ */
+unsigned int nd_region_acquire_lane(struct nd_region *nd_region)
+{
+	unsigned int cpu, lane;
+
+	cpu = get_cpu();
+	if (nd_region->num_lanes < nr_cpu_ids) {
+		struct nd_percpu_lane *ndl_lock, *ndl_count;
+		unsigned int id = nd_region->id;
+
+		lane = cpu % nd_region->num_lanes;
+		ndl_count = per_cpu_ptr(&nd_percpu_lane, cpu);
+		ndl_lock = per_cpu_ptr(&nd_percpu_lane, lane);
+		if (ndl_count->count[id]++ == 0)
+			spin_lock(&ndl_lock->lock[id]);
+	} else
+		lane = cpu;
+
+	return lane;
+}
+EXPORT_SYMBOL(nd_region_acquire_lane);
+
+void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane)
+{
+	if (nd_region->num_lanes < nr_cpu_ids) {
+		unsigned int cpu = get_cpu();
+		unsigned int id = nd_region->id;
+		struct nd_percpu_lane *ndl_lock, *ndl_count;
+
+		ndl_count = per_cpu_ptr(&nd_percpu_lane, cpu);
+		ndl_lock = per_cpu_ptr(&nd_percpu_lane, lane);
+		if (--ndl_count->count[id] == 0)
+			spin_unlock(&ndl_lock->lock[id]);
+		put_cpu();
+	}
+	put_cpu();
+}
+EXPORT_SYMBOL(nd_region_release_lane);
+
 static int nd_region_probe(struct device *dev)
 {
 	int err;
+	static unsigned long once;
 	struct nd_region_namespaces *num_ns;
 	struct nd_region *nd_region = to_nd_region(dev);
 	int rc = nd_region_register_namespaces(nd_region, &err);
 
+	if (nd_region->num_lanes > num_online_cpus()
+			&& nd_region->num_lanes < num_possible_cpus()
+			&& !test_and_set_bit(0, &once)) {
+		dev_info(dev, "online cpus (%d) < concurrent i/o lanes (%d) < possible cpus (%d)\n",
+				num_online_cpus(), nd_region->num_lanes,
+				num_possible_cpus());
+		dev_info(dev, "setting nr_cpus=%d may yield better libnvdimm device performance\n",
+				nd_region->num_lanes);
+	}
+
 	num_ns = devm_kzalloc(dev, sizeof(*num_ns), GFP_KERNEL);
 	if (!num_ns)
 		return -ENOMEM;
@@ -84,6 +172,7 @@ static struct nd_device_driver nd_region_driver = {
 
 int __init nd_region_init(void)
 {
+	nd_region_init_locks();
 	return nd_driver_register(&nd_region_driver);
 }
 
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index ac21ce419beb..cfa68d6590d6 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -532,6 +532,12 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
 	if (nd_region->id < 0) {
 		kfree(nd_region);
 		return NULL;
+	} else if (nd_region->id >= CONFIG_ND_MAX_REGIONS) {
+		dev_err(&nvdimm_bus->dev, "max region limit %d reached\n",
+				CONFIG_ND_MAX_REGIONS);
+		ida_simple_remove(&region_ida, nd_region->id);
+		kfree(nd_region);
+		return NULL;
 	}
 
 	memcpy(nd_region->mapping, ndr_desc->nd_mapping,
@@ -545,6 +551,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
 	nd_region->ndr_mappings = ndr_desc->num_mappings;
 	nd_region->provider_data = ndr_desc->provider_data;
 	nd_region->nd_set = ndr_desc->nd_set;
+	nd_region->num_lanes = ndr_desc->num_lanes;
 	ida_init(&nd_region->ns_ida);
 	dev = &nd_region->dev;
 	dev_set_name(dev, "region%d", nd_region->id);
@@ -561,6 +568,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
 struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus,
 		struct nd_region_desc *ndr_desc)
 {
+	ndr_desc->num_lanes = ND_MAX_LANES;
 	return nd_region_create(nvdimm_bus, ndr_desc, &nd_pmem_device_type,
 			__func__);
 }
@@ -571,6 +579,7 @@ struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus,
 {
 	if (ndr_desc->num_mappings > 1)
 		return NULL;
+	ndr_desc->num_lanes = min(ndr_desc->num_lanes, ND_MAX_LANES);
 	return nd_region_create(nvdimm_bus, ndr_desc, &nd_blk_device_type,
 			__func__);
 }
@@ -579,6 +588,7 @@ EXPORT_SYMBOL_GPL(nvdimm_blk_region_create);
 struct nd_region *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
 		struct nd_region_desc *ndr_desc)
 {
+	ndr_desc->num_lanes = ND_MAX_LANES;
 	return nd_region_create(nvdimm_bus, ndr_desc, &nd_volatile_device_type,
 			__func__);
 }
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index a59dca17b3aa..531d99dfac68 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -85,6 +85,7 @@ struct nd_region_desc {
 	const struct attribute_group **attr_groups;
 	struct nd_interleave_set *nd_set;
 	void *provider_data;
+	int num_lanes;
 };
 
 struct nvdimm_bus;


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 04/15] libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
  2015-06-17 23:54 ` Dan Williams
@ 2015-06-17 23:55   ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Rafael J. Wysocki, linux-kernel,
	Andy Lutomirski, Jens Axboe, linux-acpi, H. Peter Anvin,
	linux-fsdevel, Ross Zwisler, hch, mingo

From: Ross Zwisler <ross.zwisler@linux.intel.com>

The libnvdimm implementation handles allocating dimm address space (DPA)
between PMEM and BLK mode interfaces.  After DPA has been allocated from
a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
as a struct bio based block device. Unlike PMEM, BLK is required to
handle platform specific details like mmio register formats and memory
controller interleave.  For this reason the libnvdimm generic nd_blk
driver calls back into the bus provider to carry out the I/O.

This initial implementation handles the BLK interface defined by the
ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
DCR (dimm control region), BDW (block data window), IDT (interleave
descriptor) NFIT structures and the hardware register format.
[1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
[2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c             |  445 ++++++++++++++++++++++++++++++++++++++-
 drivers/acpi/nfit.h             |   49 ++++
 drivers/nvdimm/Kconfig          |   12 +
 drivers/nvdimm/Makefile         |    3 
 drivers/nvdimm/blk.c            |  241 +++++++++++++++++++++
 drivers/nvdimm/dimm_devs.c      |    9 +
 drivers/nvdimm/namespace_devs.c |   60 +++++
 drivers/nvdimm/nd-core.h        |    5 
 drivers/nvdimm/nd.h             |   13 +
 drivers/nvdimm/region.c         |    8 +
 drivers/nvdimm/region_devs.c    |   83 +++++++
 include/linux/libnvdimm.h       |   27 ++
 12 files changed, 923 insertions(+), 32 deletions(-)
 create mode 100644 drivers/nvdimm/blk.c

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index fc38b49eff7d..3a77709fd394 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -13,12 +13,16 @@
 #include <linux/list_sort.h>
 #include <linux/libnvdimm.h>
 #include <linux/module.h>
+#include <linux/mutex.h>
 #include <linux/ndctl.h>
 #include <linux/list.h>
 #include <linux/acpi.h>
 #include <linux/sort.h>
+#include <linux/io.h>
 #include "nfit.h"
 
+#include <asm-generic/io-64-nonatomic-hi-lo.h>
+
 static bool force_enable_dimms;
 module_param(force_enable_dimms, bool, S_IRUGO|S_IWUSR);
 MODULE_PARM_DESC(force_enable_dimms, "Ignore _STA (ACPI DIMM device) status");
@@ -72,7 +76,7 @@ static int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc,
 
 		if (!adev)
 			return -ENOTTY;
-		dimm_name = dev_name(&adev->dev);
+		dimm_name = nvdimm_name(nvdimm);
 		cmd_name = nvdimm_cmd_name(cmd);
 		dsm_mask = nfit_mem->dsm_mask;
 		desc = nd_cmd_dimm_desc(cmd);
@@ -279,6 +283,23 @@ static bool add_bdw(struct acpi_nfit_desc *acpi_desc,
 	return true;
 }
 
+static bool add_idt(struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_interleave *idt)
+{
+	struct device *dev = acpi_desc->dev;
+	struct nfit_idt *nfit_idt = devm_kzalloc(dev, sizeof(*nfit_idt),
+			GFP_KERNEL);
+
+	if (!nfit_idt)
+		return false;
+	INIT_LIST_HEAD(&nfit_idt->list);
+	nfit_idt->idt = idt;
+	list_add_tail(&nfit_idt->list, &acpi_desc->idts);
+	dev_dbg(dev, "%s: idt index: %d num_lines: %d\n", __func__,
+			idt->interleave_index, idt->line_count);
+	return true;
+}
+
 static void *add_table(struct acpi_nfit_desc *acpi_desc, void *table,
 		const void *end)
 {
@@ -307,9 +328,9 @@ static void *add_table(struct acpi_nfit_desc *acpi_desc, void *table,
 		if (!add_bdw(acpi_desc, table))
 			return err;
 		break;
-	/* TODO */
 	case ACPI_NFIT_TYPE_INTERLEAVE:
-		dev_dbg(dev, "%s: idt\n", __func__);
+		if (!add_idt(acpi_desc, table))
+			return err;
 		break;
 	case ACPI_NFIT_TYPE_FLUSH_ADDRESS:
 		dev_dbg(dev, "%s: flush\n", __func__);
@@ -362,8 +383,11 @@ static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
 		struct nfit_mem *nfit_mem, struct acpi_nfit_system_address *spa)
 {
 	u16 dcr = __to_nfit_memdev(nfit_mem)->region_index;
+	struct nfit_memdev *nfit_memdev;
 	struct nfit_dcr *nfit_dcr;
 	struct nfit_bdw *nfit_bdw;
+	struct nfit_idt *nfit_idt;
+	u16 idt_idx, range_index;
 
 	list_for_each_entry(nfit_dcr, &acpi_desc->dcrs, list) {
 		if (nfit_dcr->dcr->region_index != dcr)
@@ -396,6 +420,26 @@ static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
 		return 0;
 
 	nfit_mem_find_spa_bdw(acpi_desc, nfit_mem);
+
+	if (!nfit_mem->spa_bdw)
+		return 0;
+
+	range_index = nfit_mem->spa_bdw->range_index;
+	list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
+		if (nfit_memdev->memdev->range_index != range_index ||
+				nfit_memdev->memdev->region_index != dcr)
+			continue;
+		nfit_mem->memdev_bdw = nfit_memdev->memdev;
+		idt_idx = nfit_memdev->memdev->interleave_index;
+		list_for_each_entry(nfit_idt, &acpi_desc->idts, list) {
+			if (nfit_idt->idt->interleave_index != idt_idx)
+				continue;
+			nfit_mem->idt_bdw = nfit_idt->idt;
+			break;
+		}
+		break;
+	}
+
 	return 0;
 }
 
@@ -439,9 +483,19 @@ static int nfit_mem_dcr_init(struct acpi_nfit_desc *acpi_desc,
 		}
 
 		if (type == NFIT_SPA_DCR) {
+			struct nfit_idt *nfit_idt;
+			u16 idt_idx;
+
 			/* multiple dimms may share a SPA when interleaved */
 			nfit_mem->spa_dcr = spa;
 			nfit_mem->memdev_dcr = nfit_memdev->memdev;
+			idt_idx = nfit_memdev->memdev->interleave_index;
+			list_for_each_entry(nfit_idt, &acpi_desc->idts, list) {
+				if (nfit_idt->idt->interleave_index != idt_idx)
+					continue;
+				nfit_mem->idt_dcr = nfit_idt->idt;
+				break;
+			}
 		} else {
 			/*
 			 * A single dimm may belong to multiple SPA-PM
@@ -871,6 +925,359 @@ static int acpi_nfit_init_interleave_set(struct acpi_nfit_desc *acpi_desc,
 	return 0;
 }
 
+static u64 to_interleave_offset(u64 offset, struct nfit_blk_mmio *mmio)
+{
+	struct acpi_nfit_interleave *idt = mmio->idt;
+	u32 sub_line_offset, line_index, line_offset;
+	u64 line_no, table_skip_count, table_offset;
+
+	line_no = div_u64_rem(offset, mmio->line_size, &sub_line_offset);
+	table_skip_count = div_u64_rem(line_no, mmio->num_lines, &line_index);
+	line_offset = idt->line_offset[line_index]
+		* mmio->line_size;
+	table_offset = table_skip_count * mmio->table_size;
+
+	return mmio->base_offset + line_offset + table_offset + sub_line_offset;
+}
+
+static u64 read_blk_stat(struct nfit_blk *nfit_blk, unsigned int bw)
+{
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[DCR];
+	u64 offset = nfit_blk->stat_offset + mmio->size * bw;
+
+	if (mmio->num_lines)
+		offset = to_interleave_offset(offset, mmio);
+
+	return readq(mmio->base + offset);
+}
+
+static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
+		resource_size_t dpa, unsigned int len, unsigned int write)
+{
+	u64 cmd, offset;
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[DCR];
+
+	enum {
+		BCW_OFFSET_MASK = (1ULL << 48)-1,
+		BCW_LEN_SHIFT = 48,
+		BCW_LEN_MASK = (1ULL << 8) - 1,
+		BCW_CMD_SHIFT = 56,
+	};
+
+	cmd = (dpa >> L1_CACHE_SHIFT) & BCW_OFFSET_MASK;
+	len = len >> L1_CACHE_SHIFT;
+	cmd |= ((u64) len & BCW_LEN_MASK) << BCW_LEN_SHIFT;
+	cmd |= ((u64) write) << BCW_CMD_SHIFT;
+
+	offset = nfit_blk->cmd_offset + mmio->size * bw;
+	if (mmio->num_lines)
+		offset = to_interleave_offset(offset, mmio);
+
+	writeq(cmd, mmio->base + offset);
+	/* FIXME: conditionally perform read-back if mandated by firmware */
+}
+
+static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
+		resource_size_t dpa, void *iobuf, size_t len, int rw,
+		unsigned int lane)
+{
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW];
+	unsigned int copied = 0;
+	u64 base_offset;
+	int rc;
+
+	base_offset = nfit_blk->bdw_offset + dpa % L1_CACHE_BYTES
+		+ lane * mmio->size;
+	/* TODO: non-temporal access, flush hints, cache management etc... */
+	write_blk_ctl(nfit_blk, lane, dpa, len, rw);
+	while (len) {
+		unsigned int c;
+		u64 offset;
+
+		if (mmio->num_lines) {
+			u32 line_offset;
+
+			offset = to_interleave_offset(base_offset + copied,
+					mmio);
+			div_u64_rem(offset, mmio->line_size, &line_offset);
+			c = min_t(size_t, len, mmio->line_size - line_offset);
+		} else {
+			offset = base_offset + nfit_blk->bdw_offset;
+			c = len;
+		}
+
+		if (rw)
+			memcpy(mmio->aperture + offset, iobuf + copied, c);
+		else
+			memcpy(iobuf + copied, mmio->aperture + offset, c);
+
+		copied += c;
+		len -= c;
+	}
+	rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0;
+	return rc;
+}
+
+static int acpi_nfit_blk_region_do_io(struct nd_blk_region *ndbr,
+		resource_size_t dpa, void *iobuf, u64 len, int rw)
+{
+	struct nfit_blk *nfit_blk = nd_blk_region_provider_data(ndbr);
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW];
+	struct nd_region *nd_region = nfit_blk->nd_region;
+	unsigned int lane, copied = 0;
+	int rc = 0;
+
+	lane = nd_region_acquire_lane(nd_region);
+	while (len) {
+		u64 c = min(len, mmio->size);
+
+		rc = acpi_nfit_blk_single_io(nfit_blk, dpa + copied,
+				iobuf + copied, c, rw, lane);
+		if (rc)
+			break;
+
+		copied += c;
+		len -= c;
+	}
+	nd_region_release_lane(nd_region, lane);
+
+	return rc;
+}
+
+static void nfit_spa_mapping_release(struct kref *kref)
+{
+	struct nfit_spa_mapping *spa_map = to_spa_map(kref);
+	struct acpi_nfit_system_address *spa = spa_map->spa;
+	struct acpi_nfit_desc *acpi_desc = spa_map->acpi_desc;
+
+	WARN_ON(!mutex_is_locked(&acpi_desc->spa_map_mutex));
+	dev_dbg(acpi_desc->dev, "%s: SPA%d\n", __func__, spa->range_index);
+	iounmap(spa_map->iomem);
+	release_mem_region(spa->address, spa->length);
+	list_del(&spa_map->list);
+	kfree(spa_map);
+}
+
+static struct nfit_spa_mapping *find_spa_mapping(
+		struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	struct nfit_spa_mapping *spa_map;
+
+	WARN_ON(!mutex_is_locked(&acpi_desc->spa_map_mutex));
+	list_for_each_entry(spa_map, &acpi_desc->spa_maps, list)
+		if (spa_map->spa == spa)
+			return spa_map;
+
+	return NULL;
+}
+
+static void nfit_spa_unmap(struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	struct nfit_spa_mapping *spa_map;
+
+	mutex_lock(&acpi_desc->spa_map_mutex);
+	spa_map = find_spa_mapping(acpi_desc, spa);
+
+	if (spa_map)
+		kref_put(&spa_map->kref, nfit_spa_mapping_release);
+	mutex_unlock(&acpi_desc->spa_map_mutex);
+}
+
+static void __iomem *__nfit_spa_map(struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	resource_size_t start = spa->address;
+	resource_size_t n = spa->length;
+	struct nfit_spa_mapping *spa_map;
+	struct resource *res;
+
+	WARN_ON(!mutex_is_locked(&acpi_desc->spa_map_mutex));
+
+	spa_map = find_spa_mapping(acpi_desc, spa);
+	if (spa_map) {
+		kref_get(&spa_map->kref);
+		return spa_map->iomem;
+	}
+
+	spa_map = kzalloc(sizeof(*spa_map), GFP_KERNEL);
+	if (!spa_map)
+		return NULL;
+
+	INIT_LIST_HEAD(&spa_map->list);
+	spa_map->spa = spa;
+	kref_init(&spa_map->kref);
+	spa_map->acpi_desc = acpi_desc;
+
+	res = request_mem_region(start, n, dev_name(acpi_desc->dev));
+	if (!res)
+		goto err_mem;
+
+	/* TODO: cacheability based on the spa type */
+	spa_map->iomem = ioremap_nocache(start, n);
+	if (!spa_map->iomem)
+		goto err_map;
+
+	list_add_tail(&spa_map->list, &acpi_desc->spa_maps);
+	return spa_map->iomem;
+
+ err_map:
+	release_mem_region(start, n);
+ err_mem:
+	kfree(spa_map);
+	return NULL;
+}
+
+/**
+ * nfit_spa_map - interleave-aware managed-mappings of acpi_nfit_system_address ranges
+ * @nvdimm_bus: NFIT-bus that provided the spa table entry
+ * @nfit_spa: spa table to map
+ *
+ * In the case where block-data-window apertures and
+ * dimm-control-regions are interleaved they will end up sharing a
+ * single request_mem_region() + ioremap() for the address range.  In
+ * the style of devm nfit_spa_map() mappings are automatically dropped
+ * when all region devices referencing the same mapping are disabled /
+ * unbound.
+ */
+static void __iomem *nfit_spa_map(struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	void __iomem *iomem;
+
+	mutex_lock(&acpi_desc->spa_map_mutex);
+	iomem = __nfit_spa_map(acpi_desc, spa);
+	mutex_unlock(&acpi_desc->spa_map_mutex);
+
+	return iomem;
+}
+
+static int nfit_blk_init_interleave(struct nfit_blk_mmio *mmio,
+		struct acpi_nfit_interleave *idt, u16 interleave_ways)
+{
+	if (idt) {
+		mmio->num_lines = idt->line_count;
+		mmio->line_size = idt->line_size;
+		if (interleave_ways == 0)
+			return -ENXIO;
+		mmio->table_size = mmio->num_lines * interleave_ways
+			* mmio->line_size;
+	}
+
+	return 0;
+}
+
+static int acpi_nfit_blk_region_enable(struct nvdimm_bus *nvdimm_bus,
+		struct device *dev)
+{
+	struct nvdimm_bus_descriptor *nd_desc = to_nd_desc(nvdimm_bus);
+	struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
+	struct nd_blk_region *ndbr = to_nd_blk_region(dev);
+	struct nfit_blk_mmio *mmio;
+	struct nfit_blk *nfit_blk;
+	struct nfit_mem *nfit_mem;
+	struct nvdimm *nvdimm;
+	int rc;
+
+	nvdimm = nd_blk_region_to_dimm(ndbr);
+	nfit_mem = nvdimm_provider_data(nvdimm);
+	if (!nfit_mem || !nfit_mem->dcr || !nfit_mem->bdw) {
+		dev_dbg(dev, "%s: missing%s%s%s\n", __func__,
+				nfit_mem ? "" : " nfit_mem",
+				nfit_mem->dcr ? "" : " dcr",
+				nfit_mem->bdw ? "" : " bdw");
+		return -ENXIO;
+	}
+
+	nfit_blk = devm_kzalloc(dev, sizeof(*nfit_blk), GFP_KERNEL);
+	if (!nfit_blk)
+		return -ENOMEM;
+	nd_blk_region_set_provider_data(ndbr, nfit_blk);
+	nfit_blk->nd_region = to_nd_region(dev);
+
+	/* map block aperture memory */
+	nfit_blk->bdw_offset = nfit_mem->bdw->offset;
+	mmio = &nfit_blk->mmio[BDW];
+	mmio->base = nfit_spa_map(acpi_desc, nfit_mem->spa_bdw);
+	if (!mmio->base) {
+		dev_dbg(dev, "%s: %s failed to map bdw\n", __func__,
+				nvdimm_name(nvdimm));
+		return -ENOMEM;
+	}
+	mmio->size = nfit_mem->bdw->size;
+	mmio->base_offset = nfit_mem->memdev_bdw->region_offset;
+	mmio->idt = nfit_mem->idt_bdw;
+	mmio->spa = nfit_mem->spa_bdw;
+	rc = nfit_blk_init_interleave(mmio, nfit_mem->idt_bdw,
+			nfit_mem->memdev_bdw->interleave_ways);
+	if (rc) {
+		dev_dbg(dev, "%s: %s failed to init bdw interleave\n",
+				__func__, nvdimm_name(nvdimm));
+		return rc;
+	}
+
+	/* map block control memory */
+	nfit_blk->cmd_offset = nfit_mem->dcr->command_offset;
+	nfit_blk->stat_offset = nfit_mem->dcr->status_offset;
+	mmio = &nfit_blk->mmio[DCR];
+	mmio->base = nfit_spa_map(acpi_desc, nfit_mem->spa_dcr);
+	if (!mmio->base) {
+		dev_dbg(dev, "%s: %s failed to map dcr\n", __func__,
+				nvdimm_name(nvdimm));
+		return -ENOMEM;
+	}
+	mmio->size = nfit_mem->dcr->window_size;
+	mmio->base_offset = nfit_mem->memdev_dcr->region_offset;
+	mmio->idt = nfit_mem->idt_dcr;
+	mmio->spa = nfit_mem->spa_dcr;
+	rc = nfit_blk_init_interleave(mmio, nfit_mem->idt_dcr,
+			nfit_mem->memdev_dcr->interleave_ways);
+	if (rc) {
+		dev_dbg(dev, "%s: %s failed to init dcr interleave\n",
+				__func__, nvdimm_name(nvdimm));
+		return rc;
+	}
+
+	if (mmio->line_size == 0)
+		return 0;
+
+	if ((u32) nfit_blk->cmd_offset % mmio->line_size
+			+ 8 > mmio->line_size) {
+		dev_dbg(dev, "cmd_offset crosses interleave boundary\n");
+		return -ENXIO;
+	} else if ((u32) nfit_blk->stat_offset % mmio->line_size
+			+ 8 > mmio->line_size) {
+		dev_dbg(dev, "stat_offset crosses interleave boundary\n");
+		return -ENXIO;
+	}
+
+	return 0;
+}
+
+static void acpi_nfit_blk_region_disable(struct nvdimm_bus *nvdimm_bus,
+		struct device *dev)
+{
+	struct nvdimm_bus_descriptor *nd_desc = to_nd_desc(nvdimm_bus);
+	struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
+	struct nd_blk_region *ndbr = to_nd_blk_region(dev);
+	struct nfit_blk *nfit_blk = nd_blk_region_provider_data(ndbr);
+	int i;
+
+	if (!nfit_blk)
+		return; /* never enabled */
+
+	/* auto-free BLK spa mappings */
+	for (i = 0; i < 2; i++) {
+		struct nfit_blk_mmio *mmio = &nfit_blk->mmio[i];
+
+		if (mmio->base)
+			nfit_spa_unmap(acpi_desc, mmio->spa);
+	}
+	nd_blk_region_set_provider_data(ndbr, NULL);
+	/* devm will free nfit_blk */
+}
+
 static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,
 		struct nd_mapping *nd_mapping, struct nd_region_desc *ndr_desc,
 		struct acpi_nfit_memory_map *memdev,
@@ -878,6 +1285,7 @@ static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,
 {
 	struct nvdimm *nvdimm = acpi_nfit_dimm_by_handle(acpi_desc,
 			memdev->device_handle);
+	struct nd_blk_region_desc *ndbr_desc;
 	struct nfit_mem *nfit_mem;
 	int blk_valid = 0;
 
@@ -908,6 +1316,10 @@ static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,
 
 		ndr_desc->nd_mapping = nd_mapping;
 		ndr_desc->num_mappings = blk_valid;
+		ndbr_desc = to_blk_region_desc(ndr_desc);
+		ndbr_desc->enable = acpi_nfit_blk_region_enable;
+		ndbr_desc->disable = acpi_nfit_blk_region_disable;
+		ndbr_desc->do_io = acpi_nfit_blk_region_do_io;
 		if (!nvdimm_blk_region_create(acpi_desc->nvdimm_bus, ndr_desc))
 			return -ENOMEM;
 		break;
@@ -921,8 +1333,9 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 {
 	static struct nd_mapping nd_mappings[ND_MAX_MAPPINGS];
 	struct acpi_nfit_system_address *spa = nfit_spa->spa;
+	struct nd_blk_region_desc ndbr_desc;
+	struct nd_region_desc *ndr_desc;
 	struct nfit_memdev *nfit_memdev;
-	struct nd_region_desc ndr_desc;
 	struct nvdimm_bus *nvdimm_bus;
 	struct resource res;
 	int count = 0, rc;
@@ -935,12 +1348,13 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 
 	memset(&res, 0, sizeof(res));
 	memset(&nd_mappings, 0, sizeof(nd_mappings));
-	memset(&ndr_desc, 0, sizeof(ndr_desc));
+	memset(&ndbr_desc, 0, sizeof(ndbr_desc));
 	res.start = spa->address;
 	res.end = res.start + spa->length - 1;
-	ndr_desc.res = &res;
-	ndr_desc.provider_data = nfit_spa;
-	ndr_desc.attr_groups = acpi_nfit_region_attribute_groups;
+	ndr_desc = &ndbr_desc.ndr_desc;
+	ndr_desc->res = &res;
+	ndr_desc->provider_data = nfit_spa;
+	ndr_desc->attr_groups = acpi_nfit_region_attribute_groups;
 	list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
 		struct acpi_nfit_memory_map *memdev = nfit_memdev->memdev;
 		struct nd_mapping *nd_mapping;
@@ -953,24 +1367,24 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 			return -ENXIO;
 		}
 		nd_mapping = &nd_mappings[count++];
-		rc = acpi_nfit_init_mapping(acpi_desc, nd_mapping, &ndr_desc,
+		rc = acpi_nfit_init_mapping(acpi_desc, nd_mapping, ndr_desc,
 				memdev, spa);
 		if (rc)
 			return rc;
 	}
 
-	ndr_desc.nd_mapping = nd_mappings;
-	ndr_desc.num_mappings = count;
-	rc = acpi_nfit_init_interleave_set(acpi_desc, &ndr_desc, spa);
+	ndr_desc->nd_mapping = nd_mappings;
+	ndr_desc->num_mappings = count;
+	rc = acpi_nfit_init_interleave_set(acpi_desc, ndr_desc, spa);
 	if (rc)
 		return rc;
 
 	nvdimm_bus = acpi_desc->nvdimm_bus;
 	if (nfit_spa_type(spa) == NFIT_SPA_PM) {
-		if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
+		if (!nvdimm_pmem_region_create(nvdimm_bus, ndr_desc))
 			return -ENOMEM;
 	} else if (nfit_spa_type(spa) == NFIT_SPA_VOLATILE) {
-		if (!nvdimm_volatile_region_create(nvdimm_bus, &ndr_desc))
+		if (!nvdimm_volatile_region_create(nvdimm_bus, ndr_desc))
 			return -ENOMEM;
 	}
 	return 0;
@@ -996,11 +1410,14 @@ static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 	u8 *data;
 	int rc;
 
+	INIT_LIST_HEAD(&acpi_desc->spa_maps);
 	INIT_LIST_HEAD(&acpi_desc->spas);
 	INIT_LIST_HEAD(&acpi_desc->dcrs);
 	INIT_LIST_HEAD(&acpi_desc->bdws);
+	INIT_LIST_HEAD(&acpi_desc->idts);
 	INIT_LIST_HEAD(&acpi_desc->memdevs);
 	INIT_LIST_HEAD(&acpi_desc->dimms);
+	mutex_init(&acpi_desc->spa_map_mutex);
 
 	data = (u8 *) acpi_desc->nfit;
 	end = data + sz;
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index b76e33629098..7bd38b7baf39 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -52,6 +52,11 @@ struct nfit_bdw {
 	struct list_head list;
 };
 
+struct nfit_idt {
+	struct acpi_nfit_interleave *idt;
+	struct list_head list;
+};
+
 struct nfit_memdev {
 	struct acpi_nfit_memory_map *memdev;
 	struct list_head list;
@@ -62,10 +67,13 @@ struct nfit_mem {
 	struct nvdimm *nvdimm;
 	struct acpi_nfit_memory_map *memdev_dcr;
 	struct acpi_nfit_memory_map *memdev_pmem;
+	struct acpi_nfit_memory_map *memdev_bdw;
 	struct acpi_nfit_control_region *dcr;
 	struct acpi_nfit_data_region *bdw;
 	struct acpi_nfit_system_address *spa_dcr;
 	struct acpi_nfit_system_address *spa_bdw;
+	struct acpi_nfit_interleave *idt_dcr;
+	struct acpi_nfit_interleave *idt_bdw;
 	struct list_head list;
 	struct acpi_device *adev;
 	unsigned long dsm_mask;
@@ -74,16 +82,57 @@ struct nfit_mem {
 struct acpi_nfit_desc {
 	struct nvdimm_bus_descriptor nd_desc;
 	struct acpi_table_nfit *nfit;
+	struct mutex spa_map_mutex;
+	struct list_head spa_maps;
 	struct list_head memdevs;
 	struct list_head dimms;
 	struct list_head spas;
 	struct list_head dcrs;
 	struct list_head bdws;
+	struct list_head idts;
 	struct nvdimm_bus *nvdimm_bus;
 	struct device *dev;
 	unsigned long dimm_dsm_force_en;
 };
 
+enum nd_blk_mmio_selector {
+	BDW,
+	DCR,
+};
+
+struct nfit_blk {
+	struct nfit_blk_mmio {
+		union {
+			void __iomem *base;
+			void *aperture;
+		};
+		u64 size;
+		u64 base_offset;
+		u32 line_size;
+		u32 num_lines;
+		u32 table_size;
+		struct acpi_nfit_interleave *idt;
+		struct acpi_nfit_system_address *spa;
+	} mmio[2];
+	struct nd_region *nd_region;
+	u64 bdw_offset; /* post interleave offset */
+	u64 stat_offset;
+	u64 cmd_offset;
+};
+
+struct nfit_spa_mapping {
+	struct acpi_nfit_desc *acpi_desc;
+	struct acpi_nfit_system_address *spa;
+	struct list_head list;
+	struct kref kref;
+	void __iomem *iomem;
+};
+
+static inline struct nfit_spa_mapping *to_spa_map(struct kref *kref)
+{
+	return container_of(kref, struct nfit_spa_mapping, kref);
+}
+
 static inline struct acpi_nfit_memory_map *__to_nfit_memdev(
 		struct nfit_mem *nfit_mem)
 {
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index a9def3839655..912cb36b8435 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -33,6 +33,18 @@ config BLK_DEV_PMEM
 
 	  Say Y if you want to use an NVDIMM
 
+config ND_BLK
+	tristate "BLK: Block data window (aperture) device support"
+	default LIBNVDIMM
+	help
+	  Support NVDIMMs, or other devices, that implement a BLK-mode
+	  access capability.  BLK-mode access uses memory-mapped-i/o
+	  apertures to access persistent media.
+
+	  Say Y if your platform firmware emits an ACPI.NFIT table
+	  (CONFIG_ACPI_NFIT), or otherwise exposes BLK-mode
+	  capabilities.
+
 config ND_BTT_DEVS
 	bool
 
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index aa5bb1acf831..b5682e70904a 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -1,11 +1,14 @@
 obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
 obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
 obj-$(CONFIG_ND_BTT) += nd_btt.o
+obj-$(CONFIG_ND_BLK) += nd_blk.o
 
 nd_pmem-y := pmem.o
 
 nd_btt-y := btt.o
 
+nd_blk-y := blk.o
+
 libnvdimm-y := core.o
 libnvdimm-y += bus.o
 libnvdimm-y += dimm_devs.o
diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
new file mode 100644
index 000000000000..a2749b5e43d7
--- /dev/null
+++ b/drivers/nvdimm/blk.c
@@ -0,0 +1,241 @@
+/*
+ * NVDIMM Block Window Driver
+ * Copyright (c) 2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/blkdev.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/nd.h>
+#include <linux/sizes.h>
+#include "nd.h"
+
+struct nd_blk_device {
+	struct request_queue *queue;
+	struct gendisk *disk;
+	struct nd_namespace_blk *nsblk;
+	struct nd_blk_region *ndbr;
+	size_t disk_size;
+};
+
+static int nd_blk_major;
+
+static resource_size_t to_dev_offset(struct nd_namespace_blk *nsblk,
+				resource_size_t ns_offset, unsigned int len)
+{
+	int i;
+
+	for (i = 0; i < nsblk->num_resources; i++) {
+		if (ns_offset < resource_size(nsblk->res[i])) {
+			if (ns_offset + len > resource_size(nsblk->res[i])) {
+				dev_WARN_ONCE(&nsblk->dev, 1,
+					"%s: illegal request\n", __func__);
+				return SIZE_MAX;
+			}
+			return nsblk->res[i]->start + ns_offset;
+		}
+		ns_offset -= resource_size(nsblk->res[i]);
+	}
+
+	dev_WARN_ONCE(&nsblk->dev, 1, "%s: request out of range\n", __func__);
+	return SIZE_MAX;
+}
+
+static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct block_device *bdev = bio->bi_bdev;
+	struct gendisk *disk = bdev->bd_disk;
+	struct nd_namespace_blk *nsblk;
+	struct nd_blk_device *blk_dev;
+	struct nd_blk_region *ndbr;
+	struct bvec_iter iter;
+	struct bio_vec bvec;
+	int err = 0, rw;
+	sector_t sector;
+
+	sector = bio->bi_iter.bi_sector;
+	if (bio_end_sector(bio) > get_capacity(disk)) {
+		err = -EIO;
+		goto out;
+	}
+
+	BUG_ON(bio->bi_rw & REQ_DISCARD);
+
+	rw = bio_data_dir(bio);
+
+	blk_dev = disk->private_data;
+	nsblk = blk_dev->nsblk;
+	ndbr = blk_dev->ndbr;
+	bio_for_each_segment(bvec, bio, iter) {
+		unsigned int len = bvec.bv_len;
+		resource_size_t	dev_offset;
+		void *iobuf;
+
+		BUG_ON(len > PAGE_SIZE);
+
+		dev_offset = to_dev_offset(nsblk, sector << SECTOR_SHIFT, len);
+		if (dev_offset == SIZE_MAX) {
+			err = -EIO;
+			goto out;
+		}
+
+		iobuf = kmap_atomic(bvec.bv_page);
+		err = ndbr->do_io(ndbr, dev_offset, iobuf + bvec.bv_offset,
+				len, rw);
+		kunmap_atomic(iobuf);
+		if (err)
+			goto out;
+
+		sector += len >> SECTOR_SHIFT;
+	}
+
+ out:
+	bio_endio(bio, err);
+}
+
+static int nd_blk_rw_bytes(struct gendisk *disk, resource_size_t offset,
+		void *iobuf, size_t n, int rw)
+{
+	struct nd_blk_device *blk_dev = disk->private_data;
+	struct nd_namespace_blk *nsblk = blk_dev->nsblk;
+	struct nd_blk_region *ndbr = blk_dev->ndbr;
+	resource_size_t	dev_offset;
+
+	dev_offset = to_dev_offset(nsblk, offset, n);
+
+	if (unlikely(offset + n > blk_dev->disk_size)) {
+		dev_WARN_ONCE(disk_to_dev(disk), 1, "request out of range\n");
+		return -EFAULT;
+	}
+
+	if (dev_offset == SIZE_MAX)
+		return -EIO;
+
+	return ndbr->do_io(ndbr, dev_offset, iobuf, n, rw);
+}
+
+static const struct block_device_operations nd_blk_fops = {
+	.owner = THIS_MODULE,
+	.rw_bytes = nd_blk_rw_bytes,
+};
+
+static int nd_blk_probe(struct device *dev)
+{
+	struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+	struct nd_blk_device *blk_dev;
+	resource_size_t disk_size;
+	struct gendisk *disk;
+	int err;
+
+	disk_size = nd_namespace_blk_validate(nsblk);
+	if (disk_size < ND_MIN_NAMESPACE_SIZE)
+		return -ENXIO;
+
+	blk_dev = kzalloc(sizeof(*blk_dev), GFP_KERNEL);
+	if (!blk_dev)
+		return -ENOMEM;
+
+	blk_dev->disk_size	= disk_size;
+
+	blk_dev->queue = blk_alloc_queue(GFP_KERNEL);
+	if (!blk_dev->queue) {
+		err = -ENOMEM;
+		goto err_alloc_queue;
+	}
+
+	blk_queue_make_request(blk_dev->queue, nd_blk_make_request);
+	blk_queue_max_hw_sectors(blk_dev->queue, 1024);
+	blk_queue_bounce_limit(blk_dev->queue, BLK_BOUNCE_ANY);
+	blk_queue_logical_block_size(blk_dev->queue, nsblk->lbasize);
+
+	disk = blk_dev->disk = alloc_disk(0);
+	if (!disk) {
+		err = -ENOMEM;
+		goto err_alloc_disk;
+	}
+
+	blk_dev->ndbr = to_nd_blk_region(nsblk->dev.parent);
+	blk_dev->nsblk = nsblk;
+
+	disk->driverfs_dev	= dev;
+	disk->major		= nd_blk_major;
+	disk->first_minor	= 0;
+	disk->fops		= &nd_blk_fops;
+	disk->private_data	= blk_dev;
+	disk->queue		= blk_dev->queue;
+	disk->flags		= GENHD_FL_EXT_DEVT;
+	sprintf(disk->disk_name, "ndblk%d.%d", nd_region->id, nsblk->id);
+	set_capacity(disk, disk_size >> SECTOR_SHIFT);
+
+	dev_set_drvdata(dev, blk_dev);
+	nvdimm_bus_add_disk(disk);
+
+	return 0;
+
+ err_alloc_disk:
+	blk_cleanup_queue(blk_dev->queue);
+ err_alloc_queue:
+	kfree(blk_dev);
+	return err;
+}
+
+static int nd_blk_remove(struct device *dev)
+{
+	struct nd_blk_device *blk_dev = dev_get_drvdata(dev);
+
+	nvdimm_bus_remove_disk(blk_dev->disk);
+	blk_cleanup_queue(blk_dev->queue);
+	kfree(blk_dev);
+
+	return 0;
+}
+
+static struct nd_device_driver nd_blk_driver = {
+	.probe = nd_blk_probe,
+	.remove = nd_blk_remove,
+	.drv = {
+		.name = "nd_blk",
+	},
+	.type = ND_DRIVER_NAMESPACE_BLK,
+};
+
+static int __init nd_blk_init(void)
+{
+	int rc;
+
+	rc = register_blkdev(0, "nd_blk");
+	if (rc < 0)
+		return rc;
+
+	nd_blk_major = rc;
+	rc = nd_driver_register(&nd_blk_driver);
+
+	if (rc < 0)
+		unregister_blkdev(nd_blk_major, "nd_blk");
+
+	return rc;
+}
+
+static void __exit nd_blk_exit(void)
+{
+	driver_unregister(&nd_blk_driver.drv);
+	unregister_blkdev(nd_blk_major, "nd_blk");
+}
+
+MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_BLK);
+module_init(nd_blk_init);
+module_exit(nd_blk_exit);
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 83b179ed6d61..c05eb807d674 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -209,6 +209,15 @@ struct nvdimm *to_nvdimm(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(to_nvdimm);
 
+struct nvdimm *nd_blk_region_to_dimm(struct nd_blk_region *ndbr)
+{
+	struct nd_region *nd_region = &ndbr->nd_region;
+	struct nd_mapping *nd_mapping = &nd_region->mapping[0];
+
+	return nd_mapping->nvdimm;
+}
+EXPORT_SYMBOL_GPL(nd_blk_region_to_dimm);
+
 struct nvdimm_drvdata *to_ndd(struct nd_mapping *nd_mapping)
 {
 	struct nvdimm *nvdimm = nd_mapping->nvdimm;
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 50b502b1908e..68780d768e7b 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -149,6 +149,66 @@ static resource_size_t nd_namespace_blk_size(struct nd_namespace_blk *nsblk)
 	return size;
 }
 
+static resource_size_t __nd_namespace_blk_validate(
+		struct nd_namespace_blk *nsblk)
+{
+	struct nd_region *nd_region = to_nd_region(nsblk->dev.parent);
+	struct nd_mapping *nd_mapping = &nd_region->mapping[0];
+	struct nvdimm_drvdata *ndd = to_ndd(nd_mapping);
+	struct nd_label_id label_id;
+	struct resource *res;
+	int count, i;
+
+	if (!nsblk->uuid || !nsblk->lbasize || !ndd)
+		return 0;
+
+	count = 0;
+	nd_label_gen_id(&label_id, nsblk->uuid, NSLABEL_FLAG_LOCAL);
+	for_each_dpa_resource(ndd, res) {
+		if (strcmp(res->name, label_id.id) != 0)
+			continue;
+		/*
+		 * Resources with unacknoweldged adjustments indicate a
+		 * failure to update labels
+		 */
+		if (res->flags & DPA_RESOURCE_ADJUSTED)
+			return 0;
+		count++;
+	}
+
+	/* These values match after a successful label update */
+	if (count != nsblk->num_resources)
+		return 0;
+
+	for (i = 0; i < nsblk->num_resources; i++) {
+		struct resource *found = NULL;
+
+		for_each_dpa_resource(ndd, res)
+			if (res == nsblk->res[i]) {
+				found = res;
+				break;
+			}
+		/* stale resource */
+		if (!found)
+			return 0;
+	}
+
+	return nd_namespace_blk_size(nsblk);
+}
+
+resource_size_t nd_namespace_blk_validate(struct nd_namespace_blk *nsblk)
+{
+	resource_size_t size;
+
+	nvdimm_bus_lock(&nsblk->dev);
+	size = __nd_namespace_blk_validate(nsblk);
+	nvdimm_bus_unlock(&nsblk->dev);
+
+	return size;
+}
+EXPORT_SYMBOL(nd_namespace_blk_validate);
+
+
 static int nd_namespace_label_update(struct nd_region *nd_region,
 		struct device *dev)
 {
diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h
index 1375d30b3da5..9a90915e6fd2 100644
--- a/drivers/nvdimm/nd-core.h
+++ b/drivers/nvdimm/nd-core.h
@@ -23,10 +23,7 @@ extern struct list_head nvdimm_bus_list;
 extern struct mutex nvdimm_bus_list_mutex;
 extern int nvdimm_major;
 
-struct block_device;
-struct nd_io_claim;
 struct nd_btt;
-struct nd_io;
 
 struct nvdimm_bus {
 	struct nvdimm_bus_descriptor *nd_desc;
@@ -49,8 +46,8 @@ struct nvdimm {
 };
 
 bool is_nvdimm(struct device *dev);
-bool is_nd_blk(struct device *dev);
 bool is_nd_pmem(struct device *dev);
+bool is_nd_blk(struct device *dev);
 struct gendisk;
 #if IS_ENABLED(CONFIG_ND_BTT_DEVS)
 bool is_nd_btt(struct device *dev);
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index b876b839f49a..2b7746e798fb 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -96,6 +96,15 @@ struct nd_region {
 	struct nd_mapping mapping[0];
 };
 
+struct nd_blk_region {
+	int (*enable)(struct nvdimm_bus *nvdimm_bus, struct device *dev);
+	void (*disable)(struct nvdimm_bus *nvdimm_bus, struct device *dev);
+	int (*do_io)(struct nd_blk_region *ndbr, resource_size_t dpa,
+			void *iobuf, u64 len, int rw);
+	void *blk_provider_data;
+	struct nd_region nd_region;
+};
+
 /*
  * Lookup next in the repeating sequence of 01, 10, and 11.
  */
@@ -142,8 +151,6 @@ struct nd_btt *to_nd_btt(struct device *dev);
 struct btt_sb;
 u64 nd_btt_sb_checksum(struct btt_sb *btt_sb);
 struct nd_region *to_nd_region(struct device *dev);
-unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
-void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
 int nd_region_to_nstype(struct nd_region *nd_region);
 int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
 u64 nd_region_interleave_set_cookie(struct nd_region *nd_region);
@@ -159,4 +166,6 @@ void nvdimm_free_dpa(struct nvdimm_drvdata *ndd, struct resource *res);
 struct resource *nvdimm_allocate_dpa(struct nvdimm_drvdata *ndd,
 		struct nd_label_id *label_id, resource_size_t start,
 		resource_size_t n);
+int nd_blk_region_init(struct nd_region *nd_region);
+resource_size_t nd_namespace_blk_validate(struct nd_namespace_blk *nsblk);
 #endif /* __ND_H__ */
diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c
index aa617bf86506..d9d82e7a90fa 100644
--- a/drivers/nvdimm/region.c
+++ b/drivers/nvdimm/region.c
@@ -94,11 +94,10 @@ EXPORT_SYMBOL(nd_region_release_lane);
 
 static int nd_region_probe(struct device *dev)
 {
-	int err;
+	int err, rc;
 	static unsigned long once;
 	struct nd_region_namespaces *num_ns;
 	struct nd_region *nd_region = to_nd_region(dev);
-	int rc = nd_region_register_namespaces(nd_region, &err);
 
 	if (nd_region->num_lanes > num_online_cpus()
 			&& nd_region->num_lanes < num_possible_cpus()
@@ -110,6 +109,11 @@ static int nd_region_probe(struct device *dev)
 				nd_region->num_lanes);
 	}
 
+	rc = nd_blk_region_init(nd_region);
+	if (rc)
+		return rc;
+
+	rc = nd_region_register_namespaces(nd_region, &err);
 	num_ns = devm_kzalloc(dev, sizeof(*num_ns), GFP_KERNEL);
 	if (!num_ns)
 		return -ENOMEM;
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index cfa68d6590d6..b16ec20dbba2 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -11,6 +11,7 @@
  * General Public License for more details.
  */
 #include <linux/scatterlist.h>
+#include <linux/highmem.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/sort.h>
@@ -33,7 +34,10 @@ static void nd_region_release(struct device *dev)
 		put_device(&nvdimm->dev);
 	}
 	ida_simple_remove(&region_ida, nd_region->id);
-	kfree(nd_region);
+	if (is_nd_blk(dev))
+		kfree(to_nd_blk_region(dev));
+	else
+		kfree(nd_region);
 }
 
 static struct device_type nd_blk_device_type = {
@@ -70,6 +74,33 @@ struct nd_region *to_nd_region(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(to_nd_region);
 
+struct nd_blk_region *to_nd_blk_region(struct device *dev)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+
+	WARN_ON(!is_nd_blk(dev));
+	return container_of(nd_region, struct nd_blk_region, nd_region);
+}
+EXPORT_SYMBOL_GPL(to_nd_blk_region);
+
+void *nd_region_provider_data(struct nd_region *nd_region)
+{
+	return nd_region->provider_data;
+}
+EXPORT_SYMBOL_GPL(nd_region_provider_data);
+
+void *nd_blk_region_provider_data(struct nd_blk_region *ndbr)
+{
+	return ndbr->blk_provider_data;
+}
+EXPORT_SYMBOL_GPL(nd_blk_region_provider_data);
+
+void nd_blk_region_set_provider_data(struct nd_blk_region *ndbr, void *data)
+{
+	ndbr->blk_provider_data = data;
+}
+EXPORT_SYMBOL_GPL(nd_blk_region_set_provider_data);
+
 /**
  * nd_region_to_nstype() - region to an integer namespace type
  * @nd_region: region-device to interrogate
@@ -345,7 +376,9 @@ u64 nd_region_interleave_set_cookie(struct nd_region *nd_region)
 
 /*
  * Upon successful probe/remove, take/release a reference on the
- * associated interleave set (if present)
+ * associated dimms in the interleave set, on successful probe of a BLK
+ * namespace check if we need a new seed, and on remove of a BLK region
+ * notify the provider to disable the region.
  */
 static void nd_region_notify_driver_action(struct nvdimm_bus *nvdimm_bus,
 		struct device *dev, bool probe)
@@ -365,6 +398,11 @@ static void nd_region_notify_driver_action(struct nvdimm_bus *nvdimm_bus,
 			nd_mapping->ndd = NULL;
 			atomic_dec(&nvdimm->busy);
 		}
+
+		if (is_nd_pmem(dev))
+			return;
+
+		to_nd_blk_region(dev)->disable(nvdimm_bus, dev);
 	} else if (dev->parent && is_nd_blk(dev->parent) && probe) {
 		struct nd_region *nd_region = to_nd_region(dev->parent);
 
@@ -497,11 +535,21 @@ struct attribute_group nd_mapping_attribute_group = {
 };
 EXPORT_SYMBOL_GPL(nd_mapping_attribute_group);
 
-void *nd_region_provider_data(struct nd_region *nd_region)
+int nd_blk_region_init(struct nd_region *nd_region)
 {
-	return nd_region->provider_data;
+	struct device *dev = &nd_region->dev;
+	struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
+
+	if (!is_nd_blk(dev))
+		return 0;
+
+	if (nd_region->ndr_mappings < 1) {
+		dev_err(dev, "invalid BLK region\n");
+		return -ENXIO;
+	}
+
+	return to_nd_blk_region(dev)->enable(nvdimm_bus, dev);
 }
-EXPORT_SYMBOL_GPL(nd_region_provider_data);
 
 static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
 		struct nd_region_desc *ndr_desc, struct device_type *dev_type,
@@ -523,9 +571,28 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
 		}
 	}
 
-	nd_region = kzalloc(sizeof(struct nd_region)
-			+ sizeof(struct nd_mapping) * ndr_desc->num_mappings,
-			GFP_KERNEL);
+	if (dev_type == &nd_blk_device_type) {
+		struct nd_blk_region_desc *ndbr_desc;
+		struct nd_blk_region *ndbr;
+
+		ndbr_desc = to_blk_region_desc(ndr_desc);
+		ndbr = kzalloc(sizeof(*ndbr) + sizeof(struct nd_mapping)
+				* ndr_desc->num_mappings,
+				GFP_KERNEL);
+		if (ndbr) {
+			nd_region = &ndbr->nd_region;
+			ndbr->enable = ndbr_desc->enable;
+			ndbr->disable = ndbr_desc->disable;
+			ndbr->do_io = ndbr_desc->do_io;
+		} else
+			nd_region = NULL;
+	} else {
+		nd_region = kzalloc(sizeof(struct nd_region)
+				+ sizeof(struct nd_mapping)
+				* ndr_desc->num_mappings,
+				GFP_KERNEL);
+	}
+
 	if (!nd_region)
 		return NULL;
 	nd_region->id = ida_simple_get(&region_ida, 0, 0, GFP_KERNEL);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 531d99dfac68..7fc1b25bdb5d 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -14,6 +14,7 @@
  */
 #ifndef __LIBNVDIMM_H__
 #define __LIBNVDIMM_H__
+#include <linux/kernel.h>
 #include <linux/sizes.h>
 #include <linux/types.h>
 
@@ -89,8 +90,24 @@ struct nd_region_desc {
 };
 
 struct nvdimm_bus;
-struct device;
 struct module;
+struct device;
+struct nd_blk_region;
+struct nd_blk_region_desc {
+	int (*enable)(struct nvdimm_bus *nvdimm_bus, struct device *dev);
+	void (*disable)(struct nvdimm_bus *nvdimm_bus, struct device *dev);
+	int (*do_io)(struct nd_blk_region *ndbr, resource_size_t dpa,
+			void *iobuf, u64 len, int rw);
+	struct nd_region_desc ndr_desc;
+};
+
+static inline struct nd_blk_region_desc *to_blk_region_desc(
+		struct nd_region_desc *ndr_desc)
+{
+	return container_of(ndr_desc, struct nd_blk_region_desc, ndr_desc);
+
+}
+
 struct nvdimm_bus *__nvdimm_bus_register(struct device *parent,
 		struct nvdimm_bus_descriptor *nfit_desc, struct module *module);
 #define nvdimm_bus_register(parent, desc) \
@@ -99,10 +116,10 @@ void nvdimm_bus_unregister(struct nvdimm_bus *nvdimm_bus);
 struct nvdimm_bus *to_nvdimm_bus(struct device *dev);
 struct nvdimm *to_nvdimm(struct device *dev);
 struct nd_region *to_nd_region(struct device *dev);
+struct nd_blk_region *to_nd_blk_region(struct device *dev);
 struct nvdimm_bus_descriptor *to_nd_desc(struct nvdimm_bus *nvdimm_bus);
 const char *nvdimm_name(struct nvdimm *nvdimm);
 void *nvdimm_provider_data(struct nvdimm *nvdimm);
-void *nd_region_provider_data(struct nd_region *nd_region);
 struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data,
 		const struct attribute_group **groups, unsigned long flags,
 		unsigned long *dsm_mask);
@@ -120,5 +137,11 @@ struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus,
 		struct nd_region_desc *ndr_desc);
 struct nd_region *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
 		struct nd_region_desc *ndr_desc);
+void *nd_region_provider_data(struct nd_region *nd_region);
+void *nd_blk_region_provider_data(struct nd_blk_region *ndbr);
+void nd_blk_region_set_provider_data(struct nd_blk_region *ndbr, void *data);
+struct nvdimm *nd_blk_region_to_dimm(struct nd_blk_region *ndbr);
+unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
+void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
 u64 nd_fletcher64(void *addr, size_t len, bool le);
 #endif /* __LIBNVDIMM_H__ */


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 04/15] libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
@ 2015-06-17 23:55   ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Rafael J. Wysocki, linux-kernel,
	Andy Lutomirski, Jens Axboe, linux-acpi, H. Peter Anvin,
	linux-fsdevel, Ross Zwisler, hch, mingo

From: Ross Zwisler <ross.zwisler@linux.intel.com>

The libnvdimm implementation handles allocating dimm address space (DPA)
between PMEM and BLK mode interfaces.  After DPA has been allocated from
a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
as a struct bio based block device. Unlike PMEM, BLK is required to
handle platform specific details like mmio register formats and memory
controller interleave.  For this reason the libnvdimm generic nd_blk
driver calls back into the bus provider to carry out the I/O.

This initial implementation handles the BLK interface defined by the
ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
DCR (dimm control region), BDW (block data window), IDT (interleave
descriptor) NFIT structures and the hardware register format.
[1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
[2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c             |  445 ++++++++++++++++++++++++++++++++++++++-
 drivers/acpi/nfit.h             |   49 ++++
 drivers/nvdimm/Kconfig          |   12 +
 drivers/nvdimm/Makefile         |    3 
 drivers/nvdimm/blk.c            |  241 +++++++++++++++++++++
 drivers/nvdimm/dimm_devs.c      |    9 +
 drivers/nvdimm/namespace_devs.c |   60 +++++
 drivers/nvdimm/nd-core.h        |    5 
 drivers/nvdimm/nd.h             |   13 +
 drivers/nvdimm/region.c         |    8 +
 drivers/nvdimm/region_devs.c    |   83 +++++++
 include/linux/libnvdimm.h       |   27 ++
 12 files changed, 923 insertions(+), 32 deletions(-)
 create mode 100644 drivers/nvdimm/blk.c

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index fc38b49eff7d..3a77709fd394 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -13,12 +13,16 @@
 #include <linux/list_sort.h>
 #include <linux/libnvdimm.h>
 #include <linux/module.h>
+#include <linux/mutex.h>
 #include <linux/ndctl.h>
 #include <linux/list.h>
 #include <linux/acpi.h>
 #include <linux/sort.h>
+#include <linux/io.h>
 #include "nfit.h"
 
+#include <asm-generic/io-64-nonatomic-hi-lo.h>
+
 static bool force_enable_dimms;
 module_param(force_enable_dimms, bool, S_IRUGO|S_IWUSR);
 MODULE_PARM_DESC(force_enable_dimms, "Ignore _STA (ACPI DIMM device) status");
@@ -72,7 +76,7 @@ static int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc,
 
 		if (!adev)
 			return -ENOTTY;
-		dimm_name = dev_name(&adev->dev);
+		dimm_name = nvdimm_name(nvdimm);
 		cmd_name = nvdimm_cmd_name(cmd);
 		dsm_mask = nfit_mem->dsm_mask;
 		desc = nd_cmd_dimm_desc(cmd);
@@ -279,6 +283,23 @@ static bool add_bdw(struct acpi_nfit_desc *acpi_desc,
 	return true;
 }
 
+static bool add_idt(struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_interleave *idt)
+{
+	struct device *dev = acpi_desc->dev;
+	struct nfit_idt *nfit_idt = devm_kzalloc(dev, sizeof(*nfit_idt),
+			GFP_KERNEL);
+
+	if (!nfit_idt)
+		return false;
+	INIT_LIST_HEAD(&nfit_idt->list);
+	nfit_idt->idt = idt;
+	list_add_tail(&nfit_idt->list, &acpi_desc->idts);
+	dev_dbg(dev, "%s: idt index: %d num_lines: %d\n", __func__,
+			idt->interleave_index, idt->line_count);
+	return true;
+}
+
 static void *add_table(struct acpi_nfit_desc *acpi_desc, void *table,
 		const void *end)
 {
@@ -307,9 +328,9 @@ static void *add_table(struct acpi_nfit_desc *acpi_desc, void *table,
 		if (!add_bdw(acpi_desc, table))
 			return err;
 		break;
-	/* TODO */
 	case ACPI_NFIT_TYPE_INTERLEAVE:
-		dev_dbg(dev, "%s: idt\n", __func__);
+		if (!add_idt(acpi_desc, table))
+			return err;
 		break;
 	case ACPI_NFIT_TYPE_FLUSH_ADDRESS:
 		dev_dbg(dev, "%s: flush\n", __func__);
@@ -362,8 +383,11 @@ static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
 		struct nfit_mem *nfit_mem, struct acpi_nfit_system_address *spa)
 {
 	u16 dcr = __to_nfit_memdev(nfit_mem)->region_index;
+	struct nfit_memdev *nfit_memdev;
 	struct nfit_dcr *nfit_dcr;
 	struct nfit_bdw *nfit_bdw;
+	struct nfit_idt *nfit_idt;
+	u16 idt_idx, range_index;
 
 	list_for_each_entry(nfit_dcr, &acpi_desc->dcrs, list) {
 		if (nfit_dcr->dcr->region_index != dcr)
@@ -396,6 +420,26 @@ static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
 		return 0;
 
 	nfit_mem_find_spa_bdw(acpi_desc, nfit_mem);
+
+	if (!nfit_mem->spa_bdw)
+		return 0;
+
+	range_index = nfit_mem->spa_bdw->range_index;
+	list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
+		if (nfit_memdev->memdev->range_index != range_index ||
+				nfit_memdev->memdev->region_index != dcr)
+			continue;
+		nfit_mem->memdev_bdw = nfit_memdev->memdev;
+		idt_idx = nfit_memdev->memdev->interleave_index;
+		list_for_each_entry(nfit_idt, &acpi_desc->idts, list) {
+			if (nfit_idt->idt->interleave_index != idt_idx)
+				continue;
+			nfit_mem->idt_bdw = nfit_idt->idt;
+			break;
+		}
+		break;
+	}
+
 	return 0;
 }
 
@@ -439,9 +483,19 @@ static int nfit_mem_dcr_init(struct acpi_nfit_desc *acpi_desc,
 		}
 
 		if (type == NFIT_SPA_DCR) {
+			struct nfit_idt *nfit_idt;
+			u16 idt_idx;
+
 			/* multiple dimms may share a SPA when interleaved */
 			nfit_mem->spa_dcr = spa;
 			nfit_mem->memdev_dcr = nfit_memdev->memdev;
+			idt_idx = nfit_memdev->memdev->interleave_index;
+			list_for_each_entry(nfit_idt, &acpi_desc->idts, list) {
+				if (nfit_idt->idt->interleave_index != idt_idx)
+					continue;
+				nfit_mem->idt_dcr = nfit_idt->idt;
+				break;
+			}
 		} else {
 			/*
 			 * A single dimm may belong to multiple SPA-PM
@@ -871,6 +925,359 @@ static int acpi_nfit_init_interleave_set(struct acpi_nfit_desc *acpi_desc,
 	return 0;
 }
 
+static u64 to_interleave_offset(u64 offset, struct nfit_blk_mmio *mmio)
+{
+	struct acpi_nfit_interleave *idt = mmio->idt;
+	u32 sub_line_offset, line_index, line_offset;
+	u64 line_no, table_skip_count, table_offset;
+
+	line_no = div_u64_rem(offset, mmio->line_size, &sub_line_offset);
+	table_skip_count = div_u64_rem(line_no, mmio->num_lines, &line_index);
+	line_offset = idt->line_offset[line_index]
+		* mmio->line_size;
+	table_offset = table_skip_count * mmio->table_size;
+
+	return mmio->base_offset + line_offset + table_offset + sub_line_offset;
+}
+
+static u64 read_blk_stat(struct nfit_blk *nfit_blk, unsigned int bw)
+{
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[DCR];
+	u64 offset = nfit_blk->stat_offset + mmio->size * bw;
+
+	if (mmio->num_lines)
+		offset = to_interleave_offset(offset, mmio);
+
+	return readq(mmio->base + offset);
+}
+
+static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
+		resource_size_t dpa, unsigned int len, unsigned int write)
+{
+	u64 cmd, offset;
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[DCR];
+
+	enum {
+		BCW_OFFSET_MASK = (1ULL << 48)-1,
+		BCW_LEN_SHIFT = 48,
+		BCW_LEN_MASK = (1ULL << 8) - 1,
+		BCW_CMD_SHIFT = 56,
+	};
+
+	cmd = (dpa >> L1_CACHE_SHIFT) & BCW_OFFSET_MASK;
+	len = len >> L1_CACHE_SHIFT;
+	cmd |= ((u64) len & BCW_LEN_MASK) << BCW_LEN_SHIFT;
+	cmd |= ((u64) write) << BCW_CMD_SHIFT;
+
+	offset = nfit_blk->cmd_offset + mmio->size * bw;
+	if (mmio->num_lines)
+		offset = to_interleave_offset(offset, mmio);
+
+	writeq(cmd, mmio->base + offset);
+	/* FIXME: conditionally perform read-back if mandated by firmware */
+}
+
+static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
+		resource_size_t dpa, void *iobuf, size_t len, int rw,
+		unsigned int lane)
+{
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW];
+	unsigned int copied = 0;
+	u64 base_offset;
+	int rc;
+
+	base_offset = nfit_blk->bdw_offset + dpa % L1_CACHE_BYTES
+		+ lane * mmio->size;
+	/* TODO: non-temporal access, flush hints, cache management etc... */
+	write_blk_ctl(nfit_blk, lane, dpa, len, rw);
+	while (len) {
+		unsigned int c;
+		u64 offset;
+
+		if (mmio->num_lines) {
+			u32 line_offset;
+
+			offset = to_interleave_offset(base_offset + copied,
+					mmio);
+			div_u64_rem(offset, mmio->line_size, &line_offset);
+			c = min_t(size_t, len, mmio->line_size - line_offset);
+		} else {
+			offset = base_offset + nfit_blk->bdw_offset;
+			c = len;
+		}
+
+		if (rw)
+			memcpy(mmio->aperture + offset, iobuf + copied, c);
+		else
+			memcpy(iobuf + copied, mmio->aperture + offset, c);
+
+		copied += c;
+		len -= c;
+	}
+	rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0;
+	return rc;
+}
+
+static int acpi_nfit_blk_region_do_io(struct nd_blk_region *ndbr,
+		resource_size_t dpa, void *iobuf, u64 len, int rw)
+{
+	struct nfit_blk *nfit_blk = nd_blk_region_provider_data(ndbr);
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW];
+	struct nd_region *nd_region = nfit_blk->nd_region;
+	unsigned int lane, copied = 0;
+	int rc = 0;
+
+	lane = nd_region_acquire_lane(nd_region);
+	while (len) {
+		u64 c = min(len, mmio->size);
+
+		rc = acpi_nfit_blk_single_io(nfit_blk, dpa + copied,
+				iobuf + copied, c, rw, lane);
+		if (rc)
+			break;
+
+		copied += c;
+		len -= c;
+	}
+	nd_region_release_lane(nd_region, lane);
+
+	return rc;
+}
+
+static void nfit_spa_mapping_release(struct kref *kref)
+{
+	struct nfit_spa_mapping *spa_map = to_spa_map(kref);
+	struct acpi_nfit_system_address *spa = spa_map->spa;
+	struct acpi_nfit_desc *acpi_desc = spa_map->acpi_desc;
+
+	WARN_ON(!mutex_is_locked(&acpi_desc->spa_map_mutex));
+	dev_dbg(acpi_desc->dev, "%s: SPA%d\n", __func__, spa->range_index);
+	iounmap(spa_map->iomem);
+	release_mem_region(spa->address, spa->length);
+	list_del(&spa_map->list);
+	kfree(spa_map);
+}
+
+static struct nfit_spa_mapping *find_spa_mapping(
+		struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	struct nfit_spa_mapping *spa_map;
+
+	WARN_ON(!mutex_is_locked(&acpi_desc->spa_map_mutex));
+	list_for_each_entry(spa_map, &acpi_desc->spa_maps, list)
+		if (spa_map->spa == spa)
+			return spa_map;
+
+	return NULL;
+}
+
+static void nfit_spa_unmap(struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	struct nfit_spa_mapping *spa_map;
+
+	mutex_lock(&acpi_desc->spa_map_mutex);
+	spa_map = find_spa_mapping(acpi_desc, spa);
+
+	if (spa_map)
+		kref_put(&spa_map->kref, nfit_spa_mapping_release);
+	mutex_unlock(&acpi_desc->spa_map_mutex);
+}
+
+static void __iomem *__nfit_spa_map(struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	resource_size_t start = spa->address;
+	resource_size_t n = spa->length;
+	struct nfit_spa_mapping *spa_map;
+	struct resource *res;
+
+	WARN_ON(!mutex_is_locked(&acpi_desc->spa_map_mutex));
+
+	spa_map = find_spa_mapping(acpi_desc, spa);
+	if (spa_map) {
+		kref_get(&spa_map->kref);
+		return spa_map->iomem;
+	}
+
+	spa_map = kzalloc(sizeof(*spa_map), GFP_KERNEL);
+	if (!spa_map)
+		return NULL;
+
+	INIT_LIST_HEAD(&spa_map->list);
+	spa_map->spa = spa;
+	kref_init(&spa_map->kref);
+	spa_map->acpi_desc = acpi_desc;
+
+	res = request_mem_region(start, n, dev_name(acpi_desc->dev));
+	if (!res)
+		goto err_mem;
+
+	/* TODO: cacheability based on the spa type */
+	spa_map->iomem = ioremap_nocache(start, n);
+	if (!spa_map->iomem)
+		goto err_map;
+
+	list_add_tail(&spa_map->list, &acpi_desc->spa_maps);
+	return spa_map->iomem;
+
+ err_map:
+	release_mem_region(start, n);
+ err_mem:
+	kfree(spa_map);
+	return NULL;
+}
+
+/**
+ * nfit_spa_map - interleave-aware managed-mappings of acpi_nfit_system_address ranges
+ * @nvdimm_bus: NFIT-bus that provided the spa table entry
+ * @nfit_spa: spa table to map
+ *
+ * In the case where block-data-window apertures and
+ * dimm-control-regions are interleaved they will end up sharing a
+ * single request_mem_region() + ioremap() for the address range.  In
+ * the style of devm nfit_spa_map() mappings are automatically dropped
+ * when all region devices referencing the same mapping are disabled /
+ * unbound.
+ */
+static void __iomem *nfit_spa_map(struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	void __iomem *iomem;
+
+	mutex_lock(&acpi_desc->spa_map_mutex);
+	iomem = __nfit_spa_map(acpi_desc, spa);
+	mutex_unlock(&acpi_desc->spa_map_mutex);
+
+	return iomem;
+}
+
+static int nfit_blk_init_interleave(struct nfit_blk_mmio *mmio,
+		struct acpi_nfit_interleave *idt, u16 interleave_ways)
+{
+	if (idt) {
+		mmio->num_lines = idt->line_count;
+		mmio->line_size = idt->line_size;
+		if (interleave_ways == 0)
+			return -ENXIO;
+		mmio->table_size = mmio->num_lines * interleave_ways
+			* mmio->line_size;
+	}
+
+	return 0;
+}
+
+static int acpi_nfit_blk_region_enable(struct nvdimm_bus *nvdimm_bus,
+		struct device *dev)
+{
+	struct nvdimm_bus_descriptor *nd_desc = to_nd_desc(nvdimm_bus);
+	struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
+	struct nd_blk_region *ndbr = to_nd_blk_region(dev);
+	struct nfit_blk_mmio *mmio;
+	struct nfit_blk *nfit_blk;
+	struct nfit_mem *nfit_mem;
+	struct nvdimm *nvdimm;
+	int rc;
+
+	nvdimm = nd_blk_region_to_dimm(ndbr);
+	nfit_mem = nvdimm_provider_data(nvdimm);
+	if (!nfit_mem || !nfit_mem->dcr || !nfit_mem->bdw) {
+		dev_dbg(dev, "%s: missing%s%s%s\n", __func__,
+				nfit_mem ? "" : " nfit_mem",
+				nfit_mem->dcr ? "" : " dcr",
+				nfit_mem->bdw ? "" : " bdw");
+		return -ENXIO;
+	}
+
+	nfit_blk = devm_kzalloc(dev, sizeof(*nfit_blk), GFP_KERNEL);
+	if (!nfit_blk)
+		return -ENOMEM;
+	nd_blk_region_set_provider_data(ndbr, nfit_blk);
+	nfit_blk->nd_region = to_nd_region(dev);
+
+	/* map block aperture memory */
+	nfit_blk->bdw_offset = nfit_mem->bdw->offset;
+	mmio = &nfit_blk->mmio[BDW];
+	mmio->base = nfit_spa_map(acpi_desc, nfit_mem->spa_bdw);
+	if (!mmio->base) {
+		dev_dbg(dev, "%s: %s failed to map bdw\n", __func__,
+				nvdimm_name(nvdimm));
+		return -ENOMEM;
+	}
+	mmio->size = nfit_mem->bdw->size;
+	mmio->base_offset = nfit_mem->memdev_bdw->region_offset;
+	mmio->idt = nfit_mem->idt_bdw;
+	mmio->spa = nfit_mem->spa_bdw;
+	rc = nfit_blk_init_interleave(mmio, nfit_mem->idt_bdw,
+			nfit_mem->memdev_bdw->interleave_ways);
+	if (rc) {
+		dev_dbg(dev, "%s: %s failed to init bdw interleave\n",
+				__func__, nvdimm_name(nvdimm));
+		return rc;
+	}
+
+	/* map block control memory */
+	nfit_blk->cmd_offset = nfit_mem->dcr->command_offset;
+	nfit_blk->stat_offset = nfit_mem->dcr->status_offset;
+	mmio = &nfit_blk->mmio[DCR];
+	mmio->base = nfit_spa_map(acpi_desc, nfit_mem->spa_dcr);
+	if (!mmio->base) {
+		dev_dbg(dev, "%s: %s failed to map dcr\n", __func__,
+				nvdimm_name(nvdimm));
+		return -ENOMEM;
+	}
+	mmio->size = nfit_mem->dcr->window_size;
+	mmio->base_offset = nfit_mem->memdev_dcr->region_offset;
+	mmio->idt = nfit_mem->idt_dcr;
+	mmio->spa = nfit_mem->spa_dcr;
+	rc = nfit_blk_init_interleave(mmio, nfit_mem->idt_dcr,
+			nfit_mem->memdev_dcr->interleave_ways);
+	if (rc) {
+		dev_dbg(dev, "%s: %s failed to init dcr interleave\n",
+				__func__, nvdimm_name(nvdimm));
+		return rc;
+	}
+
+	if (mmio->line_size == 0)
+		return 0;
+
+	if ((u32) nfit_blk->cmd_offset % mmio->line_size
+			+ 8 > mmio->line_size) {
+		dev_dbg(dev, "cmd_offset crosses interleave boundary\n");
+		return -ENXIO;
+	} else if ((u32) nfit_blk->stat_offset % mmio->line_size
+			+ 8 > mmio->line_size) {
+		dev_dbg(dev, "stat_offset crosses interleave boundary\n");
+		return -ENXIO;
+	}
+
+	return 0;
+}
+
+static void acpi_nfit_blk_region_disable(struct nvdimm_bus *nvdimm_bus,
+		struct device *dev)
+{
+	struct nvdimm_bus_descriptor *nd_desc = to_nd_desc(nvdimm_bus);
+	struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
+	struct nd_blk_region *ndbr = to_nd_blk_region(dev);
+	struct nfit_blk *nfit_blk = nd_blk_region_provider_data(ndbr);
+	int i;
+
+	if (!nfit_blk)
+		return; /* never enabled */
+
+	/* auto-free BLK spa mappings */
+	for (i = 0; i < 2; i++) {
+		struct nfit_blk_mmio *mmio = &nfit_blk->mmio[i];
+
+		if (mmio->base)
+			nfit_spa_unmap(acpi_desc, mmio->spa);
+	}
+	nd_blk_region_set_provider_data(ndbr, NULL);
+	/* devm will free nfit_blk */
+}
+
 static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,
 		struct nd_mapping *nd_mapping, struct nd_region_desc *ndr_desc,
 		struct acpi_nfit_memory_map *memdev,
@@ -878,6 +1285,7 @@ static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,
 {
 	struct nvdimm *nvdimm = acpi_nfit_dimm_by_handle(acpi_desc,
 			memdev->device_handle);
+	struct nd_blk_region_desc *ndbr_desc;
 	struct nfit_mem *nfit_mem;
 	int blk_valid = 0;
 
@@ -908,6 +1316,10 @@ static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,
 
 		ndr_desc->nd_mapping = nd_mapping;
 		ndr_desc->num_mappings = blk_valid;
+		ndbr_desc = to_blk_region_desc(ndr_desc);
+		ndbr_desc->enable = acpi_nfit_blk_region_enable;
+		ndbr_desc->disable = acpi_nfit_blk_region_disable;
+		ndbr_desc->do_io = acpi_nfit_blk_region_do_io;
 		if (!nvdimm_blk_region_create(acpi_desc->nvdimm_bus, ndr_desc))
 			return -ENOMEM;
 		break;
@@ -921,8 +1333,9 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 {
 	static struct nd_mapping nd_mappings[ND_MAX_MAPPINGS];
 	struct acpi_nfit_system_address *spa = nfit_spa->spa;
+	struct nd_blk_region_desc ndbr_desc;
+	struct nd_region_desc *ndr_desc;
 	struct nfit_memdev *nfit_memdev;
-	struct nd_region_desc ndr_desc;
 	struct nvdimm_bus *nvdimm_bus;
 	struct resource res;
 	int count = 0, rc;
@@ -935,12 +1348,13 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 
 	memset(&res, 0, sizeof(res));
 	memset(&nd_mappings, 0, sizeof(nd_mappings));
-	memset(&ndr_desc, 0, sizeof(ndr_desc));
+	memset(&ndbr_desc, 0, sizeof(ndbr_desc));
 	res.start = spa->address;
 	res.end = res.start + spa->length - 1;
-	ndr_desc.res = &res;
-	ndr_desc.provider_data = nfit_spa;
-	ndr_desc.attr_groups = acpi_nfit_region_attribute_groups;
+	ndr_desc = &ndbr_desc.ndr_desc;
+	ndr_desc->res = &res;
+	ndr_desc->provider_data = nfit_spa;
+	ndr_desc->attr_groups = acpi_nfit_region_attribute_groups;
 	list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
 		struct acpi_nfit_memory_map *memdev = nfit_memdev->memdev;
 		struct nd_mapping *nd_mapping;
@@ -953,24 +1367,24 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 			return -ENXIO;
 		}
 		nd_mapping = &nd_mappings[count++];
-		rc = acpi_nfit_init_mapping(acpi_desc, nd_mapping, &ndr_desc,
+		rc = acpi_nfit_init_mapping(acpi_desc, nd_mapping, ndr_desc,
 				memdev, spa);
 		if (rc)
 			return rc;
 	}
 
-	ndr_desc.nd_mapping = nd_mappings;
-	ndr_desc.num_mappings = count;
-	rc = acpi_nfit_init_interleave_set(acpi_desc, &ndr_desc, spa);
+	ndr_desc->nd_mapping = nd_mappings;
+	ndr_desc->num_mappings = count;
+	rc = acpi_nfit_init_interleave_set(acpi_desc, ndr_desc, spa);
 	if (rc)
 		return rc;
 
 	nvdimm_bus = acpi_desc->nvdimm_bus;
 	if (nfit_spa_type(spa) == NFIT_SPA_PM) {
-		if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
+		if (!nvdimm_pmem_region_create(nvdimm_bus, ndr_desc))
 			return -ENOMEM;
 	} else if (nfit_spa_type(spa) == NFIT_SPA_VOLATILE) {
-		if (!nvdimm_volatile_region_create(nvdimm_bus, &ndr_desc))
+		if (!nvdimm_volatile_region_create(nvdimm_bus, ndr_desc))
 			return -ENOMEM;
 	}
 	return 0;
@@ -996,11 +1410,14 @@ static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 	u8 *data;
 	int rc;
 
+	INIT_LIST_HEAD(&acpi_desc->spa_maps);
 	INIT_LIST_HEAD(&acpi_desc->spas);
 	INIT_LIST_HEAD(&acpi_desc->dcrs);
 	INIT_LIST_HEAD(&acpi_desc->bdws);
+	INIT_LIST_HEAD(&acpi_desc->idts);
 	INIT_LIST_HEAD(&acpi_desc->memdevs);
 	INIT_LIST_HEAD(&acpi_desc->dimms);
+	mutex_init(&acpi_desc->spa_map_mutex);
 
 	data = (u8 *) acpi_desc->nfit;
 	end = data + sz;
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index b76e33629098..7bd38b7baf39 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -52,6 +52,11 @@ struct nfit_bdw {
 	struct list_head list;
 };
 
+struct nfit_idt {
+	struct acpi_nfit_interleave *idt;
+	struct list_head list;
+};
+
 struct nfit_memdev {
 	struct acpi_nfit_memory_map *memdev;
 	struct list_head list;
@@ -62,10 +67,13 @@ struct nfit_mem {
 	struct nvdimm *nvdimm;
 	struct acpi_nfit_memory_map *memdev_dcr;
 	struct acpi_nfit_memory_map *memdev_pmem;
+	struct acpi_nfit_memory_map *memdev_bdw;
 	struct acpi_nfit_control_region *dcr;
 	struct acpi_nfit_data_region *bdw;
 	struct acpi_nfit_system_address *spa_dcr;
 	struct acpi_nfit_system_address *spa_bdw;
+	struct acpi_nfit_interleave *idt_dcr;
+	struct acpi_nfit_interleave *idt_bdw;
 	struct list_head list;
 	struct acpi_device *adev;
 	unsigned long dsm_mask;
@@ -74,16 +82,57 @@ struct nfit_mem {
 struct acpi_nfit_desc {
 	struct nvdimm_bus_descriptor nd_desc;
 	struct acpi_table_nfit *nfit;
+	struct mutex spa_map_mutex;
+	struct list_head spa_maps;
 	struct list_head memdevs;
 	struct list_head dimms;
 	struct list_head spas;
 	struct list_head dcrs;
 	struct list_head bdws;
+	struct list_head idts;
 	struct nvdimm_bus *nvdimm_bus;
 	struct device *dev;
 	unsigned long dimm_dsm_force_en;
 };
 
+enum nd_blk_mmio_selector {
+	BDW,
+	DCR,
+};
+
+struct nfit_blk {
+	struct nfit_blk_mmio {
+		union {
+			void __iomem *base;
+			void *aperture;
+		};
+		u64 size;
+		u64 base_offset;
+		u32 line_size;
+		u32 num_lines;
+		u32 table_size;
+		struct acpi_nfit_interleave *idt;
+		struct acpi_nfit_system_address *spa;
+	} mmio[2];
+	struct nd_region *nd_region;
+	u64 bdw_offset; /* post interleave offset */
+	u64 stat_offset;
+	u64 cmd_offset;
+};
+
+struct nfit_spa_mapping {
+	struct acpi_nfit_desc *acpi_desc;
+	struct acpi_nfit_system_address *spa;
+	struct list_head list;
+	struct kref kref;
+	void __iomem *iomem;
+};
+
+static inline struct nfit_spa_mapping *to_spa_map(struct kref *kref)
+{
+	return container_of(kref, struct nfit_spa_mapping, kref);
+}
+
 static inline struct acpi_nfit_memory_map *__to_nfit_memdev(
 		struct nfit_mem *nfit_mem)
 {
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index a9def3839655..912cb36b8435 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -33,6 +33,18 @@ config BLK_DEV_PMEM
 
 	  Say Y if you want to use an NVDIMM
 
+config ND_BLK
+	tristate "BLK: Block data window (aperture) device support"
+	default LIBNVDIMM
+	help
+	  Support NVDIMMs, or other devices, that implement a BLK-mode
+	  access capability.  BLK-mode access uses memory-mapped-i/o
+	  apertures to access persistent media.
+
+	  Say Y if your platform firmware emits an ACPI.NFIT table
+	  (CONFIG_ACPI_NFIT), or otherwise exposes BLK-mode
+	  capabilities.
+
 config ND_BTT_DEVS
 	bool
 
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index aa5bb1acf831..b5682e70904a 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -1,11 +1,14 @@
 obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
 obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
 obj-$(CONFIG_ND_BTT) += nd_btt.o
+obj-$(CONFIG_ND_BLK) += nd_blk.o
 
 nd_pmem-y := pmem.o
 
 nd_btt-y := btt.o
 
+nd_blk-y := blk.o
+
 libnvdimm-y := core.o
 libnvdimm-y += bus.o
 libnvdimm-y += dimm_devs.o
diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
new file mode 100644
index 000000000000..a2749b5e43d7
--- /dev/null
+++ b/drivers/nvdimm/blk.c
@@ -0,0 +1,241 @@
+/*
+ * NVDIMM Block Window Driver
+ * Copyright (c) 2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/blkdev.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/nd.h>
+#include <linux/sizes.h>
+#include "nd.h"
+
+struct nd_blk_device {
+	struct request_queue *queue;
+	struct gendisk *disk;
+	struct nd_namespace_blk *nsblk;
+	struct nd_blk_region *ndbr;
+	size_t disk_size;
+};
+
+static int nd_blk_major;
+
+static resource_size_t to_dev_offset(struct nd_namespace_blk *nsblk,
+				resource_size_t ns_offset, unsigned int len)
+{
+	int i;
+
+	for (i = 0; i < nsblk->num_resources; i++) {
+		if (ns_offset < resource_size(nsblk->res[i])) {
+			if (ns_offset + len > resource_size(nsblk->res[i])) {
+				dev_WARN_ONCE(&nsblk->dev, 1,
+					"%s: illegal request\n", __func__);
+				return SIZE_MAX;
+			}
+			return nsblk->res[i]->start + ns_offset;
+		}
+		ns_offset -= resource_size(nsblk->res[i]);
+	}
+
+	dev_WARN_ONCE(&nsblk->dev, 1, "%s: request out of range\n", __func__);
+	return SIZE_MAX;
+}
+
+static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct block_device *bdev = bio->bi_bdev;
+	struct gendisk *disk = bdev->bd_disk;
+	struct nd_namespace_blk *nsblk;
+	struct nd_blk_device *blk_dev;
+	struct nd_blk_region *ndbr;
+	struct bvec_iter iter;
+	struct bio_vec bvec;
+	int err = 0, rw;
+	sector_t sector;
+
+	sector = bio->bi_iter.bi_sector;
+	if (bio_end_sector(bio) > get_capacity(disk)) {
+		err = -EIO;
+		goto out;
+	}
+
+	BUG_ON(bio->bi_rw & REQ_DISCARD);
+
+	rw = bio_data_dir(bio);
+
+	blk_dev = disk->private_data;
+	nsblk = blk_dev->nsblk;
+	ndbr = blk_dev->ndbr;
+	bio_for_each_segment(bvec, bio, iter) {
+		unsigned int len = bvec.bv_len;
+		resource_size_t	dev_offset;
+		void *iobuf;
+
+		BUG_ON(len > PAGE_SIZE);
+
+		dev_offset = to_dev_offset(nsblk, sector << SECTOR_SHIFT, len);
+		if (dev_offset == SIZE_MAX) {
+			err = -EIO;
+			goto out;
+		}
+
+		iobuf = kmap_atomic(bvec.bv_page);
+		err = ndbr->do_io(ndbr, dev_offset, iobuf + bvec.bv_offset,
+				len, rw);
+		kunmap_atomic(iobuf);
+		if (err)
+			goto out;
+
+		sector += len >> SECTOR_SHIFT;
+	}
+
+ out:
+	bio_endio(bio, err);
+}
+
+static int nd_blk_rw_bytes(struct gendisk *disk, resource_size_t offset,
+		void *iobuf, size_t n, int rw)
+{
+	struct nd_blk_device *blk_dev = disk->private_data;
+	struct nd_namespace_blk *nsblk = blk_dev->nsblk;
+	struct nd_blk_region *ndbr = blk_dev->ndbr;
+	resource_size_t	dev_offset;
+
+	dev_offset = to_dev_offset(nsblk, offset, n);
+
+	if (unlikely(offset + n > blk_dev->disk_size)) {
+		dev_WARN_ONCE(disk_to_dev(disk), 1, "request out of range\n");
+		return -EFAULT;
+	}
+
+	if (dev_offset == SIZE_MAX)
+		return -EIO;
+
+	return ndbr->do_io(ndbr, dev_offset, iobuf, n, rw);
+}
+
+static const struct block_device_operations nd_blk_fops = {
+	.owner = THIS_MODULE,
+	.rw_bytes = nd_blk_rw_bytes,
+};
+
+static int nd_blk_probe(struct device *dev)
+{
+	struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+	struct nd_blk_device *blk_dev;
+	resource_size_t disk_size;
+	struct gendisk *disk;
+	int err;
+
+	disk_size = nd_namespace_blk_validate(nsblk);
+	if (disk_size < ND_MIN_NAMESPACE_SIZE)
+		return -ENXIO;
+
+	blk_dev = kzalloc(sizeof(*blk_dev), GFP_KERNEL);
+	if (!blk_dev)
+		return -ENOMEM;
+
+	blk_dev->disk_size	= disk_size;
+
+	blk_dev->queue = blk_alloc_queue(GFP_KERNEL);
+	if (!blk_dev->queue) {
+		err = -ENOMEM;
+		goto err_alloc_queue;
+	}
+
+	blk_queue_make_request(blk_dev->queue, nd_blk_make_request);
+	blk_queue_max_hw_sectors(blk_dev->queue, 1024);
+	blk_queue_bounce_limit(blk_dev->queue, BLK_BOUNCE_ANY);
+	blk_queue_logical_block_size(blk_dev->queue, nsblk->lbasize);
+
+	disk = blk_dev->disk = alloc_disk(0);
+	if (!disk) {
+		err = -ENOMEM;
+		goto err_alloc_disk;
+	}
+
+	blk_dev->ndbr = to_nd_blk_region(nsblk->dev.parent);
+	blk_dev->nsblk = nsblk;
+
+	disk->driverfs_dev	= dev;
+	disk->major		= nd_blk_major;
+	disk->first_minor	= 0;
+	disk->fops		= &nd_blk_fops;
+	disk->private_data	= blk_dev;
+	disk->queue		= blk_dev->queue;
+	disk->flags		= GENHD_FL_EXT_DEVT;
+	sprintf(disk->disk_name, "ndblk%d.%d", nd_region->id, nsblk->id);
+	set_capacity(disk, disk_size >> SECTOR_SHIFT);
+
+	dev_set_drvdata(dev, blk_dev);
+	nvdimm_bus_add_disk(disk);
+
+	return 0;
+
+ err_alloc_disk:
+	blk_cleanup_queue(blk_dev->queue);
+ err_alloc_queue:
+	kfree(blk_dev);
+	return err;
+}
+
+static int nd_blk_remove(struct device *dev)
+{
+	struct nd_blk_device *blk_dev = dev_get_drvdata(dev);
+
+	nvdimm_bus_remove_disk(blk_dev->disk);
+	blk_cleanup_queue(blk_dev->queue);
+	kfree(blk_dev);
+
+	return 0;
+}
+
+static struct nd_device_driver nd_blk_driver = {
+	.probe = nd_blk_probe,
+	.remove = nd_blk_remove,
+	.drv = {
+		.name = "nd_blk",
+	},
+	.type = ND_DRIVER_NAMESPACE_BLK,
+};
+
+static int __init nd_blk_init(void)
+{
+	int rc;
+
+	rc = register_blkdev(0, "nd_blk");
+	if (rc < 0)
+		return rc;
+
+	nd_blk_major = rc;
+	rc = nd_driver_register(&nd_blk_driver);
+
+	if (rc < 0)
+		unregister_blkdev(nd_blk_major, "nd_blk");
+
+	return rc;
+}
+
+static void __exit nd_blk_exit(void)
+{
+	driver_unregister(&nd_blk_driver.drv);
+	unregister_blkdev(nd_blk_major, "nd_blk");
+}
+
+MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_BLK);
+module_init(nd_blk_init);
+module_exit(nd_blk_exit);
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 83b179ed6d61..c05eb807d674 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -209,6 +209,15 @@ struct nvdimm *to_nvdimm(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(to_nvdimm);
 
+struct nvdimm *nd_blk_region_to_dimm(struct nd_blk_region *ndbr)
+{
+	struct nd_region *nd_region = &ndbr->nd_region;
+	struct nd_mapping *nd_mapping = &nd_region->mapping[0];
+
+	return nd_mapping->nvdimm;
+}
+EXPORT_SYMBOL_GPL(nd_blk_region_to_dimm);
+
 struct nvdimm_drvdata *to_ndd(struct nd_mapping *nd_mapping)
 {
 	struct nvdimm *nvdimm = nd_mapping->nvdimm;
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 50b502b1908e..68780d768e7b 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -149,6 +149,66 @@ static resource_size_t nd_namespace_blk_size(struct nd_namespace_blk *nsblk)
 	return size;
 }
 
+static resource_size_t __nd_namespace_blk_validate(
+		struct nd_namespace_blk *nsblk)
+{
+	struct nd_region *nd_region = to_nd_region(nsblk->dev.parent);
+	struct nd_mapping *nd_mapping = &nd_region->mapping[0];
+	struct nvdimm_drvdata *ndd = to_ndd(nd_mapping);
+	struct nd_label_id label_id;
+	struct resource *res;
+	int count, i;
+
+	if (!nsblk->uuid || !nsblk->lbasize || !ndd)
+		return 0;
+
+	count = 0;
+	nd_label_gen_id(&label_id, nsblk->uuid, NSLABEL_FLAG_LOCAL);
+	for_each_dpa_resource(ndd, res) {
+		if (strcmp(res->name, label_id.id) != 0)
+			continue;
+		/*
+		 * Resources with unacknoweldged adjustments indicate a
+		 * failure to update labels
+		 */
+		if (res->flags & DPA_RESOURCE_ADJUSTED)
+			return 0;
+		count++;
+	}
+
+	/* These values match after a successful label update */
+	if (count != nsblk->num_resources)
+		return 0;
+
+	for (i = 0; i < nsblk->num_resources; i++) {
+		struct resource *found = NULL;
+
+		for_each_dpa_resource(ndd, res)
+			if (res == nsblk->res[i]) {
+				found = res;
+				break;
+			}
+		/* stale resource */
+		if (!found)
+			return 0;
+	}
+
+	return nd_namespace_blk_size(nsblk);
+}
+
+resource_size_t nd_namespace_blk_validate(struct nd_namespace_blk *nsblk)
+{
+	resource_size_t size;
+
+	nvdimm_bus_lock(&nsblk->dev);
+	size = __nd_namespace_blk_validate(nsblk);
+	nvdimm_bus_unlock(&nsblk->dev);
+
+	return size;
+}
+EXPORT_SYMBOL(nd_namespace_blk_validate);
+
+
 static int nd_namespace_label_update(struct nd_region *nd_region,
 		struct device *dev)
 {
diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h
index 1375d30b3da5..9a90915e6fd2 100644
--- a/drivers/nvdimm/nd-core.h
+++ b/drivers/nvdimm/nd-core.h
@@ -23,10 +23,7 @@ extern struct list_head nvdimm_bus_list;
 extern struct mutex nvdimm_bus_list_mutex;
 extern int nvdimm_major;
 
-struct block_device;
-struct nd_io_claim;
 struct nd_btt;
-struct nd_io;
 
 struct nvdimm_bus {
 	struct nvdimm_bus_descriptor *nd_desc;
@@ -49,8 +46,8 @@ struct nvdimm {
 };
 
 bool is_nvdimm(struct device *dev);
-bool is_nd_blk(struct device *dev);
 bool is_nd_pmem(struct device *dev);
+bool is_nd_blk(struct device *dev);
 struct gendisk;
 #if IS_ENABLED(CONFIG_ND_BTT_DEVS)
 bool is_nd_btt(struct device *dev);
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index b876b839f49a..2b7746e798fb 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -96,6 +96,15 @@ struct nd_region {
 	struct nd_mapping mapping[0];
 };
 
+struct nd_blk_region {
+	int (*enable)(struct nvdimm_bus *nvdimm_bus, struct device *dev);
+	void (*disable)(struct nvdimm_bus *nvdimm_bus, struct device *dev);
+	int (*do_io)(struct nd_blk_region *ndbr, resource_size_t dpa,
+			void *iobuf, u64 len, int rw);
+	void *blk_provider_data;
+	struct nd_region nd_region;
+};
+
 /*
  * Lookup next in the repeating sequence of 01, 10, and 11.
  */
@@ -142,8 +151,6 @@ struct nd_btt *to_nd_btt(struct device *dev);
 struct btt_sb;
 u64 nd_btt_sb_checksum(struct btt_sb *btt_sb);
 struct nd_region *to_nd_region(struct device *dev);
-unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
-void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
 int nd_region_to_nstype(struct nd_region *nd_region);
 int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
 u64 nd_region_interleave_set_cookie(struct nd_region *nd_region);
@@ -159,4 +166,6 @@ void nvdimm_free_dpa(struct nvdimm_drvdata *ndd, struct resource *res);
 struct resource *nvdimm_allocate_dpa(struct nvdimm_drvdata *ndd,
 		struct nd_label_id *label_id, resource_size_t start,
 		resource_size_t n);
+int nd_blk_region_init(struct nd_region *nd_region);
+resource_size_t nd_namespace_blk_validate(struct nd_namespace_blk *nsblk);
 #endif /* __ND_H__ */
diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c
index aa617bf86506..d9d82e7a90fa 100644
--- a/drivers/nvdimm/region.c
+++ b/drivers/nvdimm/region.c
@@ -94,11 +94,10 @@ EXPORT_SYMBOL(nd_region_release_lane);
 
 static int nd_region_probe(struct device *dev)
 {
-	int err;
+	int err, rc;
 	static unsigned long once;
 	struct nd_region_namespaces *num_ns;
 	struct nd_region *nd_region = to_nd_region(dev);
-	int rc = nd_region_register_namespaces(nd_region, &err);
 
 	if (nd_region->num_lanes > num_online_cpus()
 			&& nd_region->num_lanes < num_possible_cpus()
@@ -110,6 +109,11 @@ static int nd_region_probe(struct device *dev)
 				nd_region->num_lanes);
 	}
 
+	rc = nd_blk_region_init(nd_region);
+	if (rc)
+		return rc;
+
+	rc = nd_region_register_namespaces(nd_region, &err);
 	num_ns = devm_kzalloc(dev, sizeof(*num_ns), GFP_KERNEL);
 	if (!num_ns)
 		return -ENOMEM;
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index cfa68d6590d6..b16ec20dbba2 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -11,6 +11,7 @@
  * General Public License for more details.
  */
 #include <linux/scatterlist.h>
+#include <linux/highmem.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/sort.h>
@@ -33,7 +34,10 @@ static void nd_region_release(struct device *dev)
 		put_device(&nvdimm->dev);
 	}
 	ida_simple_remove(&region_ida, nd_region->id);
-	kfree(nd_region);
+	if (is_nd_blk(dev))
+		kfree(to_nd_blk_region(dev));
+	else
+		kfree(nd_region);
 }
 
 static struct device_type nd_blk_device_type = {
@@ -70,6 +74,33 @@ struct nd_region *to_nd_region(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(to_nd_region);
 
+struct nd_blk_region *to_nd_blk_region(struct device *dev)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+
+	WARN_ON(!is_nd_blk(dev));
+	return container_of(nd_region, struct nd_blk_region, nd_region);
+}
+EXPORT_SYMBOL_GPL(to_nd_blk_region);
+
+void *nd_region_provider_data(struct nd_region *nd_region)
+{
+	return nd_region->provider_data;
+}
+EXPORT_SYMBOL_GPL(nd_region_provider_data);
+
+void *nd_blk_region_provider_data(struct nd_blk_region *ndbr)
+{
+	return ndbr->blk_provider_data;
+}
+EXPORT_SYMBOL_GPL(nd_blk_region_provider_data);
+
+void nd_blk_region_set_provider_data(struct nd_blk_region *ndbr, void *data)
+{
+	ndbr->blk_provider_data = data;
+}
+EXPORT_SYMBOL_GPL(nd_blk_region_set_provider_data);
+
 /**
  * nd_region_to_nstype() - region to an integer namespace type
  * @nd_region: region-device to interrogate
@@ -345,7 +376,9 @@ u64 nd_region_interleave_set_cookie(struct nd_region *nd_region)
 
 /*
  * Upon successful probe/remove, take/release a reference on the
- * associated interleave set (if present)
+ * associated dimms in the interleave set, on successful probe of a BLK
+ * namespace check if we need a new seed, and on remove of a BLK region
+ * notify the provider to disable the region.
  */
 static void nd_region_notify_driver_action(struct nvdimm_bus *nvdimm_bus,
 		struct device *dev, bool probe)
@@ -365,6 +398,11 @@ static void nd_region_notify_driver_action(struct nvdimm_bus *nvdimm_bus,
 			nd_mapping->ndd = NULL;
 			atomic_dec(&nvdimm->busy);
 		}
+
+		if (is_nd_pmem(dev))
+			return;
+
+		to_nd_blk_region(dev)->disable(nvdimm_bus, dev);
 	} else if (dev->parent && is_nd_blk(dev->parent) && probe) {
 		struct nd_region *nd_region = to_nd_region(dev->parent);
 
@@ -497,11 +535,21 @@ struct attribute_group nd_mapping_attribute_group = {
 };
 EXPORT_SYMBOL_GPL(nd_mapping_attribute_group);
 
-void *nd_region_provider_data(struct nd_region *nd_region)
+int nd_blk_region_init(struct nd_region *nd_region)
 {
-	return nd_region->provider_data;
+	struct device *dev = &nd_region->dev;
+	struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
+
+	if (!is_nd_blk(dev))
+		return 0;
+
+	if (nd_region->ndr_mappings < 1) {
+		dev_err(dev, "invalid BLK region\n");
+		return -ENXIO;
+	}
+
+	return to_nd_blk_region(dev)->enable(nvdimm_bus, dev);
 }
-EXPORT_SYMBOL_GPL(nd_region_provider_data);
 
 static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
 		struct nd_region_desc *ndr_desc, struct device_type *dev_type,
@@ -523,9 +571,28 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
 		}
 	}
 
-	nd_region = kzalloc(sizeof(struct nd_region)
-			+ sizeof(struct nd_mapping) * ndr_desc->num_mappings,
-			GFP_KERNEL);
+	if (dev_type == &nd_blk_device_type) {
+		struct nd_blk_region_desc *ndbr_desc;
+		struct nd_blk_region *ndbr;
+
+		ndbr_desc = to_blk_region_desc(ndr_desc);
+		ndbr = kzalloc(sizeof(*ndbr) + sizeof(struct nd_mapping)
+				* ndr_desc->num_mappings,
+				GFP_KERNEL);
+		if (ndbr) {
+			nd_region = &ndbr->nd_region;
+			ndbr->enable = ndbr_desc->enable;
+			ndbr->disable = ndbr_desc->disable;
+			ndbr->do_io = ndbr_desc->do_io;
+		} else
+			nd_region = NULL;
+	} else {
+		nd_region = kzalloc(sizeof(struct nd_region)
+				+ sizeof(struct nd_mapping)
+				* ndr_desc->num_mappings,
+				GFP_KERNEL);
+	}
+
 	if (!nd_region)
 		return NULL;
 	nd_region->id = ida_simple_get(&region_ida, 0, 0, GFP_KERNEL);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 531d99dfac68..7fc1b25bdb5d 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -14,6 +14,7 @@
  */
 #ifndef __LIBNVDIMM_H__
 #define __LIBNVDIMM_H__
+#include <linux/kernel.h>
 #include <linux/sizes.h>
 #include <linux/types.h>
 
@@ -89,8 +90,24 @@ struct nd_region_desc {
 };
 
 struct nvdimm_bus;
-struct device;
 struct module;
+struct device;
+struct nd_blk_region;
+struct nd_blk_region_desc {
+	int (*enable)(struct nvdimm_bus *nvdimm_bus, struct device *dev);
+	void (*disable)(struct nvdimm_bus *nvdimm_bus, struct device *dev);
+	int (*do_io)(struct nd_blk_region *ndbr, resource_size_t dpa,
+			void *iobuf, u64 len, int rw);
+	struct nd_region_desc ndr_desc;
+};
+
+static inline struct nd_blk_region_desc *to_blk_region_desc(
+		struct nd_region_desc *ndr_desc)
+{
+	return container_of(ndr_desc, struct nd_blk_region_desc, ndr_desc);
+
+}
+
 struct nvdimm_bus *__nvdimm_bus_register(struct device *parent,
 		struct nvdimm_bus_descriptor *nfit_desc, struct module *module);
 #define nvdimm_bus_register(parent, desc) \
@@ -99,10 +116,10 @@ void nvdimm_bus_unregister(struct nvdimm_bus *nvdimm_bus);
 struct nvdimm_bus *to_nvdimm_bus(struct device *dev);
 struct nvdimm *to_nvdimm(struct device *dev);
 struct nd_region *to_nd_region(struct device *dev);
+struct nd_blk_region *to_nd_blk_region(struct device *dev);
 struct nvdimm_bus_descriptor *to_nd_desc(struct nvdimm_bus *nvdimm_bus);
 const char *nvdimm_name(struct nvdimm *nvdimm);
 void *nvdimm_provider_data(struct nvdimm *nvdimm);
-void *nd_region_provider_data(struct nd_region *nd_region);
 struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data,
 		const struct attribute_group **groups, unsigned long flags,
 		unsigned long *dsm_mask);
@@ -120,5 +137,11 @@ struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus,
 		struct nd_region_desc *ndr_desc);
 struct nd_region *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
 		struct nd_region_desc *ndr_desc);
+void *nd_region_provider_data(struct nd_region *nd_region);
+void *nd_blk_region_provider_data(struct nd_blk_region *ndbr);
+void nd_blk_region_set_provider_data(struct nd_blk_region *ndbr, void *data);
+struct nvdimm *nd_blk_region_to_dimm(struct nd_blk_region *ndbr);
+unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
+void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
 u64 nd_fletcher64(void *addr, size_t len, bool le);
 #endif /* __LIBNVDIMM_H__ */


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 05/15] tools/testing/nvdimm: libnvdimm unit test infrastructure
  2015-06-17 23:54 ` Dan Williams
@ 2015-06-17 23:55   ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: mingo, boaz, toshi.kani, Rafael J. Wysocki, Robert Moore,
	linux-kernel, linux-acpi, Lv Zheng, linux-fsdevel, hch

'libnvdimm' is the first driver sub-system in the kernel to implement
mocking for unit test coverage.  The nfit_test module gets built as an
external module and arranges for external module replacements of nfit,
libnvdimm, nd_pmem, and nd_blk.  These replacements use the linker
--wrap option to redirect calls to ioremap() + request_mem_region() to
custom defined unit test resources.  The end result is a fully
functional nvdimm_bus, as far as userspace is concerned, but with the
capability to perform otherwise destructive tests on emulated resources.

Q: Why not use QEMU for this emulation?
QEMU is not suitable for unit testing.  QEMU's role is to faithfully
emulate the platform.  A unit test's role is to unfaithfully implement
the platform with the goal of triggering bugs in the corners of the
sub-system implementation.  As bugs are discovered in platforms, or the
sub-system itself, the unit tests are extended to backstop a fix with a
reproducer unit test.

Another problem with QEMU is that it would require coordination of 3
software projects instead of 2 (kernel + libndctl [1]) to maintain and
execute the tests.  The chances for bit rot and the difficulty of
getting the tests running goes up non-linearly the more components
involved.


Q: Why submit this to the kernel tree instead of external modules in
   libndctl?
Simple, to alleviate the same risk that out-of-tree external modules
face.  Updates to drivers/nvdimm/ can be immediately evaluated to see if
they have any impact on tools/testing/nvdimm/.


Q: What are the negative implications of merging this?
It is a unique maintenance burden because the purpose of mocking an
interface to enable a unit test is to purposefully short circuit the
semantics of a routine to enable testing.  For example
__wrap_ioremap_cache() fakes the pmem driver into "ioremap()'ing" a test
resource buffer allocated by dma_alloc_coherent().  The future
maintenance burden hits when someone changes the semantics of
ioremap_cache() and wonders what the implications are for the unit test.

[1]: https://github.com/pmem/ndctl

Cc: <linux-acpi@vger.kernel.org>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c                   |   12 
 drivers/acpi/nfit.h                   |    6 
 tools/testing/nvdimm/Kbuild           |   40 +
 tools/testing/nvdimm/Makefile         |    7 
 tools/testing/nvdimm/config_check.c   |   15 
 tools/testing/nvdimm/test/Kbuild      |    8 
 tools/testing/nvdimm/test/iomap.c     |  151 ++++
 tools/testing/nvdimm/test/nfit.c      | 1112 +++++++++++++++++++++++++++++++++
 tools/testing/nvdimm/test/nfit_test.h |   29 +
 9 files changed, 1376 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/nvdimm/Kbuild
 create mode 100644 tools/testing/nvdimm/Makefile
 create mode 100644 tools/testing/nvdimm/config_check.c
 create mode 100644 tools/testing/nvdimm/test/Kbuild
 create mode 100644 tools/testing/nvdimm/test/iomap.c
 create mode 100644 tools/testing/nvdimm/test/nfit.c
 create mode 100644 tools/testing/nvdimm/test/nfit_test.h

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 3a77709fd394..9363e4b0e6a7 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -29,10 +29,11 @@ MODULE_PARM_DESC(force_enable_dimms, "Ignore _STA (ACPI DIMM device) status");
 
 static u8 nfit_uuid[NFIT_UUID_MAX][16];
 
-static const u8 *to_nfit_uuid(enum nfit_uuids id)
+const u8 *to_nfit_uuid(enum nfit_uuids id)
 {
 	return nfit_uuid[id];
 }
+EXPORT_SYMBOL(to_nfit_uuid);
 
 static struct acpi_nfit_desc *to_acpi_nfit_desc(
 		struct nvdimm_bus_descriptor *nd_desc)
@@ -577,11 +578,12 @@ static struct attribute_group acpi_nfit_attribute_group = {
 	.attrs = acpi_nfit_attributes,
 };
 
-static const struct attribute_group *acpi_nfit_attribute_groups[] = {
+const struct attribute_group *acpi_nfit_attribute_groups[] = {
 	&nvdimm_bus_attribute_group,
 	&acpi_nfit_attribute_group,
 	NULL,
 };
+EXPORT_SYMBOL_GPL(acpi_nfit_attribute_groups);
 
 static struct acpi_nfit_memory_map *to_nfit_memdev(struct device *dev)
 {
@@ -1319,7 +1321,7 @@ static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,
 		ndbr_desc = to_blk_region_desc(ndr_desc);
 		ndbr_desc->enable = acpi_nfit_blk_region_enable;
 		ndbr_desc->disable = acpi_nfit_blk_region_disable;
-		ndbr_desc->do_io = acpi_nfit_blk_region_do_io;
+		ndbr_desc->do_io = acpi_desc->blk_do_io;
 		if (!nvdimm_blk_region_create(acpi_desc->nvdimm_bus, ndr_desc))
 			return -ENOMEM;
 		break;
@@ -1403,7 +1405,7 @@ static int acpi_nfit_register_regions(struct acpi_nfit_desc *acpi_desc)
 	return 0;
 }
 
-static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
+int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 {
 	struct device *dev = acpi_desc->dev;
 	const void *end;
@@ -1442,6 +1444,7 @@ static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 
 	return acpi_nfit_register_regions(acpi_desc);
 }
+EXPORT_SYMBOL_GPL(acpi_nfit_init);
 
 static int acpi_nfit_add(struct acpi_device *adev)
 {
@@ -1466,6 +1469,7 @@ static int acpi_nfit_add(struct acpi_device *adev)
 	dev_set_drvdata(dev, acpi_desc);
 	acpi_desc->dev = dev;
 	acpi_desc->nfit = (struct acpi_table_nfit *) tbl;
+	acpi_desc->blk_do_io = acpi_nfit_blk_region_do_io;
 	nd_desc = &acpi_desc->nd_desc;
 	nd_desc->provider_name = "ACPI.NFIT";
 	nd_desc->ndctl = acpi_nfit_ctl;
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index 7bd38b7baf39..c62fffea8423 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -93,6 +93,8 @@ struct acpi_nfit_desc {
 	struct nvdimm_bus *nvdimm_bus;
 	struct device *dev;
 	unsigned long dimm_dsm_force_en;
+	int (*blk_do_io)(struct nd_blk_region *ndbr, resource_size_t dpa,
+			void *iobuf, u64 len, int rw);
 };
 
 enum nd_blk_mmio_selector {
@@ -146,4 +148,8 @@ static inline struct acpi_nfit_desc *to_acpi_desc(
 {
 	return container_of(nd_desc, struct acpi_nfit_desc, nd_desc);
 }
+
+const u8 *to_nfit_uuid(enum nfit_uuids id);
+int acpi_nfit_init(struct acpi_nfit_desc *nfit, acpi_size sz);
+extern const struct attribute_group *acpi_nfit_attribute_groups[];
 #endif /* __NFIT_H__ */
diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
new file mode 100644
index 000000000000..df624d7ea652
--- /dev/null
+++ b/tools/testing/nvdimm/Kbuild
@@ -0,0 +1,40 @@
+ldflags-y += --wrap=ioremap_cache
+ldflags-y += --wrap=ioremap_nocache
+ldflags-y += --wrap=iounmap
+ldflags-y += --wrap=__request_region
+ldflags-y += --wrap=__release_region
+
+DRIVERS := ../../../drivers
+NVDIMM_SRC := $(DRIVERS)/nvdimm
+ACPI_SRC := $(DRIVERS)/acpi
+
+obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
+obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
+obj-$(CONFIG_ND_BTT) += nd_btt.o
+obj-$(CONFIG_ND_BLK) += nd_blk.o
+obj-$(CONFIG_ACPI_NFIT) += nfit.o
+
+nfit-y := $(ACPI_SRC)/nfit.o
+nfit-y += config_check.o
+
+nd_pmem-y := $(NVDIMM_SRC)/pmem.o
+nd_pmem-y += config_check.o
+
+nd_btt-y := $(NVDIMM_SRC)/btt.o
+nd_btt-y += config_check.o
+
+nd_blk-y := $(NVDIMM_SRC)/blk.o
+nd_blk-y += config_check.o
+
+libnvdimm-y := $(NVDIMM_SRC)/core.o
+libnvdimm-y += $(NVDIMM_SRC)/bus.o
+libnvdimm-y += $(NVDIMM_SRC)/dimm_devs.o
+libnvdimm-y += $(NVDIMM_SRC)/dimm.o
+libnvdimm-y += $(NVDIMM_SRC)/region_devs.o
+libnvdimm-y += $(NVDIMM_SRC)/region.o
+libnvdimm-y += $(NVDIMM_SRC)/namespace_devs.o
+libnvdimm-y += $(NVDIMM_SRC)/label.o
+libnvdimm-$(CONFIG_ND_BTT_DEVS) += $(NVDIMM_SRC)/btt_devs.o
+libnvdimm-y += config_check.o
+
+obj-m += test/
diff --git a/tools/testing/nvdimm/Makefile b/tools/testing/nvdimm/Makefile
new file mode 100644
index 000000000000..3dfe024b4e7e
--- /dev/null
+++ b/tools/testing/nvdimm/Makefile
@@ -0,0 +1,7 @@
+KDIR ?= ../../../
+
+default:
+	$(MAKE) -C $(KDIR) M=$$PWD
+
+install: default
+	$(MAKE) -C $(KDIR) M=$$PWD modules_install
diff --git a/tools/testing/nvdimm/config_check.c b/tools/testing/nvdimm/config_check.c
new file mode 100644
index 000000000000..f2c7615554eb
--- /dev/null
+++ b/tools/testing/nvdimm/config_check.c
@@ -0,0 +1,15 @@
+#include <linux/kconfig.h>
+#include <linux/bug.h>
+
+void check(void)
+{
+	/*
+	 * These kconfig symbols must be set to "m" for nfit_test to
+	 * load and operate.
+	 */
+	BUILD_BUG_ON(!IS_MODULE(CONFIG_LIBNVDIMM));
+	BUILD_BUG_ON(!IS_MODULE(CONFIG_BLK_DEV_PMEM));
+	BUILD_BUG_ON(!IS_MODULE(CONFIG_ND_BTT));
+	BUILD_BUG_ON(!IS_MODULE(CONFIG_ND_BLK));
+	BUILD_BUG_ON(!IS_MODULE(CONFIG_ACPI_NFIT));
+}
diff --git a/tools/testing/nvdimm/test/Kbuild b/tools/testing/nvdimm/test/Kbuild
new file mode 100644
index 000000000000..9241064970fe
--- /dev/null
+++ b/tools/testing/nvdimm/test/Kbuild
@@ -0,0 +1,8 @@
+ccflags-y := -I$(src)/../../../../drivers/nvdimm/
+ccflags-y += -I$(src)/../../../../drivers/acpi/
+
+obj-m += nfit_test.o
+obj-m += nfit_test_iomap.o
+
+nfit_test-y := nfit.o
+nfit_test_iomap-y := iomap.o
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
new file mode 100644
index 000000000000..c85a6f6ba559
--- /dev/null
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -0,0 +1,151 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/rculist.h>
+#include <linux/export.h>
+#include <linux/ioport.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/io.h>
+#include "nfit_test.h"
+
+static LIST_HEAD(iomap_head);
+
+static struct iomap_ops {
+	nfit_test_lookup_fn nfit_test_lookup;
+	struct list_head list;
+} iomap_ops = {
+	.list = LIST_HEAD_INIT(iomap_ops.list),
+};
+
+void nfit_test_setup(nfit_test_lookup_fn lookup)
+{
+	iomap_ops.nfit_test_lookup = lookup;
+	list_add_rcu(&iomap_ops.list, &iomap_head);
+}
+EXPORT_SYMBOL(nfit_test_setup);
+
+void nfit_test_teardown(void)
+{
+	list_del_rcu(&iomap_ops.list);
+	synchronize_rcu();
+}
+EXPORT_SYMBOL(nfit_test_teardown);
+
+static struct nfit_test_resource *get_nfit_res(resource_size_t resource)
+{
+	struct iomap_ops *ops;
+
+	ops = list_first_or_null_rcu(&iomap_head, typeof(*ops), list);
+	if (ops)
+		return ops->nfit_test_lookup(resource);
+	return NULL;
+}
+
+void __iomem *__nfit_test_ioremap(resource_size_t offset, unsigned long size,
+		void __iomem *(*fallback_fn)(resource_size_t, unsigned long))
+{
+	struct nfit_test_resource *nfit_res;
+
+	rcu_read_lock();
+	nfit_res = get_nfit_res(offset);
+	rcu_read_unlock();
+	if (nfit_res)
+		return (void __iomem *) nfit_res->buf + offset
+			- nfit_res->res->start;
+	return fallback_fn(offset, size);
+}
+
+void __iomem *__wrap_ioremap_cache(resource_size_t offset, unsigned long size)
+{
+	return __nfit_test_ioremap(offset, size, ioremap_cache);
+}
+EXPORT_SYMBOL(__wrap_ioremap_cache);
+
+void __iomem *__wrap_ioremap_nocache(resource_size_t offset, unsigned long size)
+{
+	return __nfit_test_ioremap(offset, size, ioremap_nocache);
+}
+EXPORT_SYMBOL(__wrap_ioremap_nocache);
+
+void __wrap_iounmap(volatile void __iomem *addr)
+{
+	struct nfit_test_resource *nfit_res;
+
+	rcu_read_lock();
+	nfit_res = get_nfit_res((unsigned long) addr);
+	rcu_read_unlock();
+	if (nfit_res)
+		return;
+	return iounmap(addr);
+}
+EXPORT_SYMBOL(__wrap_iounmap);
+
+struct resource *__wrap___request_region(struct resource *parent,
+		resource_size_t start, resource_size_t n, const char *name,
+		int flags)
+{
+	struct nfit_test_resource *nfit_res;
+
+	if (parent == &iomem_resource) {
+		rcu_read_lock();
+		nfit_res = get_nfit_res(start);
+		rcu_read_unlock();
+		if (nfit_res) {
+			struct resource *res = nfit_res->res + 1;
+
+			if (start + n > nfit_res->res->start
+					+ resource_size(nfit_res->res)) {
+				pr_debug("%s: start: %llx n: %llx overflow: %pr\n",
+						__func__, start, n,
+						nfit_res->res);
+				return NULL;
+			}
+
+			res->start = start;
+			res->end = start + n - 1;
+			res->name = name;
+			res->flags = resource_type(parent);
+			res->flags |= IORESOURCE_BUSY | flags;
+			pr_debug("%s: %pr\n", __func__, res);
+			return res;
+		}
+	}
+	return __request_region(parent, start, n, name, flags);
+}
+EXPORT_SYMBOL(__wrap___request_region);
+
+void __wrap___release_region(struct resource *parent, resource_size_t start,
+				resource_size_t n)
+{
+	struct nfit_test_resource *nfit_res;
+
+	if (parent == &iomem_resource) {
+		rcu_read_lock();
+		nfit_res = get_nfit_res(start);
+		rcu_read_unlock();
+		if (nfit_res) {
+			struct resource *res = nfit_res->res + 1;
+
+			if (start != res->start || resource_size(res) != n)
+				pr_info("%s: start: %llx n: %llx mismatch: %pr\n",
+						__func__, start, n, res);
+			else
+				memset(res, 0, sizeof(*res));
+			return;
+		}
+	}
+	__release_region(parent, start, n);
+}
+EXPORT_SYMBOL(__wrap___release_region);
+
+MODULE_LICENSE("GPL v2");
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
new file mode 100644
index 000000000000..416f8fbf9881
--- /dev/null
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -0,0 +1,1112 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/platform_device.h>
+#include <linux/dma-mapping.h>
+#include <linux/libnvdimm.h>
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/ndctl.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+#include <nfit.h>
+#include <nd.h>
+#include "nfit_test.h"
+
+/*
+ * Generate an NFIT table to describe the following topology:
+ *
+ * BUS0: Interleaved PMEM regions, and aliasing with BLK regions
+ *
+ *                     (a)                       (b)            DIMM   BLK-REGION
+ *           +----------+--------------+----------+---------+
+ * +------+  |  blk2.0  |     pm0.0    |  blk2.1  |  pm1.0  |    0      region2
+ * | imc0 +--+- - - - - region0 - - - -+----------+         +
+ * +--+---+  |  blk3.0  |     pm0.0    |  blk3.1  |  pm1.0  |    1      region3
+ *    |      +----------+--------------v----------v         v
+ * +--+---+                            |                    |
+ * | cpu0 |                                    region1
+ * +--+---+                            |                    |
+ *    |      +-------------------------^----------^         ^
+ * +--+---+  |                 blk4.0             |  pm1.0  |    2      region4
+ * | imc1 +--+-------------------------+----------+         +
+ * +------+  |                 blk5.0             |  pm1.0  |    3      region5
+ *           +-------------------------+----------+-+-------+
+ *
+ * *) In this layout we have four dimms and two memory controllers in one
+ *    socket.  Each unique interface (BLK or PMEM) to DPA space
+ *    is identified by a region device with a dynamically assigned id.
+ *
+ * *) The first portion of dimm0 and dimm1 are interleaved as REGION0.
+ *    A single PMEM namespace "pm0.0" is created using half of the
+ *    REGION0 SPA-range.  REGION0 spans dimm0 and dimm1.  PMEM namespace
+ *    allocate from from the bottom of a region.  The unallocated
+ *    portion of REGION0 aliases with REGION2 and REGION3.  That
+ *    unallacted capacity is reclaimed as BLK namespaces ("blk2.0" and
+ *    "blk3.0") starting at the base of each DIMM to offset (a) in those
+ *    DIMMs.  "pm0.0", "blk2.0" and "blk3.0" are free-form readable
+ *    names that can be assigned to a namespace.
+ *
+ * *) In the last portion of dimm0 and dimm1 we have an interleaved
+ *    SPA range, REGION1, that spans those two dimms as well as dimm2
+ *    and dimm3.  Some of REGION1 allocated to a PMEM namespace named
+ *    "pm1.0" the rest is reclaimed in 4 BLK namespaces (for each
+ *    dimm in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
+ *    "blk5.0".
+ *
+ * *) The portion of dimm2 and dimm3 that do not participate in the
+ *    REGION1 interleaved SPA range (i.e. the DPA address below offset
+ *    (b) are also included in the "blk4.0" and "blk5.0" namespaces.
+ *    Note, that BLK namespaces need not be contiguous in DPA-space, and
+ *    can consume aliased capacity from multiple interleave sets.
+ *
+ * BUS1: Legacy NVDIMM (single contiguous range)
+ *
+ *  region2
+ * +---------------------+
+ * |---------------------|
+ * ||       pm2.0       ||
+ * |---------------------|
+ * +---------------------+
+ *
+ * *) A NFIT-table may describe a simple system-physical-address range
+ *    with no BLK aliasing.  This type of region may optionally
+ *    reference an NVDIMM.
+ */
+enum {
+	NUM_PM  = 2,
+	NUM_DCR = 4,
+	NUM_BDW = NUM_DCR,
+	NUM_SPA = NUM_PM + NUM_DCR + NUM_BDW,
+	NUM_MEM = NUM_DCR + NUM_BDW + 2 /* spa0 iset */ + 4 /* spa1 iset */,
+	DIMM_SIZE = SZ_32M,
+	LABEL_SIZE = SZ_128K,
+	SPA0_SIZE = DIMM_SIZE,
+	SPA1_SIZE = DIMM_SIZE*2,
+	SPA2_SIZE = DIMM_SIZE,
+	BDW_SIZE = 64 << 8,
+	DCR_SIZE = 12,
+	NUM_NFITS = 2, /* permit testing multiple NFITs per system */
+};
+
+struct nfit_test_dcr {
+	__le64 bdw_addr;
+	__le32 bdw_status;
+	__u8 aperature[BDW_SIZE];
+};
+
+#define NFIT_DIMM_HANDLE(node, socket, imc, chan, dimm) \
+	(((node & 0xfff) << 16) | ((socket & 0xf) << 12) \
+	 | ((imc & 0xf) << 8) | ((chan & 0xf) << 4) | (dimm & 0xf))
+
+static u32 handle[NUM_DCR] = {
+	[0] = NFIT_DIMM_HANDLE(0, 0, 0, 0, 0),
+	[1] = NFIT_DIMM_HANDLE(0, 0, 0, 0, 1),
+	[2] = NFIT_DIMM_HANDLE(0, 0, 1, 0, 0),
+	[3] = NFIT_DIMM_HANDLE(0, 0, 1, 0, 1),
+};
+
+struct nfit_test {
+	struct acpi_nfit_desc acpi_desc;
+	struct platform_device pdev;
+	struct list_head resources;
+	void *nfit_buf;
+	dma_addr_t nfit_dma;
+	size_t nfit_size;
+	int num_dcr;
+	int num_pm;
+	void **dimm;
+	dma_addr_t *dimm_dma;
+	void **label;
+	dma_addr_t *label_dma;
+	void **spa_set;
+	dma_addr_t *spa_set_dma;
+	struct nfit_test_dcr **dcr;
+	dma_addr_t *dcr_dma;
+	int (*alloc)(struct nfit_test *t);
+	void (*setup)(struct nfit_test *t);
+};
+
+static struct nfit_test *to_nfit_test(struct device *dev)
+{
+	struct platform_device *pdev = to_platform_device(dev);
+
+	return container_of(pdev, struct nfit_test, pdev);
+}
+
+static int nfit_test_ctl(struct nvdimm_bus_descriptor *nd_desc,
+		struct nvdimm *nvdimm, unsigned int cmd, void *buf,
+		unsigned int buf_len)
+{
+	struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
+	struct nfit_test *t = container_of(acpi_desc, typeof(*t), acpi_desc);
+	struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
+	int i, rc;
+
+	if (!nfit_mem || !test_bit(cmd, &nfit_mem->dsm_mask))
+		return -ENXIO;
+
+	/* lookup label space for the given dimm */
+	for (i = 0; i < ARRAY_SIZE(handle); i++)
+		if (__to_nfit_memdev(nfit_mem)->device_handle == handle[i])
+			break;
+	if (i >= ARRAY_SIZE(handle))
+		return -ENXIO;
+
+	switch (cmd) {
+	case ND_CMD_GET_CONFIG_SIZE: {
+		struct nd_cmd_get_config_size *nd_cmd = buf;
+
+		if (buf_len < sizeof(*nd_cmd))
+			return -EINVAL;
+		nd_cmd->status = 0;
+		nd_cmd->config_size = LABEL_SIZE;
+		nd_cmd->max_xfer = SZ_4K;
+		rc = 0;
+		break;
+	}
+	case ND_CMD_GET_CONFIG_DATA: {
+		struct nd_cmd_get_config_data_hdr *nd_cmd = buf;
+		unsigned int len, offset = nd_cmd->in_offset;
+
+		if (buf_len < sizeof(*nd_cmd))
+			return -EINVAL;
+		if (offset >= LABEL_SIZE)
+			return -EINVAL;
+		if (nd_cmd->in_length + sizeof(*nd_cmd) > buf_len)
+			return -EINVAL;
+
+		nd_cmd->status = 0;
+		len = min(nd_cmd->in_length, LABEL_SIZE - offset);
+		memcpy(nd_cmd->out_buf, t->label[i] + offset, len);
+		rc = buf_len - sizeof(*nd_cmd) - len;
+		break;
+	}
+	case ND_CMD_SET_CONFIG_DATA: {
+		struct nd_cmd_set_config_hdr *nd_cmd = buf;
+		unsigned int len, offset = nd_cmd->in_offset;
+		u32 *status;
+
+		if (buf_len < sizeof(*nd_cmd))
+			return -EINVAL;
+		if (offset >= LABEL_SIZE)
+			return -EINVAL;
+		if (nd_cmd->in_length + sizeof(*nd_cmd) + 4 > buf_len)
+			return -EINVAL;
+
+		status = buf + nd_cmd->in_length + sizeof(*nd_cmd);
+		*status = 0;
+		len = min(nd_cmd->in_length, LABEL_SIZE - offset);
+		memcpy(t->label[i] + offset, nd_cmd->in_buf, len);
+		rc = buf_len - sizeof(*nd_cmd) - (len + 4);
+		break;
+	}
+	default:
+		return -ENOTTY;
+	}
+
+	return rc;
+}
+
+static DEFINE_SPINLOCK(nfit_test_lock);
+static struct nfit_test *instances[NUM_NFITS];
+
+static void release_nfit_res(void *data)
+{
+	struct nfit_test_resource *nfit_res = data;
+	struct resource *res = nfit_res->res;
+
+	spin_lock(&nfit_test_lock);
+	list_del(&nfit_res->list);
+	spin_unlock(&nfit_test_lock);
+
+	if (is_vmalloc_addr(nfit_res->buf))
+		vfree(nfit_res->buf);
+	else
+		dma_free_coherent(nfit_res->dev, resource_size(res),
+				nfit_res->buf, res->start);
+	kfree(res);
+	kfree(nfit_res);
+}
+
+static void *__test_alloc(struct nfit_test *t, size_t size, dma_addr_t *dma,
+		void *buf)
+{
+	struct device *dev = &t->pdev.dev;
+	struct resource *res = kzalloc(sizeof(*res) * 2, GFP_KERNEL);
+	struct nfit_test_resource *nfit_res = kzalloc(sizeof(*nfit_res),
+			GFP_KERNEL);
+	int rc;
+
+	if (!res || !buf || !nfit_res)
+		goto err;
+	rc = devm_add_action(dev, release_nfit_res, nfit_res);
+	if (rc)
+		goto err;
+	INIT_LIST_HEAD(&nfit_res->list);
+	memset(buf, 0, size);
+	nfit_res->dev = dev;
+	nfit_res->buf = buf;
+	nfit_res->res = res;
+	res->start = *dma;
+	res->end = *dma + size - 1;
+	res->name = "NFIT";
+	spin_lock(&nfit_test_lock);
+	list_add(&nfit_res->list, &t->resources);
+	spin_unlock(&nfit_test_lock);
+
+	return nfit_res->buf;
+ err:
+	if (buf && !is_vmalloc_addr(buf))
+		dma_free_coherent(dev, size, buf, *dma);
+	else if (buf)
+		vfree(buf);
+	kfree(res);
+	kfree(nfit_res);
+	return NULL;
+}
+
+static void *test_alloc(struct nfit_test *t, size_t size, dma_addr_t *dma)
+{
+	void *buf = vmalloc(size);
+
+	*dma = (unsigned long) buf;
+	return __test_alloc(t, size, dma, buf);
+}
+
+static void *test_alloc_coherent(struct nfit_test *t, size_t size,
+		dma_addr_t *dma)
+{
+	struct device *dev = &t->pdev.dev;
+	void *buf = dma_alloc_coherent(dev, size, dma, GFP_KERNEL);
+
+	return __test_alloc(t, size, dma, buf);
+}
+
+static struct nfit_test_resource *nfit_test_lookup(resource_size_t addr)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(instances); i++) {
+		struct nfit_test_resource *n, *nfit_res = NULL;
+		struct nfit_test *t = instances[i];
+
+		if (!t)
+			continue;
+		spin_lock(&nfit_test_lock);
+		list_for_each_entry(n, &t->resources, list) {
+			if (addr >= n->res->start && (addr < n->res->start
+						+ resource_size(n->res))) {
+				nfit_res = n;
+				break;
+			} else if (addr >= (unsigned long) n->buf
+					&& (addr < (unsigned long) n->buf
+						+ resource_size(n->res))) {
+				nfit_res = n;
+				break;
+			}
+		}
+		spin_unlock(&nfit_test_lock);
+		if (nfit_res)
+			return nfit_res;
+	}
+
+	return NULL;
+}
+
+static int nfit_test0_alloc(struct nfit_test *t)
+{
+	size_t nfit_size = sizeof(struct acpi_table_nfit)
+			+ sizeof(struct acpi_nfit_system_address) * NUM_SPA
+			+ sizeof(struct acpi_nfit_memory_map) * NUM_MEM
+			+ sizeof(struct acpi_nfit_control_region) * NUM_DCR
+			+ sizeof(struct acpi_nfit_data_region) * NUM_BDW;
+	int i;
+
+	t->nfit_buf = test_alloc(t, nfit_size, &t->nfit_dma);
+	if (!t->nfit_buf)
+		return -ENOMEM;
+	t->nfit_size = nfit_size;
+
+	t->spa_set[0] = test_alloc_coherent(t, SPA0_SIZE, &t->spa_set_dma[0]);
+	if (!t->spa_set[0])
+		return -ENOMEM;
+
+	t->spa_set[1] = test_alloc_coherent(t, SPA1_SIZE, &t->spa_set_dma[1]);
+	if (!t->spa_set[1])
+		return -ENOMEM;
+
+	for (i = 0; i < NUM_DCR; i++) {
+		t->dimm[i] = test_alloc(t, DIMM_SIZE, &t->dimm_dma[i]);
+		if (!t->dimm[i])
+			return -ENOMEM;
+
+		t->label[i] = test_alloc(t, LABEL_SIZE, &t->label_dma[i]);
+		if (!t->label[i])
+			return -ENOMEM;
+		sprintf(t->label[i], "label%d", i);
+	}
+
+	for (i = 0; i < NUM_DCR; i++) {
+		t->dcr[i] = test_alloc(t, LABEL_SIZE, &t->dcr_dma[i]);
+		if (!t->dcr[i])
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int nfit_test1_alloc(struct nfit_test *t)
+{
+	size_t nfit_size = sizeof(struct acpi_table_nfit)
+		+ sizeof(struct acpi_nfit_system_address)
+		+ sizeof(struct acpi_nfit_memory_map)
+		+ sizeof(struct acpi_nfit_control_region);
+
+	t->nfit_buf = test_alloc(t, nfit_size, &t->nfit_dma);
+	if (!t->nfit_buf)
+		return -ENOMEM;
+	t->nfit_size = nfit_size;
+
+	t->spa_set[0] = test_alloc_coherent(t, SPA2_SIZE, &t->spa_set_dma[0]);
+	if (!t->spa_set[0])
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void nfit_test_init_header(struct acpi_table_nfit *nfit, size_t size)
+{
+	memcpy(nfit->header.signature, ACPI_SIG_NFIT, 4);
+	nfit->header.length = size;
+	nfit->header.revision = 1;
+	memcpy(nfit->header.oem_id, "LIBND", 6);
+	memcpy(nfit->header.oem_table_id, "TEST", 5);
+	nfit->header.oem_revision = 1;
+	memcpy(nfit->header.asl_compiler_id, "TST", 4);
+	nfit->header.asl_compiler_revision = 1;
+}
+
+static void nfit_test0_setup(struct nfit_test *t)
+{
+	struct nvdimm_bus_descriptor *nd_desc;
+	struct acpi_nfit_desc *acpi_desc;
+	struct acpi_nfit_memory_map *memdev;
+	void *nfit_buf = t->nfit_buf;
+	size_t size = t->nfit_size;
+	struct acpi_nfit_system_address *spa;
+	struct acpi_nfit_control_region *dcr;
+	struct acpi_nfit_data_region *bdw;
+	unsigned int offset;
+
+	nfit_test_init_header(nfit_buf, size);
+
+	/*
+	 * spa0 (interleave first half of dimm0 and dimm1, note storage
+	 * does not actually alias the related block-data-window
+	 * regions)
+	 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit);
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_PM), 16);
+	spa->range_index = 0+1;
+	spa->address = t->spa_set_dma[0];
+	spa->length = SPA0_SIZE;
+
+	/*
+	 * spa1 (interleave last half of the 4 DIMMS, note storage
+	 * does not actually alias the related block-data-window
+	 * regions)
+	 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa);
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_PM), 16);
+	spa->range_index = 1+1;
+	spa->address = t->spa_set_dma[1];
+	spa->length = SPA1_SIZE;
+
+	/* spa2 (dcr0) dimm0 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 2;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+	spa->range_index = 2+1;
+	spa->address = t->dcr_dma[0];
+	spa->length = DCR_SIZE;
+
+	/* spa3 (dcr1) dimm1 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 3;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+	spa->range_index = 3+1;
+	spa->address = t->dcr_dma[1];
+	spa->length = DCR_SIZE;
+
+	/* spa4 (dcr2) dimm2 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 4;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+	spa->range_index = 4+1;
+	spa->address = t->dcr_dma[2];
+	spa->length = DCR_SIZE;
+
+	/* spa5 (dcr3) dimm3 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 5;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+	spa->range_index = 5+1;
+	spa->address = t->dcr_dma[3];
+	spa->length = DCR_SIZE;
+
+	/* spa6 (bdw for dcr0) dimm0 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 6;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+	spa->range_index = 6+1;
+	spa->address = t->dimm_dma[0];
+	spa->length = DIMM_SIZE;
+
+	/* spa7 (bdw for dcr1) dimm1 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 7;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+	spa->range_index = 7+1;
+	spa->address = t->dimm_dma[1];
+	spa->length = DIMM_SIZE;
+
+	/* spa8 (bdw for dcr2) dimm2 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 8;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+	spa->range_index = 8+1;
+	spa->address = t->dimm_dma[2];
+	spa->length = DIMM_SIZE;
+
+	/* spa9 (bdw for dcr3) dimm3 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 9;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+	spa->range_index = 9+1;
+	spa->address = t->dimm_dma[3];
+	spa->length = DIMM_SIZE;
+
+	offset = sizeof(struct acpi_table_nfit) + sizeof(*spa) * 10;
+	/* mem-region0 (spa0, dimm0) */
+	memdev = nfit_buf + offset;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[0];
+	memdev->physical_id = 0;
+	memdev->region_id = 0;
+	memdev->range_index = 0+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = SPA0_SIZE/2;
+	memdev->region_offset = t->spa_set_dma[0];
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 2;
+
+	/* mem-region1 (spa0, dimm1) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map);
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[1];
+	memdev->physical_id = 1;
+	memdev->region_id = 0;
+	memdev->range_index = 0+1;
+	memdev->region_index = 1+1;
+	memdev->region_size = SPA0_SIZE/2;
+	memdev->region_offset = t->spa_set_dma[0] + SPA0_SIZE/2;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 2;
+
+	/* mem-region2 (spa1, dimm0) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 2;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[0];
+	memdev->physical_id = 0;
+	memdev->region_id = 1;
+	memdev->range_index = 1+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = SPA1_SIZE/4;
+	memdev->region_offset = t->spa_set_dma[1];
+	memdev->address = SPA0_SIZE/2;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 4;
+
+	/* mem-region3 (spa1, dimm1) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 3;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[1];
+	memdev->physical_id = 1;
+	memdev->region_id = 1;
+	memdev->range_index = 1+1;
+	memdev->region_index = 1+1;
+	memdev->region_size = SPA1_SIZE/4;
+	memdev->region_offset = t->spa_set_dma[1] + SPA1_SIZE/4;
+	memdev->address = SPA0_SIZE/2;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 4;
+
+	/* mem-region4 (spa1, dimm2) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 4;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[2];
+	memdev->physical_id = 2;
+	memdev->region_id = 0;
+	memdev->range_index = 1+1;
+	memdev->region_index = 2+1;
+	memdev->region_size = SPA1_SIZE/4;
+	memdev->region_offset = t->spa_set_dma[1] + 2*SPA1_SIZE/4;
+	memdev->address = SPA0_SIZE/2;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 4;
+
+	/* mem-region5 (spa1, dimm3) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 5;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[3];
+	memdev->physical_id = 3;
+	memdev->region_id = 0;
+	memdev->range_index = 1+1;
+	memdev->region_index = 3+1;
+	memdev->region_size = SPA1_SIZE/4;
+	memdev->region_offset = t->spa_set_dma[1] + 3*SPA1_SIZE/4;
+	memdev->address = SPA0_SIZE/2;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 4;
+
+	/* mem-region6 (spa/dcr0, dimm0) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 6;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[0];
+	memdev->physical_id = 0;
+	memdev->region_id = 0;
+	memdev->range_index = 2+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region7 (spa/dcr1, dimm1) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 7;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[1];
+	memdev->physical_id = 1;
+	memdev->region_id = 0;
+	memdev->range_index = 3+1;
+	memdev->region_index = 1+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region8 (spa/dcr2, dimm2) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 8;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[2];
+	memdev->physical_id = 2;
+	memdev->region_id = 0;
+	memdev->range_index = 4+1;
+	memdev->region_index = 2+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region9 (spa/dcr3, dimm3) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 9;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[3];
+	memdev->physical_id = 3;
+	memdev->region_id = 0;
+	memdev->range_index = 5+1;
+	memdev->region_index = 3+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region10 (spa/bdw0, dimm0) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 10;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[0];
+	memdev->physical_id = 0;
+	memdev->region_id = 0;
+	memdev->range_index = 6+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region11 (spa/bdw1, dimm1) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 11;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[1];
+	memdev->physical_id = 1;
+	memdev->region_id = 0;
+	memdev->range_index = 7+1;
+	memdev->region_index = 1+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region12 (spa/bdw2, dimm2) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 12;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[2];
+	memdev->physical_id = 2;
+	memdev->region_id = 0;
+	memdev->range_index = 8+1;
+	memdev->region_index = 2+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region13 (spa/dcr3, dimm3) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 13;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[3];
+	memdev->physical_id = 3;
+	memdev->region_id = 0;
+	memdev->range_index = 9+1;
+	memdev->region_index = 3+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	offset = offset + sizeof(struct acpi_nfit_memory_map) * 14;
+	/* dcr-descriptor0 */
+	dcr = nfit_buf + offset;
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 0+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~handle[0];
+	dcr->windows = 1;
+	dcr->window_size = DCR_SIZE;
+	dcr->command_offset = 0;
+	dcr->command_size = 8;
+	dcr->status_offset = 8;
+	dcr->status_size = 4;
+
+	/* dcr-descriptor1 */
+	dcr = nfit_buf + offset + sizeof(struct acpi_nfit_control_region);
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 1+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~handle[1];
+	dcr->windows = 1;
+	dcr->window_size = DCR_SIZE;
+	dcr->command_offset = 0;
+	dcr->command_size = 8;
+	dcr->status_offset = 8;
+	dcr->status_size = 4;
+
+	/* dcr-descriptor2 */
+	dcr = nfit_buf + offset + sizeof(struct acpi_nfit_control_region) * 2;
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 2+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~handle[2];
+	dcr->windows = 1;
+	dcr->window_size = DCR_SIZE;
+	dcr->command_offset = 0;
+	dcr->command_size = 8;
+	dcr->status_offset = 8;
+	dcr->status_size = 4;
+
+	/* dcr-descriptor3 */
+	dcr = nfit_buf + offset + sizeof(struct acpi_nfit_control_region) * 3;
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 3+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~handle[3];
+	dcr->windows = 1;
+	dcr->window_size = DCR_SIZE;
+	dcr->command_offset = 0;
+	dcr->command_size = 8;
+	dcr->status_offset = 8;
+	dcr->status_size = 4;
+
+	offset = offset + sizeof(struct acpi_nfit_control_region) * 4;
+	/* bdw0 (spa/dcr0, dimm0) */
+	bdw = nfit_buf + offset;
+	bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+	bdw->header.length = sizeof(struct acpi_nfit_data_region);
+	bdw->region_index = 0+1;
+	bdw->windows = 1;
+	bdw->offset = 0;
+	bdw->size = BDW_SIZE;
+	bdw->capacity = DIMM_SIZE;
+	bdw->start_address = 0;
+
+	/* bdw1 (spa/dcr1, dimm1) */
+	bdw = nfit_buf + offset + sizeof(struct acpi_nfit_data_region);
+	bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+	bdw->header.length = sizeof(struct acpi_nfit_data_region);
+	bdw->region_index = 1+1;
+	bdw->windows = 1;
+	bdw->offset = 0;
+	bdw->size = BDW_SIZE;
+	bdw->capacity = DIMM_SIZE;
+	bdw->start_address = 0;
+
+	/* bdw2 (spa/dcr2, dimm2) */
+	bdw = nfit_buf + offset + sizeof(struct acpi_nfit_data_region) * 2;
+	bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+	bdw->header.length = sizeof(struct acpi_nfit_data_region);
+	bdw->region_index = 2+1;
+	bdw->windows = 1;
+	bdw->offset = 0;
+	bdw->size = BDW_SIZE;
+	bdw->capacity = DIMM_SIZE;
+	bdw->start_address = 0;
+
+	/* bdw3 (spa/dcr3, dimm3) */
+	bdw = nfit_buf + offset + sizeof(struct acpi_nfit_data_region) * 3;
+	bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+	bdw->header.length = sizeof(struct acpi_nfit_data_region);
+	bdw->region_index = 3+1;
+	bdw->windows = 1;
+	bdw->offset = 0;
+	bdw->size = BDW_SIZE;
+	bdw->capacity = DIMM_SIZE;
+	bdw->start_address = 0;
+
+	acpi_desc = &t->acpi_desc;
+	set_bit(ND_CMD_GET_CONFIG_SIZE, &acpi_desc->dimm_dsm_force_en);
+	set_bit(ND_CMD_GET_CONFIG_DATA, &acpi_desc->dimm_dsm_force_en);
+	set_bit(ND_CMD_SET_CONFIG_DATA, &acpi_desc->dimm_dsm_force_en);
+	nd_desc = &acpi_desc->nd_desc;
+	nd_desc->ndctl = nfit_test_ctl;
+}
+
+static void nfit_test1_setup(struct nfit_test *t)
+{
+	size_t size = t->nfit_size, offset;
+	void *nfit_buf = t->nfit_buf;
+	struct acpi_nfit_memory_map *memdev;
+	struct acpi_nfit_control_region *dcr;
+	struct acpi_nfit_system_address *spa;
+
+	nfit_test_init_header(nfit_buf, size);
+
+	offset = sizeof(struct acpi_table_nfit);
+	/* spa0 (flat range with no bdw aliasing) */
+	spa = nfit_buf + offset;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_PM), 16);
+	spa->range_index = 0+1;
+	spa->address = t->spa_set_dma[0];
+	spa->length = SPA2_SIZE;
+
+	offset += sizeof(*spa);
+	/* mem-region0 (spa0, dimm0) */
+	memdev = nfit_buf + offset;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = 0;
+	memdev->physical_id = 0;
+	memdev->region_id = 0;
+	memdev->range_index = 0+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = SPA2_SIZE;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	offset += sizeof(*memdev);
+	/* dcr-descriptor0 */
+	dcr = nfit_buf + offset;
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 0+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~0;
+	dcr->code = 0x201;
+	dcr->windows = 0;
+	dcr->window_size = 0;
+	dcr->command_offset = 0;
+	dcr->command_size = 0;
+	dcr->status_offset = 0;
+	dcr->status_size = 0;
+}
+
+static int nfit_test_blk_do_io(struct nd_blk_region *ndbr, resource_size_t dpa,
+		void *iobuf, u64 len, int rw)
+{
+	struct nfit_blk *nfit_blk = ndbr->blk_provider_data;
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW];
+	struct nd_region *nd_region = &ndbr->nd_region;
+	unsigned int lane;
+
+	lane = nd_region_acquire_lane(nd_region);
+	if (rw)
+		memcpy(mmio->base + dpa, iobuf, len);
+	else
+		memcpy(iobuf, mmio->base + dpa, len);
+	nd_region_release_lane(nd_region, lane);
+
+	return 0;
+}
+
+static int nfit_test_probe(struct platform_device *pdev)
+{
+	struct nvdimm_bus_descriptor *nd_desc;
+	struct acpi_nfit_desc *acpi_desc;
+	struct device *dev = &pdev->dev;
+	struct nfit_test *nfit_test;
+	int rc;
+
+	nfit_test = to_nfit_test(&pdev->dev);
+
+	/* common alloc */
+	if (nfit_test->num_dcr) {
+		int num = nfit_test->num_dcr;
+
+		nfit_test->dimm = devm_kcalloc(dev, num, sizeof(void *),
+				GFP_KERNEL);
+		nfit_test->dimm_dma = devm_kcalloc(dev, num, sizeof(dma_addr_t),
+				GFP_KERNEL);
+		nfit_test->label = devm_kcalloc(dev, num, sizeof(void *),
+				GFP_KERNEL);
+		nfit_test->label_dma = devm_kcalloc(dev, num,
+				sizeof(dma_addr_t), GFP_KERNEL);
+		nfit_test->dcr = devm_kcalloc(dev, num,
+				sizeof(struct nfit_test_dcr *), GFP_KERNEL);
+		nfit_test->dcr_dma = devm_kcalloc(dev, num,
+				sizeof(dma_addr_t), GFP_KERNEL);
+		if (nfit_test->dimm && nfit_test->dimm_dma && nfit_test->label
+				&& nfit_test->label_dma && nfit_test->dcr
+				&& nfit_test->dcr_dma)
+			/* pass */;
+		else
+			return -ENOMEM;
+	}
+
+	if (nfit_test->num_pm) {
+		int num = nfit_test->num_pm;
+
+		nfit_test->spa_set = devm_kcalloc(dev, num, sizeof(void *),
+				GFP_KERNEL);
+		nfit_test->spa_set_dma = devm_kcalloc(dev, num,
+				sizeof(dma_addr_t), GFP_KERNEL);
+		if (nfit_test->spa_set && nfit_test->spa_set_dma)
+			/* pass */;
+		else
+			return -ENOMEM;
+	}
+
+	/* per-nfit specific alloc */
+	if (nfit_test->alloc(nfit_test))
+		return -ENOMEM;
+
+	nfit_test->setup(nfit_test);
+	acpi_desc = &nfit_test->acpi_desc;
+	acpi_desc->dev = &pdev->dev;
+	acpi_desc->nfit = nfit_test->nfit_buf;
+	acpi_desc->blk_do_io = nfit_test_blk_do_io;
+	nd_desc = &acpi_desc->nd_desc;
+	nd_desc->attr_groups = acpi_nfit_attribute_groups;
+	acpi_desc->nvdimm_bus = nvdimm_bus_register(&pdev->dev, nd_desc);
+	if (!acpi_desc->nvdimm_bus)
+		return -ENXIO;
+
+	rc = acpi_nfit_init(acpi_desc, nfit_test->nfit_size);
+	if (rc) {
+		nvdimm_bus_unregister(acpi_desc->nvdimm_bus);
+		return rc;
+	}
+
+	return 0;
+}
+
+static int nfit_test_remove(struct platform_device *pdev)
+{
+	struct nfit_test *nfit_test = to_nfit_test(&pdev->dev);
+	struct acpi_nfit_desc *acpi_desc = &nfit_test->acpi_desc;
+
+	nvdimm_bus_unregister(acpi_desc->nvdimm_bus);
+
+	return 0;
+}
+
+static void nfit_test_release(struct device *dev)
+{
+	struct nfit_test *nfit_test = to_nfit_test(dev);
+
+	kfree(nfit_test);
+}
+
+static const struct platform_device_id nfit_test_id[] = {
+	{ KBUILD_MODNAME },
+	{ },
+};
+
+static struct platform_driver nfit_test_driver = {
+	.probe = nfit_test_probe,
+	.remove = nfit_test_remove,
+	.driver = {
+		.name = KBUILD_MODNAME,
+	},
+	.id_table = nfit_test_id,
+};
+
+#ifdef CONFIG_CMA_SIZE_MBYTES
+#define CMA_SIZE_MBYTES CONFIG_CMA_SIZE_MBYTES
+#else
+#define CMA_SIZE_MBYTES 0
+#endif
+
+static __init int nfit_test_init(void)
+{
+	int rc, i;
+
+	nfit_test_setup(nfit_test_lookup);
+
+	for (i = 0; i < NUM_NFITS; i++) {
+		struct nfit_test *nfit_test;
+		struct platform_device *pdev;
+		static int once;
+
+		nfit_test = kzalloc(sizeof(*nfit_test), GFP_KERNEL);
+		if (!nfit_test) {
+			rc = -ENOMEM;
+			goto err_register;
+		}
+		INIT_LIST_HEAD(&nfit_test->resources);
+		switch (i) {
+		case 0:
+			nfit_test->num_pm = NUM_PM;
+			nfit_test->num_dcr = NUM_DCR;
+			nfit_test->alloc = nfit_test0_alloc;
+			nfit_test->setup = nfit_test0_setup;
+			break;
+		case 1:
+			nfit_test->num_pm = 1;
+			nfit_test->alloc = nfit_test1_alloc;
+			nfit_test->setup = nfit_test1_setup;
+			break;
+		default:
+			rc = -EINVAL;
+			goto err_register;
+		}
+		pdev = &nfit_test->pdev;
+		pdev->name = KBUILD_MODNAME;
+		pdev->id = i;
+		pdev->dev.release = nfit_test_release;
+		rc = platform_device_register(pdev);
+		if (rc) {
+			put_device(&pdev->dev);
+			goto err_register;
+		}
+
+		rc = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
+		if (rc)
+			goto err_register;
+
+		instances[i] = nfit_test;
+
+		if (!once++) {
+			dma_addr_t dma;
+			void *buf;
+
+			buf = dma_alloc_coherent(&pdev->dev, SZ_128M, &dma,
+					GFP_KERNEL);
+			if (!buf) {
+				rc = -ENOMEM;
+				dev_warn(&pdev->dev, "need 128M of free cma\n");
+				goto err_register;
+			}
+			dma_free_coherent(&pdev->dev, SZ_128M, buf, dma);
+		}
+	}
+
+	rc = platform_driver_register(&nfit_test_driver);
+	if (rc)
+		goto err_register;
+	return 0;
+
+ err_register:
+	for (i = 0; i < NUM_NFITS; i++)
+		if (instances[i])
+			platform_device_unregister(&instances[i]->pdev);
+	nfit_test_teardown();
+	return rc;
+}
+
+static __exit void nfit_test_exit(void)
+{
+	int i;
+
+	platform_driver_unregister(&nfit_test_driver);
+	for (i = 0; i < NUM_NFITS; i++)
+		platform_device_unregister(&instances[i]->pdev);
+	nfit_test_teardown();
+}
+
+module_init(nfit_test_init);
+module_exit(nfit_test_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Intel Corporation");
diff --git a/tools/testing/nvdimm/test/nfit_test.h b/tools/testing/nvdimm/test/nfit_test.h
new file mode 100644
index 000000000000..96c5e16d7db9
--- /dev/null
+++ b/tools/testing/nvdimm/test/nfit_test.h
@@ -0,0 +1,29 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __NFIT_TEST_H__
+#define __NFIT_TEST_H__
+
+struct nfit_test_resource {
+	struct list_head list;
+	struct resource *res;
+	struct device *dev;
+	void *buf;
+};
+
+typedef struct nfit_test_resource *(*nfit_test_lookup_fn)(resource_size_t);
+void __iomem *__wrap_ioremap_nocache(resource_size_t offset,
+		unsigned long size);
+void __wrap_iounmap(volatile void __iomem *addr);
+void nfit_test_setup(nfit_test_lookup_fn lookup);
+void nfit_test_teardown(void);
+#endif


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 05/15] tools/testing/nvdimm: libnvdimm unit test infrastructure
@ 2015-06-17 23:55   ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: mingo, boaz, toshi.kani, Rafael J. Wysocki, Robert Moore,
	linux-kernel, linux-acpi, Lv Zheng, linux-fsdevel, hch

'libnvdimm' is the first driver sub-system in the kernel to implement
mocking for unit test coverage.  The nfit_test module gets built as an
external module and arranges for external module replacements of nfit,
libnvdimm, nd_pmem, and nd_blk.  These replacements use the linker
--wrap option to redirect calls to ioremap() + request_mem_region() to
custom defined unit test resources.  The end result is a fully
functional nvdimm_bus, as far as userspace is concerned, but with the
capability to perform otherwise destructive tests on emulated resources.

Q: Why not use QEMU for this emulation?
QEMU is not suitable for unit testing.  QEMU's role is to faithfully
emulate the platform.  A unit test's role is to unfaithfully implement
the platform with the goal of triggering bugs in the corners of the
sub-system implementation.  As bugs are discovered in platforms, or the
sub-system itself, the unit tests are extended to backstop a fix with a
reproducer unit test.

Another problem with QEMU is that it would require coordination of 3
software projects instead of 2 (kernel + libndctl [1]) to maintain and
execute the tests.  The chances for bit rot and the difficulty of
getting the tests running goes up non-linearly the more components
involved.


Q: Why submit this to the kernel tree instead of external modules in
   libndctl?
Simple, to alleviate the same risk that out-of-tree external modules
face.  Updates to drivers/nvdimm/ can be immediately evaluated to see if
they have any impact on tools/testing/nvdimm/.


Q: What are the negative implications of merging this?
It is a unique maintenance burden because the purpose of mocking an
interface to enable a unit test is to purposefully short circuit the
semantics of a routine to enable testing.  For example
__wrap_ioremap_cache() fakes the pmem driver into "ioremap()'ing" a test
resource buffer allocated by dma_alloc_coherent().  The future
maintenance burden hits when someone changes the semantics of
ioremap_cache() and wonders what the implications are for the unit test.

[1]: https://github.com/pmem/ndctl

Cc: <linux-acpi@vger.kernel.org>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c                   |   12 
 drivers/acpi/nfit.h                   |    6 
 tools/testing/nvdimm/Kbuild           |   40 +
 tools/testing/nvdimm/Makefile         |    7 
 tools/testing/nvdimm/config_check.c   |   15 
 tools/testing/nvdimm/test/Kbuild      |    8 
 tools/testing/nvdimm/test/iomap.c     |  151 ++++
 tools/testing/nvdimm/test/nfit.c      | 1112 +++++++++++++++++++++++++++++++++
 tools/testing/nvdimm/test/nfit_test.h |   29 +
 9 files changed, 1376 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/nvdimm/Kbuild
 create mode 100644 tools/testing/nvdimm/Makefile
 create mode 100644 tools/testing/nvdimm/config_check.c
 create mode 100644 tools/testing/nvdimm/test/Kbuild
 create mode 100644 tools/testing/nvdimm/test/iomap.c
 create mode 100644 tools/testing/nvdimm/test/nfit.c
 create mode 100644 tools/testing/nvdimm/test/nfit_test.h

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 3a77709fd394..9363e4b0e6a7 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -29,10 +29,11 @@ MODULE_PARM_DESC(force_enable_dimms, "Ignore _STA (ACPI DIMM device) status");
 
 static u8 nfit_uuid[NFIT_UUID_MAX][16];
 
-static const u8 *to_nfit_uuid(enum nfit_uuids id)
+const u8 *to_nfit_uuid(enum nfit_uuids id)
 {
 	return nfit_uuid[id];
 }
+EXPORT_SYMBOL(to_nfit_uuid);
 
 static struct acpi_nfit_desc *to_acpi_nfit_desc(
 		struct nvdimm_bus_descriptor *nd_desc)
@@ -577,11 +578,12 @@ static struct attribute_group acpi_nfit_attribute_group = {
 	.attrs = acpi_nfit_attributes,
 };
 
-static const struct attribute_group *acpi_nfit_attribute_groups[] = {
+const struct attribute_group *acpi_nfit_attribute_groups[] = {
 	&nvdimm_bus_attribute_group,
 	&acpi_nfit_attribute_group,
 	NULL,
 };
+EXPORT_SYMBOL_GPL(acpi_nfit_attribute_groups);
 
 static struct acpi_nfit_memory_map *to_nfit_memdev(struct device *dev)
 {
@@ -1319,7 +1321,7 @@ static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,
 		ndbr_desc = to_blk_region_desc(ndr_desc);
 		ndbr_desc->enable = acpi_nfit_blk_region_enable;
 		ndbr_desc->disable = acpi_nfit_blk_region_disable;
-		ndbr_desc->do_io = acpi_nfit_blk_region_do_io;
+		ndbr_desc->do_io = acpi_desc->blk_do_io;
 		if (!nvdimm_blk_region_create(acpi_desc->nvdimm_bus, ndr_desc))
 			return -ENOMEM;
 		break;
@@ -1403,7 +1405,7 @@ static int acpi_nfit_register_regions(struct acpi_nfit_desc *acpi_desc)
 	return 0;
 }
 
-static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
+int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 {
 	struct device *dev = acpi_desc->dev;
 	const void *end;
@@ -1442,6 +1444,7 @@ static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 
 	return acpi_nfit_register_regions(acpi_desc);
 }
+EXPORT_SYMBOL_GPL(acpi_nfit_init);
 
 static int acpi_nfit_add(struct acpi_device *adev)
 {
@@ -1466,6 +1469,7 @@ static int acpi_nfit_add(struct acpi_device *adev)
 	dev_set_drvdata(dev, acpi_desc);
 	acpi_desc->dev = dev;
 	acpi_desc->nfit = (struct acpi_table_nfit *) tbl;
+	acpi_desc->blk_do_io = acpi_nfit_blk_region_do_io;
 	nd_desc = &acpi_desc->nd_desc;
 	nd_desc->provider_name = "ACPI.NFIT";
 	nd_desc->ndctl = acpi_nfit_ctl;
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index 7bd38b7baf39..c62fffea8423 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -93,6 +93,8 @@ struct acpi_nfit_desc {
 	struct nvdimm_bus *nvdimm_bus;
 	struct device *dev;
 	unsigned long dimm_dsm_force_en;
+	int (*blk_do_io)(struct nd_blk_region *ndbr, resource_size_t dpa,
+			void *iobuf, u64 len, int rw);
 };
 
 enum nd_blk_mmio_selector {
@@ -146,4 +148,8 @@ static inline struct acpi_nfit_desc *to_acpi_desc(
 {
 	return container_of(nd_desc, struct acpi_nfit_desc, nd_desc);
 }
+
+const u8 *to_nfit_uuid(enum nfit_uuids id);
+int acpi_nfit_init(struct acpi_nfit_desc *nfit, acpi_size sz);
+extern const struct attribute_group *acpi_nfit_attribute_groups[];
 #endif /* __NFIT_H__ */
diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
new file mode 100644
index 000000000000..df624d7ea652
--- /dev/null
+++ b/tools/testing/nvdimm/Kbuild
@@ -0,0 +1,40 @@
+ldflags-y += --wrap=ioremap_cache
+ldflags-y += --wrap=ioremap_nocache
+ldflags-y += --wrap=iounmap
+ldflags-y += --wrap=__request_region
+ldflags-y += --wrap=__release_region
+
+DRIVERS := ../../../drivers
+NVDIMM_SRC := $(DRIVERS)/nvdimm
+ACPI_SRC := $(DRIVERS)/acpi
+
+obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
+obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
+obj-$(CONFIG_ND_BTT) += nd_btt.o
+obj-$(CONFIG_ND_BLK) += nd_blk.o
+obj-$(CONFIG_ACPI_NFIT) += nfit.o
+
+nfit-y := $(ACPI_SRC)/nfit.o
+nfit-y += config_check.o
+
+nd_pmem-y := $(NVDIMM_SRC)/pmem.o
+nd_pmem-y += config_check.o
+
+nd_btt-y := $(NVDIMM_SRC)/btt.o
+nd_btt-y += config_check.o
+
+nd_blk-y := $(NVDIMM_SRC)/blk.o
+nd_blk-y += config_check.o
+
+libnvdimm-y := $(NVDIMM_SRC)/core.o
+libnvdimm-y += $(NVDIMM_SRC)/bus.o
+libnvdimm-y += $(NVDIMM_SRC)/dimm_devs.o
+libnvdimm-y += $(NVDIMM_SRC)/dimm.o
+libnvdimm-y += $(NVDIMM_SRC)/region_devs.o
+libnvdimm-y += $(NVDIMM_SRC)/region.o
+libnvdimm-y += $(NVDIMM_SRC)/namespace_devs.o
+libnvdimm-y += $(NVDIMM_SRC)/label.o
+libnvdimm-$(CONFIG_ND_BTT_DEVS) += $(NVDIMM_SRC)/btt_devs.o
+libnvdimm-y += config_check.o
+
+obj-m += test/
diff --git a/tools/testing/nvdimm/Makefile b/tools/testing/nvdimm/Makefile
new file mode 100644
index 000000000000..3dfe024b4e7e
--- /dev/null
+++ b/tools/testing/nvdimm/Makefile
@@ -0,0 +1,7 @@
+KDIR ?= ../../../
+
+default:
+	$(MAKE) -C $(KDIR) M=$$PWD
+
+install: default
+	$(MAKE) -C $(KDIR) M=$$PWD modules_install
diff --git a/tools/testing/nvdimm/config_check.c b/tools/testing/nvdimm/config_check.c
new file mode 100644
index 000000000000..f2c7615554eb
--- /dev/null
+++ b/tools/testing/nvdimm/config_check.c
@@ -0,0 +1,15 @@
+#include <linux/kconfig.h>
+#include <linux/bug.h>
+
+void check(void)
+{
+	/*
+	 * These kconfig symbols must be set to "m" for nfit_test to
+	 * load and operate.
+	 */
+	BUILD_BUG_ON(!IS_MODULE(CONFIG_LIBNVDIMM));
+	BUILD_BUG_ON(!IS_MODULE(CONFIG_BLK_DEV_PMEM));
+	BUILD_BUG_ON(!IS_MODULE(CONFIG_ND_BTT));
+	BUILD_BUG_ON(!IS_MODULE(CONFIG_ND_BLK));
+	BUILD_BUG_ON(!IS_MODULE(CONFIG_ACPI_NFIT));
+}
diff --git a/tools/testing/nvdimm/test/Kbuild b/tools/testing/nvdimm/test/Kbuild
new file mode 100644
index 000000000000..9241064970fe
--- /dev/null
+++ b/tools/testing/nvdimm/test/Kbuild
@@ -0,0 +1,8 @@
+ccflags-y := -I$(src)/../../../../drivers/nvdimm/
+ccflags-y += -I$(src)/../../../../drivers/acpi/
+
+obj-m += nfit_test.o
+obj-m += nfit_test_iomap.o
+
+nfit_test-y := nfit.o
+nfit_test_iomap-y := iomap.o
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
new file mode 100644
index 000000000000..c85a6f6ba559
--- /dev/null
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -0,0 +1,151 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/rculist.h>
+#include <linux/export.h>
+#include <linux/ioport.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/io.h>
+#include "nfit_test.h"
+
+static LIST_HEAD(iomap_head);
+
+static struct iomap_ops {
+	nfit_test_lookup_fn nfit_test_lookup;
+	struct list_head list;
+} iomap_ops = {
+	.list = LIST_HEAD_INIT(iomap_ops.list),
+};
+
+void nfit_test_setup(nfit_test_lookup_fn lookup)
+{
+	iomap_ops.nfit_test_lookup = lookup;
+	list_add_rcu(&iomap_ops.list, &iomap_head);
+}
+EXPORT_SYMBOL(nfit_test_setup);
+
+void nfit_test_teardown(void)
+{
+	list_del_rcu(&iomap_ops.list);
+	synchronize_rcu();
+}
+EXPORT_SYMBOL(nfit_test_teardown);
+
+static struct nfit_test_resource *get_nfit_res(resource_size_t resource)
+{
+	struct iomap_ops *ops;
+
+	ops = list_first_or_null_rcu(&iomap_head, typeof(*ops), list);
+	if (ops)
+		return ops->nfit_test_lookup(resource);
+	return NULL;
+}
+
+void __iomem *__nfit_test_ioremap(resource_size_t offset, unsigned long size,
+		void __iomem *(*fallback_fn)(resource_size_t, unsigned long))
+{
+	struct nfit_test_resource *nfit_res;
+
+	rcu_read_lock();
+	nfit_res = get_nfit_res(offset);
+	rcu_read_unlock();
+	if (nfit_res)
+		return (void __iomem *) nfit_res->buf + offset
+			- nfit_res->res->start;
+	return fallback_fn(offset, size);
+}
+
+void __iomem *__wrap_ioremap_cache(resource_size_t offset, unsigned long size)
+{
+	return __nfit_test_ioremap(offset, size, ioremap_cache);
+}
+EXPORT_SYMBOL(__wrap_ioremap_cache);
+
+void __iomem *__wrap_ioremap_nocache(resource_size_t offset, unsigned long size)
+{
+	return __nfit_test_ioremap(offset, size, ioremap_nocache);
+}
+EXPORT_SYMBOL(__wrap_ioremap_nocache);
+
+void __wrap_iounmap(volatile void __iomem *addr)
+{
+	struct nfit_test_resource *nfit_res;
+
+	rcu_read_lock();
+	nfit_res = get_nfit_res((unsigned long) addr);
+	rcu_read_unlock();
+	if (nfit_res)
+		return;
+	return iounmap(addr);
+}
+EXPORT_SYMBOL(__wrap_iounmap);
+
+struct resource *__wrap___request_region(struct resource *parent,
+		resource_size_t start, resource_size_t n, const char *name,
+		int flags)
+{
+	struct nfit_test_resource *nfit_res;
+
+	if (parent == &iomem_resource) {
+		rcu_read_lock();
+		nfit_res = get_nfit_res(start);
+		rcu_read_unlock();
+		if (nfit_res) {
+			struct resource *res = nfit_res->res + 1;
+
+			if (start + n > nfit_res->res->start
+					+ resource_size(nfit_res->res)) {
+				pr_debug("%s: start: %llx n: %llx overflow: %pr\n",
+						__func__, start, n,
+						nfit_res->res);
+				return NULL;
+			}
+
+			res->start = start;
+			res->end = start + n - 1;
+			res->name = name;
+			res->flags = resource_type(parent);
+			res->flags |= IORESOURCE_BUSY | flags;
+			pr_debug("%s: %pr\n", __func__, res);
+			return res;
+		}
+	}
+	return __request_region(parent, start, n, name, flags);
+}
+EXPORT_SYMBOL(__wrap___request_region);
+
+void __wrap___release_region(struct resource *parent, resource_size_t start,
+				resource_size_t n)
+{
+	struct nfit_test_resource *nfit_res;
+
+	if (parent == &iomem_resource) {
+		rcu_read_lock();
+		nfit_res = get_nfit_res(start);
+		rcu_read_unlock();
+		if (nfit_res) {
+			struct resource *res = nfit_res->res + 1;
+
+			if (start != res->start || resource_size(res) != n)
+				pr_info("%s: start: %llx n: %llx mismatch: %pr\n",
+						__func__, start, n, res);
+			else
+				memset(res, 0, sizeof(*res));
+			return;
+		}
+	}
+	__release_region(parent, start, n);
+}
+EXPORT_SYMBOL(__wrap___release_region);
+
+MODULE_LICENSE("GPL v2");
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
new file mode 100644
index 000000000000..416f8fbf9881
--- /dev/null
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -0,0 +1,1112 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/platform_device.h>
+#include <linux/dma-mapping.h>
+#include <linux/libnvdimm.h>
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/ndctl.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+#include <nfit.h>
+#include <nd.h>
+#include "nfit_test.h"
+
+/*
+ * Generate an NFIT table to describe the following topology:
+ *
+ * BUS0: Interleaved PMEM regions, and aliasing with BLK regions
+ *
+ *                     (a)                       (b)            DIMM   BLK-REGION
+ *           +----------+--------------+----------+---------+
+ * +------+  |  blk2.0  |     pm0.0    |  blk2.1  |  pm1.0  |    0      region2
+ * | imc0 +--+- - - - - region0 - - - -+----------+         +
+ * +--+---+  |  blk3.0  |     pm0.0    |  blk3.1  |  pm1.0  |    1      region3
+ *    |      +----------+--------------v----------v         v
+ * +--+---+                            |                    |
+ * | cpu0 |                                    region1
+ * +--+---+                            |                    |
+ *    |      +-------------------------^----------^         ^
+ * +--+---+  |                 blk4.0             |  pm1.0  |    2      region4
+ * | imc1 +--+-------------------------+----------+         +
+ * +------+  |                 blk5.0             |  pm1.0  |    3      region5
+ *           +-------------------------+----------+-+-------+
+ *
+ * *) In this layout we have four dimms and two memory controllers in one
+ *    socket.  Each unique interface (BLK or PMEM) to DPA space
+ *    is identified by a region device with a dynamically assigned id.
+ *
+ * *) The first portion of dimm0 and dimm1 are interleaved as REGION0.
+ *    A single PMEM namespace "pm0.0" is created using half of the
+ *    REGION0 SPA-range.  REGION0 spans dimm0 and dimm1.  PMEM namespace
+ *    allocate from from the bottom of a region.  The unallocated
+ *    portion of REGION0 aliases with REGION2 and REGION3.  That
+ *    unallacted capacity is reclaimed as BLK namespaces ("blk2.0" and
+ *    "blk3.0") starting at the base of each DIMM to offset (a) in those
+ *    DIMMs.  "pm0.0", "blk2.0" and "blk3.0" are free-form readable
+ *    names that can be assigned to a namespace.
+ *
+ * *) In the last portion of dimm0 and dimm1 we have an interleaved
+ *    SPA range, REGION1, that spans those two dimms as well as dimm2
+ *    and dimm3.  Some of REGION1 allocated to a PMEM namespace named
+ *    "pm1.0" the rest is reclaimed in 4 BLK namespaces (for each
+ *    dimm in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
+ *    "blk5.0".
+ *
+ * *) The portion of dimm2 and dimm3 that do not participate in the
+ *    REGION1 interleaved SPA range (i.e. the DPA address below offset
+ *    (b) are also included in the "blk4.0" and "blk5.0" namespaces.
+ *    Note, that BLK namespaces need not be contiguous in DPA-space, and
+ *    can consume aliased capacity from multiple interleave sets.
+ *
+ * BUS1: Legacy NVDIMM (single contiguous range)
+ *
+ *  region2
+ * +---------------------+
+ * |---------------------|
+ * ||       pm2.0       ||
+ * |---------------------|
+ * +---------------------+
+ *
+ * *) A NFIT-table may describe a simple system-physical-address range
+ *    with no BLK aliasing.  This type of region may optionally
+ *    reference an NVDIMM.
+ */
+enum {
+	NUM_PM  = 2,
+	NUM_DCR = 4,
+	NUM_BDW = NUM_DCR,
+	NUM_SPA = NUM_PM + NUM_DCR + NUM_BDW,
+	NUM_MEM = NUM_DCR + NUM_BDW + 2 /* spa0 iset */ + 4 /* spa1 iset */,
+	DIMM_SIZE = SZ_32M,
+	LABEL_SIZE = SZ_128K,
+	SPA0_SIZE = DIMM_SIZE,
+	SPA1_SIZE = DIMM_SIZE*2,
+	SPA2_SIZE = DIMM_SIZE,
+	BDW_SIZE = 64 << 8,
+	DCR_SIZE = 12,
+	NUM_NFITS = 2, /* permit testing multiple NFITs per system */
+};
+
+struct nfit_test_dcr {
+	__le64 bdw_addr;
+	__le32 bdw_status;
+	__u8 aperature[BDW_SIZE];
+};
+
+#define NFIT_DIMM_HANDLE(node, socket, imc, chan, dimm) \
+	(((node & 0xfff) << 16) | ((socket & 0xf) << 12) \
+	 | ((imc & 0xf) << 8) | ((chan & 0xf) << 4) | (dimm & 0xf))
+
+static u32 handle[NUM_DCR] = {
+	[0] = NFIT_DIMM_HANDLE(0, 0, 0, 0, 0),
+	[1] = NFIT_DIMM_HANDLE(0, 0, 0, 0, 1),
+	[2] = NFIT_DIMM_HANDLE(0, 0, 1, 0, 0),
+	[3] = NFIT_DIMM_HANDLE(0, 0, 1, 0, 1),
+};
+
+struct nfit_test {
+	struct acpi_nfit_desc acpi_desc;
+	struct platform_device pdev;
+	struct list_head resources;
+	void *nfit_buf;
+	dma_addr_t nfit_dma;
+	size_t nfit_size;
+	int num_dcr;
+	int num_pm;
+	void **dimm;
+	dma_addr_t *dimm_dma;
+	void **label;
+	dma_addr_t *label_dma;
+	void **spa_set;
+	dma_addr_t *spa_set_dma;
+	struct nfit_test_dcr **dcr;
+	dma_addr_t *dcr_dma;
+	int (*alloc)(struct nfit_test *t);
+	void (*setup)(struct nfit_test *t);
+};
+
+static struct nfit_test *to_nfit_test(struct device *dev)
+{
+	struct platform_device *pdev = to_platform_device(dev);
+
+	return container_of(pdev, struct nfit_test, pdev);
+}
+
+static int nfit_test_ctl(struct nvdimm_bus_descriptor *nd_desc,
+		struct nvdimm *nvdimm, unsigned int cmd, void *buf,
+		unsigned int buf_len)
+{
+	struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
+	struct nfit_test *t = container_of(acpi_desc, typeof(*t), acpi_desc);
+	struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
+	int i, rc;
+
+	if (!nfit_mem || !test_bit(cmd, &nfit_mem->dsm_mask))
+		return -ENXIO;
+
+	/* lookup label space for the given dimm */
+	for (i = 0; i < ARRAY_SIZE(handle); i++)
+		if (__to_nfit_memdev(nfit_mem)->device_handle == handle[i])
+			break;
+	if (i >= ARRAY_SIZE(handle))
+		return -ENXIO;
+
+	switch (cmd) {
+	case ND_CMD_GET_CONFIG_SIZE: {
+		struct nd_cmd_get_config_size *nd_cmd = buf;
+
+		if (buf_len < sizeof(*nd_cmd))
+			return -EINVAL;
+		nd_cmd->status = 0;
+		nd_cmd->config_size = LABEL_SIZE;
+		nd_cmd->max_xfer = SZ_4K;
+		rc = 0;
+		break;
+	}
+	case ND_CMD_GET_CONFIG_DATA: {
+		struct nd_cmd_get_config_data_hdr *nd_cmd = buf;
+		unsigned int len, offset = nd_cmd->in_offset;
+
+		if (buf_len < sizeof(*nd_cmd))
+			return -EINVAL;
+		if (offset >= LABEL_SIZE)
+			return -EINVAL;
+		if (nd_cmd->in_length + sizeof(*nd_cmd) > buf_len)
+			return -EINVAL;
+
+		nd_cmd->status = 0;
+		len = min(nd_cmd->in_length, LABEL_SIZE - offset);
+		memcpy(nd_cmd->out_buf, t->label[i] + offset, len);
+		rc = buf_len - sizeof(*nd_cmd) - len;
+		break;
+	}
+	case ND_CMD_SET_CONFIG_DATA: {
+		struct nd_cmd_set_config_hdr *nd_cmd = buf;
+		unsigned int len, offset = nd_cmd->in_offset;
+		u32 *status;
+
+		if (buf_len < sizeof(*nd_cmd))
+			return -EINVAL;
+		if (offset >= LABEL_SIZE)
+			return -EINVAL;
+		if (nd_cmd->in_length + sizeof(*nd_cmd) + 4 > buf_len)
+			return -EINVAL;
+
+		status = buf + nd_cmd->in_length + sizeof(*nd_cmd);
+		*status = 0;
+		len = min(nd_cmd->in_length, LABEL_SIZE - offset);
+		memcpy(t->label[i] + offset, nd_cmd->in_buf, len);
+		rc = buf_len - sizeof(*nd_cmd) - (len + 4);
+		break;
+	}
+	default:
+		return -ENOTTY;
+	}
+
+	return rc;
+}
+
+static DEFINE_SPINLOCK(nfit_test_lock);
+static struct nfit_test *instances[NUM_NFITS];
+
+static void release_nfit_res(void *data)
+{
+	struct nfit_test_resource *nfit_res = data;
+	struct resource *res = nfit_res->res;
+
+	spin_lock(&nfit_test_lock);
+	list_del(&nfit_res->list);
+	spin_unlock(&nfit_test_lock);
+
+	if (is_vmalloc_addr(nfit_res->buf))
+		vfree(nfit_res->buf);
+	else
+		dma_free_coherent(nfit_res->dev, resource_size(res),
+				nfit_res->buf, res->start);
+	kfree(res);
+	kfree(nfit_res);
+}
+
+static void *__test_alloc(struct nfit_test *t, size_t size, dma_addr_t *dma,
+		void *buf)
+{
+	struct device *dev = &t->pdev.dev;
+	struct resource *res = kzalloc(sizeof(*res) * 2, GFP_KERNEL);
+	struct nfit_test_resource *nfit_res = kzalloc(sizeof(*nfit_res),
+			GFP_KERNEL);
+	int rc;
+
+	if (!res || !buf || !nfit_res)
+		goto err;
+	rc = devm_add_action(dev, release_nfit_res, nfit_res);
+	if (rc)
+		goto err;
+	INIT_LIST_HEAD(&nfit_res->list);
+	memset(buf, 0, size);
+	nfit_res->dev = dev;
+	nfit_res->buf = buf;
+	nfit_res->res = res;
+	res->start = *dma;
+	res->end = *dma + size - 1;
+	res->name = "NFIT";
+	spin_lock(&nfit_test_lock);
+	list_add(&nfit_res->list, &t->resources);
+	spin_unlock(&nfit_test_lock);
+
+	return nfit_res->buf;
+ err:
+	if (buf && !is_vmalloc_addr(buf))
+		dma_free_coherent(dev, size, buf, *dma);
+	else if (buf)
+		vfree(buf);
+	kfree(res);
+	kfree(nfit_res);
+	return NULL;
+}
+
+static void *test_alloc(struct nfit_test *t, size_t size, dma_addr_t *dma)
+{
+	void *buf = vmalloc(size);
+
+	*dma = (unsigned long) buf;
+	return __test_alloc(t, size, dma, buf);
+}
+
+static void *test_alloc_coherent(struct nfit_test *t, size_t size,
+		dma_addr_t *dma)
+{
+	struct device *dev = &t->pdev.dev;
+	void *buf = dma_alloc_coherent(dev, size, dma, GFP_KERNEL);
+
+	return __test_alloc(t, size, dma, buf);
+}
+
+static struct nfit_test_resource *nfit_test_lookup(resource_size_t addr)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(instances); i++) {
+		struct nfit_test_resource *n, *nfit_res = NULL;
+		struct nfit_test *t = instances[i];
+
+		if (!t)
+			continue;
+		spin_lock(&nfit_test_lock);
+		list_for_each_entry(n, &t->resources, list) {
+			if (addr >= n->res->start && (addr < n->res->start
+						+ resource_size(n->res))) {
+				nfit_res = n;
+				break;
+			} else if (addr >= (unsigned long) n->buf
+					&& (addr < (unsigned long) n->buf
+						+ resource_size(n->res))) {
+				nfit_res = n;
+				break;
+			}
+		}
+		spin_unlock(&nfit_test_lock);
+		if (nfit_res)
+			return nfit_res;
+	}
+
+	return NULL;
+}
+
+static int nfit_test0_alloc(struct nfit_test *t)
+{
+	size_t nfit_size = sizeof(struct acpi_table_nfit)
+			+ sizeof(struct acpi_nfit_system_address) * NUM_SPA
+			+ sizeof(struct acpi_nfit_memory_map) * NUM_MEM
+			+ sizeof(struct acpi_nfit_control_region) * NUM_DCR
+			+ sizeof(struct acpi_nfit_data_region) * NUM_BDW;
+	int i;
+
+	t->nfit_buf = test_alloc(t, nfit_size, &t->nfit_dma);
+	if (!t->nfit_buf)
+		return -ENOMEM;
+	t->nfit_size = nfit_size;
+
+	t->spa_set[0] = test_alloc_coherent(t, SPA0_SIZE, &t->spa_set_dma[0]);
+	if (!t->spa_set[0])
+		return -ENOMEM;
+
+	t->spa_set[1] = test_alloc_coherent(t, SPA1_SIZE, &t->spa_set_dma[1]);
+	if (!t->spa_set[1])
+		return -ENOMEM;
+
+	for (i = 0; i < NUM_DCR; i++) {
+		t->dimm[i] = test_alloc(t, DIMM_SIZE, &t->dimm_dma[i]);
+		if (!t->dimm[i])
+			return -ENOMEM;
+
+		t->label[i] = test_alloc(t, LABEL_SIZE, &t->label_dma[i]);
+		if (!t->label[i])
+			return -ENOMEM;
+		sprintf(t->label[i], "label%d", i);
+	}
+
+	for (i = 0; i < NUM_DCR; i++) {
+		t->dcr[i] = test_alloc(t, LABEL_SIZE, &t->dcr_dma[i]);
+		if (!t->dcr[i])
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int nfit_test1_alloc(struct nfit_test *t)
+{
+	size_t nfit_size = sizeof(struct acpi_table_nfit)
+		+ sizeof(struct acpi_nfit_system_address)
+		+ sizeof(struct acpi_nfit_memory_map)
+		+ sizeof(struct acpi_nfit_control_region);
+
+	t->nfit_buf = test_alloc(t, nfit_size, &t->nfit_dma);
+	if (!t->nfit_buf)
+		return -ENOMEM;
+	t->nfit_size = nfit_size;
+
+	t->spa_set[0] = test_alloc_coherent(t, SPA2_SIZE, &t->spa_set_dma[0]);
+	if (!t->spa_set[0])
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void nfit_test_init_header(struct acpi_table_nfit *nfit, size_t size)
+{
+	memcpy(nfit->header.signature, ACPI_SIG_NFIT, 4);
+	nfit->header.length = size;
+	nfit->header.revision = 1;
+	memcpy(nfit->header.oem_id, "LIBND", 6);
+	memcpy(nfit->header.oem_table_id, "TEST", 5);
+	nfit->header.oem_revision = 1;
+	memcpy(nfit->header.asl_compiler_id, "TST", 4);
+	nfit->header.asl_compiler_revision = 1;
+}
+
+static void nfit_test0_setup(struct nfit_test *t)
+{
+	struct nvdimm_bus_descriptor *nd_desc;
+	struct acpi_nfit_desc *acpi_desc;
+	struct acpi_nfit_memory_map *memdev;
+	void *nfit_buf = t->nfit_buf;
+	size_t size = t->nfit_size;
+	struct acpi_nfit_system_address *spa;
+	struct acpi_nfit_control_region *dcr;
+	struct acpi_nfit_data_region *bdw;
+	unsigned int offset;
+
+	nfit_test_init_header(nfit_buf, size);
+
+	/*
+	 * spa0 (interleave first half of dimm0 and dimm1, note storage
+	 * does not actually alias the related block-data-window
+	 * regions)
+	 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit);
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_PM), 16);
+	spa->range_index = 0+1;
+	spa->address = t->spa_set_dma[0];
+	spa->length = SPA0_SIZE;
+
+	/*
+	 * spa1 (interleave last half of the 4 DIMMS, note storage
+	 * does not actually alias the related block-data-window
+	 * regions)
+	 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa);
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_PM), 16);
+	spa->range_index = 1+1;
+	spa->address = t->spa_set_dma[1];
+	spa->length = SPA1_SIZE;
+
+	/* spa2 (dcr0) dimm0 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 2;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+	spa->range_index = 2+1;
+	spa->address = t->dcr_dma[0];
+	spa->length = DCR_SIZE;
+
+	/* spa3 (dcr1) dimm1 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 3;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+	spa->range_index = 3+1;
+	spa->address = t->dcr_dma[1];
+	spa->length = DCR_SIZE;
+
+	/* spa4 (dcr2) dimm2 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 4;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+	spa->range_index = 4+1;
+	spa->address = t->dcr_dma[2];
+	spa->length = DCR_SIZE;
+
+	/* spa5 (dcr3) dimm3 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 5;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+	spa->range_index = 5+1;
+	spa->address = t->dcr_dma[3];
+	spa->length = DCR_SIZE;
+
+	/* spa6 (bdw for dcr0) dimm0 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 6;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+	spa->range_index = 6+1;
+	spa->address = t->dimm_dma[0];
+	spa->length = DIMM_SIZE;
+
+	/* spa7 (bdw for dcr1) dimm1 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 7;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+	spa->range_index = 7+1;
+	spa->address = t->dimm_dma[1];
+	spa->length = DIMM_SIZE;
+
+	/* spa8 (bdw for dcr2) dimm2 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 8;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+	spa->range_index = 8+1;
+	spa->address = t->dimm_dma[2];
+	spa->length = DIMM_SIZE;
+
+	/* spa9 (bdw for dcr3) dimm3 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 9;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+	spa->range_index = 9+1;
+	spa->address = t->dimm_dma[3];
+	spa->length = DIMM_SIZE;
+
+	offset = sizeof(struct acpi_table_nfit) + sizeof(*spa) * 10;
+	/* mem-region0 (spa0, dimm0) */
+	memdev = nfit_buf + offset;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[0];
+	memdev->physical_id = 0;
+	memdev->region_id = 0;
+	memdev->range_index = 0+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = SPA0_SIZE/2;
+	memdev->region_offset = t->spa_set_dma[0];
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 2;
+
+	/* mem-region1 (spa0, dimm1) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map);
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[1];
+	memdev->physical_id = 1;
+	memdev->region_id = 0;
+	memdev->range_index = 0+1;
+	memdev->region_index = 1+1;
+	memdev->region_size = SPA0_SIZE/2;
+	memdev->region_offset = t->spa_set_dma[0] + SPA0_SIZE/2;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 2;
+
+	/* mem-region2 (spa1, dimm0) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 2;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[0];
+	memdev->physical_id = 0;
+	memdev->region_id = 1;
+	memdev->range_index = 1+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = SPA1_SIZE/4;
+	memdev->region_offset = t->spa_set_dma[1];
+	memdev->address = SPA0_SIZE/2;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 4;
+
+	/* mem-region3 (spa1, dimm1) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 3;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[1];
+	memdev->physical_id = 1;
+	memdev->region_id = 1;
+	memdev->range_index = 1+1;
+	memdev->region_index = 1+1;
+	memdev->region_size = SPA1_SIZE/4;
+	memdev->region_offset = t->spa_set_dma[1] + SPA1_SIZE/4;
+	memdev->address = SPA0_SIZE/2;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 4;
+
+	/* mem-region4 (spa1, dimm2) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 4;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[2];
+	memdev->physical_id = 2;
+	memdev->region_id = 0;
+	memdev->range_index = 1+1;
+	memdev->region_index = 2+1;
+	memdev->region_size = SPA1_SIZE/4;
+	memdev->region_offset = t->spa_set_dma[1] + 2*SPA1_SIZE/4;
+	memdev->address = SPA0_SIZE/2;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 4;
+
+	/* mem-region5 (spa1, dimm3) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 5;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[3];
+	memdev->physical_id = 3;
+	memdev->region_id = 0;
+	memdev->range_index = 1+1;
+	memdev->region_index = 3+1;
+	memdev->region_size = SPA1_SIZE/4;
+	memdev->region_offset = t->spa_set_dma[1] + 3*SPA1_SIZE/4;
+	memdev->address = SPA0_SIZE/2;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 4;
+
+	/* mem-region6 (spa/dcr0, dimm0) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 6;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[0];
+	memdev->physical_id = 0;
+	memdev->region_id = 0;
+	memdev->range_index = 2+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region7 (spa/dcr1, dimm1) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 7;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[1];
+	memdev->physical_id = 1;
+	memdev->region_id = 0;
+	memdev->range_index = 3+1;
+	memdev->region_index = 1+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region8 (spa/dcr2, dimm2) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 8;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[2];
+	memdev->physical_id = 2;
+	memdev->region_id = 0;
+	memdev->range_index = 4+1;
+	memdev->region_index = 2+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region9 (spa/dcr3, dimm3) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 9;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[3];
+	memdev->physical_id = 3;
+	memdev->region_id = 0;
+	memdev->range_index = 5+1;
+	memdev->region_index = 3+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region10 (spa/bdw0, dimm0) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 10;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[0];
+	memdev->physical_id = 0;
+	memdev->region_id = 0;
+	memdev->range_index = 6+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region11 (spa/bdw1, dimm1) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 11;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[1];
+	memdev->physical_id = 1;
+	memdev->region_id = 0;
+	memdev->range_index = 7+1;
+	memdev->region_index = 1+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region12 (spa/bdw2, dimm2) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 12;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[2];
+	memdev->physical_id = 2;
+	memdev->region_id = 0;
+	memdev->range_index = 8+1;
+	memdev->region_index = 2+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region13 (spa/dcr3, dimm3) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 13;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[3];
+	memdev->physical_id = 3;
+	memdev->region_id = 0;
+	memdev->range_index = 9+1;
+	memdev->region_index = 3+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	offset = offset + sizeof(struct acpi_nfit_memory_map) * 14;
+	/* dcr-descriptor0 */
+	dcr = nfit_buf + offset;
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 0+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~handle[0];
+	dcr->windows = 1;
+	dcr->window_size = DCR_SIZE;
+	dcr->command_offset = 0;
+	dcr->command_size = 8;
+	dcr->status_offset = 8;
+	dcr->status_size = 4;
+
+	/* dcr-descriptor1 */
+	dcr = nfit_buf + offset + sizeof(struct acpi_nfit_control_region);
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 1+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~handle[1];
+	dcr->windows = 1;
+	dcr->window_size = DCR_SIZE;
+	dcr->command_offset = 0;
+	dcr->command_size = 8;
+	dcr->status_offset = 8;
+	dcr->status_size = 4;
+
+	/* dcr-descriptor2 */
+	dcr = nfit_buf + offset + sizeof(struct acpi_nfit_control_region) * 2;
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 2+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~handle[2];
+	dcr->windows = 1;
+	dcr->window_size = DCR_SIZE;
+	dcr->command_offset = 0;
+	dcr->command_size = 8;
+	dcr->status_offset = 8;
+	dcr->status_size = 4;
+
+	/* dcr-descriptor3 */
+	dcr = nfit_buf + offset + sizeof(struct acpi_nfit_control_region) * 3;
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 3+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~handle[3];
+	dcr->windows = 1;
+	dcr->window_size = DCR_SIZE;
+	dcr->command_offset = 0;
+	dcr->command_size = 8;
+	dcr->status_offset = 8;
+	dcr->status_size = 4;
+
+	offset = offset + sizeof(struct acpi_nfit_control_region) * 4;
+	/* bdw0 (spa/dcr0, dimm0) */
+	bdw = nfit_buf + offset;
+	bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+	bdw->header.length = sizeof(struct acpi_nfit_data_region);
+	bdw->region_index = 0+1;
+	bdw->windows = 1;
+	bdw->offset = 0;
+	bdw->size = BDW_SIZE;
+	bdw->capacity = DIMM_SIZE;
+	bdw->start_address = 0;
+
+	/* bdw1 (spa/dcr1, dimm1) */
+	bdw = nfit_buf + offset + sizeof(struct acpi_nfit_data_region);
+	bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+	bdw->header.length = sizeof(struct acpi_nfit_data_region);
+	bdw->region_index = 1+1;
+	bdw->windows = 1;
+	bdw->offset = 0;
+	bdw->size = BDW_SIZE;
+	bdw->capacity = DIMM_SIZE;
+	bdw->start_address = 0;
+
+	/* bdw2 (spa/dcr2, dimm2) */
+	bdw = nfit_buf + offset + sizeof(struct acpi_nfit_data_region) * 2;
+	bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+	bdw->header.length = sizeof(struct acpi_nfit_data_region);
+	bdw->region_index = 2+1;
+	bdw->windows = 1;
+	bdw->offset = 0;
+	bdw->size = BDW_SIZE;
+	bdw->capacity = DIMM_SIZE;
+	bdw->start_address = 0;
+
+	/* bdw3 (spa/dcr3, dimm3) */
+	bdw = nfit_buf + offset + sizeof(struct acpi_nfit_data_region) * 3;
+	bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+	bdw->header.length = sizeof(struct acpi_nfit_data_region);
+	bdw->region_index = 3+1;
+	bdw->windows = 1;
+	bdw->offset = 0;
+	bdw->size = BDW_SIZE;
+	bdw->capacity = DIMM_SIZE;
+	bdw->start_address = 0;
+
+	acpi_desc = &t->acpi_desc;
+	set_bit(ND_CMD_GET_CONFIG_SIZE, &acpi_desc->dimm_dsm_force_en);
+	set_bit(ND_CMD_GET_CONFIG_DATA, &acpi_desc->dimm_dsm_force_en);
+	set_bit(ND_CMD_SET_CONFIG_DATA, &acpi_desc->dimm_dsm_force_en);
+	nd_desc = &acpi_desc->nd_desc;
+	nd_desc->ndctl = nfit_test_ctl;
+}
+
+static void nfit_test1_setup(struct nfit_test *t)
+{
+	size_t size = t->nfit_size, offset;
+	void *nfit_buf = t->nfit_buf;
+	struct acpi_nfit_memory_map *memdev;
+	struct acpi_nfit_control_region *dcr;
+	struct acpi_nfit_system_address *spa;
+
+	nfit_test_init_header(nfit_buf, size);
+
+	offset = sizeof(struct acpi_table_nfit);
+	/* spa0 (flat range with no bdw aliasing) */
+	spa = nfit_buf + offset;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_PM), 16);
+	spa->range_index = 0+1;
+	spa->address = t->spa_set_dma[0];
+	spa->length = SPA2_SIZE;
+
+	offset += sizeof(*spa);
+	/* mem-region0 (spa0, dimm0) */
+	memdev = nfit_buf + offset;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = 0;
+	memdev->physical_id = 0;
+	memdev->region_id = 0;
+	memdev->range_index = 0+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = SPA2_SIZE;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	offset += sizeof(*memdev);
+	/* dcr-descriptor0 */
+	dcr = nfit_buf + offset;
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 0+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~0;
+	dcr->code = 0x201;
+	dcr->windows = 0;
+	dcr->window_size = 0;
+	dcr->command_offset = 0;
+	dcr->command_size = 0;
+	dcr->status_offset = 0;
+	dcr->status_size = 0;
+}
+
+static int nfit_test_blk_do_io(struct nd_blk_region *ndbr, resource_size_t dpa,
+		void *iobuf, u64 len, int rw)
+{
+	struct nfit_blk *nfit_blk = ndbr->blk_provider_data;
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW];
+	struct nd_region *nd_region = &ndbr->nd_region;
+	unsigned int lane;
+
+	lane = nd_region_acquire_lane(nd_region);
+	if (rw)
+		memcpy(mmio->base + dpa, iobuf, len);
+	else
+		memcpy(iobuf, mmio->base + dpa, len);
+	nd_region_release_lane(nd_region, lane);
+
+	return 0;
+}
+
+static int nfit_test_probe(struct platform_device *pdev)
+{
+	struct nvdimm_bus_descriptor *nd_desc;
+	struct acpi_nfit_desc *acpi_desc;
+	struct device *dev = &pdev->dev;
+	struct nfit_test *nfit_test;
+	int rc;
+
+	nfit_test = to_nfit_test(&pdev->dev);
+
+	/* common alloc */
+	if (nfit_test->num_dcr) {
+		int num = nfit_test->num_dcr;
+
+		nfit_test->dimm = devm_kcalloc(dev, num, sizeof(void *),
+				GFP_KERNEL);
+		nfit_test->dimm_dma = devm_kcalloc(dev, num, sizeof(dma_addr_t),
+				GFP_KERNEL);
+		nfit_test->label = devm_kcalloc(dev, num, sizeof(void *),
+				GFP_KERNEL);
+		nfit_test->label_dma = devm_kcalloc(dev, num,
+				sizeof(dma_addr_t), GFP_KERNEL);
+		nfit_test->dcr = devm_kcalloc(dev, num,
+				sizeof(struct nfit_test_dcr *), GFP_KERNEL);
+		nfit_test->dcr_dma = devm_kcalloc(dev, num,
+				sizeof(dma_addr_t), GFP_KERNEL);
+		if (nfit_test->dimm && nfit_test->dimm_dma && nfit_test->label
+				&& nfit_test->label_dma && nfit_test->dcr
+				&& nfit_test->dcr_dma)
+			/* pass */;
+		else
+			return -ENOMEM;
+	}
+
+	if (nfit_test->num_pm) {
+		int num = nfit_test->num_pm;
+
+		nfit_test->spa_set = devm_kcalloc(dev, num, sizeof(void *),
+				GFP_KERNEL);
+		nfit_test->spa_set_dma = devm_kcalloc(dev, num,
+				sizeof(dma_addr_t), GFP_KERNEL);
+		if (nfit_test->spa_set && nfit_test->spa_set_dma)
+			/* pass */;
+		else
+			return -ENOMEM;
+	}
+
+	/* per-nfit specific alloc */
+	if (nfit_test->alloc(nfit_test))
+		return -ENOMEM;
+
+	nfit_test->setup(nfit_test);
+	acpi_desc = &nfit_test->acpi_desc;
+	acpi_desc->dev = &pdev->dev;
+	acpi_desc->nfit = nfit_test->nfit_buf;
+	acpi_desc->blk_do_io = nfit_test_blk_do_io;
+	nd_desc = &acpi_desc->nd_desc;
+	nd_desc->attr_groups = acpi_nfit_attribute_groups;
+	acpi_desc->nvdimm_bus = nvdimm_bus_register(&pdev->dev, nd_desc);
+	if (!acpi_desc->nvdimm_bus)
+		return -ENXIO;
+
+	rc = acpi_nfit_init(acpi_desc, nfit_test->nfit_size);
+	if (rc) {
+		nvdimm_bus_unregister(acpi_desc->nvdimm_bus);
+		return rc;
+	}
+
+	return 0;
+}
+
+static int nfit_test_remove(struct platform_device *pdev)
+{
+	struct nfit_test *nfit_test = to_nfit_test(&pdev->dev);
+	struct acpi_nfit_desc *acpi_desc = &nfit_test->acpi_desc;
+
+	nvdimm_bus_unregister(acpi_desc->nvdimm_bus);
+
+	return 0;
+}
+
+static void nfit_test_release(struct device *dev)
+{
+	struct nfit_test *nfit_test = to_nfit_test(dev);
+
+	kfree(nfit_test);
+}
+
+static const struct platform_device_id nfit_test_id[] = {
+	{ KBUILD_MODNAME },
+	{ },
+};
+
+static struct platform_driver nfit_test_driver = {
+	.probe = nfit_test_probe,
+	.remove = nfit_test_remove,
+	.driver = {
+		.name = KBUILD_MODNAME,
+	},
+	.id_table = nfit_test_id,
+};
+
+#ifdef CONFIG_CMA_SIZE_MBYTES
+#define CMA_SIZE_MBYTES CONFIG_CMA_SIZE_MBYTES
+#else
+#define CMA_SIZE_MBYTES 0
+#endif
+
+static __init int nfit_test_init(void)
+{
+	int rc, i;
+
+	nfit_test_setup(nfit_test_lookup);
+
+	for (i = 0; i < NUM_NFITS; i++) {
+		struct nfit_test *nfit_test;
+		struct platform_device *pdev;
+		static int once;
+
+		nfit_test = kzalloc(sizeof(*nfit_test), GFP_KERNEL);
+		if (!nfit_test) {
+			rc = -ENOMEM;
+			goto err_register;
+		}
+		INIT_LIST_HEAD(&nfit_test->resources);
+		switch (i) {
+		case 0:
+			nfit_test->num_pm = NUM_PM;
+			nfit_test->num_dcr = NUM_DCR;
+			nfit_test->alloc = nfit_test0_alloc;
+			nfit_test->setup = nfit_test0_setup;
+			break;
+		case 1:
+			nfit_test->num_pm = 1;
+			nfit_test->alloc = nfit_test1_alloc;
+			nfit_test->setup = nfit_test1_setup;
+			break;
+		default:
+			rc = -EINVAL;
+			goto err_register;
+		}
+		pdev = &nfit_test->pdev;
+		pdev->name = KBUILD_MODNAME;
+		pdev->id = i;
+		pdev->dev.release = nfit_test_release;
+		rc = platform_device_register(pdev);
+		if (rc) {
+			put_device(&pdev->dev);
+			goto err_register;
+		}
+
+		rc = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
+		if (rc)
+			goto err_register;
+
+		instances[i] = nfit_test;
+
+		if (!once++) {
+			dma_addr_t dma;
+			void *buf;
+
+			buf = dma_alloc_coherent(&pdev->dev, SZ_128M, &dma,
+					GFP_KERNEL);
+			if (!buf) {
+				rc = -ENOMEM;
+				dev_warn(&pdev->dev, "need 128M of free cma\n");
+				goto err_register;
+			}
+			dma_free_coherent(&pdev->dev, SZ_128M, buf, dma);
+		}
+	}
+
+	rc = platform_driver_register(&nfit_test_driver);
+	if (rc)
+		goto err_register;
+	return 0;
+
+ err_register:
+	for (i = 0; i < NUM_NFITS; i++)
+		if (instances[i])
+			platform_device_unregister(&instances[i]->pdev);
+	nfit_test_teardown();
+	return rc;
+}
+
+static __exit void nfit_test_exit(void)
+{
+	int i;
+
+	platform_driver_unregister(&nfit_test_driver);
+	for (i = 0; i < NUM_NFITS; i++)
+		platform_device_unregister(&instances[i]->pdev);
+	nfit_test_teardown();
+}
+
+module_init(nfit_test_init);
+module_exit(nfit_test_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Intel Corporation");
diff --git a/tools/testing/nvdimm/test/nfit_test.h b/tools/testing/nvdimm/test/nfit_test.h
new file mode 100644
index 000000000000..96c5e16d7db9
--- /dev/null
+++ b/tools/testing/nvdimm/test/nfit_test.h
@@ -0,0 +1,29 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __NFIT_TEST_H__
+#define __NFIT_TEST_H__
+
+struct nfit_test_resource {
+	struct list_head list;
+	struct resource *res;
+	struct device *dev;
+	void *buf;
+};
+
+typedef struct nfit_test_resource *(*nfit_test_lookup_fn)(resource_size_t);
+void __iomem *__wrap_ioremap_nocache(resource_size_t offset,
+		unsigned long size);
+void __wrap_iounmap(volatile void __iomem *addr);
+void nfit_test_setup(nfit_test_lookup_fn lookup);
+void nfit_test_teardown(void);
+#endif


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 06/15] libnvdimm: Non-Volatile Devices
  2015-06-17 23:54 ` Dan Williams
@ 2015-06-17 23:55   ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Neil Brown, Greg KH, linux-kernel,
	Andy Lutomirski, Jens Axboe, linux-acpi, H. Peter Anvin,
	linux-fsdevel, hch, mingo

Maintainer information and documentation for drivers/nvdimm

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Greg KH <gregkh@linuxfoundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 Documentation/nvdimm/nvdimm.txt |  805 +++++++++++++++++++++++++++++++++++++++
 MAINTAINERS                     |   39 ++
 2 files changed, 838 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/nvdimm/nvdimm.txt

diff --git a/Documentation/nvdimm/nvdimm.txt b/Documentation/nvdimm/nvdimm.txt
new file mode 100644
index 000000000000..854f0a7c96c5
--- /dev/null
+++ b/Documentation/nvdimm/nvdimm.txt
@@ -0,0 +1,805 @@
+			  LIBNVDIMM: Non-Volatile Devices
+	      libnvdimm - kernel / libndctl - userspace helper library
+			   linux-nvdimm@lists.01.org
+				      v12
+
+
+	Glossary
+	Overview
+	    Supporting Documents
+	    Git Trees
+	LIBNVDIMM PMEM and BLK
+	Why BLK?
+	    PMEM vs BLK
+	        BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
+	Example NVDIMM Platform
+	LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
+	    LIBNDCTL: Context
+	        libndctl: instantiate a new library context example
+	    LIBNVDIMM/LIBNDCTL: Bus
+	        libnvdimm: control class device in /sys/class
+	        libnvdimm: bus
+	        libndctl: bus enumeration example
+	    LIBNVDIMM/LIBNDCTL: DIMM (NMEM)
+	        libnvdimm: DIMM (NMEM)
+	        libndctl: DIMM enumeration example
+	    LIBNVDIMM/LIBNDCTL: Region
+	        libnvdimm: region
+	        libndctl: region enumeration example
+	        Why Not Encode the Region Type into the Region Name?
+	        How Do I Determine the Major Type of a Region?
+	    LIBNVDIMM/LIBNDCTL: Namespace
+	        libnvdimm: namespace
+	        libndctl: namespace enumeration example
+	        libndctl: namespace creation example
+	        Why the Term "namespace"?
+	    LIBNVDIMM/LIBNDCTL: Block Translation Table "btt"
+	        libnvdimm: btt layout
+	        libndctl: btt creation example
+	Summary LIBNDCTL Diagram
+
+
+Glossary
+--------
+
+PMEM: A system-physical-address range where writes are persistent.  A
+block device composed of PMEM is capable of DAX.  A PMEM address range
+may span an interleave of several DIMMs.
+
+BLK: A set of one or more programmable memory mapped apertures provided
+by a DIMM to access its media.  This indirection precludes the
+performance benefit of interleaving, but enables DIMM-bounded failure
+modes.
+
+DPA: DIMM Physical Address, is a DIMM-relative offset.  With one DIMM in
+the system there would be a 1:1 system-physical-address:DPA association.
+Once more DIMMs are added a memory controller interleave must be
+decoded to determine the DPA associated with a given
+system-physical-address.  BLK capacity always has a 1:1 relationship
+with a single-DIMM's DPA range.
+
+DAX: File system extensions to bypass the page cache and block layer to
+mmap persistent memory, from a PMEM block device, directly into a
+process address space.
+
+BTT: Block Translation Table: Persistent memory is byte addressable.
+Existing software may have an expectation that the power-fail-atomicity
+of writes is at least one sector, 512 bytes.  The BTT is an indirection
+table with atomic update semantics to front a PMEM/BLK block device
+driver and present arbitrary atomic sector sizes.
+
+LABEL: Metadata stored on a DIMM device that partitions and identifies
+(persistently names) storage between PMEM and BLK.  It also partitions
+BLK storage to host BTTs with different parameters per BLK-partition.
+Note that traditional partition tables, GPT/MBR, are layered on top of a
+BLK or PMEM device.
+
+
+Overview
+--------
+
+The LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely,
+PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM
+and BLK mode access.  These three modes of operation are described by
+the "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6.  While the LIBNVDIMM
+implementation is generic and supports pre-NFIT platforms, it was guided
+by the superset of capabilities need to support this ACPI 6 definition
+for NVDIMM resources.  The bulk of the kernel implementation is in place
+to handle the case where DPA accessible via PMEM is aliased with DPA
+accessible via BLK.  When that occurs a LABEL is needed to reserve DPA
+for exclusive access via one mode a time.
+
+Supporting Documents
+ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
+NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
+DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
+Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
+
+Git Trees
+LIBNVDIMM: https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git
+LIBNDCTL: https://github.com/pmem/ndctl.git
+PMEM: https://github.com/01org/prd
+
+
+LIBNVDIMM PMEM and BLK
+------------------
+
+Prior to the arrival of the NFIT, non-volatile memory was described to a
+system in various ad-hoc ways.  Usually only the bare minimum was
+provided, namely, a single system-physical-address range where writes
+are expected to be durable after a system power loss.  Now, the NFIT
+specification standardizes not only the description of PMEM, but also
+BLK and platform message-passing entry points for control and
+configuration.
+
+For each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block
+device driver:
+
+    1. PMEM (nd_pmem.ko): Drives a system-physical-address range.  This
+    range is contiguous in system memory and may be interleaved (hardware
+    memory controller striped) across multiple DIMMs.  When interleaved the
+    platform may optionally provide details of which DIMMs are participating
+    in the interleave.
+
+    Note that while LIBNVDIMM describes system-physical-address ranges that may
+    alias with BLK access as ND_NAMESPACE_PMEM ranges and those without
+    alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no
+    distinction.  The different device-types are an implementation detail
+    that userspace can exploit to implement policies like "only interface
+    with address ranges from certain DIMMs".  It is worth noting that when
+    aliasing is present and a DIMM lacks a label, then no block device can
+    be created by default as userspace needs to do at least one allocation
+    of DPA to the PMEM range.  In contrast ND_NAMESPACE_IO ranges, once
+    registered, can be immediately attached to nd_pmem.
+
+    2. BLK (nd_blk.ko): This driver performs I/O using a set of platform
+    defined apertures.  A set of apertures will all access just one DIMM.
+    Multiple windows allow multiple concurrent accesses, much like
+    tagged-command-queuing, and would likely be used by different threads or
+    different CPUs.
+
+    The NFIT specification defines a standard format for a BLK-aperture, but
+    the spec also allows for vendor specific layouts, and non-NFIT BLK
+    implementations may other designs for BLK I/O.  For this reason "nd_blk"
+    calls back into platform-specific code to perform the I/O.  One such
+    implementation is defined in the "Driver Writer's Guide" and "DSM
+    Interface Example".
+
+
+Why BLK?
+--------
+
+While PMEM provides direct byte-addressable CPU-load/store access to
+NVDIMM storage, it does not provide the best system RAS (recovery,
+availability, and serviceability) model.  An access to a corrupted
+system-physical-address address causes a cpu exception while an access
+to a corrupted address through an BLK-aperture causes that block window
+to raise an error status in a register.  The latter is more aligned with
+the standard error model that host-bus-adapter attached disks present.
+Also, if an administrator ever wants to replace a memory it is easier to
+service a system at DIMM module boundaries.  Compare this to PMEM where
+data could be interleaved in an opaque hardware specific manner across
+several DIMMs.
+
+PMEM vs BLK
+BLK-apertures solve this RAS problem, but their presence is also the
+major contributing factor to the complexity of the ND subsystem.  They
+complicate the implementation because PMEM and BLK alias in DPA space.
+Any given DIMM's DPA-range may contribute to one or more
+system-physical-address sets of interleaved DIMMs, *and* may also be
+accessed in its entirety through its BLK-aperture.  Accessing a DPA
+through a system-physical-address while simultaneously accessing the
+same DPA through a BLK-aperture has undefined results.  For this reason,
+DIMMs with this dual interface configuration include a DSM function to
+store/retrieve a LABEL.  The LABEL effectively partitions the DPA-space
+into exclusive system-physical-address and BLK-aperture accessible
+regions.  For simplicity a DIMM is allowed a PMEM "region" per each
+interleave set in which it is a member.  The remaining DPA space can be
+carved into an arbitrary number of BLK devices with discontiguous
+extents.
+
+BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
+--------------------------------------------------
+
+One of the few
+reasons to allow multiple BLK namespaces per REGION is so that each
+BLK-namespace can be configured with a BTT with unique atomic sector
+sizes.  While a PMEM device can host a BTT the LABEL specification does
+not provide for a sector size to be specified for a PMEM namespace.
+This is due to the expectation that the primary usage model for PMEM is
+via DAX, and the BTT is incompatible with DAX.  However, for the cases
+where an application or filesystem still needs atomic sector update
+guarantees it can register a BTT on a PMEM device or partition.  See
+LIBNVDIMM/NDCTL: Block Translation Table "btt"
+
+
+Example NVDIMM Platform
+-----------------------
+
+For the remainder of this document the following diagram will be
+referenced for any example sysfs layouts.
+
+
+                             (a)               (b)           DIMM   BLK-REGION
+          +-------------------+--------+--------+--------+
++------+  |       pm0.0       | blk2.0 | pm1.0  | blk2.1 |    0      region2
+| imc0 +--+- - - region0- - - +--------+        +--------+
++--+---+  |       pm0.0       | blk3.0 | pm1.0  | blk3.1 |    1      region3
+   |      +-------------------+--------v        v--------+
++--+---+                               |                 |
+| cpu0 |                                     region1
++--+---+                               |                 |
+   |      +----------------------------^        ^--------+
++--+---+  |           blk4.0           | pm1.0  | blk4.0 |    2      region4
+| imc1 +--+----------------------------|        +--------+
++------+  |           blk5.0           | pm1.0  | blk5.0 |    3      region5
+          +----------------------------+--------+--------+
+
+In this platform we have four DIMMs and two memory controllers in one
+socket.  Each unique interface (BLK or PMEM) to DPA space is identified
+by a region device with a dynamically assigned id (REGION0 - REGION5).
+
+    1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A
+    single PMEM namespace is created in the REGION0-SPA-range that spans
+    DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that
+    interleaved system-physical-address range is reclaimed as BLK-aperture
+    accessed space starting at DPA-offset (a) into each DIMM.  In that
+    reclaimed space we create two BLK-aperture "namespaces" from REGION2 and
+    REGION3 where "blk2.0" and "blk3.0" are just human readable names that
+    could be set to any user-desired name in the LABEL.
+
+    2. In the last portion of DIMM0 and DIMM1 we have an interleaved
+    system-physical-address range, REGION1, that spans those two DIMMs as
+    well as DIMM2 and DIMM3.  Some of REGION1 allocated to a PMEM namespace
+    named "pm1.0" the rest is reclaimed in 4 BLK-aperture namespaces (for
+    each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
+    "blk5.0".
+
+    3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1
+    interleaved system-physical-address range (i.e. the DPA address below
+    offset (b) are also included in the "blk4.0" and "blk5.0" namespaces.
+    Note, that this example shows that BLK-aperture namespaces don't need to
+    be contiguous in DPA-space.
+
+    This bus is provided by the kernel under the device
+    /sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and
+    the nfit_test.ko module is loaded.  This not only test LIBNVDIMM but the
+    acpi_nfit.ko driver as well.
+
+
+LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
+----------------------------------------------------
+
+What follows is a description of the LIBNVDIMM sysfs layout and a
+corresponding object hierarchy diagram as viewed through the LIBNDCTL
+api.  The example sysfs paths and diagrams are relative to the Example
+NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit
+test.
+
+LIBNDCTL: Context
+Every api call in the LIBNDCTL library requires a context that holds the
+logging parameters and other library instance state.  The library is
+based on the libabc template:
+https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git/
+
+LIBNDCTL: instantiate a new library context example
+
+	struct ndctl_ctx *ctx;
+
+	if (ndctl_new(&ctx) == 0)
+		return ctx;
+	else
+		return NULL;
+
+LIBNVDIMM/LIBNDCTL: Bus
+-------------------
+
+A bus has a 1:1 relationship with an NFIT.  The current expectation for
+ACPI based systems is that there is only ever one platform-global NFIT.
+That said, it is trivial to register multiple NFITs, the specification
+does not preclude it.  The infrastructure supports multiple busses and
+we we use this capability to test multiple NFIT configurations in the
+unit test.
+
+LIBNVDIMM: control class device in /sys/class
+
+This character device accepts DSM messages to be passed to DIMM
+identified by its NFIT handle.
+
+	/sys/class/nd/ndctl0
+	|-- dev
+	|-- device -> ../../../ndbus0
+	|-- subsystem -> ../../../../../../../class/nd
+
+
+
+LIBNVDIMM: bus
+
+	struct nvdimm_bus *nvdimm_bus_register(struct device *parent,
+	       struct nvdimm_bus_descriptor *nfit_desc);
+
+	/sys/devices/platform/nfit_test.0/ndbus0
+	|-- btt0
+	|-- btt_seed
+	|-- commands
+	|-- nd
+	|-- nfit
+	|-- nmem0
+	|-- nmem1
+	|-- nmem2
+	|-- nmem3
+	|-- power
+	|-- provider
+	|-- region0
+	|-- region1
+	|-- region2
+	|-- region3
+	|-- region4
+	|-- region5
+	|-- uevent
+	`-- wait_probe
+
+LIBNDCTL: bus enumeration example
+Find the bus handle that describes the bus from Example NVDIMM Platform
+
+	static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx,
+			const char *provider)
+	{
+		struct ndctl_bus *bus;
+
+		ndctl_bus_foreach(ctx, bus)
+			if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0)
+				return bus;
+
+		return NULL;
+	}
+
+	bus = get_bus_by_provider(ctx, "nfit_test.0");
+
+
+LIBNVDIMM/LIBNDCTL: DIMM (NMEM)
+---------------------------
+
+The DIMM device provides a character device for sending commands to
+hardware, and it is a container for LABELs.  If the DIMM is defined by
+NFIT then an optional 'nfit' attribute sub-directory is available to add
+NFIT-specifics.
+
+Note that the kernel device name for "DIMMs" is "nmemX".  The NFIT
+describes these devices via "Memory Device to System Physical Address
+Range Mapping Structure", and there is no requirement that they actually
+be physical DIMMs, so we use a more generic name.
+
+LIBNVDIMM: DIMM (NMEM)
+
+	struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data,
+			const struct attribute_group **groups, unsigned long flags,
+			unsigned long *dsm_mask);
+
+	/sys/devices/platform/nfit_test.0/ndbus0
+	|-- nmem0
+	|   |-- available_slots
+	|   |-- commands
+	|   |-- dev
+	|   |-- devtype
+	|   |-- driver -> ../../../../../bus/nd/drivers/nvdimm
+	|   |-- modalias
+	|   |-- nfit
+	|   |   |-- device
+	|   |   |-- format
+	|   |   |-- handle
+	|   |   |-- phys_id
+	|   |   |-- rev_id
+	|   |   |-- serial
+	|   |   `-- vendor
+	|   |-- state
+	|   |-- subsystem -> ../../../../../bus/nd
+	|   `-- uevent
+	|-- nmem1
+	[..]
+
+
+LIBNDCTL: DIMM enumeration example
+
+Note, in this example we are assuming NFIT-defined DIMMs which are
+identified by an "nfit_handle" a 32-bit value where:
+Bit 3:0 DIMM number within the memory channel
+Bit 7:4 memory channel number
+Bit 11:8 memory controller ID
+Bit 15:12 socket ID (within scope of a Node controller if node controller is present)
+Bit 27:16 Node Controller ID
+Bit 31:28 Reserved
+
+	static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus,
+	       unsigned int handle)
+	{
+		struct ndctl_dimm *dimm;
+
+		ndctl_dimm_foreach(bus, dimm)
+			if (ndctl_dimm_get_handle(dimm) == handle)
+				return dimm;
+
+		return NULL;
+	}
+
+	#define DIMM_HANDLE(n, s, i, c, d) \
+		(((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \
+		 | ((c & 0xf) << 4) | (d & 0xf))
+
+	dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0));
+
+LIBNVDIMM/LIBNDCTL: Region
+----------------------
+
+A generic REGION device is registered for each PMEM range orBLK-aperture
+set.  Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture
+sets on the "nfit_test.0" bus.  The primary role of regions are to be a
+container of "mappings".  A mapping is a tuple of <DIMM,
+DPA-start-offset, length>.
+
+LIBNVDIMM provides a built-in driver for these REGION devices.  This driver
+is responsible for reconciling the aliased DPA mappings across all
+regions, parsing the LABEL, if present, and then emitting NAMESPACE
+devices with the resolved/exclusive DPA-boundaries for the nd_pmem or
+nd_blk device driver to consume.
+
+In addition to the generic attributes of "mapping"s, "interleave_ways"
+and "size" the REGION device also exports some convenience attributes.
+"nstype" indicates the integer type of namespace-device this region
+emits, "devtype" duplicates the DEVTYPE variable stored by udev at the
+'add' event, "modalias" duplicates the MODALIAS variable stored by udev
+at the 'add' event, and finally, the optional "spa_index" is provided in
+the case where the region is defined by a SPA.
+
+LIBNVDIMM: region
+
+	struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus,
+			struct nd_region_desc *ndr_desc);
+	struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus,
+			struct nd_region_desc *ndr_desc);
+
+	/sys/devices/platform/nfit_test.0/ndbus0
+	|-- region0
+	|   |-- available_size
+	|   |-- devtype
+	|   |-- driver -> ../../../../../bus/nd/drivers/nd_region
+	|   |-- init_namespaces
+	|   |-- mapping0
+	|   |-- mapping1
+	|   |-- mappings
+	|   |-- modalias
+	|   |-- namespace0.0
+	|   |-- namespace_seed
+	|   |-- nfit
+	|   |   `-- spa_index
+	|   |-- nstype
+	|   |-- set_cookie
+	|   |-- size
+	|   |-- subsystem -> ../../../../../bus/nd
+	|   `-- uevent
+	|-- region1
+	[..]
+
+LIBNDCTL: region enumeration example
+
+Sample region retrieval routines based on NFIT-unique data like
+"spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for
+BLK.
+
+	static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus,
+			unsigned int spa_index)
+	{
+		struct ndctl_region *region;
+
+		ndctl_region_foreach(bus, region) {
+			if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM)
+				continue;
+			if (ndctl_region_get_spa_index(region) == spa_index)
+				return region;
+		}
+		return NULL;
+	}
+
+	static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus,
+			unsigned int handle)
+	{
+		struct ndctl_region *region;
+
+		ndctl_region_foreach(bus, region) {
+			struct ndctl_mapping *map;
+
+			if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK)
+				continue;
+			ndctl_mapping_foreach(region, map) {
+				struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map);
+
+				if (ndctl_dimm_get_handle(dimm) == handle)
+					return region;
+			}
+		}
+		return NULL;
+	}
+
+
+Why Not Encode the Region Type into the Region Name?
+----------------------------------------------------
+
+At first glance it seems since NFIT defines just PMEM and BLK interface
+types that we should simply name REGION devices with something derived
+from those type names.  However, the ND subsystem explicitly keeps the
+REGION name generic and expects userspace to always consider the
+region-attributes for 4 reasons:
+
+    1. There are already more than two REGION and "namespace" types.  For
+    PMEM there are two subtypes.  As mentioned previously we have PMEM where
+    the constituent DIMM devices are known and anonymous PMEM.  For BLK
+    regions the NFIT specification already anticipates vendor specific
+    implementations.  The exact distinction of what a region contains is in
+    the region-attributes not the region-name or the region-devtype.
+
+    2. A region with zero child-namespaces is a possible configuration.  For
+    example, the NFIT allows for a DCR to be published without a
+    corresponding BLK-aperture.  This equates to a DIMM that can only accept
+    control/configuration messages, but no i/o through a descendant block
+    device.  Again, this "type" is advertised in the attributes ('mappings'
+    == 0) and the name does not tell you much.
+
+    3. What if a third major interface type arises in the future?  Outside
+    of vendor specific implementations, it's not difficult to envision a
+    third class of interface type beyond BLK and PMEM.  With a generic name
+    for the REGION level of the device-hierarchy old userspace
+    implementations can still make sense of new kernel advertised
+    region-types.  Userspace can always rely on the generic region
+    attributes like "mappings", "size", etc and the expected child devices
+    named "namespace".  This generic format of the device-model hierarchy
+    allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and
+    future-proof.
+
+    4. There are more robust mechanisms for determining the major type of a
+    region than a device name.  See the next section, How Do I Determine the
+    Major Type of a Region?
+
+How Do I Determine the Major Type of a Region?
+----------------------------------------------
+
+Outside of the blanket recommendation of "use libndctl", or simply
+looking at the kernel header (/usr/include/linux/ndctl.h) to decode the
+"nstype" integer attribute, here are some other options.
+
+    1. module alias lookup:
+
+    The whole point of region/namespace device type differentiation is to
+    decide which block-device driver will attach to a given LIBNVDIMM namespace.
+    One can simply use the modalias to lookup the resulting module.  It's
+    important to note that this method is robust in the presence of a
+    vendor-specific driver down the road.  If a vendor-specific
+    implementation wants to supplant the standard nd_blk driver it can with
+    minimal impact to the rest of LIBNVDIMM.
+
+    In fact, a vendor may also want to have a vendor-specific region-driver
+    (outside of nd_region).  For example, if a vendor defined its own LABEL
+    format it would need its own region driver to parse that LABEL and emit
+    the resulting namespaces.  The output from module resolution is more
+    accurate than a region-name or region-devtype.
+
+    2. udev:
+
+    The kernel "devtype" is registered in the udev database
+    # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0
+    P: /devices/platform/nfit_test.0/ndbus0/region0
+    E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0
+    E: DEVTYPE=nd_pmem
+    E: MODALIAS=nd:t2
+    E: SUBSYSTEM=nd
+
+    # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4
+    P: /devices/platform/nfit_test.0/ndbus0/region4
+    E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4
+    E: DEVTYPE=nd_blk
+    E: MODALIAS=nd:t3
+    E: SUBSYSTEM=nd
+
+    ...and is available as a region attribute, but keep in mind that the
+    "devtype" does not indicate sub-type variations and scripts should
+    really be understanding the other attributes.
+
+    3. type specific attributes:
+
+    As it currently stands a BLK-aperture region will never have a
+    "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region.  A
+    BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM
+    that does not allow I/O.  A PMEM region with a "mappings" value of zero
+    is a simple system-physical-address range.
+
+
+LIBNVDIMM/LIBNDCTL: Namespace
+-------------------------
+
+A REGION, after resolving DPA aliasing and LABEL specified boundaries,
+surfaces one or more "namespace" devices.  The arrival of a "namespace"
+device currently triggers either the nd_blk or nd_pmem driver to load
+and register a disk/block device.
+
+LIBNVDIMM: namespace
+Here is a sample layout from the three major types of NAMESPACE where
+namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid'
+attribute), namespace2.0 represents a BLK namespace (note it has a
+'sector_size' attribute) that, and namespace6.0 represents an anonymous
+PMEM namespace (note that has no 'uuid' attribute due to not support a
+LABEL).
+
+	/sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0
+	|-- alt_name
+	|-- devtype
+	|-- dpa_extents
+	|-- modalias
+	|-- resource
+	|-- size
+	|-- subsystem -> ../../../../../../bus/nd
+	|-- type
+	|-- uevent
+	`-- uuid
+	/sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0
+	|-- alt_name
+	|-- devtype
+	|-- dpa_extents
+	|-- modalias
+	|-- sector_size
+	|-- size
+	|-- subsystem -> ../../../../../../bus/nd
+	|-- type
+	|-- uevent
+	`-- uuid
+	/sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0
+	|-- block
+	|   `-- pmem0
+	|-- devtype
+	|-- driver -> ../../../../../../bus/nd/drivers/pmem
+	|-- modalias
+	|-- resource
+	|-- size
+	|-- subsystem -> ../../../../../../bus/nd
+	|-- type
+	`-- uevent
+
+LIBNDCTL: namespace enumeration example
+Namespaces are indexed relative to their parent region, example below.
+These indexes are mostly static from boot to boot, but subsystem makes
+no guarantees in this regard.  For a static namespace identifier use its
+'uuid' attribute.
+
+static struct ndctl_namespace *get_namespace_by_id(struct ndctl_region *region,
+                unsigned int id)
+{
+        struct ndctl_namespace *ndns;
+
+        ndctl_namespace_foreach(region, ndns)
+                if (ndctl_namespace_get_id(ndns) == id)
+                        return ndns;
+
+        return NULL;
+}
+
+LIBNDCTL: namespace creation example
+Idle namespaces are automatically created by the kernel if a given
+region has enough available capacity to create a new namespace.
+Namespace instantiation involves finding an idle namespace and
+configuring it.  For the most part the setting of namespace attributes
+can occur in any order, the only constraint is that 'uuid' must be set
+before 'size'.  This enables the kernel to track DPA allocations
+internally with a static identifier.
+
+static int configure_namespace(struct ndctl_region *region,
+                struct ndctl_namespace *ndns,
+                struct namespace_parameters *parameters)
+{
+        char devname[50];
+
+        snprintf(devname, sizeof(devname), "namespace%d.%d",
+                        ndctl_region_get_id(region), paramaters->id);
+
+        ndctl_namespace_set_alt_name(ndns, devname);
+        /* 'uuid' must be set prior to setting size! */
+        ndctl_namespace_set_uuid(ndns, paramaters->uuid);
+        ndctl_namespace_set_size(ndns, paramaters->size);
+        /* unlike pmem namespaces, blk namespaces have a sector size */
+        if (parameters->lbasize)
+                ndctl_namespace_set_sector_size(ndns, parameters->lbasize);
+        ndctl_namespace_enable(ndns);
+}
+
+
+Why the Term "namespace"?
+
+    1. Why not "volume" for instance?  "volume" ran the risk of confusing ND
+    as a volume manager like device-mapper.
+
+    2. The term originated to describe the sub-devices that can be created
+    within a NVME controller (see the nvme specification:
+    http://www.nvmexpress.org/specifications/), and NFIT namespaces are
+    meant to parallel the capabilities and configurability of
+    NVME-namespaces.
+
+
+LIBNVDIMM/LIBNDCTL: Block Translation Table "btt"
+---------------------------------------------
+
+A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked
+block device driver that fronts either the whole block device or a
+partition of a block device emitted by either a PMEM or BLK NAMESPACE.
+
+LIBNVDIMM: btt layout
+Every bus will start out with at least one BTT device which is the seed
+device.  To activate it set the "backing_dev", "uuid", and "sector_size"
+attributes and then bind the device to the nd_btt driver.
+
+	/sys/devices/platform/nfit_test.1/ndbus0/btt0/
+	|-- backing_dev
+	|-- delete
+	|-- devtype
+	|-- modalias
+	|-- sector_size
+	|-- subsystem -> ../../../../../bus/nd
+	|-- uevent
+	`-- uuid
+
+LIBNDCTL: btt creation example
+Similar to namespaces an idle BTT device is automatically created per
+bus.  Each time this "seed" btt device is configured and enabled a new
+seed is created.  Creating a BTT configuration involves two steps of
+finding and idle BTT and assigning it to front a PMEM or BLK namespace.
+
+	static struct ndctl_btt *get_idle_btt(struct ndctl_bus *bus)
+	{
+		struct ndctl_btt *btt;
+
+		ndctl_btt_foreach(bus, btt)
+			if (!ndctl_btt_is_enabled(btt) && !ndctl_btt_is_configured(btt))
+				return btt;
+
+		return NULL;
+	}
+
+	static int configure_btt(struct ndctl_bus *bus, struct btt_parameters *parameters)
+	{
+		btt = get_idle_btt(bus);
+
+		sprintf(bdevpath, "/dev/%s",
+				ndctl_namespace_get_block_device(parameters->ndns));
+		ndctl_btt_set_uuid(btt, parameters->uuid);
+		ndctl_btt_set_sector_size(btt, parameters->sector_size);
+		ndctl_btt_set_backing_dev(btt, parameters->bdevpath);
+		ndctl_btt_enable(btt);
+	}
+
+Once instantiated a "nd_btt" link will be created under the
+"backing_dev" (pmem0) block device:
+
+	/sys/block/pmem0/
+	|-- alignment_offset
+	|-- bdi -> ../../../../../../../virtual/bdi/259:0
+	|-- capability
+	|-- dev
+	|-- device -> ../../../namespace0.0
+	|-- discard_alignment
+	|-- ext_range
+	|-- holders
+	|-- inflight
+	|-- nd_btt -> ../../../../btt0
+
+...and a new inactive seed device will appear on the bus.
+
+Once a "backing_dev" is disabled its associated BTT will be
+automatically deleted.  This deletion is only at the device model level.
+In order to destroy a BTT the "info block" needs to be destroyed.
+
+
+Summary LIBNDCTL Diagram
+------------------------
+
+For the given example above, here is the view of the objects as seen by the LIBNDCTL api:
+            +---+
+            |CTX|    +---------+   +--------------+  +---------------+
+            +-+-+  +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" |
+              |    | +---------+   +--------------+  +---------------+
++-------+     |    | +---------+   +--------------+  +---------------+
+| DIMM0 <-+   |    +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" |
++-------+ |   |    | +---------+   +--------------+  +---------------+
+| DIMM1 <-+ +-v--+ | +---------+   +--------------+  +---------------+
++-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6  "blk2.0" |
+| DIMM2 <-+ +----+ | +---------+ | +--------------+  +----------------------+
++-------+ |        |             +-> NAMESPACE2.1 +--> ND5  "blk2.1" | BTT2 |
+| DIMM3 <-+        |               +--------------+  +----------------------+
++-------+          | +---------+   +--------------+  +---------------+
+                   +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4  "blk3.0" |
+                   | +---------+ | +--------------+  +----------------------+
+                   |             +-> NAMESPACE3.1 +--> ND3  "blk3.1" | BTT1 |
+                   |               +--------------+  +----------------------+
+                   | +---------+   +--------------+  +---------------+
+                   +-> REGION4 +---> NAMESPACE4.0 +--> ND2  "blk4.0" |
+                   | +---------+   +--------------+  +---------------+
+                   | +---------+   +--------------+  +----------------------+
+                   +-> REGION5 +---> NAMESPACE5.0 +--> ND1  "blk5.0" | BTT0 |
+                     +---------+   +--------------+  +---------------+------+
+
+
diff --git a/MAINTAINERS b/MAINTAINERS
index 590304b96b03..c2f55aed811d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5912,6 +5912,39 @@ M:	Sasha Levin <sasha.levin@oracle.com>
 S:	Maintained
 F:	tools/lib/lockdep/
 
+LIBNVDIMM: NON-VOLATILE MEMORY DEVICE SUBSYSTEM
+M:	Dan Williams <dan.j.williams@intel.com>
+L:	linux-nvdimm@lists.01.org
+Q:	https://patchwork.kernel.org/project/linux-nvdimm/list/
+S:	Supported
+F:	drivers/nvdimm/*
+F:	include/linux/nd.h
+F:	include/linux/libnvdimm.h
+F:	include/uapi/linux/ndctl.h
+
+LIBNVDIMM BLK: MMIO-APERTURE DRIVER
+M:	Ross Zwisler <ross.zwisler@linux.intel.com>
+L:	linux-nvdimm@lists.01.org
+Q:	https://patchwork.kernel.org/project/linux-nvdimm/list/
+S:	Supported
+F:	drivers/nvdimm/blk.c
+F:	drivers/nvdimm/region_devs.c
+F:	drivers/acpi/nfit*
+
+LIBNVDIMM BTT: BLOCK TRANSLATION TABLE
+M:	Vishal Verma <vishal.l.verma@intel.com>
+L:	linux-nvdimm@lists.01.org
+Q:	https://patchwork.kernel.org/project/linux-nvdimm/list/
+S:	Supported
+F:	drivers/nvdimm/btt*
+
+LIBNVDIMM PMEM: PERSISTENT MEMORY DRIVER
+M:	Ross Zwisler <ross.zwisler@linux.intel.com>
+L:	linux-nvdimm@lists.01.org
+Q:	https://patchwork.kernel.org/project/linux-nvdimm/list/
+S:	Supported
+F:	drivers/nvdimm/pmem.c
+
 LINUX FOR IBM pSERIES (RS/6000)
 M:	Paul Mackerras <paulus@au.ibm.com>
 W:	http://www.ibm.com/linux/ltc/projects/ppc
@@ -8136,12 +8169,6 @@ S:	Maintained
 F:	Documentation/blockdev/ramdisk.txt
 F:	drivers/block/brd.c
 
-PERSISTENT MEMORY DRIVER
-M:	Ross Zwisler <ross.zwisler@linux.intel.com>
-L:	linux-nvdimm@lists.01.org
-S:	Supported
-F:	drivers/block/pmem.c
-
 RANDOM NUMBER DRIVER
 M:	"Theodore Ts'o" <tytso@mit.edu>
 S:	Maintained


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 06/15] libnvdimm: Non-Volatile Devices
@ 2015-06-17 23:55   ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Neil Brown, Greg KH, linux-kernel,
	Andy Lutomirski, Jens Axboe, linux-acpi, H. Peter Anvin,
	linux-fsdevel, hch, mingo

Maintainer information and documentation for drivers/nvdimm

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Greg KH <gregkh@linuxfoundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 Documentation/nvdimm/nvdimm.txt |  805 +++++++++++++++++++++++++++++++++++++++
 MAINTAINERS                     |   39 ++
 2 files changed, 838 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/nvdimm/nvdimm.txt

diff --git a/Documentation/nvdimm/nvdimm.txt b/Documentation/nvdimm/nvdimm.txt
new file mode 100644
index 000000000000..854f0a7c96c5
--- /dev/null
+++ b/Documentation/nvdimm/nvdimm.txt
@@ -0,0 +1,805 @@
+			  LIBNVDIMM: Non-Volatile Devices
+	      libnvdimm - kernel / libndctl - userspace helper library
+			   linux-nvdimm@lists.01.org
+				      v12
+
+
+	Glossary
+	Overview
+	    Supporting Documents
+	    Git Trees
+	LIBNVDIMM PMEM and BLK
+	Why BLK?
+	    PMEM vs BLK
+	        BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
+	Example NVDIMM Platform
+	LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
+	    LIBNDCTL: Context
+	        libndctl: instantiate a new library context example
+	    LIBNVDIMM/LIBNDCTL: Bus
+	        libnvdimm: control class device in /sys/class
+	        libnvdimm: bus
+	        libndctl: bus enumeration example
+	    LIBNVDIMM/LIBNDCTL: DIMM (NMEM)
+	        libnvdimm: DIMM (NMEM)
+	        libndctl: DIMM enumeration example
+	    LIBNVDIMM/LIBNDCTL: Region
+	        libnvdimm: region
+	        libndctl: region enumeration example
+	        Why Not Encode the Region Type into the Region Name?
+	        How Do I Determine the Major Type of a Region?
+	    LIBNVDIMM/LIBNDCTL: Namespace
+	        libnvdimm: namespace
+	        libndctl: namespace enumeration example
+	        libndctl: namespace creation example
+	        Why the Term "namespace"?
+	    LIBNVDIMM/LIBNDCTL: Block Translation Table "btt"
+	        libnvdimm: btt layout
+	        libndctl: btt creation example
+	Summary LIBNDCTL Diagram
+
+
+Glossary
+--------
+
+PMEM: A system-physical-address range where writes are persistent.  A
+block device composed of PMEM is capable of DAX.  A PMEM address range
+may span an interleave of several DIMMs.
+
+BLK: A set of one or more programmable memory mapped apertures provided
+by a DIMM to access its media.  This indirection precludes the
+performance benefit of interleaving, but enables DIMM-bounded failure
+modes.
+
+DPA: DIMM Physical Address, is a DIMM-relative offset.  With one DIMM in
+the system there would be a 1:1 system-physical-address:DPA association.
+Once more DIMMs are added a memory controller interleave must be
+decoded to determine the DPA associated with a given
+system-physical-address.  BLK capacity always has a 1:1 relationship
+with a single-DIMM's DPA range.
+
+DAX: File system extensions to bypass the page cache and block layer to
+mmap persistent memory, from a PMEM block device, directly into a
+process address space.
+
+BTT: Block Translation Table: Persistent memory is byte addressable.
+Existing software may have an expectation that the power-fail-atomicity
+of writes is at least one sector, 512 bytes.  The BTT is an indirection
+table with atomic update semantics to front a PMEM/BLK block device
+driver and present arbitrary atomic sector sizes.
+
+LABEL: Metadata stored on a DIMM device that partitions and identifies
+(persistently names) storage between PMEM and BLK.  It also partitions
+BLK storage to host BTTs with different parameters per BLK-partition.
+Note that traditional partition tables, GPT/MBR, are layered on top of a
+BLK or PMEM device.
+
+
+Overview
+--------
+
+The LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely,
+PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM
+and BLK mode access.  These three modes of operation are described by
+the "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6.  While the LIBNVDIMM
+implementation is generic and supports pre-NFIT platforms, it was guided
+by the superset of capabilities need to support this ACPI 6 definition
+for NVDIMM resources.  The bulk of the kernel implementation is in place
+to handle the case where DPA accessible via PMEM is aliased with DPA
+accessible via BLK.  When that occurs a LABEL is needed to reserve DPA
+for exclusive access via one mode a time.
+
+Supporting Documents
+ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
+NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
+DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
+Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
+
+Git Trees
+LIBNVDIMM: https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git
+LIBNDCTL: https://github.com/pmem/ndctl.git
+PMEM: https://github.com/01org/prd
+
+
+LIBNVDIMM PMEM and BLK
+------------------
+
+Prior to the arrival of the NFIT, non-volatile memory was described to a
+system in various ad-hoc ways.  Usually only the bare minimum was
+provided, namely, a single system-physical-address range where writes
+are expected to be durable after a system power loss.  Now, the NFIT
+specification standardizes not only the description of PMEM, but also
+BLK and platform message-passing entry points for control and
+configuration.
+
+For each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block
+device driver:
+
+    1. PMEM (nd_pmem.ko): Drives a system-physical-address range.  This
+    range is contiguous in system memory and may be interleaved (hardware
+    memory controller striped) across multiple DIMMs.  When interleaved the
+    platform may optionally provide details of which DIMMs are participating
+    in the interleave.
+
+    Note that while LIBNVDIMM describes system-physical-address ranges that may
+    alias with BLK access as ND_NAMESPACE_PMEM ranges and those without
+    alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no
+    distinction.  The different device-types are an implementation detail
+    that userspace can exploit to implement policies like "only interface
+    with address ranges from certain DIMMs".  It is worth noting that when
+    aliasing is present and a DIMM lacks a label, then no block device can
+    be created by default as userspace needs to do at least one allocation
+    of DPA to the PMEM range.  In contrast ND_NAMESPACE_IO ranges, once
+    registered, can be immediately attached to nd_pmem.
+
+    2. BLK (nd_blk.ko): This driver performs I/O using a set of platform
+    defined apertures.  A set of apertures will all access just one DIMM.
+    Multiple windows allow multiple concurrent accesses, much like
+    tagged-command-queuing, and would likely be used by different threads or
+    different CPUs.
+
+    The NFIT specification defines a standard format for a BLK-aperture, but
+    the spec also allows for vendor specific layouts, and non-NFIT BLK
+    implementations may other designs for BLK I/O.  For this reason "nd_blk"
+    calls back into platform-specific code to perform the I/O.  One such
+    implementation is defined in the "Driver Writer's Guide" and "DSM
+    Interface Example".
+
+
+Why BLK?
+--------
+
+While PMEM provides direct byte-addressable CPU-load/store access to
+NVDIMM storage, it does not provide the best system RAS (recovery,
+availability, and serviceability) model.  An access to a corrupted
+system-physical-address address causes a cpu exception while an access
+to a corrupted address through an BLK-aperture causes that block window
+to raise an error status in a register.  The latter is more aligned with
+the standard error model that host-bus-adapter attached disks present.
+Also, if an administrator ever wants to replace a memory it is easier to
+service a system at DIMM module boundaries.  Compare this to PMEM where
+data could be interleaved in an opaque hardware specific manner across
+several DIMMs.
+
+PMEM vs BLK
+BLK-apertures solve this RAS problem, but their presence is also the
+major contributing factor to the complexity of the ND subsystem.  They
+complicate the implementation because PMEM and BLK alias in DPA space.
+Any given DIMM's DPA-range may contribute to one or more
+system-physical-address sets of interleaved DIMMs, *and* may also be
+accessed in its entirety through its BLK-aperture.  Accessing a DPA
+through a system-physical-address while simultaneously accessing the
+same DPA through a BLK-aperture has undefined results.  For this reason,
+DIMMs with this dual interface configuration include a DSM function to
+store/retrieve a LABEL.  The LABEL effectively partitions the DPA-space
+into exclusive system-physical-address and BLK-aperture accessible
+regions.  For simplicity a DIMM is allowed a PMEM "region" per each
+interleave set in which it is a member.  The remaining DPA space can be
+carved into an arbitrary number of BLK devices with discontiguous
+extents.
+
+BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
+--------------------------------------------------
+
+One of the few
+reasons to allow multiple BLK namespaces per REGION is so that each
+BLK-namespace can be configured with a BTT with unique atomic sector
+sizes.  While a PMEM device can host a BTT the LABEL specification does
+not provide for a sector size to be specified for a PMEM namespace.
+This is due to the expectation that the primary usage model for PMEM is
+via DAX, and the BTT is incompatible with DAX.  However, for the cases
+where an application or filesystem still needs atomic sector update
+guarantees it can register a BTT on a PMEM device or partition.  See
+LIBNVDIMM/NDCTL: Block Translation Table "btt"
+
+
+Example NVDIMM Platform
+-----------------------
+
+For the remainder of this document the following diagram will be
+referenced for any example sysfs layouts.
+
+
+                             (a)               (b)           DIMM   BLK-REGION
+          +-------------------+--------+--------+--------+
++------+  |       pm0.0       | blk2.0 | pm1.0  | blk2.1 |    0      region2
+| imc0 +--+- - - region0- - - +--------+        +--------+
++--+---+  |       pm0.0       | blk3.0 | pm1.0  | blk3.1 |    1      region3
+   |      +-------------------+--------v        v--------+
++--+---+                               |                 |
+| cpu0 |                                     region1
++--+---+                               |                 |
+   |      +----------------------------^        ^--------+
++--+---+  |           blk4.0           | pm1.0  | blk4.0 |    2      region4
+| imc1 +--+----------------------------|        +--------+
++------+  |           blk5.0           | pm1.0  | blk5.0 |    3      region5
+          +----------------------------+--------+--------+
+
+In this platform we have four DIMMs and two memory controllers in one
+socket.  Each unique interface (BLK or PMEM) to DPA space is identified
+by a region device with a dynamically assigned id (REGION0 - REGION5).
+
+    1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A
+    single PMEM namespace is created in the REGION0-SPA-range that spans
+    DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that
+    interleaved system-physical-address range is reclaimed as BLK-aperture
+    accessed space starting at DPA-offset (a) into each DIMM.  In that
+    reclaimed space we create two BLK-aperture "namespaces" from REGION2 and
+    REGION3 where "blk2.0" and "blk3.0" are just human readable names that
+    could be set to any user-desired name in the LABEL.
+
+    2. In the last portion of DIMM0 and DIMM1 we have an interleaved
+    system-physical-address range, REGION1, that spans those two DIMMs as
+    well as DIMM2 and DIMM3.  Some of REGION1 allocated to a PMEM namespace
+    named "pm1.0" the rest is reclaimed in 4 BLK-aperture namespaces (for
+    each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
+    "blk5.0".
+
+    3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1
+    interleaved system-physical-address range (i.e. the DPA address below
+    offset (b) are also included in the "blk4.0" and "blk5.0" namespaces.
+    Note, that this example shows that BLK-aperture namespaces don't need to
+    be contiguous in DPA-space.
+
+    This bus is provided by the kernel under the device
+    /sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and
+    the nfit_test.ko module is loaded.  This not only test LIBNVDIMM but the
+    acpi_nfit.ko driver as well.
+
+
+LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
+----------------------------------------------------
+
+What follows is a description of the LIBNVDIMM sysfs layout and a
+corresponding object hierarchy diagram as viewed through the LIBNDCTL
+api.  The example sysfs paths and diagrams are relative to the Example
+NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit
+test.
+
+LIBNDCTL: Context
+Every api call in the LIBNDCTL library requires a context that holds the
+logging parameters and other library instance state.  The library is
+based on the libabc template:
+https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git/
+
+LIBNDCTL: instantiate a new library context example
+
+	struct ndctl_ctx *ctx;
+
+	if (ndctl_new(&ctx) == 0)
+		return ctx;
+	else
+		return NULL;
+
+LIBNVDIMM/LIBNDCTL: Bus
+-------------------
+
+A bus has a 1:1 relationship with an NFIT.  The current expectation for
+ACPI based systems is that there is only ever one platform-global NFIT.
+That said, it is trivial to register multiple NFITs, the specification
+does not preclude it.  The infrastructure supports multiple busses and
+we we use this capability to test multiple NFIT configurations in the
+unit test.
+
+LIBNVDIMM: control class device in /sys/class
+
+This character device accepts DSM messages to be passed to DIMM
+identified by its NFIT handle.
+
+	/sys/class/nd/ndctl0
+	|-- dev
+	|-- device -> ../../../ndbus0
+	|-- subsystem -> ../../../../../../../class/nd
+
+
+
+LIBNVDIMM: bus
+
+	struct nvdimm_bus *nvdimm_bus_register(struct device *parent,
+	       struct nvdimm_bus_descriptor *nfit_desc);
+
+	/sys/devices/platform/nfit_test.0/ndbus0
+	|-- btt0
+	|-- btt_seed
+	|-- commands
+	|-- nd
+	|-- nfit
+	|-- nmem0
+	|-- nmem1
+	|-- nmem2
+	|-- nmem3
+	|-- power
+	|-- provider
+	|-- region0
+	|-- region1
+	|-- region2
+	|-- region3
+	|-- region4
+	|-- region5
+	|-- uevent
+	`-- wait_probe
+
+LIBNDCTL: bus enumeration example
+Find the bus handle that describes the bus from Example NVDIMM Platform
+
+	static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx,
+			const char *provider)
+	{
+		struct ndctl_bus *bus;
+
+		ndctl_bus_foreach(ctx, bus)
+			if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0)
+				return bus;
+
+		return NULL;
+	}
+
+	bus = get_bus_by_provider(ctx, "nfit_test.0");
+
+
+LIBNVDIMM/LIBNDCTL: DIMM (NMEM)
+---------------------------
+
+The DIMM device provides a character device for sending commands to
+hardware, and it is a container for LABELs.  If the DIMM is defined by
+NFIT then an optional 'nfit' attribute sub-directory is available to add
+NFIT-specifics.
+
+Note that the kernel device name for "DIMMs" is "nmemX".  The NFIT
+describes these devices via "Memory Device to System Physical Address
+Range Mapping Structure", and there is no requirement that they actually
+be physical DIMMs, so we use a more generic name.
+
+LIBNVDIMM: DIMM (NMEM)
+
+	struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data,
+			const struct attribute_group **groups, unsigned long flags,
+			unsigned long *dsm_mask);
+
+	/sys/devices/platform/nfit_test.0/ndbus0
+	|-- nmem0
+	|   |-- available_slots
+	|   |-- commands
+	|   |-- dev
+	|   |-- devtype
+	|   |-- driver -> ../../../../../bus/nd/drivers/nvdimm
+	|   |-- modalias
+	|   |-- nfit
+	|   |   |-- device
+	|   |   |-- format
+	|   |   |-- handle
+	|   |   |-- phys_id
+	|   |   |-- rev_id
+	|   |   |-- serial
+	|   |   `-- vendor
+	|   |-- state
+	|   |-- subsystem -> ../../../../../bus/nd
+	|   `-- uevent
+	|-- nmem1
+	[..]
+
+
+LIBNDCTL: DIMM enumeration example
+
+Note, in this example we are assuming NFIT-defined DIMMs which are
+identified by an "nfit_handle" a 32-bit value where:
+Bit 3:0 DIMM number within the memory channel
+Bit 7:4 memory channel number
+Bit 11:8 memory controller ID
+Bit 15:12 socket ID (within scope of a Node controller if node controller is present)
+Bit 27:16 Node Controller ID
+Bit 31:28 Reserved
+
+	static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus,
+	       unsigned int handle)
+	{
+		struct ndctl_dimm *dimm;
+
+		ndctl_dimm_foreach(bus, dimm)
+			if (ndctl_dimm_get_handle(dimm) == handle)
+				return dimm;
+
+		return NULL;
+	}
+
+	#define DIMM_HANDLE(n, s, i, c, d) \
+		(((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \
+		 | ((c & 0xf) << 4) | (d & 0xf))
+
+	dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0));
+
+LIBNVDIMM/LIBNDCTL: Region
+----------------------
+
+A generic REGION device is registered for each PMEM range orBLK-aperture
+set.  Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture
+sets on the "nfit_test.0" bus.  The primary role of regions are to be a
+container of "mappings".  A mapping is a tuple of <DIMM,
+DPA-start-offset, length>.
+
+LIBNVDIMM provides a built-in driver for these REGION devices.  This driver
+is responsible for reconciling the aliased DPA mappings across all
+regions, parsing the LABEL, if present, and then emitting NAMESPACE
+devices with the resolved/exclusive DPA-boundaries for the nd_pmem or
+nd_blk device driver to consume.
+
+In addition to the generic attributes of "mapping"s, "interleave_ways"
+and "size" the REGION device also exports some convenience attributes.
+"nstype" indicates the integer type of namespace-device this region
+emits, "devtype" duplicates the DEVTYPE variable stored by udev at the
+'add' event, "modalias" duplicates the MODALIAS variable stored by udev
+at the 'add' event, and finally, the optional "spa_index" is provided in
+the case where the region is defined by a SPA.
+
+LIBNVDIMM: region
+
+	struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus,
+			struct nd_region_desc *ndr_desc);
+	struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus,
+			struct nd_region_desc *ndr_desc);
+
+	/sys/devices/platform/nfit_test.0/ndbus0
+	|-- region0
+	|   |-- available_size
+	|   |-- devtype
+	|   |-- driver -> ../../../../../bus/nd/drivers/nd_region
+	|   |-- init_namespaces
+	|   |-- mapping0
+	|   |-- mapping1
+	|   |-- mappings
+	|   |-- modalias
+	|   |-- namespace0.0
+	|   |-- namespace_seed
+	|   |-- nfit
+	|   |   `-- spa_index
+	|   |-- nstype
+	|   |-- set_cookie
+	|   |-- size
+	|   |-- subsystem -> ../../../../../bus/nd
+	|   `-- uevent
+	|-- region1
+	[..]
+
+LIBNDCTL: region enumeration example
+
+Sample region retrieval routines based on NFIT-unique data like
+"spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for
+BLK.
+
+	static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus,
+			unsigned int spa_index)
+	{
+		struct ndctl_region *region;
+
+		ndctl_region_foreach(bus, region) {
+			if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM)
+				continue;
+			if (ndctl_region_get_spa_index(region) == spa_index)
+				return region;
+		}
+		return NULL;
+	}
+
+	static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus,
+			unsigned int handle)
+	{
+		struct ndctl_region *region;
+
+		ndctl_region_foreach(bus, region) {
+			struct ndctl_mapping *map;
+
+			if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK)
+				continue;
+			ndctl_mapping_foreach(region, map) {
+				struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map);
+
+				if (ndctl_dimm_get_handle(dimm) == handle)
+					return region;
+			}
+		}
+		return NULL;
+	}
+
+
+Why Not Encode the Region Type into the Region Name?
+----------------------------------------------------
+
+At first glance it seems since NFIT defines just PMEM and BLK interface
+types that we should simply name REGION devices with something derived
+from those type names.  However, the ND subsystem explicitly keeps the
+REGION name generic and expects userspace to always consider the
+region-attributes for 4 reasons:
+
+    1. There are already more than two REGION and "namespace" types.  For
+    PMEM there are two subtypes.  As mentioned previously we have PMEM where
+    the constituent DIMM devices are known and anonymous PMEM.  For BLK
+    regions the NFIT specification already anticipates vendor specific
+    implementations.  The exact distinction of what a region contains is in
+    the region-attributes not the region-name or the region-devtype.
+
+    2. A region with zero child-namespaces is a possible configuration.  For
+    example, the NFIT allows for a DCR to be published without a
+    corresponding BLK-aperture.  This equates to a DIMM that can only accept
+    control/configuration messages, but no i/o through a descendant block
+    device.  Again, this "type" is advertised in the attributes ('mappings'
+    == 0) and the name does not tell you much.
+
+    3. What if a third major interface type arises in the future?  Outside
+    of vendor specific implementations, it's not difficult to envision a
+    third class of interface type beyond BLK and PMEM.  With a generic name
+    for the REGION level of the device-hierarchy old userspace
+    implementations can still make sense of new kernel advertised
+    region-types.  Userspace can always rely on the generic region
+    attributes like "mappings", "size", etc and the expected child devices
+    named "namespace".  This generic format of the device-model hierarchy
+    allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and
+    future-proof.
+
+    4. There are more robust mechanisms for determining the major type of a
+    region than a device name.  See the next section, How Do I Determine the
+    Major Type of a Region?
+
+How Do I Determine the Major Type of a Region?
+----------------------------------------------
+
+Outside of the blanket recommendation of "use libndctl", or simply
+looking at the kernel header (/usr/include/linux/ndctl.h) to decode the
+"nstype" integer attribute, here are some other options.
+
+    1. module alias lookup:
+
+    The whole point of region/namespace device type differentiation is to
+    decide which block-device driver will attach to a given LIBNVDIMM namespace.
+    One can simply use the modalias to lookup the resulting module.  It's
+    important to note that this method is robust in the presence of a
+    vendor-specific driver down the road.  If a vendor-specific
+    implementation wants to supplant the standard nd_blk driver it can with
+    minimal impact to the rest of LIBNVDIMM.
+
+    In fact, a vendor may also want to have a vendor-specific region-driver
+    (outside of nd_region).  For example, if a vendor defined its own LABEL
+    format it would need its own region driver to parse that LABEL and emit
+    the resulting namespaces.  The output from module resolution is more
+    accurate than a region-name or region-devtype.
+
+    2. udev:
+
+    The kernel "devtype" is registered in the udev database
+    # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0
+    P: /devices/platform/nfit_test.0/ndbus0/region0
+    E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0
+    E: DEVTYPE=nd_pmem
+    E: MODALIAS=nd:t2
+    E: SUBSYSTEM=nd
+
+    # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4
+    P: /devices/platform/nfit_test.0/ndbus0/region4
+    E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4
+    E: DEVTYPE=nd_blk
+    E: MODALIAS=nd:t3
+    E: SUBSYSTEM=nd
+
+    ...and is available as a region attribute, but keep in mind that the
+    "devtype" does not indicate sub-type variations and scripts should
+    really be understanding the other attributes.
+
+    3. type specific attributes:
+
+    As it currently stands a BLK-aperture region will never have a
+    "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region.  A
+    BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM
+    that does not allow I/O.  A PMEM region with a "mappings" value of zero
+    is a simple system-physical-address range.
+
+
+LIBNVDIMM/LIBNDCTL: Namespace
+-------------------------
+
+A REGION, after resolving DPA aliasing and LABEL specified boundaries,
+surfaces one or more "namespace" devices.  The arrival of a "namespace"
+device currently triggers either the nd_blk or nd_pmem driver to load
+and register a disk/block device.
+
+LIBNVDIMM: namespace
+Here is a sample layout from the three major types of NAMESPACE where
+namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid'
+attribute), namespace2.0 represents a BLK namespace (note it has a
+'sector_size' attribute) that, and namespace6.0 represents an anonymous
+PMEM namespace (note that has no 'uuid' attribute due to not support a
+LABEL).
+
+	/sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0
+	|-- alt_name
+	|-- devtype
+	|-- dpa_extents
+	|-- modalias
+	|-- resource
+	|-- size
+	|-- subsystem -> ../../../../../../bus/nd
+	|-- type
+	|-- uevent
+	`-- uuid
+	/sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0
+	|-- alt_name
+	|-- devtype
+	|-- dpa_extents
+	|-- modalias
+	|-- sector_size
+	|-- size
+	|-- subsystem -> ../../../../../../bus/nd
+	|-- type
+	|-- uevent
+	`-- uuid
+	/sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0
+	|-- block
+	|   `-- pmem0
+	|-- devtype
+	|-- driver -> ../../../../../../bus/nd/drivers/pmem
+	|-- modalias
+	|-- resource
+	|-- size
+	|-- subsystem -> ../../../../../../bus/nd
+	|-- type
+	`-- uevent
+
+LIBNDCTL: namespace enumeration example
+Namespaces are indexed relative to their parent region, example below.
+These indexes are mostly static from boot to boot, but subsystem makes
+no guarantees in this regard.  For a static namespace identifier use its
+'uuid' attribute.
+
+static struct ndctl_namespace *get_namespace_by_id(struct ndctl_region *region,
+                unsigned int id)
+{
+        struct ndctl_namespace *ndns;
+
+        ndctl_namespace_foreach(region, ndns)
+                if (ndctl_namespace_get_id(ndns) == id)
+                        return ndns;
+
+        return NULL;
+}
+
+LIBNDCTL: namespace creation example
+Idle namespaces are automatically created by the kernel if a given
+region has enough available capacity to create a new namespace.
+Namespace instantiation involves finding an idle namespace and
+configuring it.  For the most part the setting of namespace attributes
+can occur in any order, the only constraint is that 'uuid' must be set
+before 'size'.  This enables the kernel to track DPA allocations
+internally with a static identifier.
+
+static int configure_namespace(struct ndctl_region *region,
+                struct ndctl_namespace *ndns,
+                struct namespace_parameters *parameters)
+{
+        char devname[50];
+
+        snprintf(devname, sizeof(devname), "namespace%d.%d",
+                        ndctl_region_get_id(region), paramaters->id);
+
+        ndctl_namespace_set_alt_name(ndns, devname);
+        /* 'uuid' must be set prior to setting size! */
+        ndctl_namespace_set_uuid(ndns, paramaters->uuid);
+        ndctl_namespace_set_size(ndns, paramaters->size);
+        /* unlike pmem namespaces, blk namespaces have a sector size */
+        if (parameters->lbasize)
+                ndctl_namespace_set_sector_size(ndns, parameters->lbasize);
+        ndctl_namespace_enable(ndns);
+}
+
+
+Why the Term "namespace"?
+
+    1. Why not "volume" for instance?  "volume" ran the risk of confusing ND
+    as a volume manager like device-mapper.
+
+    2. The term originated to describe the sub-devices that can be created
+    within a NVME controller (see the nvme specification:
+    http://www.nvmexpress.org/specifications/), and NFIT namespaces are
+    meant to parallel the capabilities and configurability of
+    NVME-namespaces.
+
+
+LIBNVDIMM/LIBNDCTL: Block Translation Table "btt"
+---------------------------------------------
+
+A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked
+block device driver that fronts either the whole block device or a
+partition of a block device emitted by either a PMEM or BLK NAMESPACE.
+
+LIBNVDIMM: btt layout
+Every bus will start out with at least one BTT device which is the seed
+device.  To activate it set the "backing_dev", "uuid", and "sector_size"
+attributes and then bind the device to the nd_btt driver.
+
+	/sys/devices/platform/nfit_test.1/ndbus0/btt0/
+	|-- backing_dev
+	|-- delete
+	|-- devtype
+	|-- modalias
+	|-- sector_size
+	|-- subsystem -> ../../../../../bus/nd
+	|-- uevent
+	`-- uuid
+
+LIBNDCTL: btt creation example
+Similar to namespaces an idle BTT device is automatically created per
+bus.  Each time this "seed" btt device is configured and enabled a new
+seed is created.  Creating a BTT configuration involves two steps of
+finding and idle BTT and assigning it to front a PMEM or BLK namespace.
+
+	static struct ndctl_btt *get_idle_btt(struct ndctl_bus *bus)
+	{
+		struct ndctl_btt *btt;
+
+		ndctl_btt_foreach(bus, btt)
+			if (!ndctl_btt_is_enabled(btt) && !ndctl_btt_is_configured(btt))
+				return btt;
+
+		return NULL;
+	}
+
+	static int configure_btt(struct ndctl_bus *bus, struct btt_parameters *parameters)
+	{
+		btt = get_idle_btt(bus);
+
+		sprintf(bdevpath, "/dev/%s",
+				ndctl_namespace_get_block_device(parameters->ndns));
+		ndctl_btt_set_uuid(btt, parameters->uuid);
+		ndctl_btt_set_sector_size(btt, parameters->sector_size);
+		ndctl_btt_set_backing_dev(btt, parameters->bdevpath);
+		ndctl_btt_enable(btt);
+	}
+
+Once instantiated a "nd_btt" link will be created under the
+"backing_dev" (pmem0) block device:
+
+	/sys/block/pmem0/
+	|-- alignment_offset
+	|-- bdi -> ../../../../../../../virtual/bdi/259:0
+	|-- capability
+	|-- dev
+	|-- device -> ../../../namespace0.0
+	|-- discard_alignment
+	|-- ext_range
+	|-- holders
+	|-- inflight
+	|-- nd_btt -> ../../../../btt0
+
+...and a new inactive seed device will appear on the bus.
+
+Once a "backing_dev" is disabled its associated BTT will be
+automatically deleted.  This deletion is only at the device model level.
+In order to destroy a BTT the "info block" needs to be destroyed.
+
+
+Summary LIBNDCTL Diagram
+------------------------
+
+For the given example above, here is the view of the objects as seen by the LIBNDCTL api:
+            +---+
+            |CTX|    +---------+   +--------------+  +---------------+
+            +-+-+  +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" |
+              |    | +---------+   +--------------+  +---------------+
++-------+     |    | +---------+   +--------------+  +---------------+
+| DIMM0 <-+   |    +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" |
++-------+ |   |    | +---------+   +--------------+  +---------------+
+| DIMM1 <-+ +-v--+ | +---------+   +--------------+  +---------------+
++-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6  "blk2.0" |
+| DIMM2 <-+ +----+ | +---------+ | +--------------+  +----------------------+
++-------+ |        |             +-> NAMESPACE2.1 +--> ND5  "blk2.1" | BTT2 |
+| DIMM3 <-+        |               +--------------+  +----------------------+
++-------+          | +---------+   +--------------+  +---------------+
+                   +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4  "blk3.0" |
+                   | +---------+ | +--------------+  +----------------------+
+                   |             +-> NAMESPACE3.1 +--> ND3  "blk3.1" | BTT1 |
+                   |               +--------------+  +----------------------+
+                   | +---------+   +--------------+  +---------------+
+                   +-> REGION4 +---> NAMESPACE4.0 +--> ND2  "blk4.0" |
+                   | +---------+   +--------------+  +---------------+
+                   | +---------+   +--------------+  +----------------------+
+                   +-> REGION5 +---> NAMESPACE5.0 +--> ND1  "blk5.0" | BTT0 |
+                     +---------+   +--------------+  +---------------+------+
+
+
diff --git a/MAINTAINERS b/MAINTAINERS
index 590304b96b03..c2f55aed811d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5912,6 +5912,39 @@ M:	Sasha Levin <sasha.levin@oracle.com>
 S:	Maintained
 F:	tools/lib/lockdep/
 
+LIBNVDIMM: NON-VOLATILE MEMORY DEVICE SUBSYSTEM
+M:	Dan Williams <dan.j.williams@intel.com>
+L:	linux-nvdimm@lists.01.org
+Q:	https://patchwork.kernel.org/project/linux-nvdimm/list/
+S:	Supported
+F:	drivers/nvdimm/*
+F:	include/linux/nd.h
+F:	include/linux/libnvdimm.h
+F:	include/uapi/linux/ndctl.h
+
+LIBNVDIMM BLK: MMIO-APERTURE DRIVER
+M:	Ross Zwisler <ross.zwisler@linux.intel.com>
+L:	linux-nvdimm@lists.01.org
+Q:	https://patchwork.kernel.org/project/linux-nvdimm/list/
+S:	Supported
+F:	drivers/nvdimm/blk.c
+F:	drivers/nvdimm/region_devs.c
+F:	drivers/acpi/nfit*
+
+LIBNVDIMM BTT: BLOCK TRANSLATION TABLE
+M:	Vishal Verma <vishal.l.verma@intel.com>
+L:	linux-nvdimm@lists.01.org
+Q:	https://patchwork.kernel.org/project/linux-nvdimm/list/
+S:	Supported
+F:	drivers/nvdimm/btt*
+
+LIBNVDIMM PMEM: PERSISTENT MEMORY DRIVER
+M:	Ross Zwisler <ross.zwisler@linux.intel.com>
+L:	linux-nvdimm@lists.01.org
+Q:	https://patchwork.kernel.org/project/linux-nvdimm/list/
+S:	Supported
+F:	drivers/nvdimm/pmem.c
+
 LINUX FOR IBM pSERIES (RS/6000)
 M:	Paul Mackerras <paulus@au.ibm.com>
 W:	http://www.ibm.com/linux/ltc/projects/ppc
@@ -8136,12 +8169,6 @@ S:	Maintained
 F:	Documentation/blockdev/ramdisk.txt
 F:	drivers/block/brd.c
 
-PERSISTENT MEMORY DRIVER
-M:	Ross Zwisler <ross.zwisler@linux.intel.com>
-L:	linux-nvdimm@lists.01.org
-S:	Supported
-F:	drivers/block/pmem.c
-
 RANDOM NUMBER DRIVER
 M:	"Theodore Ts'o" <tytso@mit.edu>
 S:	Maintained


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 07/15] fs/block_dev.c: skip rw_page if bdev has integrity
  2015-06-17 23:54 ` Dan Williams
@ 2015-06-17 23:55   ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Martin K. Petersen, Vishal Verma, linux-kernel,
	mingo, Jens Axboe, linux-acpi, Matthew Wilcox, linux-fsdevel,
	hch

From: Vishal Verma <vishal.l.verma@linux.intel.com>

If a block device has bio integrity enabled, rw_page will bypass the
integrity payload, which is undesirable. Skip rw_page if this is the
case.

Currently brd and zram provide rw_page, and the proposed 'nd' drivers
will too.

Cc: Jens Axboe <axboe@fb.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Suggested-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/block_dev.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c7e4163ede87..054ef1bbb821 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -376,7 +376,7 @@ int bdev_read_page(struct block_device *bdev, sector_t sector,
 			struct page *page)
 {
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
-	if (!ops->rw_page)
+	if (!ops->rw_page || bdev_get_integrity(bdev))
 		return -EOPNOTSUPP;
 	return ops->rw_page(bdev, sector + get_start_sect(bdev), page, READ);
 }
@@ -407,7 +407,7 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
 	int result;
 	int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
-	if (!ops->rw_page)
+	if (!ops->rw_page || bdev_get_integrity(bdev))
 		return -EOPNOTSUPP;
 	set_page_writeback(page);
 	result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, rw);


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 07/15] fs/block_dev.c: skip rw_page if bdev has integrity
@ 2015-06-17 23:55   ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Martin K. Petersen, Vishal Verma, linux-kernel,
	mingo, Jens Axboe, linux-acpi, Matthew Wilcox, linux-fsdevel,
	hch

From: Vishal Verma <vishal.l.verma@linux.intel.com>

If a block device has bio integrity enabled, rw_page will bypass the
integrity payload, which is undesirable. Skip rw_page if this is the
case.

Currently brd and zram provide rw_page, and the proposed 'nd' drivers
will too.

Cc: Jens Axboe <axboe@fb.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Suggested-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/block_dev.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c7e4163ede87..054ef1bbb821 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -376,7 +376,7 @@ int bdev_read_page(struct block_device *bdev, sector_t sector,
 			struct page *page)
 {
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
-	if (!ops->rw_page)
+	if (!ops->rw_page || bdev_get_integrity(bdev))
 		return -EOPNOTSUPP;
 	return ops->rw_page(bdev, sector + get_start_sect(bdev), page, READ);
 }
@@ -407,7 +407,7 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
 	int result;
 	int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
-	if (!ops->rw_page)
+	if (!ops->rw_page || bdev_get_integrity(bdev))
 		return -EOPNOTSUPP;
 	set_page_writeback(page);
 	result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, rw);


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 08/15] libnvdimm, btt: add support for blk integrity
  2015-06-17 23:54 ` Dan Williams
@ 2015-06-17 23:55   ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Vishal Verma, linux-kernel, hch, linux-acpi,
	linux-fsdevel, mingo

From: Vishal Verma <vishal.l.verma@linux.intel.com>

Support multiple block sizes (sector + metadata) using the blk integrity
framework. This registers a new integrity template that defines the
protection information tuple size based on the configured metadata size,
and simply acts as a passthrough for protection information generated by
another layer. The metadata is written to the storage as-is, and read back
with each sector.

Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/btt.c      |  126 +++++++++++++++++++++++++++++++++++++++------
 drivers/nvdimm/btt.h      |    2 -
 drivers/nvdimm/btt_devs.c |    3 +
 drivers/nvdimm/core.c     |   37 +++++++++++++
 drivers/nvdimm/nd.h       |    1 
 5 files changed, 151 insertions(+), 18 deletions(-)

diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 58becbd69ae1..c337b7abfb43 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -837,6 +837,11 @@ static int btt_meta_init(struct btt *btt)
 	return ret;
 }
 
+static u32 btt_meta_size(struct btt *btt)
+{
+	return btt->lbasize - btt->sector_size;
+}
+
 /*
  * This function calculates the arena in which the given LBA lies
  * by doing a linear walk. This is acceptable since we expect only
@@ -921,8 +926,63 @@ static void zero_fill_data(struct page *page, unsigned int off, u32 len)
 	kunmap_atomic(mem);
 }
 
-static int btt_read_pg(struct btt *btt, struct page *page, unsigned int off,
-			sector_t sector, unsigned int len)
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+static int btt_rw_integrity(struct btt *btt, struct bio_integrity_payload *bip,
+			struct arena_info *arena, u32 postmap, int rw)
+{
+	unsigned int len = btt_meta_size(btt);
+	u64 meta_nsoff;
+	int ret = 0;
+
+	if (bip == NULL)
+		return 0;
+
+	meta_nsoff = to_namespace_offset(arena, postmap) + btt->sector_size;
+
+	while (len) {
+		unsigned int cur_len;
+		struct bio_vec bv;
+		void *mem;
+
+		bv = bvec_iter_bvec(bip->bip_vec, bip->bip_iter);
+		/*
+		 * The 'bv' obtained from bvec_iter_bvec has its .bv_len and
+		 * .bv_offset already adjusted for iter->bi_bvec_done, and we
+		 * can use those directly
+		 */
+
+		cur_len = min(len, bv.bv_len);
+		mem = kmap_atomic(bv.bv_page);
+		if (rw)
+			ret = arena_write_bytes(arena, meta_nsoff,
+					mem + bv.bv_offset, cur_len);
+		else
+			ret = arena_read_bytes(arena, meta_nsoff,
+					mem + bv.bv_offset, cur_len);
+
+		kunmap_atomic(mem);
+		if (ret)
+			return ret;
+
+		len -= cur_len;
+		meta_nsoff += cur_len;
+		bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
+	}
+
+	return ret;
+}
+
+#else /* CONFIG_BLK_DEV_INTEGRITY */
+static int btt_rw_integrity(struct btt *btt, struct bio_integrity_payload *bip,
+			struct arena_info *arena, u32 postmap, int rw)
+{
+	return 0;
+}
+#endif
+
+static int btt_read_pg(struct btt *btt, struct bio_integrity_payload *bip,
+			struct page *page, unsigned int off, sector_t sector,
+			unsigned int len)
 {
 	int ret = 0;
 	int t_flag, e_flag;
@@ -984,6 +1044,12 @@ static int btt_read_pg(struct btt *btt, struct page *page, unsigned int off,
 		if (ret)
 			goto out_rtt;
 
+		if (bip) {
+			ret = btt_rw_integrity(btt, bip, arena, postmap, READ);
+			if (ret)
+				goto out_rtt;
+		}
+
 		arena->rtt[lane] = RTT_INVALID;
 		nd_region_release_lane(btt->nd_region, lane);
 
@@ -1001,8 +1067,9 @@ static int btt_read_pg(struct btt *btt, struct page *page, unsigned int off,
 	return ret;
 }
 
-static int btt_write_pg(struct btt *btt, sector_t sector, struct page *page,
-		unsigned int off, unsigned int len)
+static int btt_write_pg(struct btt *btt, struct bio_integrity_payload *bip,
+			sector_t sector, struct page *page, unsigned int off,
+			unsigned int len)
 {
 	int ret = 0;
 	struct arena_info *arena = NULL;
@@ -1036,12 +1103,19 @@ static int btt_write_pg(struct btt *btt, sector_t sector, struct page *page,
 		if (new_postmap >= arena->internal_nlba) {
 			ret = -EIO;
 			goto out_lane;
-		} else
-			ret = btt_data_write(arena, new_postmap, page,
-						off, cur_len);
+		}
+
+		ret = btt_data_write(arena, new_postmap, page, off, cur_len);
 		if (ret)
 			goto out_lane;
 
+		if (bip) {
+			ret = btt_rw_integrity(btt, bip, arena, new_postmap,
+						WRITE);
+			if (ret)
+				goto out_lane;
+		}
+
 		lock_map(arena, premap);
 		ret = btt_map_read(arena, premap, &old_postmap, NULL, NULL);
 		if (ret)
@@ -1081,18 +1155,18 @@ static int btt_write_pg(struct btt *btt, sector_t sector, struct page *page,
 	return ret;
 }
 
-static int btt_do_bvec(struct btt *btt, struct page *page,
-			unsigned int len, unsigned int off, int rw,
-			sector_t sector)
+static int btt_do_bvec(struct btt *btt, struct bio_integrity_payload *bip,
+			struct page *page, unsigned int len, unsigned int off,
+			int rw, sector_t sector)
 {
 	int ret;
 
 	if (rw == READ) {
-		ret = btt_read_pg(btt, page, off, sector, len);
+		ret = btt_read_pg(btt, bip, page, off, sector, len);
 		flush_dcache_page(page);
 	} else {
 		flush_dcache_page(page);
-		ret = btt_write_pg(btt, sector, page, off, len);
+		ret = btt_write_pg(btt, bip, sector, page, off, len);
 	}
 
 	return ret;
@@ -1100,6 +1174,7 @@ static int btt_do_bvec(struct btt *btt, struct page *page,
 
 static void btt_make_request(struct request_queue *q, struct bio *bio)
 {
+	struct bio_integrity_payload *bip = bio_integrity(bio);
 	struct block_device *bdev = bio->bi_bdev;
 	struct btt *btt = q->queuedata;
 	int rw;
@@ -1120,6 +1195,17 @@ static void btt_make_request(struct request_queue *q, struct bio *bio)
 	if (rw == READA)
 		rw = READ;
 
+	/*
+	 * bio_integrity_enabled also checks if the bio already has an
+	 * integrity payload attached. If it does, we *don't* do a
+	 * bio_integrity_prep here - the payload has been generated by
+	 * another kernel subsystem, and we just pass it through.
+	 */
+	if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
+		err = -EIO;
+		goto out;
+	}
+
 	bio_for_each_segment(bvec, bio, iter) {
 		unsigned int len = bvec.bv_len;
 
@@ -1129,7 +1215,7 @@ static void btt_make_request(struct request_queue *q, struct bio *bio)
 		BUG_ON(len < btt->sector_size);
 		BUG_ON(len % btt->sector_size);
 
-		err = btt_do_bvec(btt, bvec.bv_page, len, bvec.bv_offset,
+		err = btt_do_bvec(btt, bip, bvec.bv_page, len, bvec.bv_offset,
 				rw, sector);
 		if (err) {
 			dev_info(&btt->nd_btt->dev,
@@ -1150,7 +1236,7 @@ static int btt_rw_page(struct block_device *bdev, sector_t sector,
 {
 	struct btt *btt = bdev->bd_disk->private_data;
 
-	btt_do_bvec(btt, page, PAGE_CACHE_SIZE, 0, rw, sector);
+	btt_do_bvec(btt, NULL, page, PAGE_CACHE_SIZE, 0, rw, sector);
 	page_endio(page, rw & WRITE, 0);
 	return 0;
 }
@@ -1204,10 +1290,17 @@ static int btt_blk_init(struct btt *btt)
 	blk_queue_logical_block_size(btt->btt_queue, btt->sector_size);
 	btt->btt_queue->queuedata = btt;
 
-	set_capacity(btt->btt_disk,
-			btt->nlba * btt->sector_size >> SECTOR_SHIFT);
+	set_capacity(btt->btt_disk, 0);
 	add_disk(btt->btt_disk);
 
+	if (btt_meta_size(btt)) {
+		ret = nd_integrity_init(btt->btt_disk, btt_meta_size(btt));
+		if (ret)
+			goto out_free_queue;
+	}
+
+	set_capacity(btt->btt_disk, btt->nlba * btt->sector_size >> 9);
+
 	return 0;
 
 out_free_queue:
@@ -1217,6 +1310,7 @@ out_free_queue:
 
 static void btt_blk_cleanup(struct btt *btt)
 {
+	blk_integrity_unregister(btt->btt_disk);
 	del_gendisk(btt->btt_disk);
 	put_disk(btt->btt_disk);
 	blk_cleanup_queue(btt->btt_queue);
diff --git a/drivers/nvdimm/btt.h b/drivers/nvdimm/btt.h
index 8c95a7792c3e..2caa0ef7e67a 100644
--- a/drivers/nvdimm/btt.h
+++ b/drivers/nvdimm/btt.h
@@ -31,7 +31,7 @@
 #define ARENA_MAX_SIZE (1ULL << 39)	/* 512 GB */
 #define RTT_VALID (1UL << 31)
 #define RTT_INVALID 0
-#define INT_LBASIZE_ALIGNMENT 256
+#define INT_LBASIZE_ALIGNMENT 64
 #define BTT_PG_SIZE 4096
 #define BTT_DEFAULT_NFREE ND_MAX_LANES
 #define LOG_SEQ_INIT 1
diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
index c03c854f892b..02e125b91e77 100644
--- a/drivers/nvdimm/btt_devs.c
+++ b/drivers/nvdimm/btt_devs.c
@@ -53,7 +53,8 @@ struct nd_btt *to_nd_btt(struct device *dev)
 }
 EXPORT_SYMBOL(to_nd_btt);
 
-static const unsigned long btt_lbasize_supported[] = { 512, 4096, 0 };
+static const unsigned long btt_lbasize_supported[] = { 512, 520, 528,
+	4096, 4104, 4160, 4224, 0 };
 
 static ssize_t sector_size_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index 0fa9b6225450..36f112995c0c 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -13,6 +13,7 @@
 #include <linux/libnvdimm.h>
 #include <linux/export.h>
 #include <linux/module.h>
+#include <linux/blkdev.h>
 #include <linux/device.h>
 #include <linux/ctype.h>
 #include <linux/ndctl.h>
@@ -381,6 +382,42 @@ void nvdimm_bus_unregister(struct nvdimm_bus *nvdimm_bus)
 }
 EXPORT_SYMBOL_GPL(nvdimm_bus_unregister);
 
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+static int nd_pi_nop_generate_verify(struct blk_integrity_iter *iter)
+{
+	return 0;
+}
+
+int nd_integrity_init(struct gendisk *disk, unsigned long meta_size)
+{
+	struct blk_integrity integrity = {
+		.name = "ND-PI-NOP",
+		.generate_fn = nd_pi_nop_generate_verify,
+		.verify_fn = nd_pi_nop_generate_verify,
+		.tuple_size = meta_size,
+		.tag_size = meta_size,
+	};
+	int ret;
+
+	ret = blk_integrity_register(disk, &integrity);
+	if (ret)
+		return ret;
+
+	blk_queue_max_integrity_segments(disk->queue, 1);
+
+	return 0;
+}
+EXPORT_SYMBOL(nd_integrity_init);
+
+#else /* CONFIG_BLK_DEV_INTEGRITY */
+int nd_integrity_init(struct gendisk *disk, unsigned long meta_size)
+{
+	return 0;
+}
+EXPORT_SYMBOL(nd_integrity_init);
+
+#endif
+
 static __init int libnvdimm_init(void)
 {
 	int rc;
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 2b7746e798fb..1cea3f191a83 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -128,6 +128,7 @@ enum nd_async_mode {
 	ND_ASYNC,
 };
 
+int nd_integrity_init(struct gendisk *disk, unsigned long meta_size);
 void wait_nvdimm_bus_probe_idle(struct device *dev);
 void nd_device_register(struct device *dev);
 void nd_device_unregister(struct device *dev, enum nd_async_mode mode);


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 08/15] libnvdimm, btt: add support for blk integrity
@ 2015-06-17 23:55   ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Vishal Verma, linux-kernel, hch, linux-acpi,
	linux-fsdevel, mingo

From: Vishal Verma <vishal.l.verma@linux.intel.com>

Support multiple block sizes (sector + metadata) using the blk integrity
framework. This registers a new integrity template that defines the
protection information tuple size based on the configured metadata size,
and simply acts as a passthrough for protection information generated by
another layer. The metadata is written to the storage as-is, and read back
with each sector.

Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/btt.c      |  126 +++++++++++++++++++++++++++++++++++++++------
 drivers/nvdimm/btt.h      |    2 -
 drivers/nvdimm/btt_devs.c |    3 +
 drivers/nvdimm/core.c     |   37 +++++++++++++
 drivers/nvdimm/nd.h       |    1 
 5 files changed, 151 insertions(+), 18 deletions(-)

diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 58becbd69ae1..c337b7abfb43 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -837,6 +837,11 @@ static int btt_meta_init(struct btt *btt)
 	return ret;
 }
 
+static u32 btt_meta_size(struct btt *btt)
+{
+	return btt->lbasize - btt->sector_size;
+}
+
 /*
  * This function calculates the arena in which the given LBA lies
  * by doing a linear walk. This is acceptable since we expect only
@@ -921,8 +926,63 @@ static void zero_fill_data(struct page *page, unsigned int off, u32 len)
 	kunmap_atomic(mem);
 }
 
-static int btt_read_pg(struct btt *btt, struct page *page, unsigned int off,
-			sector_t sector, unsigned int len)
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+static int btt_rw_integrity(struct btt *btt, struct bio_integrity_payload *bip,
+			struct arena_info *arena, u32 postmap, int rw)
+{
+	unsigned int len = btt_meta_size(btt);
+	u64 meta_nsoff;
+	int ret = 0;
+
+	if (bip == NULL)
+		return 0;
+
+	meta_nsoff = to_namespace_offset(arena, postmap) + btt->sector_size;
+
+	while (len) {
+		unsigned int cur_len;
+		struct bio_vec bv;
+		void *mem;
+
+		bv = bvec_iter_bvec(bip->bip_vec, bip->bip_iter);
+		/*
+		 * The 'bv' obtained from bvec_iter_bvec has its .bv_len and
+		 * .bv_offset already adjusted for iter->bi_bvec_done, and we
+		 * can use those directly
+		 */
+
+		cur_len = min(len, bv.bv_len);
+		mem = kmap_atomic(bv.bv_page);
+		if (rw)
+			ret = arena_write_bytes(arena, meta_nsoff,
+					mem + bv.bv_offset, cur_len);
+		else
+			ret = arena_read_bytes(arena, meta_nsoff,
+					mem + bv.bv_offset, cur_len);
+
+		kunmap_atomic(mem);
+		if (ret)
+			return ret;
+
+		len -= cur_len;
+		meta_nsoff += cur_len;
+		bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
+	}
+
+	return ret;
+}
+
+#else /* CONFIG_BLK_DEV_INTEGRITY */
+static int btt_rw_integrity(struct btt *btt, struct bio_integrity_payload *bip,
+			struct arena_info *arena, u32 postmap, int rw)
+{
+	return 0;
+}
+#endif
+
+static int btt_read_pg(struct btt *btt, struct bio_integrity_payload *bip,
+			struct page *page, unsigned int off, sector_t sector,
+			unsigned int len)
 {
 	int ret = 0;
 	int t_flag, e_flag;
@@ -984,6 +1044,12 @@ static int btt_read_pg(struct btt *btt, struct page *page, unsigned int off,
 		if (ret)
 			goto out_rtt;
 
+		if (bip) {
+			ret = btt_rw_integrity(btt, bip, arena, postmap, READ);
+			if (ret)
+				goto out_rtt;
+		}
+
 		arena->rtt[lane] = RTT_INVALID;
 		nd_region_release_lane(btt->nd_region, lane);
 
@@ -1001,8 +1067,9 @@ static int btt_read_pg(struct btt *btt, struct page *page, unsigned int off,
 	return ret;
 }
 
-static int btt_write_pg(struct btt *btt, sector_t sector, struct page *page,
-		unsigned int off, unsigned int len)
+static int btt_write_pg(struct btt *btt, struct bio_integrity_payload *bip,
+			sector_t sector, struct page *page, unsigned int off,
+			unsigned int len)
 {
 	int ret = 0;
 	struct arena_info *arena = NULL;
@@ -1036,12 +1103,19 @@ static int btt_write_pg(struct btt *btt, sector_t sector, struct page *page,
 		if (new_postmap >= arena->internal_nlba) {
 			ret = -EIO;
 			goto out_lane;
-		} else
-			ret = btt_data_write(arena, new_postmap, page,
-						off, cur_len);
+		}
+
+		ret = btt_data_write(arena, new_postmap, page, off, cur_len);
 		if (ret)
 			goto out_lane;
 
+		if (bip) {
+			ret = btt_rw_integrity(btt, bip, arena, new_postmap,
+						WRITE);
+			if (ret)
+				goto out_lane;
+		}
+
 		lock_map(arena, premap);
 		ret = btt_map_read(arena, premap, &old_postmap, NULL, NULL);
 		if (ret)
@@ -1081,18 +1155,18 @@ static int btt_write_pg(struct btt *btt, sector_t sector, struct page *page,
 	return ret;
 }
 
-static int btt_do_bvec(struct btt *btt, struct page *page,
-			unsigned int len, unsigned int off, int rw,
-			sector_t sector)
+static int btt_do_bvec(struct btt *btt, struct bio_integrity_payload *bip,
+			struct page *page, unsigned int len, unsigned int off,
+			int rw, sector_t sector)
 {
 	int ret;
 
 	if (rw == READ) {
-		ret = btt_read_pg(btt, page, off, sector, len);
+		ret = btt_read_pg(btt, bip, page, off, sector, len);
 		flush_dcache_page(page);
 	} else {
 		flush_dcache_page(page);
-		ret = btt_write_pg(btt, sector, page, off, len);
+		ret = btt_write_pg(btt, bip, sector, page, off, len);
 	}
 
 	return ret;
@@ -1100,6 +1174,7 @@ static int btt_do_bvec(struct btt *btt, struct page *page,
 
 static void btt_make_request(struct request_queue *q, struct bio *bio)
 {
+	struct bio_integrity_payload *bip = bio_integrity(bio);
 	struct block_device *bdev = bio->bi_bdev;
 	struct btt *btt = q->queuedata;
 	int rw;
@@ -1120,6 +1195,17 @@ static void btt_make_request(struct request_queue *q, struct bio *bio)
 	if (rw == READA)
 		rw = READ;
 
+	/*
+	 * bio_integrity_enabled also checks if the bio already has an
+	 * integrity payload attached. If it does, we *don't* do a
+	 * bio_integrity_prep here - the payload has been generated by
+	 * another kernel subsystem, and we just pass it through.
+	 */
+	if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
+		err = -EIO;
+		goto out;
+	}
+
 	bio_for_each_segment(bvec, bio, iter) {
 		unsigned int len = bvec.bv_len;
 
@@ -1129,7 +1215,7 @@ static void btt_make_request(struct request_queue *q, struct bio *bio)
 		BUG_ON(len < btt->sector_size);
 		BUG_ON(len % btt->sector_size);
 
-		err = btt_do_bvec(btt, bvec.bv_page, len, bvec.bv_offset,
+		err = btt_do_bvec(btt, bip, bvec.bv_page, len, bvec.bv_offset,
 				rw, sector);
 		if (err) {
 			dev_info(&btt->nd_btt->dev,
@@ -1150,7 +1236,7 @@ static int btt_rw_page(struct block_device *bdev, sector_t sector,
 {
 	struct btt *btt = bdev->bd_disk->private_data;
 
-	btt_do_bvec(btt, page, PAGE_CACHE_SIZE, 0, rw, sector);
+	btt_do_bvec(btt, NULL, page, PAGE_CACHE_SIZE, 0, rw, sector);
 	page_endio(page, rw & WRITE, 0);
 	return 0;
 }
@@ -1204,10 +1290,17 @@ static int btt_blk_init(struct btt *btt)
 	blk_queue_logical_block_size(btt->btt_queue, btt->sector_size);
 	btt->btt_queue->queuedata = btt;
 
-	set_capacity(btt->btt_disk,
-			btt->nlba * btt->sector_size >> SECTOR_SHIFT);
+	set_capacity(btt->btt_disk, 0);
 	add_disk(btt->btt_disk);
 
+	if (btt_meta_size(btt)) {
+		ret = nd_integrity_init(btt->btt_disk, btt_meta_size(btt));
+		if (ret)
+			goto out_free_queue;
+	}
+
+	set_capacity(btt->btt_disk, btt->nlba * btt->sector_size >> 9);
+
 	return 0;
 
 out_free_queue:
@@ -1217,6 +1310,7 @@ out_free_queue:
 
 static void btt_blk_cleanup(struct btt *btt)
 {
+	blk_integrity_unregister(btt->btt_disk);
 	del_gendisk(btt->btt_disk);
 	put_disk(btt->btt_disk);
 	blk_cleanup_queue(btt->btt_queue);
diff --git a/drivers/nvdimm/btt.h b/drivers/nvdimm/btt.h
index 8c95a7792c3e..2caa0ef7e67a 100644
--- a/drivers/nvdimm/btt.h
+++ b/drivers/nvdimm/btt.h
@@ -31,7 +31,7 @@
 #define ARENA_MAX_SIZE (1ULL << 39)	/* 512 GB */
 #define RTT_VALID (1UL << 31)
 #define RTT_INVALID 0
-#define INT_LBASIZE_ALIGNMENT 256
+#define INT_LBASIZE_ALIGNMENT 64
 #define BTT_PG_SIZE 4096
 #define BTT_DEFAULT_NFREE ND_MAX_LANES
 #define LOG_SEQ_INIT 1
diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
index c03c854f892b..02e125b91e77 100644
--- a/drivers/nvdimm/btt_devs.c
+++ b/drivers/nvdimm/btt_devs.c
@@ -53,7 +53,8 @@ struct nd_btt *to_nd_btt(struct device *dev)
 }
 EXPORT_SYMBOL(to_nd_btt);
 
-static const unsigned long btt_lbasize_supported[] = { 512, 4096, 0 };
+static const unsigned long btt_lbasize_supported[] = { 512, 520, 528,
+	4096, 4104, 4160, 4224, 0 };
 
 static ssize_t sector_size_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index 0fa9b6225450..36f112995c0c 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -13,6 +13,7 @@
 #include <linux/libnvdimm.h>
 #include <linux/export.h>
 #include <linux/module.h>
+#include <linux/blkdev.h>
 #include <linux/device.h>
 #include <linux/ctype.h>
 #include <linux/ndctl.h>
@@ -381,6 +382,42 @@ void nvdimm_bus_unregister(struct nvdimm_bus *nvdimm_bus)
 }
 EXPORT_SYMBOL_GPL(nvdimm_bus_unregister);
 
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+static int nd_pi_nop_generate_verify(struct blk_integrity_iter *iter)
+{
+	return 0;
+}
+
+int nd_integrity_init(struct gendisk *disk, unsigned long meta_size)
+{
+	struct blk_integrity integrity = {
+		.name = "ND-PI-NOP",
+		.generate_fn = nd_pi_nop_generate_verify,
+		.verify_fn = nd_pi_nop_generate_verify,
+		.tuple_size = meta_size,
+		.tag_size = meta_size,
+	};
+	int ret;
+
+	ret = blk_integrity_register(disk, &integrity);
+	if (ret)
+		return ret;
+
+	blk_queue_max_integrity_segments(disk->queue, 1);
+
+	return 0;
+}
+EXPORT_SYMBOL(nd_integrity_init);
+
+#else /* CONFIG_BLK_DEV_INTEGRITY */
+int nd_integrity_init(struct gendisk *disk, unsigned long meta_size)
+{
+	return 0;
+}
+EXPORT_SYMBOL(nd_integrity_init);
+
+#endif
+
 static __init int libnvdimm_init(void)
 {
 	int rc;
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 2b7746e798fb..1cea3f191a83 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -128,6 +128,7 @@ enum nd_async_mode {
 	ND_ASYNC,
 };
 
+int nd_integrity_init(struct gendisk *disk, unsigned long meta_size);
 void wait_nvdimm_bus_probe_idle(struct device *dev);
 void nd_device_register(struct device *dev);
 void nd_device_unregister(struct device *dev, enum nd_async_mode mode);


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 09/15] libnvdimm, blk: add support for blk integrity
  2015-06-17 23:54 ` Dan Williams
@ 2015-06-17 23:55   ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Vishal Verma, linux-kernel, hch, linux-acpi,
	linux-fsdevel, mingo

From: Vishal Verma <vishal.l.verma@linux.intel.com>

Support multiple block sizes (sector + metadata) for nd_blk in the
same way as done for the BTT. Add the idea of an 'internal' lbasize,
which is properly aligned and padded, and store metadata in this space.

Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/blk.c            |  168 ++++++++++++++++++++++++++++++++++-----
 drivers/nvdimm/btt.h            |    1 
 drivers/nvdimm/bus.c            |   28 +++++--
 drivers/nvdimm/core.c           |    3 +
 drivers/nvdimm/namespace_devs.c |    3 -
 drivers/nvdimm/nd.h             |    3 +
 6 files changed, 174 insertions(+), 32 deletions(-)

diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index a2749b5e43d7..feddad325f97 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -27,10 +27,17 @@ struct nd_blk_device {
 	struct nd_namespace_blk *nsblk;
 	struct nd_blk_region *ndbr;
 	size_t disk_size;
+	u32 sector_size;
+	u32 internal_lbasize;
 };
 
 static int nd_blk_major;
 
+static u32 nd_blk_meta_size(struct nd_blk_device *blk_dev)
+{
+	return blk_dev->nsblk->lbasize - blk_dev->sector_size;
+}
+
 static resource_size_t to_dev_offset(struct nd_namespace_blk *nsblk,
 				resource_size_t ns_offset, unsigned int len)
 {
@@ -52,13 +59,114 @@ static resource_size_t to_dev_offset(struct nd_namespace_blk *nsblk,
 	return SIZE_MAX;
 }
 
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+static int nd_blk_rw_integrity(struct nd_blk_device *blk_dev,
+				struct bio_integrity_payload *bip, u64 lba,
+				int rw)
+{
+	unsigned int len = nd_blk_meta_size(blk_dev);
+	resource_size_t	dev_offset, ns_offset;
+	struct nd_namespace_blk *nsblk;
+	struct nd_blk_region *ndbr;
+	int err = 0;
+
+	nsblk = blk_dev->nsblk;
+	ndbr = blk_dev->ndbr;
+	ns_offset = lba * blk_dev->internal_lbasize + blk_dev->sector_size;
+	dev_offset = to_dev_offset(nsblk, ns_offset, len);
+	if (dev_offset == SIZE_MAX)
+		return -EIO;
+
+	while (len) {
+		unsigned int cur_len;
+		struct bio_vec bv;
+		void *iobuf;
+
+		bv = bvec_iter_bvec(bip->bip_vec, bip->bip_iter);
+		/*
+		 * The 'bv' obtained from bvec_iter_bvec has its .bv_len and
+		 * .bv_offset already adjusted for iter->bi_bvec_done, and we
+		 * can use those directly
+		 */
+
+		cur_len = min(len, bv.bv_len);
+		iobuf = kmap_atomic(bv.bv_page);
+		err = ndbr->do_io(ndbr, dev_offset, iobuf + bv.bv_offset,
+				cur_len, rw);
+		kunmap_atomic(iobuf);
+		if (err)
+			return err;
+
+		len -= cur_len;
+		dev_offset += cur_len;
+		bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
+	}
+
+	return err;
+}
+
+#else /* CONFIG_BLK_DEV_INTEGRITY */
+static int nd_blk_rw_integrity(struct nd_blk_device *blk_dev,
+				struct bio_integrity_payload *bip, u64 lba,
+				int rw)
+{
+	return 0;
+}
+#endif
+
+static int nd_blk_do_bvec(struct nd_blk_device *blk_dev,
+			struct bio_integrity_payload *bip, struct page *page,
+			unsigned int len, unsigned int off, int rw,
+			sector_t sector)
+{
+	struct nd_blk_region *ndbr = blk_dev->ndbr;
+	resource_size_t	dev_offset, ns_offset;
+	int err = 0;
+	void *iobuf;
+	u64 lba;
+
+	while (len) {
+		unsigned int cur_len;
+
+		/*
+		 * If we don't have an integrity payload, we don't have to
+		 * split the bvec into sectors, as this would cause unnecessary
+		 * Block Window setup/move steps. the do_io routine is capable
+		 * of handling len <= PAGE_SIZE.
+		 */
+		cur_len = bip ? min(len, blk_dev->sector_size) : len;
+
+		lba = div_u64(sector << SECTOR_SHIFT, blk_dev->sector_size);
+		ns_offset = lba * blk_dev->internal_lbasize;
+		dev_offset = to_dev_offset(blk_dev->nsblk, ns_offset, cur_len);
+		if (dev_offset == SIZE_MAX)
+			return -EIO;
+
+		iobuf = kmap_atomic(page);
+		err = ndbr->do_io(ndbr, dev_offset, iobuf + off, cur_len, rw);
+		kunmap_atomic(iobuf);
+		if (err)
+			return err;
+
+		if (bip) {
+			err = nd_blk_rw_integrity(blk_dev, bip, lba, rw);
+			if (err)
+				return err;
+		}
+		len -= cur_len;
+		off += cur_len;
+		sector += blk_dev->sector_size >> SECTOR_SHIFT;
+	}
+
+	return err;
+}
+
 static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
 	struct gendisk *disk = bdev->bd_disk;
-	struct nd_namespace_blk *nsblk;
+	struct bio_integrity_payload *bip;
 	struct nd_blk_device *blk_dev;
-	struct nd_blk_region *ndbr;
 	struct bvec_iter iter;
 	struct bio_vec bvec;
 	int err = 0, rw;
@@ -74,29 +182,33 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
 
 	rw = bio_data_dir(bio);
 
+	/*
+	 * bio_integrity_enabled also checks if the bio already has an
+	 * integrity payload attached. If it does, we *don't* do a
+	 * bio_integrity_prep here - the payload has been generated by
+	 * another kernel subsystem, and we just pass it through.
+	 */
+	if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
+		err = -EIO;
+		goto out;
+	}
+
+	bip = bio_integrity(bio);
 	blk_dev = disk->private_data;
-	nsblk = blk_dev->nsblk;
-	ndbr = blk_dev->ndbr;
+
 	bio_for_each_segment(bvec, bio, iter) {
 		unsigned int len = bvec.bv_len;
-		resource_size_t	dev_offset;
-		void *iobuf;
 
 		BUG_ON(len > PAGE_SIZE);
-
-		dev_offset = to_dev_offset(nsblk, sector << SECTOR_SHIFT, len);
-		if (dev_offset == SIZE_MAX) {
-			err = -EIO;
+		err = nd_blk_do_bvec(blk_dev, bip, bvec.bv_page, len,
+					bvec.bv_offset, rw, sector);
+		if (err) {
+			dev_info(&blk_dev->nsblk->dev,
+					"io error in %s sector %lld, len %d,\n",
+					(rw == READ) ? "READ" : "WRITE",
+					(unsigned long long) sector, len);
 			goto out;
 		}
-
-		iobuf = kmap_atomic(bvec.bv_page);
-		err = ndbr->do_io(ndbr, dev_offset, iobuf + bvec.bv_offset,
-				len, rw);
-		kunmap_atomic(iobuf);
-		if (err)
-			goto out;
-
 		sector += len >> SECTOR_SHIFT;
 	}
 
@@ -135,7 +247,8 @@ static int nd_blk_probe(struct device *dev)
 	struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
 	struct nd_region *nd_region = to_nd_region(dev->parent);
 	struct nd_blk_device *blk_dev;
-	resource_size_t disk_size;
+	resource_size_t disk_size, available_disk_size;
+	u64 internal_nlba;
 	struct gendisk *disk;
 	int err;
 
@@ -148,6 +261,9 @@ static int nd_blk_probe(struct device *dev)
 		return -ENOMEM;
 
 	blk_dev->disk_size	= disk_size;
+	blk_dev->sector_size = ((nsblk->lbasize >= 4096) ? 4096 : 512);
+	blk_dev->internal_lbasize = roundup(nsblk->lbasize,
+						INT_LBASIZE_ALIGNMENT);
 
 	blk_dev->queue = blk_alloc_queue(GFP_KERNEL);
 	if (!blk_dev->queue) {
@@ -158,7 +274,7 @@ static int nd_blk_probe(struct device *dev)
 	blk_queue_make_request(blk_dev->queue, nd_blk_make_request);
 	blk_queue_max_hw_sectors(blk_dev->queue, 1024);
 	blk_queue_bounce_limit(blk_dev->queue, BLK_BOUNCE_ANY);
-	blk_queue_logical_block_size(blk_dev->queue, nsblk->lbasize);
+	blk_queue_logical_block_size(blk_dev->queue, blk_dev->sector_size);
 
 	disk = blk_dev->disk = alloc_disk(0);
 	if (!disk) {
@@ -177,13 +293,21 @@ static int nd_blk_probe(struct device *dev)
 	disk->queue		= blk_dev->queue;
 	disk->flags		= GENHD_FL_EXT_DEVT;
 	sprintf(disk->disk_name, "ndblk%d.%d", nd_region->id, nsblk->id);
-	set_capacity(disk, disk_size >> SECTOR_SHIFT);
+	set_capacity(disk, 0);
 
+	internal_nlba = div_u64(disk_size, blk_dev->internal_lbasize);
+	available_disk_size = internal_nlba * blk_dev->sector_size;
 	dev_set_drvdata(dev, blk_dev);
-	nvdimm_bus_add_disk(disk);
+	err = nvdimm_bus_add_integrity_disk(disk, nd_blk_meta_size(blk_dev),
+			available_disk_size >> SECTOR_SHIFT);
+	if (err)
+		goto err_add_disk;
 
 	return 0;
 
+ err_add_disk:
+	del_gendisk(disk);
+	put_disk(disk);
  err_alloc_disk:
 	blk_cleanup_queue(blk_dev->queue);
  err_alloc_queue:
diff --git a/drivers/nvdimm/btt.h b/drivers/nvdimm/btt.h
index 2caa0ef7e67a..75b0d80a6bd9 100644
--- a/drivers/nvdimm/btt.h
+++ b/drivers/nvdimm/btt.h
@@ -31,7 +31,6 @@
 #define ARENA_MAX_SIZE (1ULL << 39)	/* 512 GB */
 #define RTT_VALID (1UL << 31)
 #define RTT_INVALID 0
-#define INT_LBASIZE_ALIGNMENT 64
 #define BTT_PG_SIZE 4096
 #define BTT_DEFAULT_NFREE ND_MAX_LANES
 #define LOG_SEQ_INIT 1
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 3c14fee5aff4..d4fbc48f5643 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -243,17 +243,13 @@ int __nd_driver_register(struct nd_device_driver *nd_drv, struct module *owner,
 }
 EXPORT_SYMBOL(__nd_driver_register);
 
-/**
- * nvdimm_bus_add_disk() - attach and run actions on an nvdimm block device
- * @disk: disk device being registered
- *
- * Note, that @disk must be a descendant of an nvdimm_bus
- */
-int nvdimm_bus_add_disk(struct gendisk *disk)
+int nvdimm_bus_add_integrity_disk(struct gendisk *disk, u32 lbasize,
+		sector_t size)
 {
 	const struct block_device_operations *ops = disk->fops;
 	struct device *dev = disk->driverfs_dev;
 	struct nvdimm_bus *nvdimm_bus;
+	int rc;
 
 	nvdimm_bus = walk_to_nvdimm_bus(dev);
 	if (!nvdimm_bus || !ops->rw_bytes)
@@ -266,10 +262,25 @@ int nvdimm_bus_add_disk(struct gendisk *disk)
 	 */
 	nvdimm_bus_lock(&nvdimm_bus->dev);
 	add_disk(disk);
+	rc = nd_integrity_init(disk, lbasize);
+	if (size)
+		set_capacity(disk, size);
 	nd_btt_add_disk(nvdimm_bus, disk);
 	nvdimm_bus_unlock(&nvdimm_bus->dev);
 
-	return 0;
+	return rc;
+}
+EXPORT_SYMBOL(nvdimm_bus_add_integrity_disk);
+
+/**
+ * nvdimm_bus_add_disk() - attach and run actions on an nvdimm block device
+ * @disk: disk device being registered
+ *
+ * Note, that @disk must be a descendant of an nvdimm_bus
+ */
+int nvdimm_bus_add_disk(struct gendisk *disk)
+{
+	return nvdimm_bus_add_integrity_disk(disk, 0, 0);
 }
 EXPORT_SYMBOL(nvdimm_bus_add_disk);
 
@@ -292,6 +303,7 @@ void nvdimm_bus_remove_disk(struct gendisk *disk)
 	 */
 	nd_synchronize();
 
+	blk_integrity_unregister(disk);
 	del_gendisk(disk);
 	put_disk(disk);
 }
diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index 36f112995c0c..8f466c384b30 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -399,6 +399,9 @@ int nd_integrity_init(struct gendisk *disk, unsigned long meta_size)
 	};
 	int ret;
 
+	if (meta_size == 0)
+		return 0;
+
 	ret = blk_integrity_register(disk, &integrity);
 	if (ret)
 		return ret;
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 68780d768e7b..0fe541a1df49 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -1013,7 +1013,8 @@ static ssize_t resource_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(resource);
 
-static const unsigned long ns_lbasize_supported[] = { 512, 0 };
+static const unsigned long ns_lbasize_supported[] = { 512, 520, 528,
+	4096, 4104, 4160, 4224, 0 };
 
 static ssize_t sector_size_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 1cea3f191a83..9b5fdb2215b1 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -29,6 +29,7 @@ enum {
 	 */
 	ND_MAX_LANES = 256,
 	SECTOR_SHIFT = 9,
+	INT_LBASIZE_ALIGNMENT = 64,
 };
 
 struct nvdimm_drvdata {
@@ -159,6 +160,8 @@ void nvdimm_bus_lock(struct device *dev);
 void nvdimm_bus_unlock(struct device *dev);
 bool is_nvdimm_bus_locked(struct device *dev);
 int nvdimm_bus_add_disk(struct gendisk *disk);
+int nvdimm_bus_add_integrity_disk(struct gendisk *disk, u32 lbasize,
+		sector_t size);
 void nvdimm_bus_remove_disk(struct gendisk *disk);
 void nvdimm_drvdata_release(struct kref *kref);
 void put_ndd(struct nvdimm_drvdata *ndd);


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 09/15] libnvdimm, blk: add support for blk integrity
@ 2015-06-17 23:55   ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Vishal Verma, linux-kernel, hch, linux-acpi,
	linux-fsdevel, mingo

From: Vishal Verma <vishal.l.verma@linux.intel.com>

Support multiple block sizes (sector + metadata) for nd_blk in the
same way as done for the BTT. Add the idea of an 'internal' lbasize,
which is properly aligned and padded, and store metadata in this space.

Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/blk.c            |  168 ++++++++++++++++++++++++++++++++++-----
 drivers/nvdimm/btt.h            |    1 
 drivers/nvdimm/bus.c            |   28 +++++--
 drivers/nvdimm/core.c           |    3 +
 drivers/nvdimm/namespace_devs.c |    3 -
 drivers/nvdimm/nd.h             |    3 +
 6 files changed, 174 insertions(+), 32 deletions(-)

diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index a2749b5e43d7..feddad325f97 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -27,10 +27,17 @@ struct nd_blk_device {
 	struct nd_namespace_blk *nsblk;
 	struct nd_blk_region *ndbr;
 	size_t disk_size;
+	u32 sector_size;
+	u32 internal_lbasize;
 };
 
 static int nd_blk_major;
 
+static u32 nd_blk_meta_size(struct nd_blk_device *blk_dev)
+{
+	return blk_dev->nsblk->lbasize - blk_dev->sector_size;
+}
+
 static resource_size_t to_dev_offset(struct nd_namespace_blk *nsblk,
 				resource_size_t ns_offset, unsigned int len)
 {
@@ -52,13 +59,114 @@ static resource_size_t to_dev_offset(struct nd_namespace_blk *nsblk,
 	return SIZE_MAX;
 }
 
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+static int nd_blk_rw_integrity(struct nd_blk_device *blk_dev,
+				struct bio_integrity_payload *bip, u64 lba,
+				int rw)
+{
+	unsigned int len = nd_blk_meta_size(blk_dev);
+	resource_size_t	dev_offset, ns_offset;
+	struct nd_namespace_blk *nsblk;
+	struct nd_blk_region *ndbr;
+	int err = 0;
+
+	nsblk = blk_dev->nsblk;
+	ndbr = blk_dev->ndbr;
+	ns_offset = lba * blk_dev->internal_lbasize + blk_dev->sector_size;
+	dev_offset = to_dev_offset(nsblk, ns_offset, len);
+	if (dev_offset == SIZE_MAX)
+		return -EIO;
+
+	while (len) {
+		unsigned int cur_len;
+		struct bio_vec bv;
+		void *iobuf;
+
+		bv = bvec_iter_bvec(bip->bip_vec, bip->bip_iter);
+		/*
+		 * The 'bv' obtained from bvec_iter_bvec has its .bv_len and
+		 * .bv_offset already adjusted for iter->bi_bvec_done, and we
+		 * can use those directly
+		 */
+
+		cur_len = min(len, bv.bv_len);
+		iobuf = kmap_atomic(bv.bv_page);
+		err = ndbr->do_io(ndbr, dev_offset, iobuf + bv.bv_offset,
+				cur_len, rw);
+		kunmap_atomic(iobuf);
+		if (err)
+			return err;
+
+		len -= cur_len;
+		dev_offset += cur_len;
+		bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
+	}
+
+	return err;
+}
+
+#else /* CONFIG_BLK_DEV_INTEGRITY */
+static int nd_blk_rw_integrity(struct nd_blk_device *blk_dev,
+				struct bio_integrity_payload *bip, u64 lba,
+				int rw)
+{
+	return 0;
+}
+#endif
+
+static int nd_blk_do_bvec(struct nd_blk_device *blk_dev,
+			struct bio_integrity_payload *bip, struct page *page,
+			unsigned int len, unsigned int off, int rw,
+			sector_t sector)
+{
+	struct nd_blk_region *ndbr = blk_dev->ndbr;
+	resource_size_t	dev_offset, ns_offset;
+	int err = 0;
+	void *iobuf;
+	u64 lba;
+
+	while (len) {
+		unsigned int cur_len;
+
+		/*
+		 * If we don't have an integrity payload, we don't have to
+		 * split the bvec into sectors, as this would cause unnecessary
+		 * Block Window setup/move steps. the do_io routine is capable
+		 * of handling len <= PAGE_SIZE.
+		 */
+		cur_len = bip ? min(len, blk_dev->sector_size) : len;
+
+		lba = div_u64(sector << SECTOR_SHIFT, blk_dev->sector_size);
+		ns_offset = lba * blk_dev->internal_lbasize;
+		dev_offset = to_dev_offset(blk_dev->nsblk, ns_offset, cur_len);
+		if (dev_offset == SIZE_MAX)
+			return -EIO;
+
+		iobuf = kmap_atomic(page);
+		err = ndbr->do_io(ndbr, dev_offset, iobuf + off, cur_len, rw);
+		kunmap_atomic(iobuf);
+		if (err)
+			return err;
+
+		if (bip) {
+			err = nd_blk_rw_integrity(blk_dev, bip, lba, rw);
+			if (err)
+				return err;
+		}
+		len -= cur_len;
+		off += cur_len;
+		sector += blk_dev->sector_size >> SECTOR_SHIFT;
+	}
+
+	return err;
+}
+
 static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
 	struct gendisk *disk = bdev->bd_disk;
-	struct nd_namespace_blk *nsblk;
+	struct bio_integrity_payload *bip;
 	struct nd_blk_device *blk_dev;
-	struct nd_blk_region *ndbr;
 	struct bvec_iter iter;
 	struct bio_vec bvec;
 	int err = 0, rw;
@@ -74,29 +182,33 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
 
 	rw = bio_data_dir(bio);
 
+	/*
+	 * bio_integrity_enabled also checks if the bio already has an
+	 * integrity payload attached. If it does, we *don't* do a
+	 * bio_integrity_prep here - the payload has been generated by
+	 * another kernel subsystem, and we just pass it through.
+	 */
+	if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
+		err = -EIO;
+		goto out;
+	}
+
+	bip = bio_integrity(bio);
 	blk_dev = disk->private_data;
-	nsblk = blk_dev->nsblk;
-	ndbr = blk_dev->ndbr;
+
 	bio_for_each_segment(bvec, bio, iter) {
 		unsigned int len = bvec.bv_len;
-		resource_size_t	dev_offset;
-		void *iobuf;
 
 		BUG_ON(len > PAGE_SIZE);
-
-		dev_offset = to_dev_offset(nsblk, sector << SECTOR_SHIFT, len);
-		if (dev_offset == SIZE_MAX) {
-			err = -EIO;
+		err = nd_blk_do_bvec(blk_dev, bip, bvec.bv_page, len,
+					bvec.bv_offset, rw, sector);
+		if (err) {
+			dev_info(&blk_dev->nsblk->dev,
+					"io error in %s sector %lld, len %d,\n",
+					(rw == READ) ? "READ" : "WRITE",
+					(unsigned long long) sector, len);
 			goto out;
 		}
-
-		iobuf = kmap_atomic(bvec.bv_page);
-		err = ndbr->do_io(ndbr, dev_offset, iobuf + bvec.bv_offset,
-				len, rw);
-		kunmap_atomic(iobuf);
-		if (err)
-			goto out;
-
 		sector += len >> SECTOR_SHIFT;
 	}
 
@@ -135,7 +247,8 @@ static int nd_blk_probe(struct device *dev)
 	struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
 	struct nd_region *nd_region = to_nd_region(dev->parent);
 	struct nd_blk_device *blk_dev;
-	resource_size_t disk_size;
+	resource_size_t disk_size, available_disk_size;
+	u64 internal_nlba;
 	struct gendisk *disk;
 	int err;
 
@@ -148,6 +261,9 @@ static int nd_blk_probe(struct device *dev)
 		return -ENOMEM;
 
 	blk_dev->disk_size	= disk_size;
+	blk_dev->sector_size = ((nsblk->lbasize >= 4096) ? 4096 : 512);
+	blk_dev->internal_lbasize = roundup(nsblk->lbasize,
+						INT_LBASIZE_ALIGNMENT);
 
 	blk_dev->queue = blk_alloc_queue(GFP_KERNEL);
 	if (!blk_dev->queue) {
@@ -158,7 +274,7 @@ static int nd_blk_probe(struct device *dev)
 	blk_queue_make_request(blk_dev->queue, nd_blk_make_request);
 	blk_queue_max_hw_sectors(blk_dev->queue, 1024);
 	blk_queue_bounce_limit(blk_dev->queue, BLK_BOUNCE_ANY);
-	blk_queue_logical_block_size(blk_dev->queue, nsblk->lbasize);
+	blk_queue_logical_block_size(blk_dev->queue, blk_dev->sector_size);
 
 	disk = blk_dev->disk = alloc_disk(0);
 	if (!disk) {
@@ -177,13 +293,21 @@ static int nd_blk_probe(struct device *dev)
 	disk->queue		= blk_dev->queue;
 	disk->flags		= GENHD_FL_EXT_DEVT;
 	sprintf(disk->disk_name, "ndblk%d.%d", nd_region->id, nsblk->id);
-	set_capacity(disk, disk_size >> SECTOR_SHIFT);
+	set_capacity(disk, 0);
 
+	internal_nlba = div_u64(disk_size, blk_dev->internal_lbasize);
+	available_disk_size = internal_nlba * blk_dev->sector_size;
 	dev_set_drvdata(dev, blk_dev);
-	nvdimm_bus_add_disk(disk);
+	err = nvdimm_bus_add_integrity_disk(disk, nd_blk_meta_size(blk_dev),
+			available_disk_size >> SECTOR_SHIFT);
+	if (err)
+		goto err_add_disk;
 
 	return 0;
 
+ err_add_disk:
+	del_gendisk(disk);
+	put_disk(disk);
  err_alloc_disk:
 	blk_cleanup_queue(blk_dev->queue);
  err_alloc_queue:
diff --git a/drivers/nvdimm/btt.h b/drivers/nvdimm/btt.h
index 2caa0ef7e67a..75b0d80a6bd9 100644
--- a/drivers/nvdimm/btt.h
+++ b/drivers/nvdimm/btt.h
@@ -31,7 +31,6 @@
 #define ARENA_MAX_SIZE (1ULL << 39)	/* 512 GB */
 #define RTT_VALID (1UL << 31)
 #define RTT_INVALID 0
-#define INT_LBASIZE_ALIGNMENT 64
 #define BTT_PG_SIZE 4096
 #define BTT_DEFAULT_NFREE ND_MAX_LANES
 #define LOG_SEQ_INIT 1
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 3c14fee5aff4..d4fbc48f5643 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -243,17 +243,13 @@ int __nd_driver_register(struct nd_device_driver *nd_drv, struct module *owner,
 }
 EXPORT_SYMBOL(__nd_driver_register);
 
-/**
- * nvdimm_bus_add_disk() - attach and run actions on an nvdimm block device
- * @disk: disk device being registered
- *
- * Note, that @disk must be a descendant of an nvdimm_bus
- */
-int nvdimm_bus_add_disk(struct gendisk *disk)
+int nvdimm_bus_add_integrity_disk(struct gendisk *disk, u32 lbasize,
+		sector_t size)
 {
 	const struct block_device_operations *ops = disk->fops;
 	struct device *dev = disk->driverfs_dev;
 	struct nvdimm_bus *nvdimm_bus;
+	int rc;
 
 	nvdimm_bus = walk_to_nvdimm_bus(dev);
 	if (!nvdimm_bus || !ops->rw_bytes)
@@ -266,10 +262,25 @@ int nvdimm_bus_add_disk(struct gendisk *disk)
 	 */
 	nvdimm_bus_lock(&nvdimm_bus->dev);
 	add_disk(disk);
+	rc = nd_integrity_init(disk, lbasize);
+	if (size)
+		set_capacity(disk, size);
 	nd_btt_add_disk(nvdimm_bus, disk);
 	nvdimm_bus_unlock(&nvdimm_bus->dev);
 
-	return 0;
+	return rc;
+}
+EXPORT_SYMBOL(nvdimm_bus_add_integrity_disk);
+
+/**
+ * nvdimm_bus_add_disk() - attach and run actions on an nvdimm block device
+ * @disk: disk device being registered
+ *
+ * Note, that @disk must be a descendant of an nvdimm_bus
+ */
+int nvdimm_bus_add_disk(struct gendisk *disk)
+{
+	return nvdimm_bus_add_integrity_disk(disk, 0, 0);
 }
 EXPORT_SYMBOL(nvdimm_bus_add_disk);
 
@@ -292,6 +303,7 @@ void nvdimm_bus_remove_disk(struct gendisk *disk)
 	 */
 	nd_synchronize();
 
+	blk_integrity_unregister(disk);
 	del_gendisk(disk);
 	put_disk(disk);
 }
diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index 36f112995c0c..8f466c384b30 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -399,6 +399,9 @@ int nd_integrity_init(struct gendisk *disk, unsigned long meta_size)
 	};
 	int ret;
 
+	if (meta_size == 0)
+		return 0;
+
 	ret = blk_integrity_register(disk, &integrity);
 	if (ret)
 		return ret;
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 68780d768e7b..0fe541a1df49 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -1013,7 +1013,8 @@ static ssize_t resource_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(resource);
 
-static const unsigned long ns_lbasize_supported[] = { 512, 0 };
+static const unsigned long ns_lbasize_supported[] = { 512, 520, 528,
+	4096, 4104, 4160, 4224, 0 };
 
 static ssize_t sector_size_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 1cea3f191a83..9b5fdb2215b1 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -29,6 +29,7 @@ enum {
 	 */
 	ND_MAX_LANES = 256,
 	SECTOR_SHIFT = 9,
+	INT_LBASIZE_ALIGNMENT = 64,
 };
 
 struct nvdimm_drvdata {
@@ -159,6 +160,8 @@ void nvdimm_bus_lock(struct device *dev);
 void nvdimm_bus_unlock(struct device *dev);
 bool is_nvdimm_bus_locked(struct device *dev);
 int nvdimm_bus_add_disk(struct gendisk *disk);
+int nvdimm_bus_add_integrity_disk(struct gendisk *disk, u32 lbasize,
+		sector_t size);
 void nvdimm_bus_remove_disk(struct gendisk *disk);
 void nvdimm_drvdata_release(struct kref *kref);
 void put_ndd(struct nvdimm_drvdata *ndd);


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 10/15] libnvdimm: fix up max_hw_sectors
  2015-06-17 23:54 ` Dan Williams
@ 2015-06-17 23:55   ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Vishal Verma, linux-kernel, hch, linux-acpi,
	linux-fsdevel, mingo

There is no hardware limit to enforce on the size of the i/o that can be
passed to nd block device, so set it to UINT_MAX.  Do this centrally for
all nd block devices in nd_blk_queue_init();

Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/blk.c  |    3 +--
 drivers/nvdimm/btt.c  |    3 +--
 drivers/nvdimm/core.c |    7 +++++++
 drivers/nvdimm/nd.h   |    1 +
 drivers/nvdimm/pmem.c |    3 +--
 5 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index feddad325f97..8a6345797a71 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -272,9 +272,8 @@ static int nd_blk_probe(struct device *dev)
 	}
 
 	blk_queue_make_request(blk_dev->queue, nd_blk_make_request);
-	blk_queue_max_hw_sectors(blk_dev->queue, 1024);
-	blk_queue_bounce_limit(blk_dev->queue, BLK_BOUNCE_ANY);
 	blk_queue_logical_block_size(blk_dev->queue, blk_dev->sector_size);
+	nd_blk_queue_init(blk_dev->queue);
 
 	disk = blk_dev->disk = alloc_disk(0);
 	if (!disk) {
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index c337b7abfb43..380e01cedd24 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1285,9 +1285,8 @@ static int btt_blk_init(struct btt *btt)
 	btt->btt_disk->flags = GENHD_FL_EXT_DEVT;
 
 	blk_queue_make_request(btt->btt_queue, btt_make_request);
-	blk_queue_max_hw_sectors(btt->btt_queue, 1024);
-	blk_queue_bounce_limit(btt->btt_queue, BLK_BOUNCE_ANY);
 	blk_queue_logical_block_size(btt->btt_queue, btt->sector_size);
+	nd_blk_queue_init(btt->btt_queue);
 	btt->btt_queue->queuedata = btt;
 
 	set_capacity(btt->btt_disk, 0);
diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index 8f466c384b30..d27b13357873 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -214,6 +214,13 @@ ssize_t nd_sector_size_store(struct device *dev, const char *buf,
 	}
 }
 
+void nd_blk_queue_init(struct request_queue *q)
+{
+	blk_queue_max_hw_sectors(q, UINT_MAX);
+	blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
+}
+EXPORT_SYMBOL(nd_blk_queue_init);
+
 static ssize_t commands_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 9b5fdb2215b1..2f20d5dca028 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -171,5 +171,6 @@ struct resource *nvdimm_allocate_dpa(struct nvdimm_drvdata *ndd,
 		struct nd_label_id *label_id, resource_size_t start,
 		resource_size_t n);
 int nd_blk_region_init(struct nd_region *nd_region);
+void nd_blk_queue_init(struct request_queue *q);
 resource_size_t nd_namespace_blk_validate(struct nd_namespace_blk *nsblk);
 #endif /* __ND_H__ */
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 1f4767150975..b825a2201aa8 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -172,8 +172,7 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 		goto out_unmap;
 
 	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
-	blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
-	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
+	nd_blk_queue_init(pmem->pmem_queue);
 
 	disk = alloc_disk(0);
 	if (!disk)


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 10/15] libnvdimm: fix up max_hw_sectors
@ 2015-06-17 23:55   ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Vishal Verma, linux-kernel, hch, linux-acpi,
	linux-fsdevel, mingo

There is no hardware limit to enforce on the size of the i/o that can be
passed to nd block device, so set it to UINT_MAX.  Do this centrally for
all nd block devices in nd_blk_queue_init();

Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/blk.c  |    3 +--
 drivers/nvdimm/btt.c  |    3 +--
 drivers/nvdimm/core.c |    7 +++++++
 drivers/nvdimm/nd.h   |    1 +
 drivers/nvdimm/pmem.c |    3 +--
 5 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index feddad325f97..8a6345797a71 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -272,9 +272,8 @@ static int nd_blk_probe(struct device *dev)
 	}
 
 	blk_queue_make_request(blk_dev->queue, nd_blk_make_request);
-	blk_queue_max_hw_sectors(blk_dev->queue, 1024);
-	blk_queue_bounce_limit(blk_dev->queue, BLK_BOUNCE_ANY);
 	blk_queue_logical_block_size(blk_dev->queue, blk_dev->sector_size);
+	nd_blk_queue_init(blk_dev->queue);
 
 	disk = blk_dev->disk = alloc_disk(0);
 	if (!disk) {
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index c337b7abfb43..380e01cedd24 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1285,9 +1285,8 @@ static int btt_blk_init(struct btt *btt)
 	btt->btt_disk->flags = GENHD_FL_EXT_DEVT;
 
 	blk_queue_make_request(btt->btt_queue, btt_make_request);
-	blk_queue_max_hw_sectors(btt->btt_queue, 1024);
-	blk_queue_bounce_limit(btt->btt_queue, BLK_BOUNCE_ANY);
 	blk_queue_logical_block_size(btt->btt_queue, btt->sector_size);
+	nd_blk_queue_init(btt->btt_queue);
 	btt->btt_queue->queuedata = btt;
 
 	set_capacity(btt->btt_disk, 0);
diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index 8f466c384b30..d27b13357873 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -214,6 +214,13 @@ ssize_t nd_sector_size_store(struct device *dev, const char *buf,
 	}
 }
 
+void nd_blk_queue_init(struct request_queue *q)
+{
+	blk_queue_max_hw_sectors(q, UINT_MAX);
+	blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
+}
+EXPORT_SYMBOL(nd_blk_queue_init);
+
 static ssize_t commands_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 9b5fdb2215b1..2f20d5dca028 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -171,5 +171,6 @@ struct resource *nvdimm_allocate_dpa(struct nvdimm_drvdata *ndd,
 		struct nd_label_id *label_id, resource_size_t start,
 		resource_size_t n);
 int nd_blk_region_init(struct nd_region *nd_region);
+void nd_blk_queue_init(struct request_queue *q);
 resource_size_t nd_namespace_blk_validate(struct nd_namespace_blk *nsblk);
 #endif /* __ND_H__ */
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 1f4767150975..b825a2201aa8 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -172,8 +172,7 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 		goto out_unmap;
 
 	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
-	blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
-	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
+	nd_blk_queue_init(pmem->pmem_queue);
 
 	disk = alloc_disk(0);
 	if (!disk)


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 11/15] libnvdimm: pmem, blk, and btt make_request cleanups
  2015-06-17 23:54 ` Dan Williams
@ 2015-06-17 23:55   ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Vishal Verma, linux-kernel, hch, linux-acpi,
	linux-fsdevel, mingo

Various cleanups:

1/ kill the BUG_ONs since we've already told the block layer we don't
   support DISCARD on all these drivers.

2/ Fix up use of 'rw'.  No need to cache it in the pmem driver and for
   btt using bio_data_dir() saves a check for READA.

3/ Kill the local 'sector' variables.  bio_for_each_segment() is already
   advancing the iterator's sector number by the bio_vec length.

Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/blk.c  |   14 ++++----------
 drivers/nvdimm/btt.c  |   18 +++++-------------
 drivers/nvdimm/pmem.c |   20 ++++++--------------
 3 files changed, 15 insertions(+), 37 deletions(-)

diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 8a6345797a71..9d609ef95266 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -170,18 +170,12 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
 	struct bvec_iter iter;
 	struct bio_vec bvec;
 	int err = 0, rw;
-	sector_t sector;
 
-	sector = bio->bi_iter.bi_sector;
-	if (bio_end_sector(bio) > get_capacity(disk)) {
+	if (unlikely(bio_end_sector(bio) > get_capacity(disk))) {
 		err = -EIO;
 		goto out;
 	}
 
-	BUG_ON(bio->bi_rw & REQ_DISCARD);
-
-	rw = bio_data_dir(bio);
-
 	/*
 	 * bio_integrity_enabled also checks if the bio already has an
 	 * integrity payload attached. If it does, we *don't* do a
@@ -196,20 +190,20 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
 	bip = bio_integrity(bio);
 	blk_dev = disk->private_data;
 
+	rw = bio_data_dir(bio);
 	bio_for_each_segment(bvec, bio, iter) {
 		unsigned int len = bvec.bv_len;
 
 		BUG_ON(len > PAGE_SIZE);
 		err = nd_blk_do_bvec(blk_dev, bip, bvec.bv_page, len,
-					bvec.bv_offset, rw, sector);
+					bvec.bv_offset, rw, iter.bi_sector);
 		if (err) {
 			dev_info(&blk_dev->nsblk->dev,
 					"io error in %s sector %lld, len %d,\n",
 					(rw == READ) ? "READ" : "WRITE",
-					(unsigned long long) sector, len);
+					(unsigned long long) iter.bi_sector, len);
 			goto out;
 		}
-		sector += len >> SECTOR_SHIFT;
 	}
 
  out:
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 380e01cedd24..83b798dd2e68 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1177,23 +1177,16 @@ static void btt_make_request(struct request_queue *q, struct bio *bio)
 	struct bio_integrity_payload *bip = bio_integrity(bio);
 	struct block_device *bdev = bio->bi_bdev;
 	struct btt *btt = q->queuedata;
-	int rw;
-	struct bio_vec bvec;
-	sector_t sector;
 	struct bvec_iter iter;
-	int err = 0;
+	struct bio_vec bvec;
+	int err = 0, rw;
 
-	sector = bio->bi_iter.bi_sector;
 	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
 		err = -EIO;
 		goto out;
 	}
 
-	BUG_ON(bio->bi_rw & REQ_DISCARD);
-
-	rw = bio_rw(bio);
-	if (rw == READA)
-		rw = READ;
+	rw = bio_data_dir(bio);
 
 	/*
 	 * bio_integrity_enabled also checks if the bio already has an
@@ -1216,15 +1209,14 @@ static void btt_make_request(struct request_queue *q, struct bio *bio)
 		BUG_ON(len % btt->sector_size);
 
 		err = btt_do_bvec(btt, bip, bvec.bv_page, len, bvec.bv_offset,
-				rw, sector);
+				rw, iter.bi_sector);
 		if (err) {
 			dev_info(&btt->nd_btt->dev,
 					"io error in %s sector %lld, len %d,\n",
 					(rw == READ) ? "READ" : "WRITE",
-					(unsigned long long) sector, len);
+					(unsigned long long) iter.bi_sector, len);
 			goto out;
 		}
-		sector += len >> SECTOR_SHIFT;
 	}
 
 out:
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index b825a2201aa8..0337b00f5409 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -58,28 +58,20 @@ static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 
 static void pmem_make_request(struct request_queue *q, struct bio *bio)
 {
-	struct block_device *bdev = bio->bi_bdev;
-	struct pmem_device *pmem = bdev->bd_disk->private_data;
-	int rw;
+	int err = 0;
 	struct bio_vec bvec;
-	sector_t sector;
 	struct bvec_iter iter;
-	int err = 0;
+	struct block_device *bdev = bio->bi_bdev;
+	struct pmem_device *pmem = bdev->bd_disk->private_data;
 
-	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
+	if (unlikely(bio_end_sector(bio) > get_capacity(bdev->bd_disk))) {
 		err = -EIO;
 		goto out;
 	}
 
-	BUG_ON(bio->bi_rw & REQ_DISCARD);
-
-	rw = bio_data_dir(bio);
-	sector = bio->bi_iter.bi_sector;
-	bio_for_each_segment(bvec, bio, iter) {
+	bio_for_each_segment(bvec, bio, iter)
 		pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len, bvec.bv_offset,
-			     rw, sector);
-		sector += bvec.bv_len >> 9;
-	}
+				bio_data_dir(bio), iter.bi_sector);
 
 out:
 	bio_endio(bio, err);


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 11/15] libnvdimm: pmem, blk, and btt make_request cleanups
@ 2015-06-17 23:55   ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Vishal Verma, linux-kernel, hch, linux-acpi,
	linux-fsdevel, mingo

Various cleanups:

1/ kill the BUG_ONs since we've already told the block layer we don't
   support DISCARD on all these drivers.

2/ Fix up use of 'rw'.  No need to cache it in the pmem driver and for
   btt using bio_data_dir() saves a check for READA.

3/ Kill the local 'sector' variables.  bio_for_each_segment() is already
   advancing the iterator's sector number by the bio_vec length.

Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/blk.c  |   14 ++++----------
 drivers/nvdimm/btt.c  |   18 +++++-------------
 drivers/nvdimm/pmem.c |   20 ++++++--------------
 3 files changed, 15 insertions(+), 37 deletions(-)

diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 8a6345797a71..9d609ef95266 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -170,18 +170,12 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
 	struct bvec_iter iter;
 	struct bio_vec bvec;
 	int err = 0, rw;
-	sector_t sector;
 
-	sector = bio->bi_iter.bi_sector;
-	if (bio_end_sector(bio) > get_capacity(disk)) {
+	if (unlikely(bio_end_sector(bio) > get_capacity(disk))) {
 		err = -EIO;
 		goto out;
 	}
 
-	BUG_ON(bio->bi_rw & REQ_DISCARD);
-
-	rw = bio_data_dir(bio);
-
 	/*
 	 * bio_integrity_enabled also checks if the bio already has an
 	 * integrity payload attached. If it does, we *don't* do a
@@ -196,20 +190,20 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
 	bip = bio_integrity(bio);
 	blk_dev = disk->private_data;
 
+	rw = bio_data_dir(bio);
 	bio_for_each_segment(bvec, bio, iter) {
 		unsigned int len = bvec.bv_len;
 
 		BUG_ON(len > PAGE_SIZE);
 		err = nd_blk_do_bvec(blk_dev, bip, bvec.bv_page, len,
-					bvec.bv_offset, rw, sector);
+					bvec.bv_offset, rw, iter.bi_sector);
 		if (err) {
 			dev_info(&blk_dev->nsblk->dev,
 					"io error in %s sector %lld, len %d,\n",
 					(rw == READ) ? "READ" : "WRITE",
-					(unsigned long long) sector, len);
+					(unsigned long long) iter.bi_sector, len);
 			goto out;
 		}
-		sector += len >> SECTOR_SHIFT;
 	}
 
  out:
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 380e01cedd24..83b798dd2e68 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1177,23 +1177,16 @@ static void btt_make_request(struct request_queue *q, struct bio *bio)
 	struct bio_integrity_payload *bip = bio_integrity(bio);
 	struct block_device *bdev = bio->bi_bdev;
 	struct btt *btt = q->queuedata;
-	int rw;
-	struct bio_vec bvec;
-	sector_t sector;
 	struct bvec_iter iter;
-	int err = 0;
+	struct bio_vec bvec;
+	int err = 0, rw;
 
-	sector = bio->bi_iter.bi_sector;
 	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
 		err = -EIO;
 		goto out;
 	}
 
-	BUG_ON(bio->bi_rw & REQ_DISCARD);
-
-	rw = bio_rw(bio);
-	if (rw == READA)
-		rw = READ;
+	rw = bio_data_dir(bio);
 
 	/*
 	 * bio_integrity_enabled also checks if the bio already has an
@@ -1216,15 +1209,14 @@ static void btt_make_request(struct request_queue *q, struct bio *bio)
 		BUG_ON(len % btt->sector_size);
 
 		err = btt_do_bvec(btt, bip, bvec.bv_page, len, bvec.bv_offset,
-				rw, sector);
+				rw, iter.bi_sector);
 		if (err) {
 			dev_info(&btt->nd_btt->dev,
 					"io error in %s sector %lld, len %d,\n",
 					(rw == READ) ? "READ" : "WRITE",
-					(unsigned long long) sector, len);
+					(unsigned long long) iter.bi_sector, len);
 			goto out;
 		}
-		sector += len >> SECTOR_SHIFT;
 	}
 
 out:
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index b825a2201aa8..0337b00f5409 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -58,28 +58,20 @@ static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 
 static void pmem_make_request(struct request_queue *q, struct bio *bio)
 {
-	struct block_device *bdev = bio->bi_bdev;
-	struct pmem_device *pmem = bdev->bd_disk->private_data;
-	int rw;
+	int err = 0;
 	struct bio_vec bvec;
-	sector_t sector;
 	struct bvec_iter iter;
-	int err = 0;
+	struct block_device *bdev = bio->bi_bdev;
+	struct pmem_device *pmem = bdev->bd_disk->private_data;
 
-	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
+	if (unlikely(bio_end_sector(bio) > get_capacity(bdev->bd_disk))) {
 		err = -EIO;
 		goto out;
 	}
 
-	BUG_ON(bio->bi_rw & REQ_DISCARD);
-
-	rw = bio_data_dir(bio);
-	sector = bio->bi_iter.bi_sector;
-	bio_for_each_segment(bvec, bio, iter) {
+	bio_for_each_segment(bvec, bio, iter)
 		pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len, bvec.bv_offset,
-			     rw, sector);
-		sector += bvec.bv_len >> 9;
-	}
+				bio_data_dir(bio), iter.bi_sector);
 
 out:
 	bio_endio(bio, err);


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 12/15] libnvdimm: enable iostat
  2015-06-17 23:54 ` Dan Williams
@ 2015-06-17 23:55   ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Vishal Verma, linux-kernel, hch, linux-acpi,
	linux-fsdevel, mingo

This is disabled by default as the overhead is prohibitive, but if the
user takes the action to turn it on we'll oblige.

Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/Kconfig |   14 ++++++++++++++
 drivers/nvdimm/blk.c   |    7 ++++++-
 drivers/nvdimm/btt.c   |    7 ++++++-
 drivers/nvdimm/core.c  |   31 +++++++++++++++++++++++++++++++
 drivers/nvdimm/nd.h    |   13 +++++++++++++
 drivers/nvdimm/pmem.c  |    5 +++++
 6 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 912cb36b8435..9d72085a67c9 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -73,4 +73,18 @@ config ND_MAX_REGIONS
 
 	  Leave the default of 64 if you are unsure.
 
+config ND_IOSTAT
+	bool "Enable iostat by default"
+	default n
+	---help---
+	  Persistent memory i/o has very low latency to the point
+	  where the overhead to measure statistics can dramatically
+	  impact the relative performance of the driver.  Say y here
+	  to trade off performance for statistics gathering that is
+	  enabled by default.  These statistics can always be
+	  enabled/disabled at run time via the 'iostat' attribute of
+	  the block device's queue in sysfs.
+
+	  If unsure, say N
+
 endif
diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 9d609ef95266..8a65e5a500d8 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -168,8 +168,10 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
 	struct bio_integrity_payload *bip;
 	struct nd_blk_device *blk_dev;
 	struct bvec_iter iter;
+	unsigned long start;
 	struct bio_vec bvec;
 	int err = 0, rw;
+	bool do_acct;
 
 	if (unlikely(bio_end_sector(bio) > get_capacity(disk))) {
 		err = -EIO;
@@ -191,6 +193,7 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
 	blk_dev = disk->private_data;
 
 	rw = bio_data_dir(bio);
+	do_acct = nd_iostat_start(bio, &start);
 	bio_for_each_segment(bvec, bio, iter) {
 		unsigned int len = bvec.bv_len;
 
@@ -202,9 +205,11 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
 					"io error in %s sector %lld, len %d,\n",
 					(rw == READ) ? "READ" : "WRITE",
 					(unsigned long long) iter.bi_sector, len);
-			goto out;
+			break;
 		}
 	}
+	if (do_acct)
+		nd_iostat_end(bio, start);
 
  out:
 	bio_endio(bio, err);
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 83b798dd2e68..67484633c322 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1178,8 +1178,10 @@ static void btt_make_request(struct request_queue *q, struct bio *bio)
 	struct block_device *bdev = bio->bi_bdev;
 	struct btt *btt = q->queuedata;
 	struct bvec_iter iter;
+	unsigned long start;
 	struct bio_vec bvec;
 	int err = 0, rw;
+	bool do_acct;
 
 	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
 		err = -EIO;
@@ -1199,6 +1201,7 @@ static void btt_make_request(struct request_queue *q, struct bio *bio)
 		goto out;
 	}
 
+	do_acct = nd_iostat_start(bio, &start);
 	bio_for_each_segment(bvec, bio, iter) {
 		unsigned int len = bvec.bv_len;
 
@@ -1215,9 +1218,11 @@ static void btt_make_request(struct request_queue *q, struct bio *bio)
 					"io error in %s sector %lld, len %d,\n",
 					(rw == READ) ? "READ" : "WRITE",
 					(unsigned long long) iter.bi_sector, len);
-			goto out;
+			break;
 		}
 	}
+	if (do_acct)
+		nd_iostat_end(bio, start);
 
 out:
 	bio_endio(bio, err);
diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index d27b13357873..99cf95af5f24 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -218,9 +218,40 @@ void nd_blk_queue_init(struct request_queue *q)
 {
 	blk_queue_max_hw_sectors(q, UINT_MAX);
 	blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
+	if (IS_ENABLED(CONFIG_ND_IOSTAT))
+		queue_flag_set_unlocked(QUEUE_FLAG_IO_STAT, q);
 }
 EXPORT_SYMBOL(nd_blk_queue_init);
 
+void __nd_iostat_start(struct bio *bio, unsigned long *start)
+{
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	const int rw = bio_data_dir(bio);
+	int cpu = part_stat_lock();
+
+	*start = jiffies;
+	part_round_stats(cpu, &disk->part0);
+	part_stat_inc(cpu, &disk->part0, ios[rw]);
+	part_stat_add(cpu, &disk->part0, sectors[rw], bio_sectors(bio));
+	part_inc_in_flight(&disk->part0, rw);
+	part_stat_unlock();
+}
+EXPORT_SYMBOL(__nd_iostat_start);
+
+void nd_iostat_end(struct bio *bio, unsigned long start)
+{
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	unsigned long duration = jiffies - start;
+	const int rw = bio_data_dir(bio);
+	int cpu = part_stat_lock();
+
+	part_stat_add(cpu, &disk->part0, ticks[rw], duration);
+	part_round_stats(cpu, &disk->part0);
+	part_dec_in_flight(&disk->part0, rw);
+	part_stat_unlock();
+}
+EXPORT_SYMBOL(nd_iostat_end);
+
 static ssize_t commands_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 2f20d5dca028..3c4c8b6c64ec 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -13,6 +13,7 @@
 #ifndef __ND_H__
 #define __ND_H__
 #include <linux/libnvdimm.h>
+#include <linux/blkdev.h>
 #include <linux/device.h>
 #include <linux/genhd.h>
 #include <linux/mutex.h>
@@ -172,5 +173,17 @@ struct resource *nvdimm_allocate_dpa(struct nvdimm_drvdata *ndd,
 		resource_size_t n);
 int nd_blk_region_init(struct nd_region *nd_region);
 void nd_blk_queue_init(struct request_queue *q);
+void __nd_iostat_start(struct bio *bio, unsigned long *start);
+static inline bool nd_iostat_start(struct bio *bio, unsigned long *start)
+{
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+
+	if (!blk_queue_io_stat(disk->queue))
+		return false;
+
+	__nd_iostat_start(bio, start);
+	return true;
+}
+void nd_iostat_end(struct bio *bio, unsigned long start);
 resource_size_t nd_namespace_blk_validate(struct nd_namespace_blk *nsblk);
 #endif /* __ND_H__ */
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 0337b00f5409..3fd854a78f09 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -59,6 +59,8 @@ static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 static void pmem_make_request(struct request_queue *q, struct bio *bio)
 {
 	int err = 0;
+	bool do_acct;
+	unsigned long start;
 	struct bio_vec bvec;
 	struct bvec_iter iter;
 	struct block_device *bdev = bio->bi_bdev;
@@ -69,9 +71,12 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
 		goto out;
 	}
 
+	do_acct = nd_iostat_start(bio, &start);
 	bio_for_each_segment(bvec, bio, iter)
 		pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len, bvec.bv_offset,
 				bio_data_dir(bio), iter.bi_sector);
+	if (do_acct)
+		nd_iostat_end(bio, start);
 
 out:
 	bio_endio(bio, err);


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 12/15] libnvdimm: enable iostat
@ 2015-06-17 23:55   ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Vishal Verma, linux-kernel, hch, linux-acpi,
	linux-fsdevel, mingo

This is disabled by default as the overhead is prohibitive, but if the
user takes the action to turn it on we'll oblige.

Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/Kconfig |   14 ++++++++++++++
 drivers/nvdimm/blk.c   |    7 ++++++-
 drivers/nvdimm/btt.c   |    7 ++++++-
 drivers/nvdimm/core.c  |   31 +++++++++++++++++++++++++++++++
 drivers/nvdimm/nd.h    |   13 +++++++++++++
 drivers/nvdimm/pmem.c  |    5 +++++
 6 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 912cb36b8435..9d72085a67c9 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -73,4 +73,18 @@ config ND_MAX_REGIONS
 
 	  Leave the default of 64 if you are unsure.
 
+config ND_IOSTAT
+	bool "Enable iostat by default"
+	default n
+	---help---
+	  Persistent memory i/o has very low latency to the point
+	  where the overhead to measure statistics can dramatically
+	  impact the relative performance of the driver.  Say y here
+	  to trade off performance for statistics gathering that is
+	  enabled by default.  These statistics can always be
+	  enabled/disabled at run time via the 'iostat' attribute of
+	  the block device's queue in sysfs.
+
+	  If unsure, say N
+
 endif
diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 9d609ef95266..8a65e5a500d8 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -168,8 +168,10 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
 	struct bio_integrity_payload *bip;
 	struct nd_blk_device *blk_dev;
 	struct bvec_iter iter;
+	unsigned long start;
 	struct bio_vec bvec;
 	int err = 0, rw;
+	bool do_acct;
 
 	if (unlikely(bio_end_sector(bio) > get_capacity(disk))) {
 		err = -EIO;
@@ -191,6 +193,7 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
 	blk_dev = disk->private_data;
 
 	rw = bio_data_dir(bio);
+	do_acct = nd_iostat_start(bio, &start);
 	bio_for_each_segment(bvec, bio, iter) {
 		unsigned int len = bvec.bv_len;
 
@@ -202,9 +205,11 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
 					"io error in %s sector %lld, len %d,\n",
 					(rw == READ) ? "READ" : "WRITE",
 					(unsigned long long) iter.bi_sector, len);
-			goto out;
+			break;
 		}
 	}
+	if (do_acct)
+		nd_iostat_end(bio, start);
 
  out:
 	bio_endio(bio, err);
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 83b798dd2e68..67484633c322 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1178,8 +1178,10 @@ static void btt_make_request(struct request_queue *q, struct bio *bio)
 	struct block_device *bdev = bio->bi_bdev;
 	struct btt *btt = q->queuedata;
 	struct bvec_iter iter;
+	unsigned long start;
 	struct bio_vec bvec;
 	int err = 0, rw;
+	bool do_acct;
 
 	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
 		err = -EIO;
@@ -1199,6 +1201,7 @@ static void btt_make_request(struct request_queue *q, struct bio *bio)
 		goto out;
 	}
 
+	do_acct = nd_iostat_start(bio, &start);
 	bio_for_each_segment(bvec, bio, iter) {
 		unsigned int len = bvec.bv_len;
 
@@ -1215,9 +1218,11 @@ static void btt_make_request(struct request_queue *q, struct bio *bio)
 					"io error in %s sector %lld, len %d,\n",
 					(rw == READ) ? "READ" : "WRITE",
 					(unsigned long long) iter.bi_sector, len);
-			goto out;
+			break;
 		}
 	}
+	if (do_acct)
+		nd_iostat_end(bio, start);
 
 out:
 	bio_endio(bio, err);
diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index d27b13357873..99cf95af5f24 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -218,9 +218,40 @@ void nd_blk_queue_init(struct request_queue *q)
 {
 	blk_queue_max_hw_sectors(q, UINT_MAX);
 	blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
+	if (IS_ENABLED(CONFIG_ND_IOSTAT))
+		queue_flag_set_unlocked(QUEUE_FLAG_IO_STAT, q);
 }
 EXPORT_SYMBOL(nd_blk_queue_init);
 
+void __nd_iostat_start(struct bio *bio, unsigned long *start)
+{
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	const int rw = bio_data_dir(bio);
+	int cpu = part_stat_lock();
+
+	*start = jiffies;
+	part_round_stats(cpu, &disk->part0);
+	part_stat_inc(cpu, &disk->part0, ios[rw]);
+	part_stat_add(cpu, &disk->part0, sectors[rw], bio_sectors(bio));
+	part_inc_in_flight(&disk->part0, rw);
+	part_stat_unlock();
+}
+EXPORT_SYMBOL(__nd_iostat_start);
+
+void nd_iostat_end(struct bio *bio, unsigned long start)
+{
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	unsigned long duration = jiffies - start;
+	const int rw = bio_data_dir(bio);
+	int cpu = part_stat_lock();
+
+	part_stat_add(cpu, &disk->part0, ticks[rw], duration);
+	part_round_stats(cpu, &disk->part0);
+	part_dec_in_flight(&disk->part0, rw);
+	part_stat_unlock();
+}
+EXPORT_SYMBOL(nd_iostat_end);
+
 static ssize_t commands_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 2f20d5dca028..3c4c8b6c64ec 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -13,6 +13,7 @@
 #ifndef __ND_H__
 #define __ND_H__
 #include <linux/libnvdimm.h>
+#include <linux/blkdev.h>
 #include <linux/device.h>
 #include <linux/genhd.h>
 #include <linux/mutex.h>
@@ -172,5 +173,17 @@ struct resource *nvdimm_allocate_dpa(struct nvdimm_drvdata *ndd,
 		resource_size_t n);
 int nd_blk_region_init(struct nd_region *nd_region);
 void nd_blk_queue_init(struct request_queue *q);
+void __nd_iostat_start(struct bio *bio, unsigned long *start);
+static inline bool nd_iostat_start(struct bio *bio, unsigned long *start)
+{
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+
+	if (!blk_queue_io_stat(disk->queue))
+		return false;
+
+	__nd_iostat_start(bio, start);
+	return true;
+}
+void nd_iostat_end(struct bio *bio, unsigned long start);
 resource_size_t nd_namespace_blk_validate(struct nd_namespace_blk *nsblk);
 #endif /* __ND_H__ */
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 0337b00f5409..3fd854a78f09 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -59,6 +59,8 @@ static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 static void pmem_make_request(struct request_queue *q, struct bio *bio)
 {
 	int err = 0;
+	bool do_acct;
+	unsigned long start;
 	struct bio_vec bvec;
 	struct bvec_iter iter;
 	struct block_device *bdev = bio->bi_bdev;
@@ -69,9 +71,12 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
 		goto out;
 	}
 
+	do_acct = nd_iostat_start(bio, &start);
 	bio_for_each_segment(bvec, bio, iter)
 		pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len, bvec.bv_offset,
 				bio_data_dir(bio), iter.bi_sector);
+	if (do_acct)
+		nd_iostat_end(bio, start);
 
 out:
 	bio_endio(bio, err);


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 13/15] libnvdimm: flag libnvdimm block devices as non-rotational
  2015-06-17 23:54 ` Dan Williams
@ 2015-06-17 23:55   ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Vishal Verma, linux-kernel, hch, linux-acpi,
	linux-fsdevel, mingo

...since they are effectively SSDs as far as userspace is concerned.

Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/core.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index 99cf95af5f24..fa82f215990d 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -220,6 +220,7 @@ void nd_blk_queue_init(struct request_queue *q)
 	blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
 	if (IS_ENABLED(CONFIG_ND_IOSTAT))
 		queue_flag_set_unlocked(QUEUE_FLAG_IO_STAT, q);
+	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
 }
 EXPORT_SYMBOL(nd_blk_queue_init);
 


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 13/15] libnvdimm: flag libnvdimm block devices as non-rotational
@ 2015-06-17 23:55   ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:55 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, Vishal Verma, linux-kernel, hch, linux-acpi,
	linux-fsdevel, mingo

...since they are effectively SSDs as far as userspace is concerned.

Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/core.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index 99cf95af5f24..fa82f215990d 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -220,6 +220,7 @@ void nd_blk_queue_init(struct request_queue *q)
 	blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
 	if (IS_ENABLED(CONFIG_ND_IOSTAT))
 		queue_flag_set_unlocked(QUEUE_FLAG_IO_STAT, q);
+	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
 }
 EXPORT_SYMBOL(nd_blk_queue_init);
 


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-17 23:54 ` Dan Williams
@ 2015-06-17 23:56   ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:56 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, linux-kernel, hch, linux-acpi, linux-fsdevel, mingo

Upon detection of a read-only backing device arrange for the btt to
device to be read only.  Implement a catch for the BLKROSET ioctl and
only allow a btt-instance to become read-write when the backing-device
becomes read-write.  Conversely, if a backing-device becomes read-only
arrange for its parent btt to be marked read-only.  Synchronize these
changes under the bus lock.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/blk.c      |    4 +++
 drivers/nvdimm/btt.c      |   34 ++++++++++++++++++++++++++--
 drivers/nvdimm/btt_devs.c |   42 ++++++++++++++++++++++++++++++++++
 drivers/nvdimm/bus.c      |   55 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvdimm/nd-core.h  |   14 +++++++++++
 drivers/nvdimm/nd.h       |    4 +++
 drivers/nvdimm/pmem.c     |    4 +++
 7 files changed, 154 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 8a65e5a500d8..adacc27f04f1 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -239,6 +239,10 @@ static int nd_blk_rw_bytes(struct gendisk *disk, resource_size_t offset,
 static const struct block_device_operations nd_blk_fops = {
 	.owner = THIS_MODULE,
 	.rw_bytes = nd_blk_rw_bytes,
+	.ioctl = nvdimm_bdev_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl = nvdimm_bdev_compat_ioctl,
+#endif
 };
 
 static int nd_blk_probe(struct device *dev)
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 67484633c322..57d3b271e451 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1248,10 +1248,29 @@ static int btt_getgeo(struct block_device *bd, struct hd_geometry *geo)
 	return 0;
 }
 
+static int btt_revalidate_disk(struct gendisk *disk)
+{
+	struct btt *btt = disk->private_data;
+	struct nd_btt *nd_btt = btt->nd_btt;
+	struct block_device *bdev = nd_btt->backing_dev;
+	char name[BDEVNAME_SIZE];
+
+	dev_dbg(&nd_btt->dev, "backing dev: %s read-%s", bdevname(bdev, name),
+			bdev_read_only(bdev) ? "only" : "write");
+	if (bdev_read_only(bdev))
+		set_disk_ro(disk, 1);
+	return 0;
+}
+
 static const struct block_device_operations btt_fops = {
 	.owner =		THIS_MODULE,
 	.rw_page =		btt_rw_page,
 	.getgeo =		btt_getgeo,
+	.ioctl =		nvdimm_bdev_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl =		nvdimm_bdev_compat_ioctl,
+#endif
+	.revalidate_disk =	btt_revalidate_disk,
 };
 
 static int btt_blk_init(struct btt *btt)
@@ -1296,6 +1315,7 @@ static int btt_blk_init(struct btt *btt)
 	}
 
 	set_capacity(btt->btt_disk, btt->nlba * btt->sector_size >> 9);
+	revalidate_disk(btt->btt_disk);
 
 	return 0;
 
@@ -1335,6 +1355,7 @@ static struct btt *btt_init(struct nd_btt *nd_btt, unsigned long long rawsize,
 	int ret;
 	struct btt *btt;
 	struct device *dev = &nd_btt->dev;
+	struct block_device *bdev = nd_btt->backing_dev;
 
 	btt = kzalloc(sizeof(struct btt), GFP_KERNEL);
 	if (!btt)
@@ -1354,7 +1375,13 @@ static struct btt *btt_init(struct nd_btt *nd_btt, unsigned long long rawsize,
 		goto out_free;
 	}
 
-	if (btt->init_state != INIT_READY) {
+	if (btt->init_state != INIT_READY && bdev_read_only(bdev)) {
+		char name[BDEVNAME_SIZE];
+
+		dev_info(dev, "%s is read-only, unable to init btt metadata\n",
+				bdevname(bdev, name));
+		goto out_free;
+	} else if (btt->init_state != INIT_READY) {
 		btt->num_arenas = (rawsize / ARENA_MAX_SIZE) +
 			((rawsize % ARENA_MAX_SIZE) ? 1 : 0);
 		dev_dbg(dev, "init: %d arenas for %llu rawsize\n",
@@ -1369,7 +1396,7 @@ static struct btt *btt_init(struct nd_btt *nd_btt, unsigned long long rawsize,
 		ret = btt_meta_init(btt);
 		if (ret) {
 			dev_err(dev, "init: error in meta_init: %d\n", ret);
-			return NULL;
+			goto out_free;
 		}
 	}
 
@@ -1481,7 +1508,10 @@ static int nd_btt_remove(struct device *dev)
 	struct nd_btt *nd_btt = to_nd_btt(dev);
 	struct btt *btt = dev_get_drvdata(dev);
 
+	nvdimm_bus_lock(dev);
 	btt_fini(btt);
+	nvdimm_bus_unlock(dev);
+
 	unlink_btt(nd_btt);
 
 	return 0;
diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
index 02e125b91e77..bcf77dca1532 100644
--- a/drivers/nvdimm/btt_devs.c
+++ b/drivers/nvdimm/btt_devs.c
@@ -122,7 +122,7 @@ static ssize_t backing_dev_show(struct device *dev,
 		return sprintf(buf, "\n");
 }
 
-static const fmode_t nd_btt_devs_mode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
+static const fmode_t nd_btt_devs_mode = FMODE_READ | FMODE_EXCL;
 
 static void nd_btt_remove_bdev(struct nd_btt *nd_btt, const char *caller)
 {
@@ -363,6 +363,46 @@ u64 nd_btt_sb_checksum(struct btt_sb *btt_sb)
 }
 EXPORT_SYMBOL(nd_btt_sb_checksum);
 
+int set_btt_ro(struct block_device *bdev, struct device *dev, int ro)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+
+	if (!dev->driver)
+		return 0;
+
+	/* we can only mark a btt device rw if its backing device is rw */
+	if (bdev_read_only(nd_btt->backing_dev) && !ro)
+		return -EBUSY;
+
+	set_device_ro(bdev, ro);
+	return 0;
+}
+
+int set_btt_disk_ro(struct device *dev, void *data)
+{
+	struct block_device *bdev = data;
+	struct nd_btt *nd_btt;
+	struct btt *btt;
+
+	if (!is_nd_btt(dev))
+		return 0;
+
+	nd_btt = to_nd_btt(dev);
+	if (nd_btt->backing_dev != bdev)
+		return 0;
+
+	/*
+	 * We have the lock at this point and have flushed probing.  We
+	 * are guaranteed that the btt driver is unbound, or has
+	 * completed setup operations and is blocked from initiating
+	 * disk teardown until we are done walking these pointers.
+	 */
+	btt = dev_get_drvdata(dev);
+	if (btt && btt->btt_disk)
+		set_disk_ro(btt->btt_disk, 1);
+	return 0;
+}
+
 static struct nd_btt *__nd_btt_autodetect(struct nvdimm_bus *nvdimm_bus,
 		struct block_device *bdev, struct btt_sb *btt_sb)
 {
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index d4fbc48f5643..47260ca573e0 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -309,6 +309,61 @@ void nvdimm_bus_remove_disk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(nvdimm_bus_remove_disk);
 
+static int set_namespace_ro(struct block_device *bdev,
+		struct nvdimm_bus *nvdimm_bus, int ro)
+{
+	set_device_ro(bdev, ro);
+
+	/*
+	 * It's possible to mark the backing device rw while leaving the
+	 * btt device read-only.  However, marking a backing device
+	 * read-only always marks the parent btt read-only.
+	 */
+	if (!ro)
+		return 0;
+	return device_for_each_child(&nvdimm_bus->dev, bdev, set_btt_disk_ro);
+}
+
+int nvdimm_bdev_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg)
+{
+	int rc, ro;
+	struct gendisk *disk = bdev->bd_disk;
+	struct device *dev = disk->driverfs_dev;
+	struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
+
+	if (cmd != BLKROSET)
+		return -ENOTTY;
+
+	if (get_user(ro, (int __user *)(arg)))
+		return -EFAULT;
+
+	if (ro == 0 || ro == 1)
+		/* pass */;
+	else
+		return -EINVAL;
+
+	nvdimm_bus_lock(&nvdimm_bus->dev);
+	wait_nvdimm_bus_probe_idle(&nvdimm_bus->dev);
+	if (bdev_read_only(bdev) == ro)
+		rc = 0;
+	else if (is_nd_btt(dev))
+		rc = set_btt_ro(bdev, dev, ro);
+	else
+		rc = set_namespace_ro(bdev, nvdimm_bus, ro);
+	nvdimm_bus_unlock(&nvdimm_bus->dev);
+
+	return rc;
+}
+EXPORT_SYMBOL(nvdimm_bdev_ioctl);
+
+int nvdimm_bdev_compat_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg)
+{
+	return nvdimm_bdev_ioctl(bdev, mode, cmd, arg);
+}
+EXPORT_SYMBOL(nvdimm_bdev_compat_ioctl);
+
 static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
 		char *buf)
 {
diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h
index 9a90915e6fd2..ba548d248b4e 100644
--- a/drivers/nvdimm/nd-core.h
+++ b/drivers/nvdimm/nd-core.h
@@ -49,11 +49,14 @@ bool is_nvdimm(struct device *dev);
 bool is_nd_pmem(struct device *dev);
 bool is_nd_blk(struct device *dev);
 struct gendisk;
+struct block_device;
 #if IS_ENABLED(CONFIG_ND_BTT_DEVS)
 bool is_nd_btt(struct device *dev);
 struct nd_btt *nd_btt_create(struct nvdimm_bus *nvdimm_bus);
 void nd_btt_add_disk(struct nvdimm_bus *nvdimm_bus, struct gendisk *disk);
 void nd_btt_remove_disk(struct nvdimm_bus *nvdimm_bus, struct gendisk *disk);
+int set_btt_ro(struct block_device *bdev, struct device *dev, int ro);
+int set_btt_disk_ro(struct device *dev, void *data);
 #else
 static inline bool is_nd_btt(struct device *dev)
 {
@@ -74,6 +77,17 @@ static inline void nd_btt_remove_disk(struct nvdimm_bus *nvdimm_bus,
 		struct gendisk *disk)
 {
 }
+
+static inline int set_btt_ro(struct block_device *bdev, struct device *dev,
+		int ro)
+{
+	return 0;
+}
+
+static inline int set_btt_disk_ro(struct device *dev, void *data)
+{
+	return 0;
+}
 #endif
 struct nvdimm_bus *walk_to_nvdimm_bus(struct device *nd_dev);
 int __init nvdimm_bus_init(void);
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 3c4c8b6c64ec..2786eb8456ec 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -164,6 +164,10 @@ int nvdimm_bus_add_disk(struct gendisk *disk);
 int nvdimm_bus_add_integrity_disk(struct gendisk *disk, u32 lbasize,
 		sector_t size);
 void nvdimm_bus_remove_disk(struct gendisk *disk);
+int nvdimm_bdev_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg);
+int nvdimm_bdev_compat_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg);
 void nvdimm_drvdata_release(struct kref *kref);
 void put_ndd(struct nvdimm_drvdata *ndd);
 int nd_label_reserve_dpa(struct nvdimm_drvdata *ndd);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3fd854a78f09..96964419b72d 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -131,6 +131,10 @@ static const struct block_device_operations pmem_fops = {
 	.rw_page =		pmem_rw_page,
 	.rw_bytes =		pmem_rw_bytes,
 	.direct_access =	pmem_direct_access,
+	.ioctl =		nvdimm_bdev_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl =		nvdimm_bdev_ioctl,
+#endif
 };
 
 static struct pmem_device *pmem_alloc(struct device *dev,


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-17 23:56   ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:56 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, linux-kernel, hch, linux-acpi, linux-fsdevel, mingo

Upon detection of a read-only backing device arrange for the btt to
device to be read only.  Implement a catch for the BLKROSET ioctl and
only allow a btt-instance to become read-write when the backing-device
becomes read-write.  Conversely, if a backing-device becomes read-only
arrange for its parent btt to be marked read-only.  Synchronize these
changes under the bus lock.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/blk.c      |    4 +++
 drivers/nvdimm/btt.c      |   34 ++++++++++++++++++++++++++--
 drivers/nvdimm/btt_devs.c |   42 ++++++++++++++++++++++++++++++++++
 drivers/nvdimm/bus.c      |   55 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvdimm/nd-core.h  |   14 +++++++++++
 drivers/nvdimm/nd.h       |    4 +++
 drivers/nvdimm/pmem.c     |    4 +++
 7 files changed, 154 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 8a65e5a500d8..adacc27f04f1 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -239,6 +239,10 @@ static int nd_blk_rw_bytes(struct gendisk *disk, resource_size_t offset,
 static const struct block_device_operations nd_blk_fops = {
 	.owner = THIS_MODULE,
 	.rw_bytes = nd_blk_rw_bytes,
+	.ioctl = nvdimm_bdev_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl = nvdimm_bdev_compat_ioctl,
+#endif
 };
 
 static int nd_blk_probe(struct device *dev)
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 67484633c322..57d3b271e451 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1248,10 +1248,29 @@ static int btt_getgeo(struct block_device *bd, struct hd_geometry *geo)
 	return 0;
 }
 
+static int btt_revalidate_disk(struct gendisk *disk)
+{
+	struct btt *btt = disk->private_data;
+	struct nd_btt *nd_btt = btt->nd_btt;
+	struct block_device *bdev = nd_btt->backing_dev;
+	char name[BDEVNAME_SIZE];
+
+	dev_dbg(&nd_btt->dev, "backing dev: %s read-%s", bdevname(bdev, name),
+			bdev_read_only(bdev) ? "only" : "write");
+	if (bdev_read_only(bdev))
+		set_disk_ro(disk, 1);
+	return 0;
+}
+
 static const struct block_device_operations btt_fops = {
 	.owner =		THIS_MODULE,
 	.rw_page =		btt_rw_page,
 	.getgeo =		btt_getgeo,
+	.ioctl =		nvdimm_bdev_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl =		nvdimm_bdev_compat_ioctl,
+#endif
+	.revalidate_disk =	btt_revalidate_disk,
 };
 
 static int btt_blk_init(struct btt *btt)
@@ -1296,6 +1315,7 @@ static int btt_blk_init(struct btt *btt)
 	}
 
 	set_capacity(btt->btt_disk, btt->nlba * btt->sector_size >> 9);
+	revalidate_disk(btt->btt_disk);
 
 	return 0;
 
@@ -1335,6 +1355,7 @@ static struct btt *btt_init(struct nd_btt *nd_btt, unsigned long long rawsize,
 	int ret;
 	struct btt *btt;
 	struct device *dev = &nd_btt->dev;
+	struct block_device *bdev = nd_btt->backing_dev;
 
 	btt = kzalloc(sizeof(struct btt), GFP_KERNEL);
 	if (!btt)
@@ -1354,7 +1375,13 @@ static struct btt *btt_init(struct nd_btt *nd_btt, unsigned long long rawsize,
 		goto out_free;
 	}
 
-	if (btt->init_state != INIT_READY) {
+	if (btt->init_state != INIT_READY && bdev_read_only(bdev)) {
+		char name[BDEVNAME_SIZE];
+
+		dev_info(dev, "%s is read-only, unable to init btt metadata\n",
+				bdevname(bdev, name));
+		goto out_free;
+	} else if (btt->init_state != INIT_READY) {
 		btt->num_arenas = (rawsize / ARENA_MAX_SIZE) +
 			((rawsize % ARENA_MAX_SIZE) ? 1 : 0);
 		dev_dbg(dev, "init: %d arenas for %llu rawsize\n",
@@ -1369,7 +1396,7 @@ static struct btt *btt_init(struct nd_btt *nd_btt, unsigned long long rawsize,
 		ret = btt_meta_init(btt);
 		if (ret) {
 			dev_err(dev, "init: error in meta_init: %d\n", ret);
-			return NULL;
+			goto out_free;
 		}
 	}
 
@@ -1481,7 +1508,10 @@ static int nd_btt_remove(struct device *dev)
 	struct nd_btt *nd_btt = to_nd_btt(dev);
 	struct btt *btt = dev_get_drvdata(dev);
 
+	nvdimm_bus_lock(dev);
 	btt_fini(btt);
+	nvdimm_bus_unlock(dev);
+
 	unlink_btt(nd_btt);
 
 	return 0;
diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
index 02e125b91e77..bcf77dca1532 100644
--- a/drivers/nvdimm/btt_devs.c
+++ b/drivers/nvdimm/btt_devs.c
@@ -122,7 +122,7 @@ static ssize_t backing_dev_show(struct device *dev,
 		return sprintf(buf, "\n");
 }
 
-static const fmode_t nd_btt_devs_mode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
+static const fmode_t nd_btt_devs_mode = FMODE_READ | FMODE_EXCL;
 
 static void nd_btt_remove_bdev(struct nd_btt *nd_btt, const char *caller)
 {
@@ -363,6 +363,46 @@ u64 nd_btt_sb_checksum(struct btt_sb *btt_sb)
 }
 EXPORT_SYMBOL(nd_btt_sb_checksum);
 
+int set_btt_ro(struct block_device *bdev, struct device *dev, int ro)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+
+	if (!dev->driver)
+		return 0;
+
+	/* we can only mark a btt device rw if its backing device is rw */
+	if (bdev_read_only(nd_btt->backing_dev) && !ro)
+		return -EBUSY;
+
+	set_device_ro(bdev, ro);
+	return 0;
+}
+
+int set_btt_disk_ro(struct device *dev, void *data)
+{
+	struct block_device *bdev = data;
+	struct nd_btt *nd_btt;
+	struct btt *btt;
+
+	if (!is_nd_btt(dev))
+		return 0;
+
+	nd_btt = to_nd_btt(dev);
+	if (nd_btt->backing_dev != bdev)
+		return 0;
+
+	/*
+	 * We have the lock at this point and have flushed probing.  We
+	 * are guaranteed that the btt driver is unbound, or has
+	 * completed setup operations and is blocked from initiating
+	 * disk teardown until we are done walking these pointers.
+	 */
+	btt = dev_get_drvdata(dev);
+	if (btt && btt->btt_disk)
+		set_disk_ro(btt->btt_disk, 1);
+	return 0;
+}
+
 static struct nd_btt *__nd_btt_autodetect(struct nvdimm_bus *nvdimm_bus,
 		struct block_device *bdev, struct btt_sb *btt_sb)
 {
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index d4fbc48f5643..47260ca573e0 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -309,6 +309,61 @@ void nvdimm_bus_remove_disk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(nvdimm_bus_remove_disk);
 
+static int set_namespace_ro(struct block_device *bdev,
+		struct nvdimm_bus *nvdimm_bus, int ro)
+{
+	set_device_ro(bdev, ro);
+
+	/*
+	 * It's possible to mark the backing device rw while leaving the
+	 * btt device read-only.  However, marking a backing device
+	 * read-only always marks the parent btt read-only.
+	 */
+	if (!ro)
+		return 0;
+	return device_for_each_child(&nvdimm_bus->dev, bdev, set_btt_disk_ro);
+}
+
+int nvdimm_bdev_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg)
+{
+	int rc, ro;
+	struct gendisk *disk = bdev->bd_disk;
+	struct device *dev = disk->driverfs_dev;
+	struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
+
+	if (cmd != BLKROSET)
+		return -ENOTTY;
+
+	if (get_user(ro, (int __user *)(arg)))
+		return -EFAULT;
+
+	if (ro == 0 || ro == 1)
+		/* pass */;
+	else
+		return -EINVAL;
+
+	nvdimm_bus_lock(&nvdimm_bus->dev);
+	wait_nvdimm_bus_probe_idle(&nvdimm_bus->dev);
+	if (bdev_read_only(bdev) == ro)
+		rc = 0;
+	else if (is_nd_btt(dev))
+		rc = set_btt_ro(bdev, dev, ro);
+	else
+		rc = set_namespace_ro(bdev, nvdimm_bus, ro);
+	nvdimm_bus_unlock(&nvdimm_bus->dev);
+
+	return rc;
+}
+EXPORT_SYMBOL(nvdimm_bdev_ioctl);
+
+int nvdimm_bdev_compat_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg)
+{
+	return nvdimm_bdev_ioctl(bdev, mode, cmd, arg);
+}
+EXPORT_SYMBOL(nvdimm_bdev_compat_ioctl);
+
 static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
 		char *buf)
 {
diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h
index 9a90915e6fd2..ba548d248b4e 100644
--- a/drivers/nvdimm/nd-core.h
+++ b/drivers/nvdimm/nd-core.h
@@ -49,11 +49,14 @@ bool is_nvdimm(struct device *dev);
 bool is_nd_pmem(struct device *dev);
 bool is_nd_blk(struct device *dev);
 struct gendisk;
+struct block_device;
 #if IS_ENABLED(CONFIG_ND_BTT_DEVS)
 bool is_nd_btt(struct device *dev);
 struct nd_btt *nd_btt_create(struct nvdimm_bus *nvdimm_bus);
 void nd_btt_add_disk(struct nvdimm_bus *nvdimm_bus, struct gendisk *disk);
 void nd_btt_remove_disk(struct nvdimm_bus *nvdimm_bus, struct gendisk *disk);
+int set_btt_ro(struct block_device *bdev, struct device *dev, int ro);
+int set_btt_disk_ro(struct device *dev, void *data);
 #else
 static inline bool is_nd_btt(struct device *dev)
 {
@@ -74,6 +77,17 @@ static inline void nd_btt_remove_disk(struct nvdimm_bus *nvdimm_bus,
 		struct gendisk *disk)
 {
 }
+
+static inline int set_btt_ro(struct block_device *bdev, struct device *dev,
+		int ro)
+{
+	return 0;
+}
+
+static inline int set_btt_disk_ro(struct device *dev, void *data)
+{
+	return 0;
+}
 #endif
 struct nvdimm_bus *walk_to_nvdimm_bus(struct device *nd_dev);
 int __init nvdimm_bus_init(void);
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 3c4c8b6c64ec..2786eb8456ec 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -164,6 +164,10 @@ int nvdimm_bus_add_disk(struct gendisk *disk);
 int nvdimm_bus_add_integrity_disk(struct gendisk *disk, u32 lbasize,
 		sector_t size);
 void nvdimm_bus_remove_disk(struct gendisk *disk);
+int nvdimm_bdev_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg);
+int nvdimm_bdev_compat_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg);
 void nvdimm_drvdata_release(struct kref *kref);
 void put_ndd(struct nvdimm_drvdata *ndd);
 int nd_label_reserve_dpa(struct nvdimm_drvdata *ndd);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3fd854a78f09..96964419b72d 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -131,6 +131,10 @@ static const struct block_device_operations pmem_fops = {
 	.rw_page =		pmem_rw_page,
 	.rw_bytes =		pmem_rw_bytes,
 	.direct_access =	pmem_direct_access,
+	.ioctl =		nvdimm_bdev_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl =		nvdimm_bdev_ioctl,
+#endif
 };
 
 static struct pmem_device *pmem_alloc(struct device *dev,


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 15/15] libnvdimm, nfit: handle acpi_nfit_memory_map flags
  2015-06-17 23:54 ` Dan Williams
@ 2015-06-17 23:56   ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:56 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, linux-kernel, hch, linux-acpi, linux-fsdevel, mingo

The flags in this NFIT sub-structure indicate the state of the data on
the nvdimm relative to its energy source or last "flush to persistence".
For the most part there is nothing the driver can do but advertise the
state of these flags in sysfs and emit a message if firmware indicates
that the contents of the device may be corrupted.  However, for the case
of ACPI_NFIT_MEM_ARMED, the driver can arrange for the block devices
incorporating that nvdimm to be marked read-only.  This is a safe
default as the data is still available and new writes are held off until
the administrator either forces read-write mode, or the energy source
becomes armed.

A module parameter "force_rw" is added to allow the default to be
overridden.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c              |   35 +++++++++++++++++++++++++++++++++++
 drivers/acpi/nfit.h              |    3 +++
 drivers/nvdimm/blk.c             |    1 +
 drivers/nvdimm/bus.c             |   27 +++++++++++++++++++++++++++
 drivers/nvdimm/nd-core.h         |    1 +
 drivers/nvdimm/nd.h              |    1 +
 drivers/nvdimm/pmem.c            |    1 +
 drivers/nvdimm/region_devs.c     |   13 +++++++++++++
 include/linux/libnvdimm.h        |    2 ++
 tools/testing/nvdimm/test/nfit.c |    3 +++
 10 files changed, 87 insertions(+)

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 9363e4b0e6a7..5f645823d7d7 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -27,6 +27,10 @@ static bool force_enable_dimms;
 module_param(force_enable_dimms, bool, S_IRUGO|S_IWUSR);
 MODULE_PARM_DESC(force_enable_dimms, "Ignore _STA (ACPI DIMM device) status");
 
+static bool force_rw;
+module_param(force_rw, bool, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(force_rw, "Enable writes to DIMMs that failed to arm");
+
 static u8 nfit_uuid[NFIT_UUID_MAX][16];
 
 const u8 *to_nfit_uuid(enum nfit_uuids id)
@@ -664,6 +668,20 @@ static ssize_t serial_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(serial);
 
+static ssize_t flags_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	u16 flags = to_nfit_memdev(dev)->flags;
+
+	return sprintf(buf, "%s%s%s%s%s\n",
+			flags & ACPI_NFIT_MEM_SAVE_FAILED ? "save " : "",
+			flags & ACPI_NFIT_MEM_RESTORE_FAILED ? "restore " : "",
+			flags & ACPI_NFIT_MEM_FLUSH_FAILED ? "flush " : "",
+			flags & ACPI_NFIT_MEM_ARMED ? "arm " : "",
+			flags & ACPI_NFIT_MEM_HEALTH_OBSERVED ? "smart " : "");
+}
+static DEVICE_ATTR_RO(flags);
+
 static struct attribute *acpi_nfit_dimm_attributes[] = {
 	&dev_attr_handle.attr,
 	&dev_attr_phys_id.attr,
@@ -672,6 +690,7 @@ static struct attribute *acpi_nfit_dimm_attributes[] = {
 	&dev_attr_format.attr,
 	&dev_attr_serial.attr,
 	&dev_attr_rev_id.attr,
+	&dev_attr_flags.attr,
 	NULL,
 };
 
@@ -764,6 +783,7 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
 		struct nvdimm *nvdimm;
 		unsigned long flags = 0;
 		u32 device_handle;
+		u16 mem_flags;
 		int rc;
 
 		device_handle = __to_nfit_memdev(nfit_mem)->device_handle;
@@ -781,6 +801,10 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
 		if (nfit_mem->bdw && nfit_mem->memdev_pmem)
 			flags |= NDD_ALIASING;
 
+		mem_flags = __to_nfit_memdev(nfit_mem)->flags;
+		if ((mem_flags & ACPI_NFIT_MEM_ARMED) && !force_rw)
+			flags |= NDD_UNARMED;
+
 		rc = acpi_nfit_add_dimm(acpi_desc, nfit_mem, device_handle);
 		if (rc)
 			continue;
@@ -793,6 +817,17 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
 
 		nfit_mem->nvdimm = nvdimm;
 		dimm_count++;
+
+		if ((mem_flags & ACPI_NFIT_MEM_FAILED_MASK) == 0)
+			continue;
+
+		dev_info(acpi_desc->dev, "%s: failed: %s%s%s%s\n",
+				nvdimm_name(nvdimm),
+			mem_flags & ACPI_NFIT_MEM_SAVE_FAILED ? "save " : "",
+			mem_flags & ACPI_NFIT_MEM_RESTORE_FAILED ? "restore " : "",
+			mem_flags & ACPI_NFIT_MEM_FLUSH_FAILED ? "flush " : "",
+			mem_flags & ACPI_NFIT_MEM_ARMED ? "arm " : "");
+
 	}
 
 	return nvdimm_bus_check_dimm_count(acpi_desc->nvdimm_bus, dimm_count);
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index c62fffea8423..81f2e8c5a79c 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -22,6 +22,9 @@
 
 #define UUID_NFIT_BUS "2f10e7a4-9e91-11e4-89d3-123b93f75cba"
 #define UUID_NFIT_DIMM "4309ac30-0d11-11e4-9191-0800200c9a66"
+#define ACPI_NFIT_MEM_FAILED_MASK (ACPI_NFIT_MEM_SAVE_FAILED \
+		| ACPI_NFIT_MEM_RESTORE_FAILED | ACPI_NFIT_MEM_FLUSH_FAILED \
+		| ACPI_NFIT_MEM_ARMED)
 
 enum nfit_uuids {
 	NFIT_SPA_VOLATILE,
diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index adacc27f04f1..0b359cb8aa9f 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -243,6 +243,7 @@ static const struct block_device_operations nd_blk_fops = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl = nvdimm_bdev_compat_ioctl,
 #endif
+	.revalidate_disk = nvdimm_revalidate_disk,
 };
 
 static int nd_blk_probe(struct device *dev)
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 47260ca573e0..67525f995859 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -243,6 +243,32 @@ int __nd_driver_register(struct nd_device_driver *nd_drv, struct module *owner,
 }
 EXPORT_SYMBOL(__nd_driver_register);
 
+int nvdimm_revalidate_disk(struct gendisk *disk)
+{
+	int i;
+	struct device *dev = disk->driverfs_dev;
+	struct nd_region *nd_region = walk_to_nd_region(dev);
+
+	if (!nd_region)
+		return 0;
+
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		struct nvdimm *nvdimm = nd_mapping->nvdimm;
+
+		if ((nvdimm->flags & NDD_UNARMED) && !get_disk_ro(disk)) {
+			dev_dbg(dev, "%s: unarmed, marking disk %s ro\n",
+					dev_name(&nvdimm->dev),
+					dev_name(disk_to_dev(disk)));
+			set_disk_ro(disk, 1);
+			break;
+		}
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(nvdimm_revalidate_disk);
+
 int nvdimm_bus_add_integrity_disk(struct gendisk *disk, u32 lbasize,
 		sector_t size)
 {
@@ -265,6 +291,7 @@ int nvdimm_bus_add_integrity_disk(struct gendisk *disk, u32 lbasize,
 	rc = nd_integrity_init(disk, lbasize);
 	if (size)
 		set_capacity(disk, size);
+	revalidate_disk(disk);
 	nd_btt_add_disk(nvdimm_bus, disk);
 	nvdimm_bus_unlock(&nvdimm_bus->dev);
 
diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h
index ba548d248b4e..0cac05aca8f5 100644
--- a/drivers/nvdimm/nd-core.h
+++ b/drivers/nvdimm/nd-core.h
@@ -110,6 +110,7 @@ bool nd_is_uuid_unique(struct device *dev, u8 *uuid);
 struct nd_region;
 struct nvdimm_drvdata;
 struct nd_mapping;
+struct nd_region *walk_to_nd_region(struct device *nd_dev);
 resource_size_t nd_pmem_available_dpa(struct nd_region *nd_region,
 		struct nd_mapping *nd_mapping, resource_size_t *overlap);
 resource_size_t nd_blk_available_dpa(struct nd_mapping *nd_mapping);
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 2786eb8456ec..011d7c51b5da 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -168,6 +168,7 @@ int nvdimm_bdev_ioctl(struct block_device *bdev, fmode_t mode,
 		unsigned int cmd, unsigned long arg);
 int nvdimm_bdev_compat_ioctl(struct block_device *bdev, fmode_t mode,
 		unsigned int cmd, unsigned long arg);
+int nvdimm_revalidate_disk(struct gendisk *disk);
 void nvdimm_drvdata_release(struct kref *kref);
 void put_ndd(struct nvdimm_drvdata *ndd);
 int nd_label_reserve_dpa(struct nvdimm_drvdata *ndd);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 96964419b72d..b69278424dff 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -135,6 +135,7 @@ static const struct block_device_operations pmem_fops = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl =		nvdimm_bdev_ioctl,
 #endif
+	.revalidate_disk =	nvdimm_revalidate_disk,
 };
 
 static struct pmem_device *pmem_alloc(struct device *dev,
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b16ec20dbba2..bb9f329c3b9f 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -83,6 +83,19 @@ struct nd_blk_region *to_nd_blk_region(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(to_nd_blk_region);
 
+struct nd_region *walk_to_nd_region(struct device *nd_dev)
+{
+	struct device *dev;
+
+	for (dev = nd_dev; dev; dev = dev->parent)
+		if (dev->type->release == nd_region_release)
+			break;
+	dev_WARN_ONCE(nd_dev, !dev, "invalid dev, not an nd_region descendant\n");
+	if (dev)
+		return to_nd_region(dev);
+	return NULL;
+}
+
 void *nd_region_provider_data(struct nd_region *nd_region)
 {
 	return nd_region->provider_data;
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 7fc1b25bdb5d..dc799a29ed1a 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -21,6 +21,8 @@
 enum {
 	/* when a dimm supports both PMEM and BLK access a label is required */
 	NDD_ALIASING = 1 << 0,
+	/* unarmed memory devices may not persist writes */
+	NDD_UNARMED = 1 << 1,
 
 	/* need to set a limit somewhere, but yes, this is likely overkill */
 	ND_IOCTL_MAX_BUFLEN = SZ_4M,
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index 416f8fbf9881..c57ecb4e421d 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -873,6 +873,9 @@ static void nfit_test1_setup(struct nfit_test *t)
 	memdev->address = 0;
 	memdev->interleave_index = 0;
 	memdev->interleave_ways = 1;
+	memdev->flags = ACPI_NFIT_MEM_SAVE_FAILED | ACPI_NFIT_MEM_RESTORE_FAILED
+		| ACPI_NFIT_MEM_FLUSH_FAILED | ACPI_NFIT_MEM_HEALTH_OBSERVED
+		| ACPI_NFIT_MEM_ARMED;
 
 	offset += sizeof(*memdev);
 	/* dcr-descriptor0 */


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH 15/15] libnvdimm, nfit: handle acpi_nfit_memory_map flags
@ 2015-06-17 23:56   ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-17 23:56 UTC (permalink / raw)
  To: axboe, linux-nvdimm
  Cc: boaz, toshi.kani, linux-kernel, hch, linux-acpi, linux-fsdevel, mingo

The flags in this NFIT sub-structure indicate the state of the data on
the nvdimm relative to its energy source or last "flush to persistence".
For the most part there is nothing the driver can do but advertise the
state of these flags in sysfs and emit a message if firmware indicates
that the contents of the device may be corrupted.  However, for the case
of ACPI_NFIT_MEM_ARMED, the driver can arrange for the block devices
incorporating that nvdimm to be marked read-only.  This is a safe
default as the data is still available and new writes are held off until
the administrator either forces read-write mode, or the energy source
becomes armed.

A module parameter "force_rw" is added to allow the default to be
overridden.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c              |   35 +++++++++++++++++++++++++++++++++++
 drivers/acpi/nfit.h              |    3 +++
 drivers/nvdimm/blk.c             |    1 +
 drivers/nvdimm/bus.c             |   27 +++++++++++++++++++++++++++
 drivers/nvdimm/nd-core.h         |    1 +
 drivers/nvdimm/nd.h              |    1 +
 drivers/nvdimm/pmem.c            |    1 +
 drivers/nvdimm/region_devs.c     |   13 +++++++++++++
 include/linux/libnvdimm.h        |    2 ++
 tools/testing/nvdimm/test/nfit.c |    3 +++
 10 files changed, 87 insertions(+)

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 9363e4b0e6a7..5f645823d7d7 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -27,6 +27,10 @@ static bool force_enable_dimms;
 module_param(force_enable_dimms, bool, S_IRUGO|S_IWUSR);
 MODULE_PARM_DESC(force_enable_dimms, "Ignore _STA (ACPI DIMM device) status");
 
+static bool force_rw;
+module_param(force_rw, bool, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(force_rw, "Enable writes to DIMMs that failed to arm");
+
 static u8 nfit_uuid[NFIT_UUID_MAX][16];
 
 const u8 *to_nfit_uuid(enum nfit_uuids id)
@@ -664,6 +668,20 @@ static ssize_t serial_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(serial);
 
+static ssize_t flags_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	u16 flags = to_nfit_memdev(dev)->flags;
+
+	return sprintf(buf, "%s%s%s%s%s\n",
+			flags & ACPI_NFIT_MEM_SAVE_FAILED ? "save " : "",
+			flags & ACPI_NFIT_MEM_RESTORE_FAILED ? "restore " : "",
+			flags & ACPI_NFIT_MEM_FLUSH_FAILED ? "flush " : "",
+			flags & ACPI_NFIT_MEM_ARMED ? "arm " : "",
+			flags & ACPI_NFIT_MEM_HEALTH_OBSERVED ? "smart " : "");
+}
+static DEVICE_ATTR_RO(flags);
+
 static struct attribute *acpi_nfit_dimm_attributes[] = {
 	&dev_attr_handle.attr,
 	&dev_attr_phys_id.attr,
@@ -672,6 +690,7 @@ static struct attribute *acpi_nfit_dimm_attributes[] = {
 	&dev_attr_format.attr,
 	&dev_attr_serial.attr,
 	&dev_attr_rev_id.attr,
+	&dev_attr_flags.attr,
 	NULL,
 };
 
@@ -764,6 +783,7 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
 		struct nvdimm *nvdimm;
 		unsigned long flags = 0;
 		u32 device_handle;
+		u16 mem_flags;
 		int rc;
 
 		device_handle = __to_nfit_memdev(nfit_mem)->device_handle;
@@ -781,6 +801,10 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
 		if (nfit_mem->bdw && nfit_mem->memdev_pmem)
 			flags |= NDD_ALIASING;
 
+		mem_flags = __to_nfit_memdev(nfit_mem)->flags;
+		if ((mem_flags & ACPI_NFIT_MEM_ARMED) && !force_rw)
+			flags |= NDD_UNARMED;
+
 		rc = acpi_nfit_add_dimm(acpi_desc, nfit_mem, device_handle);
 		if (rc)
 			continue;
@@ -793,6 +817,17 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
 
 		nfit_mem->nvdimm = nvdimm;
 		dimm_count++;
+
+		if ((mem_flags & ACPI_NFIT_MEM_FAILED_MASK) == 0)
+			continue;
+
+		dev_info(acpi_desc->dev, "%s: failed: %s%s%s%s\n",
+				nvdimm_name(nvdimm),
+			mem_flags & ACPI_NFIT_MEM_SAVE_FAILED ? "save " : "",
+			mem_flags & ACPI_NFIT_MEM_RESTORE_FAILED ? "restore " : "",
+			mem_flags & ACPI_NFIT_MEM_FLUSH_FAILED ? "flush " : "",
+			mem_flags & ACPI_NFIT_MEM_ARMED ? "arm " : "");
+
 	}
 
 	return nvdimm_bus_check_dimm_count(acpi_desc->nvdimm_bus, dimm_count);
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index c62fffea8423..81f2e8c5a79c 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -22,6 +22,9 @@
 
 #define UUID_NFIT_BUS "2f10e7a4-9e91-11e4-89d3-123b93f75cba"
 #define UUID_NFIT_DIMM "4309ac30-0d11-11e4-9191-0800200c9a66"
+#define ACPI_NFIT_MEM_FAILED_MASK (ACPI_NFIT_MEM_SAVE_FAILED \
+		| ACPI_NFIT_MEM_RESTORE_FAILED | ACPI_NFIT_MEM_FLUSH_FAILED \
+		| ACPI_NFIT_MEM_ARMED)
 
 enum nfit_uuids {
 	NFIT_SPA_VOLATILE,
diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index adacc27f04f1..0b359cb8aa9f 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -243,6 +243,7 @@ static const struct block_device_operations nd_blk_fops = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl = nvdimm_bdev_compat_ioctl,
 #endif
+	.revalidate_disk = nvdimm_revalidate_disk,
 };
 
 static int nd_blk_probe(struct device *dev)
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 47260ca573e0..67525f995859 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -243,6 +243,32 @@ int __nd_driver_register(struct nd_device_driver *nd_drv, struct module *owner,
 }
 EXPORT_SYMBOL(__nd_driver_register);
 
+int nvdimm_revalidate_disk(struct gendisk *disk)
+{
+	int i;
+	struct device *dev = disk->driverfs_dev;
+	struct nd_region *nd_region = walk_to_nd_region(dev);
+
+	if (!nd_region)
+		return 0;
+
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		struct nvdimm *nvdimm = nd_mapping->nvdimm;
+
+		if ((nvdimm->flags & NDD_UNARMED) && !get_disk_ro(disk)) {
+			dev_dbg(dev, "%s: unarmed, marking disk %s ro\n",
+					dev_name(&nvdimm->dev),
+					dev_name(disk_to_dev(disk)));
+			set_disk_ro(disk, 1);
+			break;
+		}
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(nvdimm_revalidate_disk);
+
 int nvdimm_bus_add_integrity_disk(struct gendisk *disk, u32 lbasize,
 		sector_t size)
 {
@@ -265,6 +291,7 @@ int nvdimm_bus_add_integrity_disk(struct gendisk *disk, u32 lbasize,
 	rc = nd_integrity_init(disk, lbasize);
 	if (size)
 		set_capacity(disk, size);
+	revalidate_disk(disk);
 	nd_btt_add_disk(nvdimm_bus, disk);
 	nvdimm_bus_unlock(&nvdimm_bus->dev);
 
diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h
index ba548d248b4e..0cac05aca8f5 100644
--- a/drivers/nvdimm/nd-core.h
+++ b/drivers/nvdimm/nd-core.h
@@ -110,6 +110,7 @@ bool nd_is_uuid_unique(struct device *dev, u8 *uuid);
 struct nd_region;
 struct nvdimm_drvdata;
 struct nd_mapping;
+struct nd_region *walk_to_nd_region(struct device *nd_dev);
 resource_size_t nd_pmem_available_dpa(struct nd_region *nd_region,
 		struct nd_mapping *nd_mapping, resource_size_t *overlap);
 resource_size_t nd_blk_available_dpa(struct nd_mapping *nd_mapping);
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 2786eb8456ec..011d7c51b5da 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -168,6 +168,7 @@ int nvdimm_bdev_ioctl(struct block_device *bdev, fmode_t mode,
 		unsigned int cmd, unsigned long arg);
 int nvdimm_bdev_compat_ioctl(struct block_device *bdev, fmode_t mode,
 		unsigned int cmd, unsigned long arg);
+int nvdimm_revalidate_disk(struct gendisk *disk);
 void nvdimm_drvdata_release(struct kref *kref);
 void put_ndd(struct nvdimm_drvdata *ndd);
 int nd_label_reserve_dpa(struct nvdimm_drvdata *ndd);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 96964419b72d..b69278424dff 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -135,6 +135,7 @@ static const struct block_device_operations pmem_fops = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl =		nvdimm_bdev_ioctl,
 #endif
+	.revalidate_disk =	nvdimm_revalidate_disk,
 };
 
 static struct pmem_device *pmem_alloc(struct device *dev,
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b16ec20dbba2..bb9f329c3b9f 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -83,6 +83,19 @@ struct nd_blk_region *to_nd_blk_region(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(to_nd_blk_region);
 
+struct nd_region *walk_to_nd_region(struct device *nd_dev)
+{
+	struct device *dev;
+
+	for (dev = nd_dev; dev; dev = dev->parent)
+		if (dev->type->release == nd_region_release)
+			break;
+	dev_WARN_ONCE(nd_dev, !dev, "invalid dev, not an nd_region descendant\n");
+	if (dev)
+		return to_nd_region(dev);
+	return NULL;
+}
+
 void *nd_region_provider_data(struct nd_region *nd_region)
 {
 	return nd_region->provider_data;
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 7fc1b25bdb5d..dc799a29ed1a 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -21,6 +21,8 @@
 enum {
 	/* when a dimm supports both PMEM and BLK access a label is required */
 	NDD_ALIASING = 1 << 0,
+	/* unarmed memory devices may not persist writes */
+	NDD_UNARMED = 1 << 1,
 
 	/* need to set a limit somewhere, but yes, this is likely overkill */
 	ND_IOCTL_MAX_BUFLEN = SZ_4M,
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index 416f8fbf9881..c57ecb4e421d 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -873,6 +873,9 @@ static void nfit_test1_setup(struct nfit_test *t)
 	memdev->address = 0;
 	memdev->interleave_index = 0;
 	memdev->interleave_ways = 1;
+	memdev->flags = ACPI_NFIT_MEM_SAVE_FAILED | ACPI_NFIT_MEM_RESTORE_FAILED
+		| ACPI_NFIT_MEM_FLUSH_FAILED | ACPI_NFIT_MEM_HEALTH_OBSERVED
+		| ACPI_NFIT_MEM_ARMED;
 
 	offset += sizeof(*memdev);
 	/* dcr-descriptor0 */


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* Re: [PATCH 01/15] block: introduce an ->rw_bytes() block device operation
  2015-06-17 23:54   ` Dan Williams
@ 2015-06-18 19:25     ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-18 19:25 UTC (permalink / raw)
  To: Jens Axboe, linux-nvdimm
  Cc: Ingo Molnar, linux-kernel, Andy Lutomirski, Jens Axboe,
	Linux ACPI, H. Peter Anvin, linux-fsdevel, Christoph Hellwig

On Wed, Jun 17, 2015 at 4:54 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> Why do we need a new method in block_device_operations?
>
> The capacities of persistent memory make it too large to map as RAM (no
> struct page coverage by default), so Linux arranges for it to be managed
> as a block device. The bio interface to a block device enforces sector
> at a time i/o and has infrastructure for asynchronous completion of i/o.
> The ->rw_page() interface is closely tied to the page cache and also
> carries asynchronous i/o completion assumptions.  NVDIMM devices are
> fast enough to complete i/o's synchronously (memcpy) and some in kernel
> applications can take advantage of the byte-aligned (as opposed to
> sector-aligned) nature of the media.  The ->rw_bytes() operation is
> added to fill this role that does not fit into any existing access
> method.
>
> It could be argued that a ->rw_bytes() method makes a struct
> block_device not a *block* device.  However, the applications for
> persistent memory as storage devices makes them more "block" devices
> than "character" devices.
>
> The first consumer of the ->rw_bytes() capability is a stacked
> block_device driver (BTT - block translation table) that adds atomic
> sector update semantics on top of an nvdimm storage device.
>
> Why enable drivers like BTT on top of a new globally visibly
> block_device_operations op rather than an internal detail of nvdimm
> drivers?
>
> 1/ We want ->rw_bytes() consumers to be enabled on either a per-disk or
> per-partition basis.  Consider the case of enabling DAX+XFS on a single
> persistent memory disk whereby the metadata needs atomic sector update
> guarantees, but the data would like to be DAX capable.  Solution is to
> create two partitions and enable BTT on the "metadata/XFS-logdev"
> partition.
>
> 2/ We want this configuration topology to be visible to the sysfs device
> model, and not an internal detail of nvdimm drivers requiring special
> tooling.  For example if you ever wanted to "fsck" BTT metadata that
> could be carried out on the raw nvdimm device directly rather than
> require custom tooling / mechanisms to access the raw media.
>
> 3/ It becomes trivial to add new BTT like drivers without touching the
> nvdimm drivers to add is_btt_mode(), is_foo_mode(), etc... checks in the
> fast path.
>

Acked-by: Jens Axboe <axboe@fb.com> ...off list.

Christoph, I assume that satisfies your primary concern with the BTT
infrastructure and implementation?

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 01/15] block: introduce an ->rw_bytes() block device operation
@ 2015-06-18 19:25     ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-18 19:25 UTC (permalink / raw)
  To: Jens Axboe, linux-nvdimm@lists.01.org
  Cc: Ingo Molnar, linux-kernel, Andy Lutomirski, Jens Axboe,
	Linux ACPI, H. Peter Anvin, linux-fsdevel, Christoph Hellwig

On Wed, Jun 17, 2015 at 4:54 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> Why do we need a new method in block_device_operations?
>
> The capacities of persistent memory make it too large to map as RAM (no
> struct page coverage by default), so Linux arranges for it to be managed
> as a block device. The bio interface to a block device enforces sector
> at a time i/o and has infrastructure for asynchronous completion of i/o.
> The ->rw_page() interface is closely tied to the page cache and also
> carries asynchronous i/o completion assumptions.  NVDIMM devices are
> fast enough to complete i/o's synchronously (memcpy) and some in kernel
> applications can take advantage of the byte-aligned (as opposed to
> sector-aligned) nature of the media.  The ->rw_bytes() operation is
> added to fill this role that does not fit into any existing access
> method.
>
> It could be argued that a ->rw_bytes() method makes a struct
> block_device not a *block* device.  However, the applications for
> persistent memory as storage devices makes them more "block" devices
> than "character" devices.
>
> The first consumer of the ->rw_bytes() capability is a stacked
> block_device driver (BTT - block translation table) that adds atomic
> sector update semantics on top of an nvdimm storage device.
>
> Why enable drivers like BTT on top of a new globally visibly
> block_device_operations op rather than an internal detail of nvdimm
> drivers?
>
> 1/ We want ->rw_bytes() consumers to be enabled on either a per-disk or
> per-partition basis.  Consider the case of enabling DAX+XFS on a single
> persistent memory disk whereby the metadata needs atomic sector update
> guarantees, but the data would like to be DAX capable.  Solution is to
> create two partitions and enable BTT on the "metadata/XFS-logdev"
> partition.
>
> 2/ We want this configuration topology to be visible to the sysfs device
> model, and not an internal detail of nvdimm drivers requiring special
> tooling.  For example if you ever wanted to "fsck" BTT metadata that
> could be carried out on the raw nvdimm device directly rather than
> require custom tooling / mechanisms to access the raw media.
>
> 3/ It becomes trivial to add new BTT like drivers without touching the
> nvdimm drivers to add is_btt_mode(), is_foo_mode(), etc... checks in the
> fast path.
>

Acked-by: Jens Axboe <axboe@fb.com> ...off list.

Christoph, I assume that satisfies your primary concern with the BTT
infrastructure and implementation?

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-17 23:56   ` Dan Williams
@ 2015-06-18 22:55     ` Vishal Verma
  -1 siblings, 0 replies; 164+ messages in thread
From: Vishal Verma @ 2015-06-18 22:55 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, linux-kernel, mingo, linux-acpi, linux-fsdevel, hch

On Wed, 2015-06-17 at 19:56 -0400, Dan Williams wrote:
> Upon detection of a read-only backing device arrange for the btt to
> device to be read only.  Implement a catch for the BLKROSET ioctl and
> only allow a btt-instance to become read-write when the backing-device
> becomes read-write.  Conversely, if a backing-device becomes read-only
> arrange for its parent btt to be marked read-only.  Synchronize these
> changes under the bus lock.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/nvdimm/blk.c      |    4 +++
>  drivers/nvdimm/btt.c      |   34 ++++++++++++++++++++++++++--
>  drivers/nvdimm/btt_devs.c |   42 ++++++++++++++++++++++++++++++++++
>  drivers/nvdimm/bus.c      |   55 +++++++++++++++++++++++++++++++++++++++++++++
>  drivers/nvdimm/nd-core.h  |   14 +++++++++++
>  drivers/nvdimm/nd.h       |    4 +++
>  drivers/nvdimm/pmem.c     |    4 +++
>  7 files changed, 154 insertions(+), 3 deletions(-)

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>



^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-18 22:55     ` Vishal Verma
  0 siblings, 0 replies; 164+ messages in thread
From: Vishal Verma @ 2015-06-18 22:55 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, linux-kernel, mingo, linux-acpi, linux-fsdevel, hch

On Wed, 2015-06-17 at 19:56 -0400, Dan Williams wrote:
> Upon detection of a read-only backing device arrange for the btt to
> device to be read only.  Implement a catch for the BLKROSET ioctl and
> only allow a btt-instance to become read-write when the backing-device
> becomes read-write.  Conversely, if a backing-device becomes read-only
> arrange for its parent btt to be marked read-only.  Synchronize these
> changes under the bus lock.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/nvdimm/blk.c      |    4 +++
>  drivers/nvdimm/btt.c      |   34 ++++++++++++++++++++++++++--
>  drivers/nvdimm/btt_devs.c |   42 ++++++++++++++++++++++++++++++++++
>  drivers/nvdimm/bus.c      |   55 +++++++++++++++++++++++++++++++++++++++++++++
>  drivers/nvdimm/nd-core.h  |   14 +++++++++++
>  drivers/nvdimm/nd.h       |    4 +++
>  drivers/nvdimm/pmem.c     |    4 +++
>  7 files changed, 154 insertions(+), 3 deletions(-)

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>



^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 12/15] libnvdimm: enable iostat
  2015-06-17 23:55   ` Dan Williams
@ 2015-06-19  8:34     ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-19  8:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, boaz, toshi.kani, Vishal Verma,
	linux-kernel, hch, linux-acpi, linux-fsdevel, mingo

On Wed, Jun 17, 2015 at 07:55:51PM -0400, Dan Williams wrote:
> This is disabled by default as the overhead is prohibitive, but if the
> user takes the action to turn it on we'll oblige.

If you care about users a compile time selection doesn't make sense,
why not always build it but require an opt-in?

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 12/15] libnvdimm: enable iostat
@ 2015-06-19  8:34     ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-19  8:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, boaz, toshi.kani, Vishal Verma,
	linux-kernel, hch, linux-acpi, linux-fsdevel, mingo

On Wed, Jun 17, 2015 at 07:55:51PM -0400, Dan Williams wrote:
> This is disabled by default as the overhead is prohibitive, but if the
> user takes the action to turn it on we'll oblige.

If you care about users a compile time selection doesn't make sense,
why not always build it but require an opt-in?

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 12/15] libnvdimm: enable iostat
  2015-06-19  8:34     ` Christoph Hellwig
@ 2015-06-19  9:02       ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-19  9:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	Vishal Verma, linux-kernel, Linux ACPI, linux-fsdevel,
	Ingo Molnar

On Fri, Jun 19, 2015 at 1:34 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Jun 17, 2015 at 07:55:51PM -0400, Dan Williams wrote:
>> This is disabled by default as the overhead is prohibitive, but if the
>> user takes the action to turn it on we'll oblige.
>
> If you care about users a compile time selection doesn't make sense,
> why not always build it but require an opt-in?

It's always built the Kconfig just selects the default initial state
of QUEUE_FLAG_IO_STAT.  They can always turn it on/off via block queue
sysfs.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 12/15] libnvdimm: enable iostat
@ 2015-06-19  9:02       ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-19  9:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Boaz Harrosh, Kani,
	Toshimitsu, Vishal Verma, linux-kernel, Linux ACPI,
	linux-fsdevel, Ingo Molnar

On Fri, Jun 19, 2015 at 1:34 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Jun 17, 2015 at 07:55:51PM -0400, Dan Williams wrote:
>> This is disabled by default as the overhead is prohibitive, but if the
>> user takes the action to turn it on we'll oblige.
>
> If you care about users a compile time selection doesn't make sense,
> why not always build it but require an opt-in?

It's always built the Kconfig just selects the default initial state
of QUEUE_FLAG_IO_STAT.  They can always turn it on/off via block queue
sysfs.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 03/15] nd_btt: atomic sector updates
  2015-06-17 23:55   ` Dan Williams
  (?)
@ 2015-06-21 10:03     ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 10:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, boaz, toshi.kani, Vishal Verma, Neil Brown,
	Greg KH, Dave Chinner, linux-kernel, Andy Lutomirski, Jens Axboe,
	linux-acpi, Jeff Moyer, H. Peter Anvin, linux-fsdevel, hch,
	mingo

> +config ND_MAX_REGIONS
> +	int "Maximum number of regions supported by the sub-system"
> +	default 64
> +	---help---
> +	  A 'region' corresponds to an individual DIMM or an interleave
> +	  set of DIMMs.  A typical maximally configured system may have
> +	  up to 32 DIMMs.
> +
> +	  Leave the default of 64 if you are unsure.

Having static limits in Kconfig is a bad idea.  What prevents you
from handling any (reasonable) number dynamically?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 03/15] nd_btt: atomic sector updates
@ 2015-06-21 10:03     ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 10:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, boaz, toshi.kani, Vishal Verma, Neil Brown,
	Greg KH, Dave Chinner, linux-kernel, Andy Lutomirski, Jens Axboe,
	linux-acpi, Jeff Moyer, H. Peter Anvin, linux-fsdevel, hch,
	mingo

> +config ND_MAX_REGIONS
> +	int "Maximum number of regions supported by the sub-system"
> +	default 64
> +	---help---
> +	  A 'region' corresponds to an individual DIMM or an interleave
> +	  set of DIMMs.  A typical maximally configured system may have
> +	  up to 32 DIMMs.
> +
> +	  Leave the default of 64 if you are unsure.

Having static limits in Kconfig is a bad idea.  What prevents you
from handling any (reasonable) number dynamically?
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 03/15] nd_btt: atomic sector updates
@ 2015-06-21 10:03     ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 10:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, boaz, toshi.kani, Vishal Verma, Neil Brown,
	Greg KH, Dave Chinner, linux-kernel, Andy Lutomirski, Jens Axboe,
	linux-acpi, Jeff Moyer, H. Peter Anvin, linux-fsdevel, hch,
	mingo

> +config ND_MAX_REGIONS
> +	int "Maximum number of regions supported by the sub-system"
> +	default 64
> +	---help---
> +	  A 'region' corresponds to an individual DIMM or an interleave
> +	  set of DIMMs.  A typical maximally configured system may have
> +	  up to 32 DIMMs.
> +
> +	  Leave the default of 64 if you are unsure.

Having static limits in Kconfig is a bad idea.  What prevents you
from handling any (reasonable) number dynamically?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 04/15] libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
  2015-06-17 23:55   ` Dan Williams
  (?)
@ 2015-06-21 10:05     ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 10:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, boaz, toshi.kani, Rafael J. Wysocki,
	linux-kernel, Andy Lutomirski, Jens Axboe, linux-acpi,
	H. Peter Anvin, linux-fsdevel, Ross Zwisler, hch, mingo

> +#include <asm-generic/io-64-nonatomic-hi-lo.h>

As mentioned last time only arch asm/ headers may include asm-generic
headers.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 04/15] libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
@ 2015-06-21 10:05     ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 10:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, boaz, toshi.kani, Rafael J. Wysocki,
	linux-kernel, Andy Lutomirski, Jens Axboe, linux-acpi,
	H. Peter Anvin, linux-fsdevel, Ross Zwisler, hch, mingo

> +#include <asm-generic/io-64-nonatomic-hi-lo.h>

As mentioned last time only arch asm/ headers may include asm-generic
headers.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 04/15] libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
@ 2015-06-21 10:05     ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 10:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, boaz, toshi.kani, Rafael J. Wysocki,
	linux-kernel, Andy Lutomirski, Jens Axboe, linux-acpi,
	H. Peter Anvin, linux-fsdevel, Ross Zwisler, hch, mingo

> +#include <asm-generic/io-64-nonatomic-hi-lo.h>

As mentioned last time only arch asm/ headers may include asm-generic
headers.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 10/15] libnvdimm: fix up max_hw_sectors
  2015-06-17 23:55   ` Dan Williams
  (?)
@ 2015-06-21 10:08     ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 10:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, boaz, toshi.kani, Vishal Verma,
	linux-kernel, hch, linux-acpi, linux-fsdevel, mingo

> +void nd_blk_queue_init(struct request_queue *q)
> +{
> +	blk_queue_max_hw_sectors(q, UINT_MAX);
> +	blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
> +}

Please just add the calls to the drivers instead of this helper which
hides the intent.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 10/15] libnvdimm: fix up max_hw_sectors
@ 2015-06-21 10:08     ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 10:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, boaz, toshi.kani, Vishal Verma,
	linux-kernel, hch, linux-acpi, linux-fsdevel, mingo

> +void nd_blk_queue_init(struct request_queue *q)
> +{
> +	blk_queue_max_hw_sectors(q, UINT_MAX);
> +	blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
> +}

Please just add the calls to the drivers instead of this helper which
hides the intent.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 10/15] libnvdimm: fix up max_hw_sectors
@ 2015-06-21 10:08     ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 10:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, boaz, toshi.kani, Vishal Verma,
	linux-kernel, hch, linux-acpi, linux-fsdevel, mingo

> +void nd_blk_queue_init(struct request_queue *q)
> +{
> +	blk_queue_max_hw_sectors(q, UINT_MAX);
> +	blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
> +}

Please just add the calls to the drivers instead of this helper which
hides the intent.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 11/15] libnvdimm: pmem, blk, and btt make_request cleanups
  2015-06-17 23:55   ` Dan Williams
@ 2015-06-21 10:10     ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 10:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, boaz, toshi.kani, Vishal Verma,
	linux-kernel, hch, linux-acpi, linux-fsdevel, mingo

One patch per driver please.

> diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
> index 8a6345797a71..9d609ef95266 100644
> --- a/drivers/nvdimm/blk.c
> +++ b/drivers/nvdimm/blk.c
> @@ -170,18 +170,12 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
>  	struct bvec_iter iter;
>  	struct bio_vec bvec;
>  	int err = 0, rw;
> -	sector_t sector;
>  
> -	sector = bio->bi_iter.bi_sector;
> -	if (bio_end_sector(bio) > get_capacity(disk)) {
> +	if (unlikely(bio_end_sector(bio) > get_capacity(disk))) {
>  		err = -EIO;
>  		goto out;
>  	}
>  
> -	BUG_ON(bio->bi_rw & REQ_DISCARD);

If you remove the DISCARD check you can kill the max sectors one
as well, given that generic_make_request_checks() takes care of it.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 11/15] libnvdimm: pmem, blk, and btt make_request cleanups
@ 2015-06-21 10:10     ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 10:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, boaz, toshi.kani, Vishal Verma,
	linux-kernel, hch, linux-acpi, linux-fsdevel, mingo

One patch per driver please.

> diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
> index 8a6345797a71..9d609ef95266 100644
> --- a/drivers/nvdimm/blk.c
> +++ b/drivers/nvdimm/blk.c
> @@ -170,18 +170,12 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
>  	struct bvec_iter iter;
>  	struct bio_vec bvec;
>  	int err = 0, rw;
> -	sector_t sector;
>  
> -	sector = bio->bi_iter.bi_sector;
> -	if (bio_end_sector(bio) > get_capacity(disk)) {
> +	if (unlikely(bio_end_sector(bio) > get_capacity(disk))) {
>  		err = -EIO;
>  		goto out;
>  	}
>  
> -	BUG_ON(bio->bi_rw & REQ_DISCARD);

If you remove the DISCARD check you can kill the max sectors one
as well, given that generic_make_request_checks() takes care of it.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 12/15] libnvdimm: enable iostat
  2015-06-19  9:02       ` Dan Williams
@ 2015-06-21 10:11         ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 10:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani,
	Toshimitsu, Vishal Verma, linux-kernel, Linux ACPI,
	linux-fsdevel, Ingo Molnar

On Fri, Jun 19, 2015 at 02:02:39AM -0700, Dan Williams wrote:
> On Fri, Jun 19, 2015 at 1:34 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Wed, Jun 17, 2015 at 07:55:51PM -0400, Dan Williams wrote:
> >> This is disabled by default as the overhead is prohibitive, but if the
> >> user takes the action to turn it on we'll oblige.
> >
> > If you care about users a compile time selection doesn't make sense,
> > why not always build it but require an opt-in?
> 
> It's always built the Kconfig just selects the default initial state
> of QUEUE_FLAG_IO_STAT.  They can always turn it on/off via block queue
> sysfs.

Oh, missed that.  Just drop the Kconfig option in that case.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 12/15] libnvdimm: enable iostat
@ 2015-06-21 10:11         ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 10:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	Boaz Harrosh, Kani, Toshimitsu, Vishal Verma, linux-kernel,
	Linux ACPI, linux-fsdevel, Ingo Molnar

On Fri, Jun 19, 2015 at 02:02:39AM -0700, Dan Williams wrote:
> On Fri, Jun 19, 2015 at 1:34 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Wed, Jun 17, 2015 at 07:55:51PM -0400, Dan Williams wrote:
> >> This is disabled by default as the overhead is prohibitive, but if the
> >> user takes the action to turn it on we'll oblige.
> >
> > If you care about users a compile time selection doesn't make sense,
> > why not always build it but require an opt-in?
> 
> It's always built the Kconfig just selects the default initial state
> of QUEUE_FLAG_IO_STAT.  They can always turn it on/off via block queue
> sysfs.

Oh, missed that.  Just drop the Kconfig option in that case.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-17 23:56   ` Dan Williams
@ 2015-06-21 10:13     ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 10:13 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, boaz, toshi.kani, linux-kernel, hch,
	linux-acpi, linux-fsdevel, mingo

On Wed, Jun 17, 2015 at 07:56:02PM -0400, Dan Williams wrote:
> Upon detection of a read-only backing device arrange for the btt to
> device to be read only.  Implement a catch for the BLKROSET ioctl and
> only allow a btt-instance to become read-write when the backing-device
> becomes read-write.  Conversely, if a backing-device becomes read-only
> arrange for its parent btt to be marked read-only.  Synchronize these
> changes under the bus lock.

Eww.  I have to say the deeper I look into this code the more I hate
the stacking nature of btt.  It seems more and more we should never
attach pmem if we want to use a device with btt. 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-21 10:13     ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 10:13 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, boaz, toshi.kani, linux-kernel, hch,
	linux-acpi, linux-fsdevel, mingo

On Wed, Jun 17, 2015 at 07:56:02PM -0400, Dan Williams wrote:
> Upon detection of a read-only backing device arrange for the btt to
> device to be read only.  Implement a catch for the BLKROSET ioctl and
> only allow a btt-instance to become read-write when the backing-device
> becomes read-write.  Conversely, if a backing-device becomes read-only
> arrange for its parent btt to be marked read-only.  Synchronize these
> changes under the bus lock.

Eww.  I have to say the deeper I look into this code the more I hate
the stacking nature of btt.  It seems more and more we should never
attach pmem if we want to use a device with btt. 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-21 10:13     ` Christoph Hellwig
  (?)
@ 2015-06-21 13:21       ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 13:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Sun, Jun 21, 2015 at 3:13 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Jun 17, 2015 at 07:56:02PM -0400, Dan Williams wrote:
>> Upon detection of a read-only backing device arrange for the btt to
>> device to be read only.  Implement a catch for the BLKROSET ioctl and
>> only allow a btt-instance to become read-write when the backing-device
>> becomes read-write.  Conversely, if a backing-device becomes read-only
>> arrange for its parent btt to be marked read-only.  Synchronize these
>> changes under the bus lock.
>
> Eww.  I have to say the deeper I look into this code the more I hate
> the stacking nature of btt.  It seems more and more we should never
> attach pmem if we want to use a device with btt.

This question has come up before.  Making btt an internal property of
a device makes some things cleaner and others more messy.  We lose the
ability to place a btt instance on top of a partition, rather than a
whole disk.  If we ever need to access the raw device we no longer
have a direct block device to reference.  Linux has been doing stacked
configurations to change the personality of block devices since
forever (md, dm, bcache...), why invent something new to handle the
btt-personality of ->rw_bytes() devices?

BTT precludes DAX, if you want both modes on one pmem disk placing BTT
on a partition of the disk for fs metadata and DAX-capable data on the
rest is our proposed solution.  We chose this architecture after a
conversation with Dave Chinner about XFS's need to have atomic sector
guarantees for its metadata and wanting to simultaneously enable
XFS-DAX.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-21 13:21       ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 13:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Sun, Jun 21, 2015 at 3:13 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Jun 17, 2015 at 07:56:02PM -0400, Dan Williams wrote:
>> Upon detection of a read-only backing device arrange for the btt to
>> device to be read only.  Implement a catch for the BLKROSET ioctl and
>> only allow a btt-instance to become read-write when the backing-device
>> becomes read-write.  Conversely, if a backing-device becomes read-only
>> arrange for its parent btt to be marked read-only.  Synchronize these
>> changes under the bus lock.
>
> Eww.  I have to say the deeper I look into this code the more I hate
> the stacking nature of btt.  It seems more and more we should never
> attach pmem if we want to use a device with btt.

This question has come up before.  Making btt an internal property of
a device makes some things cleaner and others more messy.  We lose the
ability to place a btt instance on top of a partition, rather than a
whole disk.  If we ever need to access the raw device we no longer
have a direct block device to reference.  Linux has been doing stacked
configurations to change the personality of block devices since
forever (md, dm, bcache...), why invent something new to handle the
btt-personality of ->rw_bytes() devices?

BTT precludes DAX, if you want both modes on one pmem disk placing BTT
on a partition of the disk for fs metadata and DAX-capable data on the
rest is our proposed solution.  We chose this architecture after a
conversation with Dave Chinner about XFS's need to have atomic sector
guarantees for its metadata and wanting to simultaneously enable
XFS-DAX.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-21 13:21       ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 13:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Boaz Harrosh, Kani,
	Toshimitsu, linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Sun, Jun 21, 2015 at 3:13 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Jun 17, 2015 at 07:56:02PM -0400, Dan Williams wrote:
>> Upon detection of a read-only backing device arrange for the btt to
>> device to be read only.  Implement a catch for the BLKROSET ioctl and
>> only allow a btt-instance to become read-write when the backing-device
>> becomes read-write.  Conversely, if a backing-device becomes read-only
>> arrange for its parent btt to be marked read-only.  Synchronize these
>> changes under the bus lock.
>
> Eww.  I have to say the deeper I look into this code the more I hate
> the stacking nature of btt.  It seems more and more we should never
> attach pmem if we want to use a device with btt.

This question has come up before.  Making btt an internal property of
a device makes some things cleaner and others more messy.  We lose the
ability to place a btt instance on top of a partition, rather than a
whole disk.  If we ever need to access the raw device we no longer
have a direct block device to reference.  Linux has been doing stacked
configurations to change the personality of block devices since
forever (md, dm, bcache...), why invent something new to handle the
btt-personality of ->rw_bytes() devices?

BTT precludes DAX, if you want both modes on one pmem disk placing BTT
on a partition of the disk for fs metadata and DAX-capable data on the
rest is our proposed solution.  We chose this architecture after a
conversation with Dave Chinner about XFS's need to have atomic sector
guarantees for its metadata and wanting to simultaneously enable
XFS-DAX.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 12/15] libnvdimm: enable iostat
  2015-06-21 10:11         ` Christoph Hellwig
  (?)
@ 2015-06-21 13:22           ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 13:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	Vishal Verma, linux-kernel, Linux ACPI, linux-fsdevel,
	Ingo Molnar

On Sun, Jun 21, 2015 at 3:11 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Fri, Jun 19, 2015 at 02:02:39AM -0700, Dan Williams wrote:
>> On Fri, Jun 19, 2015 at 1:34 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > On Wed, Jun 17, 2015 at 07:55:51PM -0400, Dan Williams wrote:
>> >> This is disabled by default as the overhead is prohibitive, but if the
>> >> user takes the action to turn it on we'll oblige.
>> >
>> > If you care about users a compile time selection doesn't make sense,
>> > why not always build it but require an opt-in?
>>
>> It's always built the Kconfig just selects the default initial state
>> of QUEUE_FLAG_IO_STAT.  They can always turn it on/off via block queue
>> sysfs.
>
> Oh, missed that.  Just drop the Kconfig option in that case.

Ok.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 12/15] libnvdimm: enable iostat
@ 2015-06-21 13:22           ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 13:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	Vishal Verma, linux-kernel, Linux ACPI, linux-fsdevel,
	Ingo Molnar

On Sun, Jun 21, 2015 at 3:11 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Fri, Jun 19, 2015 at 02:02:39AM -0700, Dan Williams wrote:
>> On Fri, Jun 19, 2015 at 1:34 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > On Wed, Jun 17, 2015 at 07:55:51PM -0400, Dan Williams wrote:
>> >> This is disabled by default as the overhead is prohibitive, but if the
>> >> user takes the action to turn it on we'll oblige.
>> >
>> > If you care about users a compile time selection doesn't make sense,
>> > why not always build it but require an opt-in?
>>
>> It's always built the Kconfig just selects the default initial state
>> of QUEUE_FLAG_IO_STAT.  They can always turn it on/off via block queue
>> sysfs.
>
> Oh, missed that.  Just drop the Kconfig option in that case.

Ok.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 12/15] libnvdimm: enable iostat
@ 2015-06-21 13:22           ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 13:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Boaz Harrosh, Kani,
	Toshimitsu, Vishal Verma, linux-kernel, Linux ACPI,
	linux-fsdevel, Ingo Molnar

On Sun, Jun 21, 2015 at 3:11 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Fri, Jun 19, 2015 at 02:02:39AM -0700, Dan Williams wrote:
>> On Fri, Jun 19, 2015 at 1:34 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > On Wed, Jun 17, 2015 at 07:55:51PM -0400, Dan Williams wrote:
>> >> This is disabled by default as the overhead is prohibitive, but if the
>> >> user takes the action to turn it on we'll oblige.
>> >
>> > If you care about users a compile time selection doesn't make sense,
>> > why not always build it but require an opt-in?
>>
>> It's always built the Kconfig just selects the default initial state
>> of QUEUE_FLAG_IO_STAT.  They can always turn it on/off via block queue
>> sysfs.
>
> Oh, missed that.  Just drop the Kconfig option in that case.

Ok.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 11/15] libnvdimm: pmem, blk, and btt make_request cleanups
  2015-06-21 10:10     ` Christoph Hellwig
  (?)
@ 2015-06-21 13:26       ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 13:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	Vishal Verma, linux-kernel, Linux ACPI, linux-fsdevel,
	Ingo Molnar

On Sun, Jun 21, 2015 at 3:10 AM, Christoph Hellwig <hch@lst.de> wrote:
> One patch per driver please.
>
>> diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
>> index 8a6345797a71..9d609ef95266 100644
>> --- a/drivers/nvdimm/blk.c
>> +++ b/drivers/nvdimm/blk.c
>> @@ -170,18 +170,12 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
>>       struct bvec_iter iter;
>>       struct bio_vec bvec;
>>       int err = 0, rw;
>> -     sector_t sector;
>>
>> -     sector = bio->bi_iter.bi_sector;
>> -     if (bio_end_sector(bio) > get_capacity(disk)) {
>> +     if (unlikely(bio_end_sector(bio) > get_capacity(disk))) {
>>               err = -EIO;
>>               goto out;
>>       }
>>
>> -     BUG_ON(bio->bi_rw & REQ_DISCARD);
>
> If you remove the DISCARD check you can kill the max sectors one
> as well, given that generic_make_request_checks() takes care of it.

Ah, true, will add that with the split.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 11/15] libnvdimm: pmem, blk, and btt make_request cleanups
@ 2015-06-21 13:26       ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 13:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	Vishal Verma, linux-kernel, Linux ACPI, linux-fsdevel,
	Ingo Molnar

On Sun, Jun 21, 2015 at 3:10 AM, Christoph Hellwig <hch@lst.de> wrote:
> One patch per driver please.
>
>> diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
>> index 8a6345797a71..9d609ef95266 100644
>> --- a/drivers/nvdimm/blk.c
>> +++ b/drivers/nvdimm/blk.c
>> @@ -170,18 +170,12 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
>>       struct bvec_iter iter;
>>       struct bio_vec bvec;
>>       int err = 0, rw;
>> -     sector_t sector;
>>
>> -     sector = bio->bi_iter.bi_sector;
>> -     if (bio_end_sector(bio) > get_capacity(disk)) {
>> +     if (unlikely(bio_end_sector(bio) > get_capacity(disk))) {
>>               err = -EIO;
>>               goto out;
>>       }
>>
>> -     BUG_ON(bio->bi_rw & REQ_DISCARD);
>
> If you remove the DISCARD check you can kill the max sectors one
> as well, given that generic_make_request_checks() takes care of it.

Ah, true, will add that with the split.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 11/15] libnvdimm: pmem, blk, and btt make_request cleanups
@ 2015-06-21 13:26       ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 13:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Boaz Harrosh, Kani,
	Toshimitsu, Vishal Verma, linux-kernel, Linux ACPI,
	linux-fsdevel, Ingo Molnar

On Sun, Jun 21, 2015 at 3:10 AM, Christoph Hellwig <hch@lst.de> wrote:
> One patch per driver please.
>
>> diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
>> index 8a6345797a71..9d609ef95266 100644
>> --- a/drivers/nvdimm/blk.c
>> +++ b/drivers/nvdimm/blk.c
>> @@ -170,18 +170,12 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
>>       struct bvec_iter iter;
>>       struct bio_vec bvec;
>>       int err = 0, rw;
>> -     sector_t sector;
>>
>> -     sector = bio->bi_iter.bi_sector;
>> -     if (bio_end_sector(bio) > get_capacity(disk)) {
>> +     if (unlikely(bio_end_sector(bio) > get_capacity(disk))) {
>>               err = -EIO;
>>               goto out;
>>       }
>>
>> -     BUG_ON(bio->bi_rw & REQ_DISCARD);
>
> If you remove the DISCARD check you can kill the max sectors one
> as well, given that generic_make_request_checks() takes care of it.

Ah, true, will add that with the split.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 10/15] libnvdimm: fix up max_hw_sectors
  2015-06-21 10:08     ` Christoph Hellwig
@ 2015-06-21 13:28       ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 13:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	Vishal Verma, linux-kernel, Linux ACPI, linux-fsdevel,
	Ingo Molnar

On Sun, Jun 21, 2015 at 3:08 AM, Christoph Hellwig <hch@lst.de> wrote:
>> +void nd_blk_queue_init(struct request_queue *q)
>> +{
>> +     blk_queue_max_hw_sectors(q, UINT_MAX);
>> +     blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
>> +}
>
> Please just add the calls to the drivers instead of this helper which
> hides the intent.

I thought it made it clearer what properties are shared between block
devices on an nvdimm bus, but if you're initial reaction is that it
hides intent I'll kill the helper.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 10/15] libnvdimm: fix up max_hw_sectors
@ 2015-06-21 13:28       ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 13:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Boaz Harrosh, Kani,
	Toshimitsu, Vishal Verma, linux-kernel, Linux ACPI,
	linux-fsdevel, Ingo Molnar

On Sun, Jun 21, 2015 at 3:08 AM, Christoph Hellwig <hch@lst.de> wrote:
>> +void nd_blk_queue_init(struct request_queue *q)
>> +{
>> +     blk_queue_max_hw_sectors(q, UINT_MAX);
>> +     blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
>> +}
>
> Please just add the calls to the drivers instead of this helper which
> hides the intent.

I thought it made it clearer what properties are shared between block
devices on an nvdimm bus, but if you're initial reaction is that it
hides intent I'll kill the helper.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 04/15] libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
  2015-06-21 10:05     ` Christoph Hellwig
@ 2015-06-21 13:31       ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 13:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	Rafael J. Wysocki, linux-kernel, Andy Lutomirski, Jens Axboe,
	Linux ACPI, H. Peter Anvin, linux-fsdevel, Ross Zwisler,
	Ingo Molnar

On Sun, Jun 21, 2015 at 3:05 AM, Christoph Hellwig <hch@lst.de> wrote:
>> +#include <asm-generic/io-64-nonatomic-hi-lo.h>
>
> As mentioned last time only arch asm/ headers may include asm-generic
> headers.

No, not in this case, there's no other way to define readq()/writeq()
on 32-bit builds.  See:

drivers/block/nvme-core.c:43:#include <asm-generic/io-64-nonatomic-lo-hi.h>
drivers/edac/i3200_edac.c:18:#include <asm-generic/io-64-nonatomic-lo-hi.h>
drivers/edac/ie31200_edac.c:42:#include <asm-generic/io-64-nonatomic-lo-hi.h>
drivers/edac/x38_edac.c:18:#include <asm-generic/io-64-nonatomic-lo-hi.h>
drivers/i2c/busses/i2c-ismt.c:70:#include <asm-generic/io-64-nonatomic-lo-hi.h>
drivers/net/ethernet/intel/i40e/i40e_osdep.h:38:#include
<asm-generic/io-64-nonatomic-lo-hi.h>
drivers/net/ethernet/intel/i40evf/i40e_osdep.h:37:#include
<asm-generic/io-64-nonatomic-lo-hi.h>
drivers/net/ethernet/rocker/rocker.c:39:#include
<asm-generic/io-64-nonatomic-lo-hi.h>
drivers/platform/x86/ibm_rtl.c:36:#include <asm-generic/io-64-nonatomic-lo-hi.h>
drivers/platform/x86/intel_ips.c:81:#include
<asm-generic/io-64-nonatomic-lo-hi.h>
drivers/scsi/qla4xxx/ql4_nx.c:15:#include <asm-generic/io-64-nonatomic-lo-hi.h>

The only other option is to open code multiple readl() + writel() calls.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 04/15] libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
@ 2015-06-21 13:31       ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 13:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Boaz Harrosh, Kani,
	Toshimitsu, Rafael J. Wysocki, linux-kernel, Andy Lutomirski,
	Jens Axboe, Linux ACPI, H. Peter Anvin, linux-fsdevel,
	Ross Zwisler, Ingo Molnar

On Sun, Jun 21, 2015 at 3:05 AM, Christoph Hellwig <hch@lst.de> wrote:
>> +#include <asm-generic/io-64-nonatomic-hi-lo.h>
>
> As mentioned last time only arch asm/ headers may include asm-generic
> headers.

No, not in this case, there's no other way to define readq()/writeq()
on 32-bit builds.  See:

drivers/block/nvme-core.c:43:#include <asm-generic/io-64-nonatomic-lo-hi.h>
drivers/edac/i3200_edac.c:18:#include <asm-generic/io-64-nonatomic-lo-hi.h>
drivers/edac/ie31200_edac.c:42:#include <asm-generic/io-64-nonatomic-lo-hi.h>
drivers/edac/x38_edac.c:18:#include <asm-generic/io-64-nonatomic-lo-hi.h>
drivers/i2c/busses/i2c-ismt.c:70:#include <asm-generic/io-64-nonatomic-lo-hi.h>
drivers/net/ethernet/intel/i40e/i40e_osdep.h:38:#include
<asm-generic/io-64-nonatomic-lo-hi.h>
drivers/net/ethernet/intel/i40evf/i40e_osdep.h:37:#include
<asm-generic/io-64-nonatomic-lo-hi.h>
drivers/net/ethernet/rocker/rocker.c:39:#include
<asm-generic/io-64-nonatomic-lo-hi.h>
drivers/platform/x86/ibm_rtl.c:36:#include <asm-generic/io-64-nonatomic-lo-hi.h>
drivers/platform/x86/intel_ips.c:81:#include
<asm-generic/io-64-nonatomic-lo-hi.h>
drivers/scsi/qla4xxx/ql4_nx.c:15:#include <asm-generic/io-64-nonatomic-lo-hi.h>

The only other option is to open code multiple readl() + writel() calls.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-21 13:21       ` Dan Williams
  (?)
@ 2015-06-21 13:54         ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 13:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani,
	Toshimitsu, linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Sun, Jun 21, 2015 at 06:21:50AM -0700, Dan Williams wrote:
> This question has come up before.  Making btt an internal property of
> a device makes some things cleaner and others more messy.  We lose the
> ability to place a btt instance on top of a partition, rather than a
> whole disk.

I thought the addition of nfit labels avoids the need for a partition
table now?

> If we ever need to access the raw device we no longer
> have a direct block device to reference.  Linux has been doing stacked
> configurations to change the personality of block devices since
> forever (md, dm, bcache...), why invent something new to handle the
> btt-personality of ->rw_bytes() devices?

Because the underlying abstraction really isn't a block device
anymore, it's a byte addressable device.  This is more similar to
for example how the mtd subsystem is structured.

> BTT precludes DAX, if you want both modes on one pmem disk placing BTT
> on a partition of the disk for fs metadata and DAX-capable data on the
> rest is our proposed solution.  We chose this architecture after a
> conversation with Dave Chinner about XFS's need to have atomic sector
> guarantees for its metadata and wanting to simultaneously enable
> XFS-DAX.

I can't see why a v5 XFS filesystem with CRCs on all metadata would need
sector atomic updates any more.  But even in a case where it would it
seem like whatever label you use for partioning should sit above the
block layer.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-21 13:54         ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 13:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani,
	Toshimitsu, linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Sun, Jun 21, 2015 at 06:21:50AM -0700, Dan Williams wrote:
> This question has come up before.  Making btt an internal property of
> a device makes some things cleaner and others more messy.  We lose the
> ability to place a btt instance on top of a partition, rather than a
> whole disk.

I thought the addition of nfit labels avoids the need for a partition
table now?

> If we ever need to access the raw device we no longer
> have a direct block device to reference.  Linux has been doing stacked
> configurations to change the personality of block devices since
> forever (md, dm, bcache...), why invent something new to handle the
> btt-personality of ->rw_bytes() devices?

Because the underlying abstraction really isn't a block device
anymore, it's a byte addressable device.  This is more similar to
for example how the mtd subsystem is structured.

> BTT precludes DAX, if you want both modes on one pmem disk placing BTT
> on a partition of the disk for fs metadata and DAX-capable data on the
> rest is our proposed solution.  We chose this architecture after a
> conversation with Dave Chinner about XFS's need to have atomic sector
> guarantees for its metadata and wanting to simultaneously enable
> XFS-DAX.

I can't see why a v5 XFS filesystem with CRCs on all metadata would need
sector atomic updates any more.  But even in a case where it would it
seem like whatever label you use for partioning should sit above the
block layer.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-21 13:54         ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 13:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	Boaz Harrosh, Kani, Toshimitsu, linux-kernel, Linux ACPI,
	linux-fsdevel, Ingo Molnar

On Sun, Jun 21, 2015 at 06:21:50AM -0700, Dan Williams wrote:
> This question has come up before.  Making btt an internal property of
> a device makes some things cleaner and others more messy.  We lose the
> ability to place a btt instance on top of a partition, rather than a
> whole disk.

I thought the addition of nfit labels avoids the need for a partition
table now?

> If we ever need to access the raw device we no longer
> have a direct block device to reference.  Linux has been doing stacked
> configurations to change the personality of block devices since
> forever (md, dm, bcache...), why invent something new to handle the
> btt-personality of ->rw_bytes() devices?

Because the underlying abstraction really isn't a block device
anymore, it's a byte addressable device.  This is more similar to
for example how the mtd subsystem is structured.

> BTT precludes DAX, if you want both modes on one pmem disk placing BTT
> on a partition of the disk for fs metadata and DAX-capable data on the
> rest is our proposed solution.  We chose this architecture after a
> conversation with Dave Chinner about XFS's need to have atomic sector
> guarantees for its metadata and wanting to simultaneously enable
> XFS-DAX.

I can't see why a v5 XFS filesystem with CRCs on all metadata would need
sector atomic updates any more.  But even in a case where it would it
seem like whatever label you use for partioning should sit above the
block layer.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 04/15] libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
  2015-06-21 13:31       ` Dan Williams
  (?)
@ 2015-06-21 13:56         ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 13:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani,
	Toshimitsu, Rafael J. Wysocki, linux-kernel, Andy Lutomirski,
	Jens Axboe, Linux ACPI, H. Peter Anvin, linux-fsdevel,
	Ross Zwisler, Ingo Molnar

On Sun, Jun 21, 2015 at 06:31:38AM -0700, Dan Williams wrote:
> > As mentioned last time only arch asm/ headers may include asm-generic
> > headers.
> 
> No, not in this case, there's no other way to define readq()/writeq()
> on 32-bit builds.  See:

Oh my god.  I think we're both right: no driver should use asm-generic,
but because someone totally messed this abstraction up you have no choice.

We really should have a linux/*.h header for these that just does the
right thing.

Btw, what's the reason you're using the hi-lo ordering unlikely everyone
else?  IMHO that should be an arch and not a driver choice.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 04/15] libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
@ 2015-06-21 13:56         ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 13:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani,
	Toshimitsu, Rafael J. Wysocki, linux-kernel, Andy Lutomirski,
	Jens Axboe, Linux ACPI, H. Peter Anvin, linux-fsdevel,
	Ross Zwisler, Ingo Molnar

On Sun, Jun 21, 2015 at 06:31:38AM -0700, Dan Williams wrote:
> > As mentioned last time only arch asm/ headers may include asm-generic
> > headers.
> 
> No, not in this case, there's no other way to define readq()/writeq()
> on 32-bit builds.  See:

Oh my god.  I think we're both right: no driver should use asm-generic,
but because someone totally messed this abstraction up you have no choice.

We really should have a linux/*.h header for these that just does the
right thing.

Btw, what's the reason you're using the hi-lo ordering unlikely everyone
else?  IMHO that should be an arch and not a driver choice.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 04/15] libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
@ 2015-06-21 13:56         ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-21 13:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	Boaz Harrosh, Kani, Toshimitsu, Rafael J. Wysocki, linux-kernel,
	Andy Lutomirski, Jens Axboe, Linux ACPI, H. Peter Anvin,
	linux-fsdevel, Ross Zwisler, Ingo Molnar

On Sun, Jun 21, 2015 at 06:31:38AM -0700, Dan Williams wrote:
> > As mentioned last time only arch asm/ headers may include asm-generic
> > headers.
> 
> No, not in this case, there's no other way to define readq()/writeq()
> on 32-bit builds.  See:

Oh my god.  I think we're both right: no driver should use asm-generic,
but because someone totally messed this abstraction up you have no choice.

We really should have a linux/*.h header for these that just does the
right thing.

Btw, what's the reason you're using the hi-lo ordering unlikely everyone
else?  IMHO that should be an arch and not a driver choice.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 04/15] libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
  2015-06-21 13:56         ` Christoph Hellwig
  (?)
@ 2015-06-21 14:39           ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 14:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	Rafael J. Wysocki, linux-kernel, Andy Lutomirski, Jens Axboe,
	Linux ACPI, H. Peter Anvin, linux-fsdevel, Ross Zwisler,
	Ingo Molnar

On Sun, Jun 21, 2015 at 6:56 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Sun, Jun 21, 2015 at 06:31:38AM -0700, Dan Williams wrote:
>> > As mentioned last time only arch asm/ headers may include asm-generic
>> > headers.
>>
>> No, not in this case, there's no other way to define readq()/writeq()
>> on 32-bit builds.  See:
>
> Oh my god.  I think we're both right: no driver should use asm-generic,
> but because someone totally messed this abstraction up you have no choice.
>
> We really should have a linux/*.h header for these that just does the
> right thing.
>
> Btw, what's the reason you're using the hi-lo ordering unlikely everyone
> else?  IMHO that should be an arch and not a driver choice.

If the hardware latches the register on writing the hi or lo bits
first then it matters, otherwise it's arbitrary like it is in this
case.  It's hard to have an arch default because different devices
care about different orderings, so it must be a driver choice afaics.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 04/15] libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
@ 2015-06-21 14:39           ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 14:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	Rafael J. Wysocki, linux-kernel, Andy Lutomirski, Jens Axboe,
	Linux ACPI, H. Peter Anvin, linux-fsdevel, Ross Zwisler,
	Ingo Molnar

On Sun, Jun 21, 2015 at 6:56 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Sun, Jun 21, 2015 at 06:31:38AM -0700, Dan Williams wrote:
>> > As mentioned last time only arch asm/ headers may include asm-generic
>> > headers.
>>
>> No, not in this case, there's no other way to define readq()/writeq()
>> on 32-bit builds.  See:
>
> Oh my god.  I think we're both right: no driver should use asm-generic,
> but because someone totally messed this abstraction up you have no choice.
>
> We really should have a linux/*.h header for these that just does the
> right thing.
>
> Btw, what's the reason you're using the hi-lo ordering unlikely everyone
> else?  IMHO that should be an arch and not a driver choice.

If the hardware latches the register on writing the hi or lo bits
first then it matters, otherwise it's arbitrary like it is in this
case.  It's hard to have an arch default because different devices
care about different orderings, so it must be a driver choice afaics.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 04/15] libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
@ 2015-06-21 14:39           ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 14:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Boaz Harrosh, Kani,
	Toshimitsu, Rafael J. Wysocki, linux-kernel, Andy Lutomirski,
	Jens Axboe, Linux ACPI, H. Peter Anvin, linux-fsdevel,
	Ross Zwisler, Ingo Molnar

On Sun, Jun 21, 2015 at 6:56 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Sun, Jun 21, 2015 at 06:31:38AM -0700, Dan Williams wrote:
>> > As mentioned last time only arch asm/ headers may include asm-generic
>> > headers.
>>
>> No, not in this case, there's no other way to define readq()/writeq()
>> on 32-bit builds.  See:
>
> Oh my god.  I think we're both right: no driver should use asm-generic,
> but because someone totally messed this abstraction up you have no choice.
>
> We really should have a linux/*.h header for these that just does the
> right thing.
>
> Btw, what's the reason you're using the hi-lo ordering unlikely everyone
> else?  IMHO that should be an arch and not a driver choice.

If the hardware latches the register on writing the hi or lo bits
first then it matters, otherwise it's arbitrary like it is in this
case.  It's hard to have an arch default because different devices
care about different orderings, so it must be a driver choice afaics.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-21 13:54         ` Christoph Hellwig
@ 2015-06-21 15:11           ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 15:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Sun, Jun 21, 2015 at 6:54 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Sun, Jun 21, 2015 at 06:21:50AM -0700, Dan Williams wrote:
>> This question has come up before.  Making btt an internal property of
>> a device makes some things cleaner and others more messy.  We lose the
>> ability to place a btt instance on top of a partition, rather than a
>> whole disk.
>
> I thought the addition of nfit labels avoids the need for a partition
> table now?
>

The labels only allow allocation of persistent media between pmem and
blk.  For a given dimm you may access in either mode and the label
records the decision.  We can have a btt on either the pmem or
blk-mode disk type, or partition thereof.

>> If we ever need to access the raw device we no longer
>> have a direct block device to reference.  Linux has been doing stacked
>> configurations to change the personality of block devices since
>> forever (md, dm, bcache...), why invent something new to handle the
>> btt-personality of ->rw_bytes() devices?
>
> Because the underlying abstraction really isn't a block device
> anymore, it's a byte addressable device.  This is more similar to
> for example how the mtd subsystem is structured.

Yes, it's this hybrid thing that mostly fits into the existing block
device model save for two new block_device_operations
->direct_access() and ->rw_bytes().  We then use property of a
block_device that allows it to be claimed for exclusive ownership by a
filesystem or another block_device to layer storage semantics on top
be it files+directories, raid, caching, or atomic sectors.  NVDIMM
devices don't present the same complexity as MTD devices.  The only
complexity they present is byte-address-ability, not erase-block-size,
wear-leveling, etc...

>> BTT precludes DAX, if you want both modes on one pmem disk placing BTT
>> on a partition of the disk for fs metadata and DAX-capable data on the
>> rest is our proposed solution.  We chose this architecture after a
>> conversation with Dave Chinner about XFS's need to have atomic sector
>> guarantees for its metadata and wanting to simultaneously enable
>> XFS-DAX.
>
> I can't see why a v5 XFS filesystem with CRCs on all metadata would need
> sector atomic updates any more.  But even in a case where it would it
> seem like whatever label you use for partioning should sit above the
> block layer.

Yes, we use standard partition labels to sub-divide a namespace.  A
namespace boundary is either set by a label internal to the dimm or
the NFIT directly (for dimms that do not support internal labeling).
Good to hear that we don't need BTT for XFS v5, can we make the
guarantee for all filesystems that may want to support DAX?  I still
think stacking is a natural fit for this problem.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-21 15:11           ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 15:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Boaz Harrosh, Kani,
	Toshimitsu, linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Sun, Jun 21, 2015 at 6:54 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Sun, Jun 21, 2015 at 06:21:50AM -0700, Dan Williams wrote:
>> This question has come up before.  Making btt an internal property of
>> a device makes some things cleaner and others more messy.  We lose the
>> ability to place a btt instance on top of a partition, rather than a
>> whole disk.
>
> I thought the addition of nfit labels avoids the need for a partition
> table now?
>

The labels only allow allocation of persistent media between pmem and
blk.  For a given dimm you may access in either mode and the label
records the decision.  We can have a btt on either the pmem or
blk-mode disk type, or partition thereof.

>> If we ever need to access the raw device we no longer
>> have a direct block device to reference.  Linux has been doing stacked
>> configurations to change the personality of block devices since
>> forever (md, dm, bcache...), why invent something new to handle the
>> btt-personality of ->rw_bytes() devices?
>
> Because the underlying abstraction really isn't a block device
> anymore, it's a byte addressable device.  This is more similar to
> for example how the mtd subsystem is structured.

Yes, it's this hybrid thing that mostly fits into the existing block
device model save for two new block_device_operations
->direct_access() and ->rw_bytes().  We then use property of a
block_device that allows it to be claimed for exclusive ownership by a
filesystem or another block_device to layer storage semantics on top
be it files+directories, raid, caching, or atomic sectors.  NVDIMM
devices don't present the same complexity as MTD devices.  The only
complexity they present is byte-address-ability, not erase-block-size,
wear-leveling, etc...

>> BTT precludes DAX, if you want both modes on one pmem disk placing BTT
>> on a partition of the disk for fs metadata and DAX-capable data on the
>> rest is our proposed solution.  We chose this architecture after a
>> conversation with Dave Chinner about XFS's need to have atomic sector
>> guarantees for its metadata and wanting to simultaneously enable
>> XFS-DAX.
>
> I can't see why a v5 XFS filesystem with CRCs on all metadata would need
> sector atomic updates any more.  But even in a case where it would it
> seem like whatever label you use for partioning should sit above the
> block layer.

Yes, we use standard partition labels to sub-divide a namespace.  A
namespace boundary is either set by a label internal to the dimm or
the NFIT directly (for dimms that do not support internal labeling).
Good to hear that we don't need BTT for XFS v5, can we make the
guarantee for all filesystems that may want to support DAX?  I still
think stacking is a natural fit for this problem.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 03/15] nd_btt: atomic sector updates
  2015-06-21 10:03     ` Christoph Hellwig
@ 2015-06-21 16:31       ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 16:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	Vishal Verma, Neil Brown, Greg KH, Dave Chinner, linux-kernel,
	Andy Lutomirski, Jens Axboe, Linux ACPI, Jeff Moyer,
	H. Peter Anvin, linux-fsdevel, Ingo Molnar

On Sun, Jun 21, 2015 at 3:03 AM, Christoph Hellwig <hch@lst.de> wrote:
>> +config ND_MAX_REGIONS
>> +     int "Maximum number of regions supported by the sub-system"
>> +     default 64
>> +     ---help---
>> +       A 'region' corresponds to an individual DIMM or an interleave
>> +       set of DIMMs.  A typical maximally configured system may have
>> +       up to 32 DIMMs.
>> +
>> +       Leave the default of 64 if you are unsure.
>
> Having static limits in Kconfig is a bad idea.  What prevents you
> from handling any (reasonable) number dynamically?

Hmm, yes, this was a bad holdover from before we were using percpu
definitions for the lane locks.  Now that it's converted we can kill
the static definition of nd_percpu_lane and just do an alloc_percpu()
for each region dynamically.  Fixed in v2 and passing the test suite.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 03/15] nd_btt: atomic sector updates
@ 2015-06-21 16:31       ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-21 16:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Boaz Harrosh, Kani,
	Toshimitsu, Vishal Verma, Neil Brown, Greg KH, Dave Chinner,
	linux-kernel, Andy Lutomirski, Jens Axboe, Linux ACPI,
	Jeff Moyer, H. Peter Anvin, linux-fsdevel, Ingo Molnar

On Sun, Jun 21, 2015 at 3:03 AM, Christoph Hellwig <hch@lst.de> wrote:
>> +config ND_MAX_REGIONS
>> +     int "Maximum number of regions supported by the sub-system"
>> +     default 64
>> +     ---help---
>> +       A 'region' corresponds to an individual DIMM or an interleave
>> +       set of DIMMs.  A typical maximally configured system may have
>> +       up to 32 DIMMs.
>> +
>> +       Leave the default of 64 if you are unsure.
>
> Having static limits in Kconfig is a bad idea.  What prevents you
> from handling any (reasonable) number dynamically?

Hmm, yes, this was a bad holdover from before we were using percpu
definitions for the lane locks.  Now that it's converted we can kill
the static definition of nd_percpu_lane and just do an alloc_percpu()
for each region dynamically.  Fixed in v2 and passing the test suite.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-21 15:11           ` Dan Williams
  (?)
@ 2015-06-22  6:30             ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22  6:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Sun, Jun 21, 2015 at 08:11:25AM -0700, Dan Williams wrote:
> The labels only allow allocation of persistent media between pmem and
> blk.  For a given dimm you may access in either mode and the label
> records the decision.  We can have a btt on either the pmem or
> blk-mode disk type, or partition thereof.

Sounds like the spec should allow a btt type as well insteaad of
requiring the OS to work around it, as that seems to be one of the few
useful things to do with a run-time label.

Either way, partitions are trivial things and we could add them to the
nvdimm layer.

> Yes, it's this hybrid thing that mostly fits into the existing block
> device model save for two new block_device_operations
> ->direct_access() and ->rw_bytes().  We then use property of a
> block_device that allows it to be claimed for exclusive ownership by a
> filesystem or another block_device to layer storage semantics on top
> be it files+directories, raid, caching, or atomic sectors.  NVDIMM
> devices don't present the same complexity as MTD devices.  The only
> complexity they present is byte-address-ability, not erase-block-size,
> wear-leveling, etc...

I didn't say they show the same complexities, but the same layering.

> Good to hear that we don't need BTT for XFS v5, can we make the
> guarantee for all filesystems that may want to support DAX?  I still
> think stacking is a natural fit for this problem.

I can't make any guarantees, especially not without verification.  But
if correctly implemented any filesystems that does out of place metadata
writes (and that includes a traditional log) and uses checksum to ensure
the integrity of these updates it should be fine.  You'd still have
the issue of sector atomicy of file I/O though.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22  6:30             ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22  6:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Sun, Jun 21, 2015 at 08:11:25AM -0700, Dan Williams wrote:
> The labels only allow allocation of persistent media between pmem and
> blk.  For a given dimm you may access in either mode and the label
> records the decision.  We can have a btt on either the pmem or
> blk-mode disk type, or partition thereof.

Sounds like the spec should allow a btt type as well insteaad of
requiring the OS to work around it, as that seems to be one of the few
useful things to do with a run-time label.

Either way, partitions are trivial things and we could add them to the
nvdimm layer.

> Yes, it's this hybrid thing that mostly fits into the existing block
> device model save for two new block_device_operations
> ->direct_access() and ->rw_bytes().  We then use property of a
> block_device that allows it to be claimed for exclusive ownership by a
> filesystem or another block_device to layer storage semantics on top
> be it files+directories, raid, caching, or atomic sectors.  NVDIMM
> devices don't present the same complexity as MTD devices.  The only
> complexity they present is byte-address-ability, not erase-block-size,
> wear-leveling, etc...

I didn't say they show the same complexities, but the same layering.

> Good to hear that we don't need BTT for XFS v5, can we make the
> guarantee for all filesystems that may want to support DAX?  I still
> think stacking is a natural fit for this problem.

I can't make any guarantees, especially not without verification.  But
if correctly implemented any filesystems that does out of place metadata
writes (and that includes a traditional log) and uses checksum to ensure
the integrity of these updates it should be fine.  You'd still have
the issue of sector atomicy of file I/O though.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22  6:30             ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22  6:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Boaz Harrosh, Kani,
	Toshimitsu, linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Sun, Jun 21, 2015 at 08:11:25AM -0700, Dan Williams wrote:
> The labels only allow allocation of persistent media between pmem and
> blk.  For a given dimm you may access in either mode and the label
> records the decision.  We can have a btt on either the pmem or
> blk-mode disk type, or partition thereof.

Sounds like the spec should allow a btt type as well insteaad of
requiring the OS to work around it, as that seems to be one of the few
useful things to do with a run-time label.

Either way, partitions are trivial things and we could add them to the
nvdimm layer.

> Yes, it's this hybrid thing that mostly fits into the existing block
> device model save for two new block_device_operations
> ->direct_access() and ->rw_bytes().  We then use property of a
> block_device that allows it to be claimed for exclusive ownership by a
> filesystem or another block_device to layer storage semantics on top
> be it files+directories, raid, caching, or atomic sectors.  NVDIMM
> devices don't present the same complexity as MTD devices.  The only
> complexity they present is byte-address-ability, not erase-block-size,
> wear-leveling, etc...

I didn't say they show the same complexities, but the same layering.

> Good to hear that we don't need BTT for XFS v5, can we make the
> guarantee for all filesystems that may want to support DAX?  I still
> think stacking is a natural fit for this problem.

I can't make any guarantees, especially not without verification.  But
if correctly implemented any filesystems that does out of place metadata
writes (and that includes a traditional log) and uses checksum to ensure
the integrity of these updates it should be fine.  You'd still have
the issue of sector atomicy of file I/O though.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22  6:30             ` Christoph Hellwig
  (?)
@ 2015-06-22  7:17               ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22  7:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Sun, Jun 21, 2015 at 11:30 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Sun, Jun 21, 2015 at 08:11:25AM -0700, Dan Williams wrote:
>> The labels only allow allocation of persistent media between pmem and
>> blk.  For a given dimm you may access in either mode and the label
>> records the decision.  We can have a btt on either the pmem or
>> blk-mode disk type, or partition thereof.
>
> Sounds like the spec should allow a btt type as well insteaad of
> requiring the OS to work around it, as that seems to be one of the few
> useful things to do with a run-time label.

To be fair the namespace was initially envisioned to be btt enabled or
not, and hide the raw media device.  It was only when we added the
"XFS needs BTT so we need BTT support on partitions" constraint did I
push stacked BTT as the most flexible way to handle all these
configurations.  It also simplified the namespace to only be a
partition of access modes and leave sub-dividing pmem to standard
partitions.

> Either way, partitions are trivial things and we could add them to the
> nvdimm layer.
>
>> Yes, it's this hybrid thing that mostly fits into the existing block
>> device model save for two new block_device_operations
>> ->direct_access() and ->rw_bytes().  We then use property of a
>> block_device that allows it to be claimed for exclusive ownership by a
>> filesystem or another block_device to layer storage semantics on top
>> be it files+directories, raid, caching, or atomic sectors.  NVDIMM
>> devices don't present the same complexity as MTD devices.  The only
>> complexity they present is byte-address-ability, not erase-block-size,
>> wear-leveling, etc...
>
> I didn't say they show the same complexities, but the same layering.
>
>> Good to hear that we don't need BTT for XFS v5, can we make the
>> guarantee for all filesystems that may want to support DAX?  I still
>> think stacking is a natural fit for this problem.
>
> I can't make any guarantees, especially not without verification.  But
> if correctly implemented any filesystems that does out of place metadata
> writes (and that includes a traditional log) and uses checksum to ensure
> the integrity of these updates it should be fine.  You'd still have
> the issue of sector atomicy of file I/O though.

If someone needs sector atomicity of file I/O then by definition they
can't have DAX enabled.

There's no guarantee that these drivers are only ever paired with
XFSv5.  Drivers tend to be backported more freely than filesystems.  I
don't think the need for BTT on partitions will go away, but if you're
not convinced we could try the wait and see approach and move BTT to
only be enabled at namespace boundaries.  That's a fairly invasive
change to the configuration model, I'd hate to come back in a few
months to re-add BTT on partition support alongside the namespace only
mode.  Not trying to throw FUD, I'm willing to admit there are
downsides to the stacking model, they're just not clear to me
presently.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22  7:17               ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22  7:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Sun, Jun 21, 2015 at 11:30 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Sun, Jun 21, 2015 at 08:11:25AM -0700, Dan Williams wrote:
>> The labels only allow allocation of persistent media between pmem and
>> blk.  For a given dimm you may access in either mode and the label
>> records the decision.  We can have a btt on either the pmem or
>> blk-mode disk type, or partition thereof.
>
> Sounds like the spec should allow a btt type as well insteaad of
> requiring the OS to work around it, as that seems to be one of the few
> useful things to do with a run-time label.

To be fair the namespace was initially envisioned to be btt enabled or
not, and hide the raw media device.  It was only when we added the
"XFS needs BTT so we need BTT support on partitions" constraint did I
push stacked BTT as the most flexible way to handle all these
configurations.  It also simplified the namespace to only be a
partition of access modes and leave sub-dividing pmem to standard
partitions.

> Either way, partitions are trivial things and we could add them to the
> nvdimm layer.
>
>> Yes, it's this hybrid thing that mostly fits into the existing block
>> device model save for two new block_device_operations
>> ->direct_access() and ->rw_bytes().  We then use property of a
>> block_device that allows it to be claimed for exclusive ownership by a
>> filesystem or another block_device to layer storage semantics on top
>> be it files+directories, raid, caching, or atomic sectors.  NVDIMM
>> devices don't present the same complexity as MTD devices.  The only
>> complexity they present is byte-address-ability, not erase-block-size,
>> wear-leveling, etc...
>
> I didn't say they show the same complexities, but the same layering.
>
>> Good to hear that we don't need BTT for XFS v5, can we make the
>> guarantee for all filesystems that may want to support DAX?  I still
>> think stacking is a natural fit for this problem.
>
> I can't make any guarantees, especially not without verification.  But
> if correctly implemented any filesystems that does out of place metadata
> writes (and that includes a traditional log) and uses checksum to ensure
> the integrity of these updates it should be fine.  You'd still have
> the issue of sector atomicy of file I/O though.

If someone needs sector atomicity of file I/O then by definition they
can't have DAX enabled.

There's no guarantee that these drivers are only ever paired with
XFSv5.  Drivers tend to be backported more freely than filesystems.  I
don't think the need for BTT on partitions will go away, but if you're
not convinced we could try the wait and see approach and move BTT to
only be enabled at namespace boundaries.  That's a fairly invasive
change to the configuration model, I'd hate to come back in a few
months to re-add BTT on partition support alongside the namespace only
mode.  Not trying to throw FUD, I'm willing to admit there are
downsides to the stacking model, they're just not clear to me
presently.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22  7:17               ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22  7:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Boaz Harrosh, Kani,
	Toshimitsu, linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Sun, Jun 21, 2015 at 11:30 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Sun, Jun 21, 2015 at 08:11:25AM -0700, Dan Williams wrote:
>> The labels only allow allocation of persistent media between pmem and
>> blk.  For a given dimm you may access in either mode and the label
>> records the decision.  We can have a btt on either the pmem or
>> blk-mode disk type, or partition thereof.
>
> Sounds like the spec should allow a btt type as well insteaad of
> requiring the OS to work around it, as that seems to be one of the few
> useful things to do with a run-time label.

To be fair the namespace was initially envisioned to be btt enabled or
not, and hide the raw media device.  It was only when we added the
"XFS needs BTT so we need BTT support on partitions" constraint did I
push stacked BTT as the most flexible way to handle all these
configurations.  It also simplified the namespace to only be a
partition of access modes and leave sub-dividing pmem to standard
partitions.

> Either way, partitions are trivial things and we could add them to the
> nvdimm layer.
>
>> Yes, it's this hybrid thing that mostly fits into the existing block
>> device model save for two new block_device_operations
>> ->direct_access() and ->rw_bytes().  We then use property of a
>> block_device that allows it to be claimed for exclusive ownership by a
>> filesystem or another block_device to layer storage semantics on top
>> be it files+directories, raid, caching, or atomic sectors.  NVDIMM
>> devices don't present the same complexity as MTD devices.  The only
>> complexity they present is byte-address-ability, not erase-block-size,
>> wear-leveling, etc...
>
> I didn't say they show the same complexities, but the same layering.
>
>> Good to hear that we don't need BTT for XFS v5, can we make the
>> guarantee for all filesystems that may want to support DAX?  I still
>> think stacking is a natural fit for this problem.
>
> I can't make any guarantees, especially not without verification.  But
> if correctly implemented any filesystems that does out of place metadata
> writes (and that includes a traditional log) and uses checksum to ensure
> the integrity of these updates it should be fine.  You'd still have
> the issue of sector atomicy of file I/O though.

If someone needs sector atomicity of file I/O then by definition they
can't have DAX enabled.

There's no guarantee that these drivers are only ever paired with
XFSv5.  Drivers tend to be backported more freely than filesystems.  I
don't think the need for BTT on partitions will go away, but if you're
not convinced we could try the wait and see approach and move BTT to
only be enabled at namespace boundaries.  That's a fairly invasive
change to the configuration model, I'd hate to come back in a few
months to re-add BTT on partition support alongside the namespace only
mode.  Not trying to throw FUD, I'm willing to admit there are
downsides to the stacking model, they're just not clear to me
presently.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22  7:17               ` Dan Williams
  (?)
@ 2015-06-22  7:28                 ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22  7:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 12:17:29AM -0700, Dan Williams wrote:
> To be fair the namespace was initially envisioned to be btt enabled or
> not, and hide the raw media device.

What's the fascination with hiding one access mode just because
another one is available?

> There's no guarantee that these drivers are only ever paired with
> XFSv5.

There's not guarantee for anything.   Note that anything not following
my criteria earlier would need some form of atomic sector updates,
which is a lot more.  But then again for most of those setups you
wouldn't take advantage of pmem anyway.

Sounds like we simply shouldn't merge btt at all for now and wait for
a real use case, which would simplify the whole issue a lot.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22  7:28                 ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22  7:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 12:17:29AM -0700, Dan Williams wrote:
> To be fair the namespace was initially envisioned to be btt enabled or
> not, and hide the raw media device.

What's the fascination with hiding one access mode just because
another one is available?

> There's no guarantee that these drivers are only ever paired with
> XFSv5.

There's not guarantee for anything.   Note that anything not following
my criteria earlier would need some form of atomic sector updates,
which is a lot more.  But then again for most of those setups you
wouldn't take advantage of pmem anyway.

Sounds like we simply shouldn't merge btt at all for now and wait for
a real use case, which would simplify the whole issue a lot.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22  7:28                 ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22  7:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Boaz Harrosh, Kani,
	Toshimitsu, linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 12:17:29AM -0700, Dan Williams wrote:
> To be fair the namespace was initially envisioned to be btt enabled or
> not, and hide the raw media device.

What's the fascination with hiding one access mode just because
another one is available?

> There's no guarantee that these drivers are only ever paired with
> XFSv5.

There's not guarantee for anything.   Note that anything not following
my criteria earlier would need some form of atomic sector updates,
which is a lot more.  But then again for most of those setups you
wouldn't take advantage of pmem anyway.

Sounds like we simply shouldn't merge btt at all for now and wait for
a real use case, which would simplify the whole issue a lot.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22  7:28                 ` Christoph Hellwig
  (?)
@ 2015-06-22  7:39                   ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22  7:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 12:28 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Jun 22, 2015 at 12:17:29AM -0700, Dan Williams wrote:
>> To be fair the namespace was initially envisioned to be btt enabled or
>> not, and hide the raw media device.
>
> What's the fascination with hiding one access mode just because
> another one is available?
>

Now I'm confused, you *don't* want the raw device to be hidden *and*
you want to kill the stacking?  Something got crossed.  The current
implementation hides nothing, you get to see the entire stacked
composition.  I'd much prefer to avoid hiding anything.

>> There's no guarantee that these drivers are only ever paired with
>> XFSv5.
>
> There's not guarantee for anything.   Note that anything not following
> my criteria earlier would need some form of atomic sector updates,
> which is a lot more.  But then again for most of those setups you
> wouldn't take advantage of pmem anyway.
>
> Sounds like we simply shouldn't merge btt at all for now and wait for
> a real use case, which would simplify the whole issue a lot.

The sinister aspect of sector tearing is that most applications don't
know they have this dependency.  At least today's disk's rarely ever
tear sectors and if they do you almost certainly get a CRC error on
access.  NVDIMMs will always tear and always silently.  I think not
merging BTT at all to see what happens is simply wrong.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22  7:39                   ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22  7:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 12:28 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Jun 22, 2015 at 12:17:29AM -0700, Dan Williams wrote:
>> To be fair the namespace was initially envisioned to be btt enabled or
>> not, and hide the raw media device.
>
> What's the fascination with hiding one access mode just because
> another one is available?
>

Now I'm confused, you *don't* want the raw device to be hidden *and*
you want to kill the stacking?  Something got crossed.  The current
implementation hides nothing, you get to see the entire stacked
composition.  I'd much prefer to avoid hiding anything.

>> There's no guarantee that these drivers are only ever paired with
>> XFSv5.
>
> There's not guarantee for anything.   Note that anything not following
> my criteria earlier would need some form of atomic sector updates,
> which is a lot more.  But then again for most of those setups you
> wouldn't take advantage of pmem anyway.
>
> Sounds like we simply shouldn't merge btt at all for now and wait for
> a real use case, which would simplify the whole issue a lot.

The sinister aspect of sector tearing is that most applications don't
know they have this dependency.  At least today's disk's rarely ever
tear sectors and if they do you almost certainly get a CRC error on
access.  NVDIMMs will always tear and always silently.  I think not
merging BTT at all to see what happens is simply wrong.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22  7:39                   ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22  7:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Boaz Harrosh, Kani,
	Toshimitsu, linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 12:28 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Jun 22, 2015 at 12:17:29AM -0700, Dan Williams wrote:
>> To be fair the namespace was initially envisioned to be btt enabled or
>> not, and hide the raw media device.
>
> What's the fascination with hiding one access mode just because
> another one is available?
>

Now I'm confused, you *don't* want the raw device to be hidden *and*
you want to kill the stacking?  Something got crossed.  The current
implementation hides nothing, you get to see the entire stacked
composition.  I'd much prefer to avoid hiding anything.

>> There's no guarantee that these drivers are only ever paired with
>> XFSv5.
>
> There's not guarantee for anything.   Note that anything not following
> my criteria earlier would need some form of atomic sector updates,
> which is a lot more.  But then again for most of those setups you
> wouldn't take advantage of pmem anyway.
>
> Sounds like we simply shouldn't merge btt at all for now and wait for
> a real use case, which would simplify the whole issue a lot.

The sinister aspect of sector tearing is that most applications don't
know they have this dependency.  At least today's disk's rarely ever
tear sectors and if they do you almost certainly get a CRC error on
access.  NVDIMMs will always tear and always silently.  I think not
merging BTT at all to see what happens is simply wrong.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22  7:39                   ` Dan Williams
  (?)
@ 2015-06-22 15:02                     ` Jeff Moyer
  -1 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 15:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

Dan Williams <dan.j.williams@intel.com> writes:

>> Sounds like we simply shouldn't merge btt at all for now and wait for
>> a real use case, which would simplify the whole issue a lot.
>
> The sinister aspect of sector tearing is that most applications don't
> know they have this dependency.  At least today's disk's rarely ever
> tear sectors and if they do you almost certainly get a CRC error on
> access.  NVDIMMs will always tear and always silently.  I think not
> merging BTT at all to see what happens is simply wrong.

Agreed, we can't audit all code, and springing this potential data
corruptor on people seems irresponsible.

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 15:02                     ` Jeff Moyer
  0 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 15:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

Dan Williams <dan.j.williams@intel.com> writes:

>> Sounds like we simply shouldn't merge btt at all for now and wait for
>> a real use case, which would simplify the whole issue a lot.
>
> The sinister aspect of sector tearing is that most applications don't
> know they have this dependency.  At least today's disk's rarely ever
> tear sectors and if they do you almost certainly get a CRC error on
> access.  NVDIMMs will always tear and always silently.  I think not
> merging BTT at all to see what happens is simply wrong.

Agreed, we can't audit all code, and springing this potential data
corruptor on people seems irresponsible.

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 15:02                     ` Jeff Moyer
  0 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 15:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

Dan Williams <dan.j.williams@intel.com> writes:

>> Sounds like we simply shouldn't merge btt at all for now and wait for
>> a real use case, which would simplify the whole issue a lot.
>
> The sinister aspect of sector tearing is that most applications don't
> know they have this dependency.  At least today's disk's rarely ever
> tear sectors and if they do you almost certainly get a CRC error on
> access.  NVDIMMs will always tear and always silently.  I think not
> merging BTT at all to see what happens is simply wrong.

Agreed, we can't audit all code, and springing this potential data
corruptor on people seems irresponsible.

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22  7:39                   ` Dan Williams
  (?)
@ 2015-06-22 15:40                     ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 15:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani,
	Toshimitsu, linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 12:39:34AM -0700, Dan Williams wrote:
> Now I'm confused, you *don't* want the raw device to be hidden *and*
> you want to kill the stacking?  Something got crossed.  The current
> implementation hides nothing, you get to see the entire stacked
> composition.  I'd much prefer to avoid hiding anything.

You see it, but you can't actually use it.  The proper way to expose it
would be to the devices visible in the low-level bus sysfs enumeration
logic but only one ULD attach to it.

> The sinister aspect of sector tearing is that most applications don't
> know they have this dependency.  At least today's disk's rarely ever
> tear sectors and if they do you almost certainly get a CRC error on
> access.  NVDIMMs will always tear and always silently.  I think not
> merging BTT at all to see what happens is simply wrong.

So now you leave your users with a choice between a rock and a hard
place, that is using BTT to introduce non-significant overhead and not
supporting DAX or just use it as-is.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 15:40                     ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 15:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani,
	Toshimitsu, linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 12:39:34AM -0700, Dan Williams wrote:
> Now I'm confused, you *don't* want the raw device to be hidden *and*
> you want to kill the stacking?  Something got crossed.  The current
> implementation hides nothing, you get to see the entire stacked
> composition.  I'd much prefer to avoid hiding anything.

You see it, but you can't actually use it.  The proper way to expose it
would be to the devices visible in the low-level bus sysfs enumeration
logic but only one ULD attach to it.

> The sinister aspect of sector tearing is that most applications don't
> know they have this dependency.  At least today's disk's rarely ever
> tear sectors and if they do you almost certainly get a CRC error on
> access.  NVDIMMs will always tear and always silently.  I think not
> merging BTT at all to see what happens is simply wrong.

So now you leave your users with a choice between a rock and a hard
place, that is using BTT to introduce non-significant overhead and not
supporting DAX or just use it as-is.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 15:40                     ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 15:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	Boaz Harrosh, Kani, Toshimitsu, linux-kernel, Linux ACPI,
	linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 12:39:34AM -0700, Dan Williams wrote:
> Now I'm confused, you *don't* want the raw device to be hidden *and*
> you want to kill the stacking?  Something got crossed.  The current
> implementation hides nothing, you get to see the entire stacked
> composition.  I'd much prefer to avoid hiding anything.

You see it, but you can't actually use it.  The proper way to expose it
would be to the devices visible in the low-level bus sysfs enumeration
logic but only one ULD attach to it.

> The sinister aspect of sector tearing is that most applications don't
> know they have this dependency.  At least today's disk's rarely ever
> tear sectors and if they do you almost certainly get a CRC error on
> access.  NVDIMMs will always tear and always silently.  I think not
> merging BTT at all to see what happens is simply wrong.

So now you leave your users with a choice between a rock and a hard
place, that is using BTT to introduce non-significant overhead and not
supporting DAX or just use it as-is.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22 15:02                     ` Jeff Moyer
  (?)
@ 2015-06-22 15:41                       ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 15:41 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dan Williams, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 11:02:24AM -0400, Jeff Moyer wrote:
> Agreed, we can't audit all code, and springing this potential data
> corruptor on people seems irresponsible.

How do "the people" know they'd have to use btt in the current setup
without auditing their stack first?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 15:41                       ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 15:41 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dan Williams, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 11:02:24AM -0400, Jeff Moyer wrote:
> Agreed, we can't audit all code, and springing this potential data
> corruptor on people seems irresponsible.

How do "the people" know they'd have to use btt in the current setup
without auditing their stack first?
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 15:41                       ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 15:41 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dan Williams, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 11:02:24AM -0400, Jeff Moyer wrote:
> Agreed, we can't audit all code, and springing this potential data
> corruptor on people seems irresponsible.

How do "the people" know they'd have to use btt in the current setup
without auditing their stack first?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22 15:41                       ` Christoph Hellwig
  (?)
@ 2015-06-22 16:00                         ` Jeff Moyer
  -1 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 16:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

Christoph Hellwig <hch@lst.de> writes:

> On Mon, Jun 22, 2015 at 11:02:24AM -0400, Jeff Moyer wrote:
>> Agreed, we can't audit all code, and springing this potential data
>> corruptor on people seems irresponsible.
>
> How do "the people" know they'd have to use btt in the current setup
> without auditing their stack first?

Right now, the guidance should be to always use btt since there are no
applications that are directly taking advantage of persistent memory
(that I know).  I expect documentation would take care of that.  I also
expect that, as applications add support, they would note the
requirement for dax mountpoints in their documentation.

So, "the people" find out the same way they always have.  :)

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 16:00                         ` Jeff Moyer
  0 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 16:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

Christoph Hellwig <hch@lst.de> writes:

> On Mon, Jun 22, 2015 at 11:02:24AM -0400, Jeff Moyer wrote:
>> Agreed, we can't audit all code, and springing this potential data
>> corruptor on people seems irresponsible.
>
> How do "the people" know they'd have to use btt in the current setup
> without auditing their stack first?

Right now, the guidance should be to always use btt since there are no
applications that are directly taking advantage of persistent memory
(that I know).  I expect documentation would take care of that.  I also
expect that, as applications add support, they would note the
requirement for dax mountpoints in their documentation.

So, "the people" find out the same way they always have.  :)

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 16:00                         ` Jeff Moyer
  0 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 16:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

Christoph Hellwig <hch@lst.de> writes:

> On Mon, Jun 22, 2015 at 11:02:24AM -0400, Jeff Moyer wrote:
>> Agreed, we can't audit all code, and springing this potential data
>> corruptor on people seems irresponsible.
>
> How do "the people" know they'd have to use btt in the current setup
> without auditing their stack first?

Right now, the guidance should be to always use btt since there are no
applications that are directly taking advantage of persistent memory
(that I know).  I expect documentation would take care of that.  I also
expect that, as applications add support, they would note the
requirement for dax mountpoints in their documentation.

So, "the people" find out the same way they always have.  :)

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22 16:00                         ` Jeff Moyer
  (?)
@ 2015-06-22 16:32                           ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 16:32 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Dan Williams, Jens Axboe,
	linux-nvdimm@lists.01.org, linux-kernel, Linux ACPI,
	linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 12:00:54PM -0400, Jeff Moyer wrote:
> Right now, the guidance should be to always use btt since there are no
> applications that are directly taking advantage of persistent memory
> (that I know).  I expect documentation would take care of that.  I also
> expect that, as applications add support, they would note the
> requirement for dax mountpoints in their documentation.

It's not just DAX.  Avoiding the overhead for anything else is
another good reason.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 16:32                           ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 16:32 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Dan Williams, Jens Axboe,
	linux-nvdimm@lists.01.org, linux-kernel, Linux ACPI,
	linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 12:00:54PM -0400, Jeff Moyer wrote:
> Right now, the guidance should be to always use btt since there are no
> applications that are directly taking advantage of persistent memory
> (that I know).  I expect documentation would take care of that.  I also
> expect that, as applications add support, they would note the
> requirement for dax mountpoints in their documentation.

It's not just DAX.  Avoiding the overhead for anything else is
another good reason.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 16:32                           ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 16:32 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Dan Williams, Jens Axboe,
	linux-nvdimm@lists.01.org, linux-kernel, Linux ACPI,
	linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 12:00:54PM -0400, Jeff Moyer wrote:
> Right now, the guidance should be to always use btt since there are no
> applications that are directly taking advantage of persistent memory
> (that I know).  I expect documentation would take care of that.  I also
> expect that, as applications add support, they would note the
> requirement for dax mountpoints in their documentation.

It's not just DAX.  Avoiding the overhead for anything else is
another good reason.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
  2015-06-17 23:54   ` Dan Williams
@ 2015-06-22 16:34     ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 16:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, boaz, toshi.kani, Neil Brown, Greg KH,
	linux-kernel, mingo, linux-acpi, linux-fsdevel, hch

FYI, the mess calls into libnvdimm for each register pmem or blk device
which then walks partitions to find your btt metadata is an absolutely
no go and suggest that the layering is completely fucked up.

Please go and revisit it for a sensible model, where the different
drivers attach to a nvdimm bus instead of stacking up with a little
detour through the block layer.  From all that it's pretty clear
pmem, blk and btt should be peers in the hierarchy.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
@ 2015-06-22 16:34     ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 16:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, boaz, toshi.kani, Neil Brown, Greg KH,
	linux-kernel, mingo, linux-acpi, linux-fsdevel, hch

FYI, the mess calls into libnvdimm for each register pmem or blk device
which then walks partitions to find your btt metadata is an absolutely
no go and suggest that the layering is completely fucked up.

Please go and revisit it for a sensible model, where the different
drivers attach to a nvdimm bus instead of stacking up with a little
detour through the block layer.  From all that it's pretty clear
pmem, blk and btt should be peers in the hierarchy.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22 15:40                     ` Christoph Hellwig
  (?)
@ 2015-06-22 16:36                       ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22 16:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 8:40 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Jun 22, 2015 at 12:39:34AM -0700, Dan Williams wrote:
>> Now I'm confused, you *don't* want the raw device to be hidden *and*
>> you want to kill the stacking?  Something got crossed.  The current
>> implementation hides nothing, you get to see the entire stacked
>> composition.  I'd much prefer to avoid hiding anything.
>
> You see it, but you can't actually use it.  The proper way to expose it
> would be to the devices visible in the low-level bus sysfs enumeration
> logic but only one ULD attach to it.
>

In that case "don't stack" is too coarse of a hammer.  I see this as a
request to hide the subordinate ULD which is a new capability that DM
and MD might benefit from as well.  We already have the case in MD
where it internally holds a reference to bdev that has been hot
removed, it seems not much of a stretch to have stacking drivers be
able to hide device nodes for bdevs that they are holding.

>> The sinister aspect of sector tearing is that most applications don't
>> know they have this dependency.  At least today's disk's rarely ever
>> tear sectors and if they do you almost certainly get a CRC error on
>> access.  NVDIMMs will always tear and always silently.  I think not
>> merging BTT at all to see what happens is simply wrong.
>
> So now you leave your users with a choice between a rock and a hard
> place, that is using BTT to introduce non-significant overhead and not
> supporting DAX or just use it as-is.

Yes, if they want to use DAX they should do it consciously and audit
their application to be sure it is safe to abandon atomic sector
guarantees.  With the current flexibility to do BTT on a partition
they can do this conversion piecemeal and, for example, keep metadata
on BTT and data on DAX.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 16:36                       ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22 16:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 8:40 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Jun 22, 2015 at 12:39:34AM -0700, Dan Williams wrote:
>> Now I'm confused, you *don't* want the raw device to be hidden *and*
>> you want to kill the stacking?  Something got crossed.  The current
>> implementation hides nothing, you get to see the entire stacked
>> composition.  I'd much prefer to avoid hiding anything.
>
> You see it, but you can't actually use it.  The proper way to expose it
> would be to the devices visible in the low-level bus sysfs enumeration
> logic but only one ULD attach to it.
>

In that case "don't stack" is too coarse of a hammer.  I see this as a
request to hide the subordinate ULD which is a new capability that DM
and MD might benefit from as well.  We already have the case in MD
where it internally holds a reference to bdev that has been hot
removed, it seems not much of a stretch to have stacking drivers be
able to hide device nodes for bdevs that they are holding.

>> The sinister aspect of sector tearing is that most applications don't
>> know they have this dependency.  At least today's disk's rarely ever
>> tear sectors and if they do you almost certainly get a CRC error on
>> access.  NVDIMMs will always tear and always silently.  I think not
>> merging BTT at all to see what happens is simply wrong.
>
> So now you leave your users with a choice between a rock and a hard
> place, that is using BTT to introduce non-significant overhead and not
> supporting DAX or just use it as-is.

Yes, if they want to use DAX they should do it consciously and audit
their application to be sure it is safe to abandon atomic sector
guarantees.  With the current flexibility to do BTT on a partition
they can do this conversion piecemeal and, for example, keep metadata
on BTT and data on DAX.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 16:36                       ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22 16:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Boaz Harrosh, Kani,
	Toshimitsu, linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 8:40 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Jun 22, 2015 at 12:39:34AM -0700, Dan Williams wrote:
>> Now I'm confused, you *don't* want the raw device to be hidden *and*
>> you want to kill the stacking?  Something got crossed.  The current
>> implementation hides nothing, you get to see the entire stacked
>> composition.  I'd much prefer to avoid hiding anything.
>
> You see it, but you can't actually use it.  The proper way to expose it
> would be to the devices visible in the low-level bus sysfs enumeration
> logic but only one ULD attach to it.
>

In that case "don't stack" is too coarse of a hammer.  I see this as a
request to hide the subordinate ULD which is a new capability that DM
and MD might benefit from as well.  We already have the case in MD
where it internally holds a reference to bdev that has been hot
removed, it seems not much of a stretch to have stacking drivers be
able to hide device nodes for bdevs that they are holding.

>> The sinister aspect of sector tearing is that most applications don't
>> know they have this dependency.  At least today's disk's rarely ever
>> tear sectors and if they do you almost certainly get a CRC error on
>> access.  NVDIMMs will always tear and always silently.  I think not
>> merging BTT at all to see what happens is simply wrong.
>
> So now you leave your users with a choice between a rock and a hard
> place, that is using BTT to introduce non-significant overhead and not
> supporting DAX or just use it as-is.

Yes, if they want to use DAX they should do it consciously and audit
their application to be sure it is safe to abandon atomic sector
guarantees.  With the current flexibility to do BTT on a partition
they can do this conversion piecemeal and, for example, keep metadata
on BTT and data on DAX.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22 16:32                           ` Christoph Hellwig
  (?)
@ 2015-06-22 16:42                             ` Jeff Moyer
  -1 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 16:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

Christoph Hellwig <hch@lst.de> writes:

> On Mon, Jun 22, 2015 at 12:00:54PM -0400, Jeff Moyer wrote:
>> Right now, the guidance should be to always use btt since there are no
>> applications that are directly taking advantage of persistent memory
>> (that I know).  I expect documentation would take care of that.  I also
>> expect that, as applications add support, they would note the
>> requirement for dax mountpoints in their documentation.
>
> It's not just DAX.  Avoiding the overhead for anything else is
> another good reason.

OK, add torn sector detection/recovery to that statement, then.  More
importantly, do you agree with the sentiment or not?

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 16:42                             ` Jeff Moyer
  0 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 16:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

Christoph Hellwig <hch@lst.de> writes:

> On Mon, Jun 22, 2015 at 12:00:54PM -0400, Jeff Moyer wrote:
>> Right now, the guidance should be to always use btt since there are no
>> applications that are directly taking advantage of persistent memory
>> (that I know).  I expect documentation would take care of that.  I also
>> expect that, as applications add support, they would note the
>> requirement for dax mountpoints in their documentation.
>
> It's not just DAX.  Avoiding the overhead for anything else is
> another good reason.

OK, add torn sector detection/recovery to that statement, then.  More
importantly, do you agree with the sentiment or not?

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 16:42                             ` Jeff Moyer
  0 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 16:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

Christoph Hellwig <hch@lst.de> writes:

> On Mon, Jun 22, 2015 at 12:00:54PM -0400, Jeff Moyer wrote:
>> Right now, the guidance should be to always use btt since there are no
>> applications that are directly taking advantage of persistent memory
>> (that I know).  I expect documentation would take care of that.  I also
>> expect that, as applications add support, they would note the
>> requirement for dax mountpoints in their documentation.
>
> It's not just DAX.  Avoiding the overhead for anything else is
> another good reason.

OK, add torn sector detection/recovery to that statement, then.  More
importantly, do you agree with the sentiment or not?

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22 16:36                       ` Dan Williams
  (?)
@ 2015-06-22 16:45                         ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 16:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 09:36:50AM -0700, Dan Williams wrote:
> In that case "don't stack" is too coarse of a hammer.  I see this as a
> request to hide the subordinate ULD which is a new capability that DM
> and MD might benefit from as well.  We already have the case in MD
> where it internally holds a reference to bdev that has been hot
> removed, it seems not much of a stretch to have stacking drivers be
> able to hide device nodes for bdevs that they are holding.

I don't see why you're comparing with MD and DM here.  MD and DM
sit cleanly ontop of any block device.  If btt was independent of
libnvdimm and just used ->rw_bytes we could see it as this.

But it's all a giant entangled mess, where btt for example is probed
by libnvdimm.  At the same time pmem.c isn't really a true block
driver, it's really just a trivial shim between the block API
and pmem-style memcpy.  Especially with the proper pmem API btt
would become cleaner just calling that directly.  

> Yes, if they want to use DAX they should do it consciously and audit
> their application to be sure it is safe to abandon atomic sector
> guarantees.  With the current flexibility to do BTT on a partition
> they can do this conversion piecemeal and, for example, keep metadata
> on BTT and data on DAX.

By that logic you'd want to attach BTT by default and allow opt-out
at some level.  This could be a libnvmdimm-level partitioning scheme,
which would also allow storing the bit if BTT is used or not persistently.
Or it could be on fine grained boundaries which might be more useful.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 16:45                         ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 16:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 09:36:50AM -0700, Dan Williams wrote:
> In that case "don't stack" is too coarse of a hammer.  I see this as a
> request to hide the subordinate ULD which is a new capability that DM
> and MD might benefit from as well.  We already have the case in MD
> where it internally holds a reference to bdev that has been hot
> removed, it seems not much of a stretch to have stacking drivers be
> able to hide device nodes for bdevs that they are holding.

I don't see why you're comparing with MD and DM here.  MD and DM
sit cleanly ontop of any block device.  If btt was independent of
libnvdimm and just used ->rw_bytes we could see it as this.

But it's all a giant entangled mess, where btt for example is probed
by libnvdimm.  At the same time pmem.c isn't really a true block
driver, it's really just a trivial shim between the block API
and pmem-style memcpy.  Especially with the proper pmem API btt
would become cleaner just calling that directly.  

> Yes, if they want to use DAX they should do it consciously and audit
> their application to be sure it is safe to abandon atomic sector
> guarantees.  With the current flexibility to do BTT on a partition
> they can do this conversion piecemeal and, for example, keep metadata
> on BTT and data on DAX.

By that logic you'd want to attach BTT by default and allow opt-out
at some level.  This could be a libnvmdimm-level partitioning scheme,
which would also allow storing the bit if BTT is used or not persistently.
Or it could be on fine grained boundaries which might be more useful.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 16:45                         ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 16:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Boaz Harrosh, Kani,
	Toshimitsu, linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 09:36:50AM -0700, Dan Williams wrote:
> In that case "don't stack" is too coarse of a hammer.  I see this as a
> request to hide the subordinate ULD which is a new capability that DM
> and MD might benefit from as well.  We already have the case in MD
> where it internally holds a reference to bdev that has been hot
> removed, it seems not much of a stretch to have stacking drivers be
> able to hide device nodes for bdevs that they are holding.

I don't see why you're comparing with MD and DM here.  MD and DM
sit cleanly ontop of any block device.  If btt was independent of
libnvdimm and just used ->rw_bytes we could see it as this.

But it's all a giant entangled mess, where btt for example is probed
by libnvdimm.  At the same time pmem.c isn't really a true block
driver, it's really just a trivial shim between the block API
and pmem-style memcpy.  Especially with the proper pmem API btt
would become cleaner just calling that directly.  

> Yes, if they want to use DAX they should do it consciously and audit
> their application to be sure it is safe to abandon atomic sector
> guarantees.  With the current flexibility to do BTT on a partition
> they can do this conversion piecemeal and, for example, keep metadata
> on BTT and data on DAX.

By that logic you'd want to attach BTT by default and allow opt-out
at some level.  This could be a libnvmdimm-level partitioning scheme,
which would also allow storing the bit if BTT is used or not persistently.
Or it could be on fine grained boundaries which might be more useful.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
  2015-06-22 16:34     ` Christoph Hellwig
@ 2015-06-22 16:48       ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22 16:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	Neil Brown, Greg KH, linux-kernel, Ingo Molnar, Linux ACPI,
	linux-fsdevel

On Mon, Jun 22, 2015 at 9:34 AM, Christoph Hellwig <hch@lst.de> wrote:
> FYI, the mess calls into libnvdimm for each register pmem or blk device
> which then walks partitions to find your btt metadata is an absolutely
> no go and suggest that the layering is completely fucked up.

Only if you abandon BTT on partitions, which at this point it seems
you're boldly committed to doing.  It's unacceptable to drop BTT on
the floor so I'll take a look at making BTT per-disk only for 4.2.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
@ 2015-06-22 16:48       ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22 16:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Boaz Harrosh, Kani,
	Toshimitsu, Neil Brown, Greg KH, linux-kernel, Ingo Molnar,
	Linux ACPI, linux-fsdevel

On Mon, Jun 22, 2015 at 9:34 AM, Christoph Hellwig <hch@lst.de> wrote:
> FYI, the mess calls into libnvdimm for each register pmem or blk device
> which then walks partitions to find your btt metadata is an absolutely
> no go and suggest that the layering is completely fucked up.

Only if you abandon BTT on partitions, which at this point it seems
you're boldly committed to doing.  It's unacceptable to drop BTT on
the floor so I'll take a look at making BTT per-disk only for 4.2.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22 16:42                             ` Jeff Moyer
  (?)
@ 2015-06-22 16:48                               ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 16:48 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dan Williams, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 12:42:44PM -0400, Jeff Moyer wrote:
> OK, add torn sector detection/recovery to that statement, then.  More
> importantly, do you agree with the sentiment or not?

I think we're getting on a very slipper slope if we think about
application here.  Buffered I/O application must deal with torn
writes at any granulairty anyway, e.g. fsync + rename is the
only thing they can rely on right now (I actually have software O_ATOMIC
code to avoid this, but that's another story).

Direct I/O using application can make assumption if they know the sector
size, and we must have a way for them to be able to see our new
"subsector sector size".  And thos application are few inbetween but
also important so needing special cases for them is fine.  Although those
are the most likely ones to take advantage of byte addressing anyway.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 16:48                               ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 16:48 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dan Williams, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 12:42:44PM -0400, Jeff Moyer wrote:
> OK, add torn sector detection/recovery to that statement, then.  More
> importantly, do you agree with the sentiment or not?

I think we're getting on a very slipper slope if we think about
application here.  Buffered I/O application must deal with torn
writes at any granulairty anyway, e.g. fsync + rename is the
only thing they can rely on right now (I actually have software O_ATOMIC
code to avoid this, but that's another story).

Direct I/O using application can make assumption if they know the sector
size, and we must have a way for them to be able to see our new
"subsector sector size".  And thos application are few inbetween but
also important so needing special cases for them is fine.  Although those
are the most likely ones to take advantage of byte addressing anyway.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 16:48                               ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 16:48 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dan Williams, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 12:42:44PM -0400, Jeff Moyer wrote:
> OK, add torn sector detection/recovery to that statement, then.  More
> importantly, do you agree with the sentiment or not?

I think we're getting on a very slipper slope if we think about
application here.  Buffered I/O application must deal with torn
writes at any granulairty anyway, e.g. fsync + rename is the
only thing they can rely on right now (I actually have software O_ATOMIC
code to avoid this, but that's another story).

Direct I/O using application can make assumption if they know the sector
size, and we must have a way for them to be able to see our new
"subsector sector size".  And thos application are few inbetween but
also important so needing special cases for them is fine.  Although those
are the most likely ones to take advantage of byte addressing anyway.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
  2015-06-22 16:48       ` Dan Williams
  (?)
@ 2015-06-22 16:48         ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 16:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani,
	Toshimitsu, Neil Brown, Greg KH, linux-kernel, Ingo Molnar,
	Linux ACPI, linux-fsdevel

On Mon, Jun 22, 2015 at 09:48:03AM -0700, Dan Williams wrote:
> Only if you abandon BTT on partitions, which at this point it seems
> you're boldly committed to doing.  It's unacceptable to drop BTT on
> the floor so I'll take a look at making BTT per-disk only for 4.2.

If by partitions you mean block layer partitions: yes.  If by partitions
you mean subdivision of nvdimms: no.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
@ 2015-06-22 16:48         ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 16:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani,
	Toshimitsu, Neil Brown, Greg KH, linux-kernel, Ingo Molnar,
	Linux ACPI, linux-fsdevel

On Mon, Jun 22, 2015 at 09:48:03AM -0700, Dan Williams wrote:
> Only if you abandon BTT on partitions, which at this point it seems
> you're boldly committed to doing.  It's unacceptable to drop BTT on
> the floor so I'll take a look at making BTT per-disk only for 4.2.

If by partitions you mean block layer partitions: yes.  If by partitions
you mean subdivision of nvdimms: no.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
@ 2015-06-22 16:48         ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 16:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	Boaz Harrosh, Kani, Toshimitsu, Neil Brown, Greg KH,
	linux-kernel, Ingo Molnar, Linux ACPI, linux-fsdevel

On Mon, Jun 22, 2015 at 09:48:03AM -0700, Dan Williams wrote:
> Only if you abandon BTT on partitions, which at this point it seems
> you're boldly committed to doing.  It's unacceptable to drop BTT on
> the floor so I'll take a look at making BTT per-disk only for 4.2.

If by partitions you mean block layer partitions: yes.  If by partitions
you mean subdivision of nvdimms: no.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22 16:45                         ` Christoph Hellwig
  (?)
@ 2015-06-22 16:54                           ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22 16:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 9:45 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Jun 22, 2015 at 09:36:50AM -0700, Dan Williams wrote:
>> In that case "don't stack" is too coarse of a hammer.  I see this as a
>> request to hide the subordinate ULD which is a new capability that DM
>> and MD might benefit from as well.  We already have the case in MD
>> where it internally holds a reference to bdev that has been hot
>> removed, it seems not much of a stretch to have stacking drivers be
>> able to hide device nodes for bdevs that they are holding.
>
> I don't see why you're comparing with MD and DM here.  MD and DM
> sit cleanly ontop of any block device.  If btt was independent of
> libnvdimm and just used ->rw_bytes we could see it as this.
>
> But it's all a giant entangled mess, where btt for example is probed
> by libnvdimm.  At the same time pmem.c isn't really a true block
> driver, it's really just a trivial shim between the block API
> and pmem-style memcpy.  Especially with the proper pmem API btt
> would become cleaner just calling that directly.

The pmem api does nothing to fix torn sectors, there's no extra
atomicity guarantees that come from those instructions.

>> Yes, if they want to use DAX they should do it consciously and audit
>> their application to be sure it is safe to abandon atomic sector
>> guarantees.  With the current flexibility to do BTT on a partition
>> they can do this conversion piecemeal and, for example, keep metadata
>> on BTT and data on DAX.
>
> By that logic you'd want to attach BTT by default and allow opt-out
> at some level.  This could be a libnvmdimm-level partitioning scheme,
> which would also allow storing the bit if BTT is used or not persistently.
> Or it could be on fine grained boundaries which might be more useful.

Well, let's start with per-disk btt and see where that gets us, we can
always ramp up complexity later.  I'd just as soon make the default
opt-in/out a Kconfig toggle with a sysfs override.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 16:54                           ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22 16:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 9:45 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Jun 22, 2015 at 09:36:50AM -0700, Dan Williams wrote:
>> In that case "don't stack" is too coarse of a hammer.  I see this as a
>> request to hide the subordinate ULD which is a new capability that DM
>> and MD might benefit from as well.  We already have the case in MD
>> where it internally holds a reference to bdev that has been hot
>> removed, it seems not much of a stretch to have stacking drivers be
>> able to hide device nodes for bdevs that they are holding.
>
> I don't see why you're comparing with MD and DM here.  MD and DM
> sit cleanly ontop of any block device.  If btt was independent of
> libnvdimm and just used ->rw_bytes we could see it as this.
>
> But it's all a giant entangled mess, where btt for example is probed
> by libnvdimm.  At the same time pmem.c isn't really a true block
> driver, it's really just a trivial shim between the block API
> and pmem-style memcpy.  Especially with the proper pmem API btt
> would become cleaner just calling that directly.

The pmem api does nothing to fix torn sectors, there's no extra
atomicity guarantees that come from those instructions.

>> Yes, if they want to use DAX they should do it consciously and audit
>> their application to be sure it is safe to abandon atomic sector
>> guarantees.  With the current flexibility to do BTT on a partition
>> they can do this conversion piecemeal and, for example, keep metadata
>> on BTT and data on DAX.
>
> By that logic you'd want to attach BTT by default and allow opt-out
> at some level.  This could be a libnvmdimm-level partitioning scheme,
> which would also allow storing the bit if BTT is used or not persistently.
> Or it could be on fine grained boundaries which might be more useful.

Well, let's start with per-disk btt and see where that gets us, we can
always ramp up complexity later.  I'd just as soon make the default
opt-in/out a Kconfig toggle with a sysfs override.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 16:54                           ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22 16:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Boaz Harrosh, Kani,
	Toshimitsu, linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 9:45 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Jun 22, 2015 at 09:36:50AM -0700, Dan Williams wrote:
>> In that case "don't stack" is too coarse of a hammer.  I see this as a
>> request to hide the subordinate ULD which is a new capability that DM
>> and MD might benefit from as well.  We already have the case in MD
>> where it internally holds a reference to bdev that has been hot
>> removed, it seems not much of a stretch to have stacking drivers be
>> able to hide device nodes for bdevs that they are holding.
>
> I don't see why you're comparing with MD and DM here.  MD and DM
> sit cleanly ontop of any block device.  If btt was independent of
> libnvdimm and just used ->rw_bytes we could see it as this.
>
> But it's all a giant entangled mess, where btt for example is probed
> by libnvdimm.  At the same time pmem.c isn't really a true block
> driver, it's really just a trivial shim between the block API
> and pmem-style memcpy.  Especially with the proper pmem API btt
> would become cleaner just calling that directly.

The pmem api does nothing to fix torn sectors, there's no extra
atomicity guarantees that come from those instructions.

>> Yes, if they want to use DAX they should do it consciously and audit
>> their application to be sure it is safe to abandon atomic sector
>> guarantees.  With the current flexibility to do BTT on a partition
>> they can do this conversion piecemeal and, for example, keep metadata
>> on BTT and data on DAX.
>
> By that logic you'd want to attach BTT by default and allow opt-out
> at some level.  This could be a libnvmdimm-level partitioning scheme,
> which would also allow storing the bit if BTT is used or not persistently.
> Or it could be on fine grained boundaries which might be more useful.

Well, let's start with per-disk btt and see where that gets us, we can
always ramp up complexity later.  I'd just as soon make the default
opt-in/out a Kconfig toggle with a sysfs override.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22 16:54                           ` Dan Williams
@ 2015-06-22 16:57                             ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 16:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani,
	Toshimitsu, linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 09:54:51AM -0700, Dan Williams wrote:
> > I don't see why you're comparing with MD and DM here.  MD and DM
> > sit cleanly ontop of any block device.  If btt was independent of
> > libnvdimm and just used ->rw_bytes we could see it as this.
> >
> > But it's all a giant entangled mess, where btt for example is probed
> > by libnvdimm.  At the same time pmem.c isn't really a true block
> > driver, it's really just a trivial shim between the block API
> > and pmem-style memcpy.  Especially with the proper pmem API btt
> > would become cleaner just calling that directly.
> 
> The pmem api does nothing to fix torn sectors, there's no extra
> atomicity guarantees that come from those instructions.

Of course not.  And neither does pmem.c help with you in any way.

That's the point:  btt should be a peer to pmem.c, not on top of it
as there's no value add in pmem.c for it, and they are logically peers.

> Well, let's start with per-disk btt and see where that gets us, we can
> always ramp up complexity later.  I'd just as soon make the default
> opt-in/out a Kconfig toggle with a sysfs override.

Kconfig or sysfs are both utterly horrible choices.  It's a disk format
choice so it needs to be persisted.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 16:57                             ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-22 16:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	Boaz Harrosh, Kani, Toshimitsu, linux-kernel, Linux ACPI,
	linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 09:54:51AM -0700, Dan Williams wrote:
> > I don't see why you're comparing with MD and DM here.  MD and DM
> > sit cleanly ontop of any block device.  If btt was independent of
> > libnvdimm and just used ->rw_bytes we could see it as this.
> >
> > But it's all a giant entangled mess, where btt for example is probed
> > by libnvdimm.  At the same time pmem.c isn't really a true block
> > driver, it's really just a trivial shim between the block API
> > and pmem-style memcpy.  Especially with the proper pmem API btt
> > would become cleaner just calling that directly.
> 
> The pmem api does nothing to fix torn sectors, there's no extra
> atomicity guarantees that come from those instructions.

Of course not.  And neither does pmem.c help with you in any way.

That's the point:  btt should be a peer to pmem.c, not on top of it
as there's no value add in pmem.c for it, and they are logically peers.

> Well, let's start with per-disk btt and see where that gets us, we can
> always ramp up complexity later.  I'd just as soon make the default
> opt-in/out a Kconfig toggle with a sysfs override.

Kconfig or sysfs are both utterly horrible choices.  It's a disk format
choice so it needs to be persisted.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22 16:57                             ` Christoph Hellwig
  (?)
@ 2015-06-22 16:59                               ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22 16:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 9:57 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Jun 22, 2015 at 09:54:51AM -0700, Dan Williams wrote:
>> > I don't see why you're comparing with MD and DM here.  MD and DM
>> > sit cleanly ontop of any block device.  If btt was independent of
>> > libnvdimm and just used ->rw_bytes we could see it as this.
>> >
>> > But it's all a giant entangled mess, where btt for example is probed
>> > by libnvdimm.  At the same time pmem.c isn't really a true block
>> > driver, it's really just a trivial shim between the block API
>> > and pmem-style memcpy.  Especially with the proper pmem API btt
>> > would become cleaner just calling that directly.
>>
>> The pmem api does nothing to fix torn sectors, there's no extra
>> atomicity guarantees that come from those instructions.
>
> Of course not.  And neither does pmem.c help with you in any way.
>
> That's the point:  btt should be a peer to pmem.c, not on top of it
> as there's no value add in pmem.c for it, and they are logically peers.
>
>> Well, let's start with per-disk btt and see where that gets us, we can
>> always ramp up complexity later.  I'd just as soon make the default
>> opt-in/out a Kconfig toggle with a sysfs override.
>
> Kconfig or sysfs are both utterly horrible choices.  It's a disk format
> choice so it needs to be persisted.

Of course it will be persisted with an on disk BTT superblock.
Establishing that superblock by default and deleting at on-demand are
via Kconfig and sysfs.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 16:59                               ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22 16:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm, Boaz Harrosh, Kani, Toshimitsu,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 9:57 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Jun 22, 2015 at 09:54:51AM -0700, Dan Williams wrote:
>> > I don't see why you're comparing with MD and DM here.  MD and DM
>> > sit cleanly ontop of any block device.  If btt was independent of
>> > libnvdimm and just used ->rw_bytes we could see it as this.
>> >
>> > But it's all a giant entangled mess, where btt for example is probed
>> > by libnvdimm.  At the same time pmem.c isn't really a true block
>> > driver, it's really just a trivial shim between the block API
>> > and pmem-style memcpy.  Especially with the proper pmem API btt
>> > would become cleaner just calling that directly.
>>
>> The pmem api does nothing to fix torn sectors, there's no extra
>> atomicity guarantees that come from those instructions.
>
> Of course not.  And neither does pmem.c help with you in any way.
>
> That's the point:  btt should be a peer to pmem.c, not on top of it
> as there's no value add in pmem.c for it, and they are logically peers.
>
>> Well, let's start with per-disk btt and see where that gets us, we can
>> always ramp up complexity later.  I'd just as soon make the default
>> opt-in/out a Kconfig toggle with a sysfs override.
>
> Kconfig or sysfs are both utterly horrible choices.  It's a disk format
> choice so it needs to be persisted.

Of course it will be persisted with an on disk BTT superblock.
Establishing that superblock by default and deleting at on-demand are
via Kconfig and sysfs.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 16:59                               ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22 16:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Boaz Harrosh, Kani,
	Toshimitsu, linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 9:57 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Jun 22, 2015 at 09:54:51AM -0700, Dan Williams wrote:
>> > I don't see why you're comparing with MD and DM here.  MD and DM
>> > sit cleanly ontop of any block device.  If btt was independent of
>> > libnvdimm and just used ->rw_bytes we could see it as this.
>> >
>> > But it's all a giant entangled mess, where btt for example is probed
>> > by libnvdimm.  At the same time pmem.c isn't really a true block
>> > driver, it's really just a trivial shim between the block API
>> > and pmem-style memcpy.  Especially with the proper pmem API btt
>> > would become cleaner just calling that directly.
>>
>> The pmem api does nothing to fix torn sectors, there's no extra
>> atomicity guarantees that come from those instructions.
>
> Of course not.  And neither does pmem.c help with you in any way.
>
> That's the point:  btt should be a peer to pmem.c, not on top of it
> as there's no value add in pmem.c for it, and they are logically peers.
>
>> Well, let's start with per-disk btt and see where that gets us, we can
>> always ramp up complexity later.  I'd just as soon make the default
>> opt-in/out a Kconfig toggle with a sysfs override.
>
> Kconfig or sysfs are both utterly horrible choices.  It's a disk format
> choice so it needs to be persisted.

Of course it will be persisted with an on disk BTT superblock.
Establishing that superblock by default and deleting at on-demand are
via Kconfig and sysfs.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
  2015-06-22 16:48         ` Christoph Hellwig
  (?)
@ 2015-06-22 18:32           ` Jeff Moyer
  -1 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 18:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Jens Axboe, linux-nvdimm@lists.01.org, Neil Brown,
	Greg KH, linux-kernel, Ingo Molnar, Linux ACPI, linux-fsdevel

Christoph Hellwig <hch@lst.de> writes:

> On Mon, Jun 22, 2015 at 09:48:03AM -0700, Dan Williams wrote:
>> Only if you abandon BTT on partitions, which at this point it seems
>> you're boldly committed to doing.  It's unacceptable to drop BTT on
>> the floor so I'll take a look at making BTT per-disk only for 4.2.
>
> If by partitions you mean block layer partitions: yes.  If by partitions
> you mean subdivision of nvdimms: no.

How will this subdivision be recorded?  Not all NVDIMMs support the
label specification.

Sysadmins are already familiar with partitions;  I'm not sure why we'd
deviate from that here.  What am I missing?

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
@ 2015-06-22 18:32           ` Jeff Moyer
  0 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 18:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Jens Axboe, linux-nvdimm@lists.01.org, Neil Brown,
	Greg KH, linux-kernel, Ingo Molnar, Linux ACPI, linux-fsdevel

Christoph Hellwig <hch@lst.de> writes:

> On Mon, Jun 22, 2015 at 09:48:03AM -0700, Dan Williams wrote:
>> Only if you abandon BTT on partitions, which at this point it seems
>> you're boldly committed to doing.  It's unacceptable to drop BTT on
>> the floor so I'll take a look at making BTT per-disk only for 4.2.
>
> If by partitions you mean block layer partitions: yes.  If by partitions
> you mean subdivision of nvdimms: no.

How will this subdivision be recorded?  Not all NVDIMMs support the
label specification.

Sysadmins are already familiar with partitions;  I'm not sure why we'd
deviate from that here.  What am I missing?

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
@ 2015-06-22 18:32           ` Jeff Moyer
  0 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 18:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Jens Axboe, linux-nvdimm@lists.01.org, Neil Brown,
	Greg KH, linux-kernel, Ingo Molnar, Linux ACPI, linux-fsdevel

Christoph Hellwig <hch@lst.de> writes:

> On Mon, Jun 22, 2015 at 09:48:03AM -0700, Dan Williams wrote:
>> Only if you abandon BTT on partitions, which at this point it seems
>> you're boldly committed to doing.  It's unacceptable to drop BTT on
>> the floor so I'll take a look at making BTT per-disk only for 4.2.
>
> If by partitions you mean block layer partitions: yes.  If by partitions
> you mean subdivision of nvdimms: no.

How will this subdivision be recorded?  Not all NVDIMMs support the
label specification.

Sysadmins are already familiar with partitions;  I'm not sure why we'd
deviate from that here.  What am I missing?

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22 16:48                               ` Christoph Hellwig
  (?)
@ 2015-06-22 18:48                                 ` Jeff Moyer
  -1 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 18:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

Christoph Hellwig <hch@lst.de> writes:

> On Mon, Jun 22, 2015 at 12:42:44PM -0400, Jeff Moyer wrote:
>> OK, add torn sector detection/recovery to that statement, then.  More
>> importantly, do you agree with the sentiment or not?
>
> I think we're getting on a very slipper slope if we think about
> application here.  Buffered I/O application must deal with torn
> writes at any granulairty anyway, e.g. fsync + rename is the
> only thing they can rely on right now (I actually have software O_ATOMIC
> code to avoid this, but that's another story).

OK, so you think applications using buffered I/O will Just Work(TM)?  My
guess is that things will start to break that hadn't broken in the
past.  Sure, the application isn't designed properly, and that should be
fixed, but we shouldn't foist this on users as the default.

> Direct I/O using application can make assumption if they know the sector
> size, and we must have a way for them to be able to see our new
> "subsector sector size".

You need to let them determine that when NOT using the btt, yes.  Right
now, I don't think there's a way to determine what the underlying atomic
write unit is.  That's something the NFIT spec probably should have
defined.

> And thos application are few inbetween but also important so needing
> special cases for them is fine.  Although those are the most likely
> ones to take advantage of byte addressing anyway.

Agreed.

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 18:48                                 ` Jeff Moyer
  0 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 18:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

Christoph Hellwig <hch@lst.de> writes:

> On Mon, Jun 22, 2015 at 12:42:44PM -0400, Jeff Moyer wrote:
>> OK, add torn sector detection/recovery to that statement, then.  More
>> importantly, do you agree with the sentiment or not?
>
> I think we're getting on a very slipper slope if we think about
> application here.  Buffered I/O application must deal with torn
> writes at any granulairty anyway, e.g. fsync + rename is the
> only thing they can rely on right now (I actually have software O_ATOMIC
> code to avoid this, but that's another story).

OK, so you think applications using buffered I/O will Just Work(TM)?  My
guess is that things will start to break that hadn't broken in the
past.  Sure, the application isn't designed properly, and that should be
fixed, but we shouldn't foist this on users as the default.

> Direct I/O using application can make assumption if they know the sector
> size, and we must have a way for them to be able to see our new
> "subsector sector size".

You need to let them determine that when NOT using the btt, yes.  Right
now, I don't think there's a way to determine what the underlying atomic
write unit is.  That's something the NFIT spec probably should have
defined.

> And thos application are few inbetween but also important so needing
> special cases for them is fine.  Although those are the most likely
> ones to take advantage of byte addressing anyway.

Agreed.

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 18:48                                 ` Jeff Moyer
  0 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 18:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

Christoph Hellwig <hch@lst.de> writes:

> On Mon, Jun 22, 2015 at 12:42:44PM -0400, Jeff Moyer wrote:
>> OK, add torn sector detection/recovery to that statement, then.  More
>> importantly, do you agree with the sentiment or not?
>
> I think we're getting on a very slipper slope if we think about
> application here.  Buffered I/O application must deal with torn
> writes at any granulairty anyway, e.g. fsync + rename is the
> only thing they can rely on right now (I actually have software O_ATOMIC
> code to avoid this, but that's another story).

OK, so you think applications using buffered I/O will Just Work(TM)?  My
guess is that things will start to break that hadn't broken in the
past.  Sure, the application isn't designed properly, and that should be
fixed, but we shouldn't foist this on users as the default.

> Direct I/O using application can make assumption if they know the sector
> size, and we must have a way for them to be able to see our new
> "subsector sector size".

You need to let them determine that when NOT using the btt, yes.  Right
now, I don't think there's a way to determine what the underlying atomic
write unit is.  That's something the NFIT spec probably should have
defined.

> And thos application are few inbetween but also important so needing
> special cases for them is fine.  Although those are the most likely
> ones to take advantage of byte addressing anyway.

Agreed.

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
  2015-06-22 18:32           ` Jeff Moyer
  (?)
@ 2015-06-22 19:02             ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22 19:02 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	Neil Brown, Greg KH, linux-kernel, Ingo Molnar, Linux ACPI,
	linux-fsdevel

On Mon, Jun 22, 2015 at 11:32 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Christoph Hellwig <hch@lst.de> writes:
>
>> On Mon, Jun 22, 2015 at 09:48:03AM -0700, Dan Williams wrote:
>>> Only if you abandon BTT on partitions, which at this point it seems
>>> you're boldly committed to doing.  It's unacceptable to drop BTT on
>>> the floor so I'll take a look at making BTT per-disk only for 4.2.
>>
>> If by partitions you mean block layer partitions: yes.  If by partitions
>> you mean subdivision of nvdimms: no.
>
> How will this subdivision be recorded?  Not all NVDIMMs support the
> label specification.

...and the ones that do only use labels for resolving aliasing, not
partitioning.

> Sysadmins are already familiar with partitions;  I'm not sure why we'd
> deviate from that here.  What am I missing?

I don't see the need to re-invent partitioning which is the path this
requested rework is putting us on...

However, when the need arises for smaller granularity BTT we can have
the partition fight then.  To be clear, I believe that need is already
here today, but I'm not in a position to push that agenda at this late
date.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
@ 2015-06-22 19:02             ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22 19:02 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	Neil Brown, Greg KH, linux-kernel, Ingo Molnar, Linux ACPI,
	linux-fsdevel

On Mon, Jun 22, 2015 at 11:32 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Christoph Hellwig <hch@lst.de> writes:
>
>> On Mon, Jun 22, 2015 at 09:48:03AM -0700, Dan Williams wrote:
>>> Only if you abandon BTT on partitions, which at this point it seems
>>> you're boldly committed to doing.  It's unacceptable to drop BTT on
>>> the floor so I'll take a look at making BTT per-disk only for 4.2.
>>
>> If by partitions you mean block layer partitions: yes.  If by partitions
>> you mean subdivision of nvdimms: no.
>
> How will this subdivision be recorded?  Not all NVDIMMs support the
> label specification.

...and the ones that do only use labels for resolving aliasing, not
partitioning.

> Sysadmins are already familiar with partitions;  I'm not sure why we'd
> deviate from that here.  What am I missing?

I don't see the need to re-invent partitioning which is the path this
requested rework is putting us on...

However, when the need arises for smaller granularity BTT we can have
the partition fight then.  To be clear, I believe that need is already
here today, but I'm not in a position to push that agenda at this late
date.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
@ 2015-06-22 19:02             ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22 19:02 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	Neil Brown, Greg KH, linux-kernel, Ingo Molnar, Linux ACPI,
	linux-fsdevel

On Mon, Jun 22, 2015 at 11:32 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Christoph Hellwig <hch@lst.de> writes:
>
>> On Mon, Jun 22, 2015 at 09:48:03AM -0700, Dan Williams wrote:
>>> Only if you abandon BTT on partitions, which at this point it seems
>>> you're boldly committed to doing.  It's unacceptable to drop BTT on
>>> the floor so I'll take a look at making BTT per-disk only for 4.2.
>>
>> If by partitions you mean block layer partitions: yes.  If by partitions
>> you mean subdivision of nvdimms: no.
>
> How will this subdivision be recorded?  Not all NVDIMMs support the
> label specification.

...and the ones that do only use labels for resolving aliasing, not
partitioning.

> Sysadmins are already familiar with partitions;  I'm not sure why we'd
> deviate from that here.  What am I missing?

I don't see the need to re-invent partitioning which is the path this
requested rework is putting us on...

However, when the need arises for smaller granularity BTT we can have
the partition fight then.  To be clear, I believe that need is already
here today, but I'm not in a position to push that agenda at this late
date.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22 18:48                                 ` Jeff Moyer
@ 2015-06-22 19:04                                   ` Dan Williams
  -1 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22 19:04 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 11:48 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Christoph Hellwig <hch@lst.de> writes:
>
>> On Mon, Jun 22, 2015 at 12:42:44PM -0400, Jeff Moyer wrote:
>>> OK, add torn sector detection/recovery to that statement, then.  More
>>> importantly, do you agree with the sentiment or not?
>>
>> I think we're getting on a very slipper slope if we think about
>> application here.  Buffered I/O application must deal with torn
>> writes at any granulairty anyway, e.g. fsync + rename is the
>> only thing they can rely on right now (I actually have software O_ATOMIC
>> code to avoid this, but that's another story).
>
> OK, so you think applications using buffered I/O will Just Work(TM)?  My
> guess is that things will start to break that hadn't broken in the
> past.  Sure, the application isn't designed properly, and that should be
> fixed, but we shouldn't foist this on users as the default.
>
>> Direct I/O using application can make assumption if they know the sector
>> size, and we must have a way for them to be able to see our new
>> "subsector sector size".
>
> You need to let them determine that when NOT using the btt, yes.  Right
> now, I don't think there's a way to determine what the underlying atomic
> write unit is.  That's something the NFIT spec probably should have
> defined.

There are no atomic write units for NFIT to advertise beyond cpu register width.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 19:04                                   ` Dan Williams
  0 siblings, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-22 19:04 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 11:48 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Christoph Hellwig <hch@lst.de> writes:
>
>> On Mon, Jun 22, 2015 at 12:42:44PM -0400, Jeff Moyer wrote:
>>> OK, add torn sector detection/recovery to that statement, then.  More
>>> importantly, do you agree with the sentiment or not?
>>
>> I think we're getting on a very slipper slope if we think about
>> application here.  Buffered I/O application must deal with torn
>> writes at any granulairty anyway, e.g. fsync + rename is the
>> only thing they can rely on right now (I actually have software O_ATOMIC
>> code to avoid this, but that's another story).
>
> OK, so you think applications using buffered I/O will Just Work(TM)?  My
> guess is that things will start to break that hadn't broken in the
> past.  Sure, the application isn't designed properly, and that should be
> fixed, but we shouldn't foist this on users as the default.
>
>> Direct I/O using application can make assumption if they know the sector
>> size, and we must have a way for them to be able to see our new
>> "subsector sector size".
>
> You need to let them determine that when NOT using the btt, yes.  Right
> now, I don't think there's a way to determine what the underlying atomic
> write unit is.  That's something the NFIT spec probably should have
> defined.

There are no atomic write units for NFIT to advertise beyond cpu register width.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
  2015-06-22 19:02             ` Dan Williams
  (?)
@ 2015-06-22 19:09               ` Jeff Moyer
  -1 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 19:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	Neil Brown, Greg KH, linux-kernel, Ingo Molnar, Linux ACPI,
	linux-fsdevel

Dan Williams <dan.j.williams@intel.com> writes:

> On Mon, Jun 22, 2015 at 11:32 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Christoph Hellwig <hch@lst.de> writes:
>>
>>> On Mon, Jun 22, 2015 at 09:48:03AM -0700, Dan Williams wrote:
>>>> Only if you abandon BTT on partitions, which at this point it seems
>>>> you're boldly committed to doing.  It's unacceptable to drop BTT on
>>>> the floor so I'll take a look at making BTT per-disk only for 4.2.
>>>
>>> If by partitions you mean block layer partitions: yes.  If by partitions
>>> you mean subdivision of nvdimms: no.
>>
>> How will this subdivision be recorded?  Not all NVDIMMs support the
>> label specification.
>
> ...and the ones that do only use labels for resolving aliasing, not
> partitioning.
>
>> Sysadmins are already familiar with partitions;  I'm not sure why we'd
>> deviate from that here.  What am I missing?
>
> I don't see the need to re-invent partitioning which is the path this
> requested rework is putting us on...
>
> However, when the need arises for smaller granularity BTT we can have
> the partition fight then.  To be clear, I believe that need is already
> here today, but I'm not in a position to push that agenda at this late
> date.

The xfs example is enough to convince me that we need to support btt on
a partition right now.  Otherwise, for RHEL at least, dax on xfs simply
won't be supported.

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
@ 2015-06-22 19:09               ` Jeff Moyer
  0 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 19:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	Neil Brown, Greg KH, linux-kernel, Ingo Molnar, Linux ACPI,
	linux-fsdevel

Dan Williams <dan.j.williams@intel.com> writes:

> On Mon, Jun 22, 2015 at 11:32 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Christoph Hellwig <hch@lst.de> writes:
>>
>>> On Mon, Jun 22, 2015 at 09:48:03AM -0700, Dan Williams wrote:
>>>> Only if you abandon BTT on partitions, which at this point it seems
>>>> you're boldly committed to doing.  It's unacceptable to drop BTT on
>>>> the floor so I'll take a look at making BTT per-disk only for 4.2.
>>>
>>> If by partitions you mean block layer partitions: yes.  If by partitions
>>> you mean subdivision of nvdimms: no.
>>
>> How will this subdivision be recorded?  Not all NVDIMMs support the
>> label specification.
>
> ...and the ones that do only use labels for resolving aliasing, not
> partitioning.
>
>> Sysadmins are already familiar with partitions;  I'm not sure why we'd
>> deviate from that here.  What am I missing?
>
> I don't see the need to re-invent partitioning which is the path this
> requested rework is putting us on...
>
> However, when the need arises for smaller granularity BTT we can have
> the partition fight then.  To be clear, I believe that need is already
> here today, but I'm not in a position to push that agenda at this late
> date.

The xfs example is enough to convince me that we need to support btt on
a partition right now.  Otherwise, for RHEL at least, dax on xfs simply
won't be supported.

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
@ 2015-06-22 19:09               ` Jeff Moyer
  0 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 19:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	Neil Brown, Greg KH, linux-kernel, Ingo Molnar, Linux ACPI,
	linux-fsdevel

Dan Williams <dan.j.williams@intel.com> writes:

> On Mon, Jun 22, 2015 at 11:32 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Christoph Hellwig <hch@lst.de> writes:
>>
>>> On Mon, Jun 22, 2015 at 09:48:03AM -0700, Dan Williams wrote:
>>>> Only if you abandon BTT on partitions, which at this point it seems
>>>> you're boldly committed to doing.  It's unacceptable to drop BTT on
>>>> the floor so I'll take a look at making BTT per-disk only for 4.2.
>>>
>>> If by partitions you mean block layer partitions: yes.  If by partitions
>>> you mean subdivision of nvdimms: no.
>>
>> How will this subdivision be recorded?  Not all NVDIMMs support the
>> label specification.
>
> ...and the ones that do only use labels for resolving aliasing, not
> partitioning.
>
>> Sysadmins are already familiar with partitions;  I'm not sure why we'd
>> deviate from that here.  What am I missing?
>
> I don't see the need to re-invent partitioning which is the path this
> requested rework is putting us on...
>
> However, when the need arises for smaller granularity BTT we can have
> the partition fight then.  To be clear, I believe that need is already
> here today, but I'm not in a position to push that agenda at this late
> date.

The xfs example is enough to convince me that we need to support btt on
a partition right now.  Otherwise, for RHEL at least, dax on xfs simply
won't be supported.

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22 19:04                                   ` Dan Williams
  (?)
@ 2015-06-22 19:11                                     ` Jeff Moyer
  -1 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 19:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

Dan Williams <dan.j.williams@intel.com> writes:

>>> Direct I/O using application can make assumption if they know the sector
>>> size, and we must have a way for them to be able to see our new
>>> "subsector sector size".
>>
>> You need to let them determine that when NOT using the btt, yes.  Right
>> now, I don't think there's a way to determine what the underlying atomic
>> write unit is.  That's something the NFIT spec probably should have
>> defined.
>
> There are no atomic write units for NFIT to advertise beyond cpu register width.

That would be useful information for the platform to provide, instead of
requiring the o/s or applications to infer it.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 19:11                                     ` Jeff Moyer
  0 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 19:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

Dan Williams <dan.j.williams@intel.com> writes:

>>> Direct I/O using application can make assumption if they know the sector
>>> size, and we must have a way for them to be able to see our new
>>> "subsector sector size".
>>
>> You need to let them determine that when NOT using the btt, yes.  Right
>> now, I don't think there's a way to determine what the underlying atomic
>> write unit is.  That's something the NFIT spec probably should have
>> defined.
>
> There are no atomic write units for NFIT to advertise beyond cpu register width.

That would be useful information for the platform to provide, instead of
requiring the o/s or applications to infer it.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-22 19:11                                     ` Jeff Moyer
  0 siblings, 0 replies; 164+ messages in thread
From: Jeff Moyer @ 2015-06-22 19:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

Dan Williams <dan.j.williams@intel.com> writes:

>>> Direct I/O using application can make assumption if they know the sector
>>> size, and we must have a way for them to be able to see our new
>>> "subsector sector size".
>>
>> You need to let them determine that when NOT using the btt, yes.  Right
>> now, I don't think there's a way to determine what the underlying atomic
>> write unit is.  That's something the NFIT spec probably should have
>> defined.
>
> There are no atomic write units for NFIT to advertise beyond cpu register width.

That would be useful information for the platform to provide, instead of
requiring the o/s or applications to infer it.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
  2015-06-22 18:32           ` Jeff Moyer
                             ` (2 preceding siblings ...)
  (?)
@ 2015-06-23 10:10           ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-23 10:10 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Dan Williams, Jens Axboe,
	linux-nvdimm@lists.01.org, Neil Brown, Greg KH, linux-kernel,
	Ingo Molnar, Linux ACPI, linux-fsdevel

On Mon, Jun 22, 2015 at 02:32:53PM -0400, Jeff Moyer wrote:
> How will this subdivision be recorded?  Not all NVDIMMs support the
> label specification.

Labeks woud be preferable, it's a pity the spec is so vague.

> Sysadmins are already familiar with partitions;  I'm not sure why we'd
> deviate from that here.  What am I missing?

The layering architecture I guess?

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22 18:48                                 ` Jeff Moyer
@ 2015-06-23 10:10                                   ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-23 10:10 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Dan Williams, Jens Axboe,
	linux-nvdimm@lists.01.org, linux-kernel, Linux ACPI,
	linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 02:48:48PM -0400, Jeff Moyer wrote:
> OK, so you think applications using buffered I/O will Just Work(TM)?  My
> guess is that things will start to break that hadn't broken in the
> past.  Sure, the application isn't designed properly, and that should be
> fixed, but we shouldn't foist this on users as the default.

They will work or break the same way as before.  We have all kinds of
cases of non-sector updates for buffered I/O anyway: inlіne data, tail
merging, etc.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-23 10:10                                   ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-23 10:10 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Dan Williams, Jens Axboe,
	linux-nvdimm@lists.01.org, linux-kernel, Linux ACPI,
	linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 02:48:48PM -0400, Jeff Moyer wrote:
> OK, so you think applications using buffered I/O will Just Work(TM)?  My
> guess is that things will start to break that hadn't broken in the
> past.  Sure, the application isn't designed properly, and that should be
> fixed, but we shouldn't foist this on users as the default.

They will work or break the same way as before.  We have all kinds of
cases of non-sector updates for buffered I/O anyway: inlіne data, tail
merging, etc.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
  2015-06-22 19:02             ` Dan Williams
                               ` (2 preceding siblings ...)
  (?)
@ 2015-06-23 10:19             ` Christoph Hellwig
  2015-06-23 15:19               ` Dan Williams
  -1 siblings, 1 reply; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-23 10:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jeff Moyer, Christoph Hellwig, Jens Axboe,
	linux-nvdimm@lists.01.org, Neil Brown, Greg KH, linux-kernel,
	Ingo Molnar, Linux ACPI, linux-fsdevel

On Mon, Jun 22, 2015 at 12:02:54PM -0700, Dan Williams wrote:
> I don't see the need to re-invent partitioning which is the path this
> requested rework is putting us on...
> 
> However, when the need arises for smaller granularity BTT we can have
> the partition fight then.  To be clear, I believe that need is already
> here today, but I'm not in a position to push that agenda at this late
> date.


Instead of all this complaining and moaning let's figure out what
architecture you'd actually want.  The one I had in mind is:

+------------------------------+
|  block layer (& partitions)  |
+---------------+--------------+--------------------+
|  pmem driver  |  btt driver  |  other consumers   |
+---------------+--------------+--------------------+
|        pmem API through libnvdimm                 |
+---------------------------------------------------+

If you really want btt to stack on top of pmem it really
needs to be moved out entirely of libnvdimm and be a
generic block driver just using ->rw_bytes, e.g.:


+------------------------------+
|  btt driver                  |
+------------------------------+
|  block layer (& partitions)  |
+------------------------------+--------------------+
|  pmem driver                 | other consumers    |
+------------------------------+--------------------+
|        pmem API through libnvdimm                 |
+---------------------------------------------------+

Not the current mess where btt pretends to be a stacking block
driver but still ties into libnvdimm.

Add blk mode access to all the schemes, but it's really just
another next to the pmem driver each time.  In fact while
looking over the code a bit more I start to wonder why
we need the blk driver at all - just hook into the nfit
do_io routines instead of the low-level API based on what
libnvdimm provides, and don't offer DAX for it.  It mostly
seems duplicate code.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
  2015-06-22 19:09               ` Jeff Moyer
  (?)
  (?)
@ 2015-06-23 10:20               ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-23 10:20 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dan Williams, Christoph Hellwig, Jens Axboe,
	linux-nvdimm@lists.01.org, Neil Brown, Greg KH, linux-kernel,
	Ingo Molnar, Linux ACPI, linux-fsdevel

On Mon, Jun 22, 2015 at 03:09:02PM -0400, Jeff Moyer wrote:
> The xfs example is enough to convince me that we need to support btt on
> a partition right now.  Otherwise, for RHEL at least, dax on xfs simply
> won't be supported.

Seems like an odd stance to support some new code coming out of the
blue in favor of something developed by RH for years and already shipping
as the default filesystem at the major competitor..

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
  2015-06-23 10:19             ` Christoph Hellwig
@ 2015-06-23 15:19               ` Dan Williams
  2015-06-23 20:33                 ` Dan Williams
  2015-06-24 12:10                 ` Christoph Hellwig
  0 siblings, 2 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-23 15:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Moyer, Jens Axboe, linux-nvdimm@lists.01.org, Neil Brown,
	Greg KH, linux-kernel, Ingo Molnar, Linux ACPI, linux-fsdevel

On Tue, Jun 23, 2015 at 3:19 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Jun 22, 2015 at 12:02:54PM -0700, Dan Williams wrote:
>> I don't see the need to re-invent partitioning which is the path this
>> requested rework is putting us on...
>>
>> However, when the need arises for smaller granularity BTT we can have
>> the partition fight then.  To be clear, I believe that need is already
>> here today, but I'm not in a position to push that agenda at this late
>> date.
>
>
> Instead of all this complaining and moaning let's figure out what
> architecture you'd actually want.  The one I had in mind is:
>
> +------------------------------+
> |  block layer (& partitions)  |
> +---------------+--------------+--------------------+
> |  pmem driver  |  btt driver  |  other consumers   |
> +---------------+--------------+--------------------+
> |        pmem API through libnvdimm                 |
> +---------------------------------------------------+
>

I've got this mostly coded up.  The nice property is that BTTs now
become another flavor of the same namespace.

> If you really want btt to stack on top of pmem it really
> needs to be moved out entirely of libnvdimm and be a
> generic block driver just using ->rw_bytes, e.g.:
>
>
> +------------------------------+
> |  btt driver                  |
> +------------------------------+
> |  block layer (& partitions)  |
> +------------------------------+--------------------+
> |  pmem driver                 | other consumers    |
> +------------------------------+--------------------+
> |        pmem API through libnvdimm                 |
> +---------------------------------------------------+
>
> Not the current mess where btt pretends to be a stacking block
> driver but still ties into libnvdimm.

That tie was only to enable autodetect so that we don't need to run a
BTT assembly step from an initramfs just to get an NVDIMM up and
running.  It was a convenience, not a requirement.

> Add blk mode access to all the schemes, but it's really just
> another next to the pmem driver each time.  In fact while
> looking over the code a bit more I start to wonder why
> we need the blk driver at all - just hook into the nfit
> do_io routines instead of the low-level API based on what
> libnvdimm provides, and don't offer DAX for it.  It mostly
> seems duplicate code.

Mostly, it does handle dis-contiguous dimm-physical-address ranges,
but you're right we might be able to unify it in the coming cycle.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-22  6:30             ` Christoph Hellwig
@ 2015-06-23 19:30               ` Matthew Wilcox
  -1 siblings, 0 replies; 164+ messages in thread
From: Matthew Wilcox @ 2015-06-23 19:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Jens Axboe, linux-nvdimm, linux-kernel, Linux ACPI,
	linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 08:30:28AM +0200, Christoph Hellwig wrote:
> > Good to hear that we don't need BTT for XFS v5, can we make the
> > guarantee for all filesystems that may want to support DAX?  I still
> > think stacking is a natural fit for this problem.
> 
> I can't make any guarantees, especially not without verification.  But
> if correctly implemented any filesystems that does out of place metadata
> writes (and that includes a traditional log) and uses checksum to ensure
> the integrity of these updates it should be fine.  You'd still have
> the issue of sector atomicy of file I/O though.

Is ext4 one of the filesystems that copes with torn updates to the log?
I see there's a checksum in the tail of at least some blocks, but I'd
like someone who understands ext4 to reassure me that it also doesn't
need the ability to put its log on a BTT.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-23 19:30               ` Matthew Wilcox
  0 siblings, 0 replies; 164+ messages in thread
From: Matthew Wilcox @ 2015-06-23 19:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Jens Axboe, linux-nvdimm@lists.01.org,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Mon, Jun 22, 2015 at 08:30:28AM +0200, Christoph Hellwig wrote:
> > Good to hear that we don't need BTT for XFS v5, can we make the
> > guarantee for all filesystems that may want to support DAX?  I still
> > think stacking is a natural fit for this problem.
> 
> I can't make any guarantees, especially not without verification.  But
> if correctly implemented any filesystems that does out of place metadata
> writes (and that includes a traditional log) and uses checksum to ensure
> the integrity of these updates it should be fine.  You'd still have
> the issue of sector atomicy of file I/O though.

Is ext4 one of the filesystems that copes with torn updates to the log?
I see there's a checksum in the tail of at least some blocks, but I'd
like someone who understands ext4 to reassure me that it also doesn't
need the ability to put its log on a BTT.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
  2015-06-23 15:19               ` Dan Williams
@ 2015-06-23 20:33                 ` Dan Williams
  2015-06-24 12:10                 ` Christoph Hellwig
  1 sibling, 0 replies; 164+ messages in thread
From: Dan Williams @ 2015-06-23 20:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Moyer, Jens Axboe, linux-nvdimm@lists.01.org, Neil Brown,
	Greg KH, linux-kernel, Ingo Molnar, Linux ACPI, linux-fsdevel

On Tue, Jun 23, 2015 at 8:19 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Tue, Jun 23, 2015 at 3:19 AM, Christoph Hellwig <hch@lst.de> wrote:
>> On Mon, Jun 22, 2015 at 12:02:54PM -0700, Dan Williams wrote:
>>> I don't see the need to re-invent partitioning which is the path this
>>> requested rework is putting us on...
>>>
>>> However, when the need arises for smaller granularity BTT we can have
>>> the partition fight then.  To be clear, I believe that need is already
>>> here today, but I'm not in a position to push that agenda at this late
>>> date.
>>
>>
>> Instead of all this complaining and moaning let's figure out what
>> architecture you'd actually want.  The one I had in mind is:
>>
>> +------------------------------+
>> |  block layer (& partitions)  |
>> +---------------+--------------+--------------------+
>> |  pmem driver  |  btt driver  |  other consumers   |
>> +---------------+--------------+--------------------+
>> |        pmem API through libnvdimm                 |
>> +---------------------------------------------------+
>>
>
> I've got this mostly coded up.  The nice property is that BTTs now
> become another flavor of the same namespace.

This approach has grown on me since yesterday.  I neglected to realize
that we can carve out a BLK-mode namespace to be a BTT enabled log
device if the need arises to satisfy what BTT on a partition was doing
previously.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 02/15] libnvdimm: infrastructure for btt devices
  2015-06-23 15:19               ` Dan Williams
  2015-06-23 20:33                 ` Dan Williams
@ 2015-06-24 12:10                 ` Christoph Hellwig
  1 sibling, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-24 12:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jeff Moyer, Jens Axboe,
	linux-nvdimm@lists.01.org, Neil Brown, Greg KH, linux-kernel,
	Ingo Molnar, Linux ACPI, linux-fsdevel

On Tue, Jun 23, 2015 at 08:19:57AM -0700, Dan Williams wrote:
> That tie was only to enable autodetect so that we don't need to run a
> BTT assembly step from an initramfs just to get an NVDIMM up and
> running.  It was a convenience, not a requirement.

It's a very deep tying for that feature.  Look at what the raid code
does just for scanning.  Although even that is officially deprecate
because people don't like in-kernel scanning (which I tend to disagree with).

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
  2015-06-23 19:30               ` Matthew Wilcox
@ 2015-06-24 12:11                 ` Christoph Hellwig
  -1 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-24 12:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Dan Williams, Jens Axboe, linux-nvdimm,
	linux-kernel, Linux ACPI, linux-fsdevel, Ingo Molnar

On Tue, Jun 23, 2015 at 03:30:43PM -0400, Matthew Wilcox wrote:
> > I can't make any guarantees, especially not without verification.  But
> > if correctly implemented any filesystems that does out of place metadata
> > writes (and that includes a traditional log) and uses checksum to ensure
> > the integrity of these updates it should be fine.  You'd still have
> > the issue of sector atomicy of file I/O though.
> 
> Is ext4 one of the filesystems that copes with torn updates to the log?
> I see there's a checksum in the tail of at least some blocks, but I'd
> like someone who understands ext4 to reassure me that it also doesn't
> need the ability to put its log on a BTT.

In theory it should if the log checksums are enabled, but I wouldn't rely
on it without without confirmation and validation from the ext4 folks.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 14/15] libnvdimm: support read-only btt backing devices
@ 2015-06-24 12:11                 ` Christoph Hellwig
  0 siblings, 0 replies; 164+ messages in thread
From: Christoph Hellwig @ 2015-06-24 12:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Dan Williams, Jens Axboe,
	linux-nvdimm@lists.01.org, linux-kernel, Linux ACPI,
	linux-fsdevel, Ingo Molnar

On Tue, Jun 23, 2015 at 03:30:43PM -0400, Matthew Wilcox wrote:
> > I can't make any guarantees, especially not without verification.  But
> > if correctly implemented any filesystems that does out of place metadata
> > writes (and that includes a traditional log) and uses checksum to ensure
> > the integrity of these updates it should be fine.  You'd still have
> > the issue of sector atomicy of file I/O though.
> 
> Is ext4 one of the filesystems that copes with torn updates to the log?
> I see there's a checksum in the tail of at least some blocks, but I'd
> like someone who understands ext4 to reassure me that it also doesn't
> need the ability to put its log on a BTT.

In theory it should if the log checksums are enabled, but I wouldn't rely
on it without without confirmation and validation from the ext4 folks.

^ permalink raw reply	[flat|nested] 164+ messages in thread

end of thread, other threads:[~2015-06-24 12:12 UTC | newest]

Thread overview: 164+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-17 23:54 [PATCH 00/15] libnvdimm: ->rw_bytes(), BLK-mode, unit tests, and misc features Dan Williams
2015-06-17 23:54 ` Dan Williams
2015-06-17 23:54 ` [PATCH 01/15] block: introduce an ->rw_bytes() block device operation Dan Williams
2015-06-17 23:54   ` Dan Williams
2015-06-18 19:25   ` Dan Williams
2015-06-18 19:25     ` Dan Williams
2015-06-17 23:54 ` [PATCH 02/15] libnvdimm: infrastructure for btt devices Dan Williams
2015-06-17 23:54   ` Dan Williams
2015-06-22 16:34   ` Christoph Hellwig
2015-06-22 16:34     ` Christoph Hellwig
2015-06-22 16:48     ` Dan Williams
2015-06-22 16:48       ` Dan Williams
2015-06-22 16:48       ` Christoph Hellwig
2015-06-22 16:48         ` Christoph Hellwig
2015-06-22 16:48         ` Christoph Hellwig
2015-06-22 18:32         ` Jeff Moyer
2015-06-22 18:32           ` Jeff Moyer
2015-06-22 18:32           ` Jeff Moyer
2015-06-22 19:02           ` Dan Williams
2015-06-22 19:02             ` Dan Williams
2015-06-22 19:02             ` Dan Williams
2015-06-22 19:09             ` Jeff Moyer
2015-06-22 19:09               ` Jeff Moyer
2015-06-22 19:09               ` Jeff Moyer
2015-06-23 10:20               ` Christoph Hellwig
2015-06-23 10:19             ` Christoph Hellwig
2015-06-23 15:19               ` Dan Williams
2015-06-23 20:33                 ` Dan Williams
2015-06-24 12:10                 ` Christoph Hellwig
2015-06-23 10:10           ` Christoph Hellwig
2015-06-17 23:55 ` [PATCH 03/15] nd_btt: atomic sector updates Dan Williams
2015-06-17 23:55   ` Dan Williams
2015-06-21 10:03   ` Christoph Hellwig
2015-06-21 10:03     ` Christoph Hellwig
2015-06-21 10:03     ` Christoph Hellwig
2015-06-21 16:31     ` Dan Williams
2015-06-21 16:31       ` Dan Williams
2015-06-17 23:55 ` [PATCH 04/15] libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory Dan Williams
2015-06-17 23:55   ` Dan Williams
2015-06-21 10:05   ` Christoph Hellwig
2015-06-21 10:05     ` Christoph Hellwig
2015-06-21 10:05     ` Christoph Hellwig
2015-06-21 13:31     ` Dan Williams
2015-06-21 13:31       ` Dan Williams
2015-06-21 13:56       ` Christoph Hellwig
2015-06-21 13:56         ` Christoph Hellwig
2015-06-21 13:56         ` Christoph Hellwig
2015-06-21 14:39         ` Dan Williams
2015-06-21 14:39           ` Dan Williams
2015-06-21 14:39           ` Dan Williams
2015-06-17 23:55 ` [PATCH 05/15] tools/testing/nvdimm: libnvdimm unit test infrastructure Dan Williams
2015-06-17 23:55   ` Dan Williams
2015-06-17 23:55 ` [PATCH 06/15] libnvdimm: Non-Volatile Devices Dan Williams
2015-06-17 23:55   ` Dan Williams
2015-06-17 23:55 ` [PATCH 07/15] fs/block_dev.c: skip rw_page if bdev has integrity Dan Williams
2015-06-17 23:55   ` Dan Williams
2015-06-17 23:55 ` [PATCH 08/15] libnvdimm, btt: add support for blk integrity Dan Williams
2015-06-17 23:55   ` Dan Williams
2015-06-17 23:55 ` [PATCH 09/15] libnvdimm, blk: " Dan Williams
2015-06-17 23:55   ` Dan Williams
2015-06-17 23:55 ` [PATCH 10/15] libnvdimm: fix up max_hw_sectors Dan Williams
2015-06-17 23:55   ` Dan Williams
2015-06-21 10:08   ` Christoph Hellwig
2015-06-21 10:08     ` Christoph Hellwig
2015-06-21 10:08     ` Christoph Hellwig
2015-06-21 13:28     ` Dan Williams
2015-06-21 13:28       ` Dan Williams
2015-06-17 23:55 ` [PATCH 11/15] libnvdimm: pmem, blk, and btt make_request cleanups Dan Williams
2015-06-17 23:55   ` Dan Williams
2015-06-21 10:10   ` Christoph Hellwig
2015-06-21 10:10     ` Christoph Hellwig
2015-06-21 13:26     ` Dan Williams
2015-06-21 13:26       ` Dan Williams
2015-06-21 13:26       ` Dan Williams
2015-06-17 23:55 ` [PATCH 12/15] libnvdimm: enable iostat Dan Williams
2015-06-17 23:55   ` Dan Williams
2015-06-19  8:34   ` Christoph Hellwig
2015-06-19  8:34     ` Christoph Hellwig
2015-06-19  9:02     ` Dan Williams
2015-06-19  9:02       ` Dan Williams
2015-06-21 10:11       ` Christoph Hellwig
2015-06-21 10:11         ` Christoph Hellwig
2015-06-21 13:22         ` Dan Williams
2015-06-21 13:22           ` Dan Williams
2015-06-21 13:22           ` Dan Williams
2015-06-17 23:55 ` [PATCH 13/15] libnvdimm: flag libnvdimm block devices as non-rotational Dan Williams
2015-06-17 23:55   ` Dan Williams
2015-06-17 23:56 ` [PATCH 14/15] libnvdimm: support read-only btt backing devices Dan Williams
2015-06-17 23:56   ` Dan Williams
2015-06-18 22:55   ` Vishal Verma
2015-06-18 22:55     ` Vishal Verma
2015-06-21 10:13   ` Christoph Hellwig
2015-06-21 10:13     ` Christoph Hellwig
2015-06-21 13:21     ` Dan Williams
2015-06-21 13:21       ` Dan Williams
2015-06-21 13:21       ` Dan Williams
2015-06-21 13:54       ` Christoph Hellwig
2015-06-21 13:54         ` Christoph Hellwig
2015-06-21 13:54         ` Christoph Hellwig
2015-06-21 15:11         ` Dan Williams
2015-06-21 15:11           ` Dan Williams
2015-06-22  6:30           ` Christoph Hellwig
2015-06-22  6:30             ` Christoph Hellwig
2015-06-22  6:30             ` Christoph Hellwig
2015-06-22  7:17             ` Dan Williams
2015-06-22  7:17               ` Dan Williams
2015-06-22  7:17               ` Dan Williams
2015-06-22  7:28               ` Christoph Hellwig
2015-06-22  7:28                 ` Christoph Hellwig
2015-06-22  7:28                 ` Christoph Hellwig
2015-06-22  7:39                 ` Dan Williams
2015-06-22  7:39                   ` Dan Williams
2015-06-22  7:39                   ` Dan Williams
2015-06-22 15:02                   ` Jeff Moyer
2015-06-22 15:02                     ` Jeff Moyer
2015-06-22 15:02                     ` Jeff Moyer
2015-06-22 15:41                     ` Christoph Hellwig
2015-06-22 15:41                       ` Christoph Hellwig
2015-06-22 15:41                       ` Christoph Hellwig
2015-06-22 16:00                       ` Jeff Moyer
2015-06-22 16:00                         ` Jeff Moyer
2015-06-22 16:00                         ` Jeff Moyer
2015-06-22 16:32                         ` Christoph Hellwig
2015-06-22 16:32                           ` Christoph Hellwig
2015-06-22 16:32                           ` Christoph Hellwig
2015-06-22 16:42                           ` Jeff Moyer
2015-06-22 16:42                             ` Jeff Moyer
2015-06-22 16:42                             ` Jeff Moyer
2015-06-22 16:48                             ` Christoph Hellwig
2015-06-22 16:48                               ` Christoph Hellwig
2015-06-22 16:48                               ` Christoph Hellwig
2015-06-22 18:48                               ` Jeff Moyer
2015-06-22 18:48                                 ` Jeff Moyer
2015-06-22 18:48                                 ` Jeff Moyer
2015-06-22 19:04                                 ` Dan Williams
2015-06-22 19:04                                   ` Dan Williams
2015-06-22 19:11                                   ` Jeff Moyer
2015-06-22 19:11                                     ` Jeff Moyer
2015-06-22 19:11                                     ` Jeff Moyer
2015-06-23 10:10                                 ` Christoph Hellwig
2015-06-23 10:10                                   ` Christoph Hellwig
2015-06-22 15:40                   ` Christoph Hellwig
2015-06-22 15:40                     ` Christoph Hellwig
2015-06-22 15:40                     ` Christoph Hellwig
2015-06-22 16:36                     ` Dan Williams
2015-06-22 16:36                       ` Dan Williams
2015-06-22 16:36                       ` Dan Williams
2015-06-22 16:45                       ` Christoph Hellwig
2015-06-22 16:45                         ` Christoph Hellwig
2015-06-22 16:45                         ` Christoph Hellwig
2015-06-22 16:54                         ` Dan Williams
2015-06-22 16:54                           ` Dan Williams
2015-06-22 16:54                           ` Dan Williams
2015-06-22 16:57                           ` Christoph Hellwig
2015-06-22 16:57                             ` Christoph Hellwig
2015-06-22 16:59                             ` Dan Williams
2015-06-22 16:59                               ` Dan Williams
2015-06-22 16:59                               ` Dan Williams
2015-06-23 19:30             ` Matthew Wilcox
2015-06-23 19:30               ` Matthew Wilcox
2015-06-24 12:11               ` Christoph Hellwig
2015-06-24 12:11                 ` Christoph Hellwig
2015-06-17 23:56 ` [PATCH 15/15] libnvdimm, nfit: handle acpi_nfit_memory_map flags Dan Williams
2015-06-17 23:56   ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.