All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/21] libnd: non-volatile memory device support
@ 2015-05-20 20:56 ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:56 UTC (permalink / raw)
  To: axboe
  Cc: Boaz Harrosh, neilb, Dave Chinner, Lv Zheng, H. Peter Anvin, hch,
	linux-nvdimm, Rafael J. Wysocki, Robert Moore, mingo, linux-acpi,
	jmoyer, Nicholas Moulin, Matthew Wilcox, Ross Zwisler,
	Vishal Verma, Jens Axboe, Borislav Petkov, Thomas Gleixner,
	gregkh, linux-kernel, Andy Lutomirski, Andrew Morton,
	Linus Torvalds

Changes since v2 [1]:

1/ Rebase on the ACPICA enabling for the NFIT data structures.  The
   ACPICA project owns the definition of ACPI data structures in
   include/acpi/.  This release incorporates the NFIT and UUID definitions
   from ACPICA release R05_15_15 [2]. (Rafael, Bob)

2/ Move the ACPI NFIT driver to drivers/acpi/ (Rafael)

3/ Include documentation of the overall subsystem (Rafael)

4/ Arrange for stable block device names in the case where the platform
   configuration has not changed (Toshi and Robert)

5/ Move test infrastructure to the end of the series (Jeff)

6/ Fix up the Kconfig text for CONFIG_ND_BLK to be more descriptive
   (Andy)

7/ Report and continue upon detecting unknown NFIT tables rather than
   failing (Jeff)

8/ Rename the namespace 'type' attribute to 'nstype' so that lsblk does
   not mistake libnd block devices for scsi disks. (Robert and Christoph)

9/ Convert nd_region_{acquire|release}_lane() to user percpu variable
   infrastructure (Ross)

Thanks for all of the review!

Note, there are incremental changes to address caching, persistent
flushing, queue flags, and expanded sector size support that are
deferred until this base support is cleared to merge.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-April/000574.html
[2]: https://github.com/acpica/acpica/tree/R05_15_15

Here is the diffstat relative to v2:

 Documentation/blockdev/libnd.txt            | 804 ++++++++++++++++++++++++++++
 MAINTAINERS                                 |  39 +-
 arch/ia64/kernel/efi.c                      |   2 +-
 arch/x86/kernel/e820.c                      |   4 +-
 drivers/acpi/Kconfig                        |  27 +
 drivers/acpi/Makefile                       |   1 +
 drivers/{block/nd/acpi.c => acpi/nfit.c}    | 581 ++++++++++----------
 drivers/acpi/nfit.h                         | 160 ++++++
 drivers/block/Kconfig                       |   8 +
 drivers/block/Makefile                      |   1 +
 drivers/block/{nd/e820.c => e820_pmem.c}    |  32 +-
 drivers/block/nd/Kconfig                    |  72 +--
 drivers/block/nd/Makefile                   |  12 -
 drivers/block/nd/acpi_nfit.h                | 321 -----------
 drivers/block/nd/blk.c                      |  20 +-
 drivers/block/nd/btt.c                      |  59 +-
 drivers/block/nd/btt.h                      |   7 +-
 drivers/block/nd/btt_devs.c                 |   2 +-
 drivers/block/nd/bus.c                      |   2 +-
 drivers/block/nd/core.c                     |   9 +-
 drivers/block/nd/dimm_devs.c                |   9 +
 drivers/block/nd/label.c                    |  23 +-
 drivers/block/nd/namespace_devs.c           |  14 +-
 drivers/block/nd/nd-private.h               |   5 +-
 drivers/block/nd/nd.h                       |  18 +-
 drivers/block/nd/pmem.c                     |  25 +-
 drivers/block/nd/region.c                   |  64 ++-
 drivers/block/nd/region_devs.c              |  54 +-
 drivers/block/nd/test/nfit.c                | 794 ++++++++++++++-------------
 drivers/block/nd/test/nfit_test.h           |   2 +
 include/acpi/actbl1.h                       | 154 ++++++
 include/acpi/acuuid.h                       |  89 +++
 {drivers/block/nd => include/linux}/libnd.h |  21 +-
 33 files changed, 2202 insertions(+), 1233 deletions(-)
 create mode 100644 Documentation/blockdev/libnd.txt
 rename drivers/{block/nd/acpi.c => acpi/nfit.c} (69%)
 create mode 100644 drivers/acpi/nfit.h
 rename drivers/block/{nd/e820.c => e820_pmem.c} (69%)
 delete mode 100644 drivers/block/nd/acpi_nfit.h
 create mode 100644 include/acpi/acuuid.h
 rename {drivers/block/nd => include/linux}/libnd.h (81%)

The libndctl changes for these updates are available in ndctl.git:
https://github.com/pmem/ndctl

For this set to move forward it needs acks from ACPI and BLOCK layer
developers.  I am assuming this will ultimately go upstream via the
block tree.  A branch in nvdimm.git will be prepared at the end of the
week to give the pending acks some time to land.  Additional feedback
welcome, and hopefully it can be addressed incrementally from this
baseline going forward, i.e. aiming for inclusion in -next and no more
rebases before the 4.2 merge window opens.

---

Dan Williams (18):
      e820, efi: add ACPI 6.0 persistent memory types
      libnd, nfit: initial libnd infrastructure and NFIT support
      libnd: control character device and libnd bus sysfs attributes
      libnd, nfit: dimm/memory-devices
      libnd: control (ioctl) messages for libnd bus and dimm devices
      libnd, nd_dimm: dimm driver and base libnd device-driver infrastructure
      libnd, nfit: regions (block-data-window, persistent memory, volatile memory)
      libnd: support for legacy (non-aliasing) nvdimms
      libnd, nd_pmem: add libnd support to the pmem driver
      libnd, nfit: add interleave-set state-tracking infrastructure
      libnd: namespace indices: read and validate
      libnd: pmem label sets and namespace instantiation.
      libnd: blk labels and namespace instantiation
      libnd: write pmem label set
      libnd: write blk label set
      libnd: infrastructure for btt devices
      nfit-test: manufactured NFITs for interface development
      libnd: Non-Volatile Devices

Ross Zwisler (2):
      pmem: Dynamically allocate partition numbers
      libnd, nfit, nd_blk: driver for BLK-mode access persistent memory

Vishal Verma (1):
      nd_btt: atomic sector updates


 Documentation/blockdev/btt.txt    |  273 ++++++
 Documentation/blockdev/libnd.txt  |  804 +++++++++++++++++
 MAINTAINERS                       |   39 +
 arch/arm64/kernel/efi.c           |    1 
 arch/ia64/kernel/efi.c            |    4 
 arch/x86/boot/compressed/eboot.c  |    4 
 arch/x86/include/uapi/asm/e820.h  |    1 
 arch/x86/kernel/e820.c            |   28 +
 arch/x86/kernel/pmem.c            |    2 
 arch/x86/platform/efi/efi.c       |    3 
 drivers/acpi/Kconfig              |   27 +
 drivers/acpi/Makefile             |    1 
 drivers/acpi/nfit.c               | 1474 ++++++++++++++++++++++++++++++++
 drivers/acpi/nfit.h               |  160 +++
 drivers/block/Kconfig             |   21 
 drivers/block/Makefile            |    3 
 drivers/block/e820_pmem.c         |  100 ++
 drivers/block/nd/Kconfig          |   91 ++
 drivers/block/nd/Makefile         |   29 +
 drivers/block/nd/blk.c            |  252 +++++
 drivers/block/nd/btt.c            | 1438 +++++++++++++++++++++++++++++++
 drivers/block/nd/btt.h            |  186 ++++
 drivers/block/nd/btt_devs.c       |  443 ++++++++++
 drivers/block/nd/bus.c            |  770 +++++++++++++++++
 drivers/block/nd/core.c           |  472 ++++++++++
 drivers/block/nd/dimm.c           |  115 +++
 drivers/block/nd/dimm_devs.c      |  516 +++++++++++
 drivers/block/nd/label.c          |  922 ++++++++++++++++++++
 drivers/block/nd/label.h          |  143 +++
 drivers/block/nd/namespace_devs.c | 1701 +++++++++++++++++++++++++++++++++++++
 drivers/block/nd/nd-private.h     |  111 ++
 drivers/block/nd/nd.h             |  257 ++++++
 drivers/block/nd/pmem.c           |  107 ++
 drivers/block/nd/region.c         |  189 ++++
 drivers/block/nd/region_devs.c    |  667 +++++++++++++++
 drivers/block/nd/test/Makefile    |    5 
 drivers/block/nd/test/iomap.c     |  151 +++
 drivers/block/nd/test/nfit.c      | 1171 +++++++++++++++++++++++++
 drivers/block/nd/test/nfit_test.h |   28 +
 include/acpi/actbl1.h             |  154 +++
 include/acpi/acuuid.h             |   89 ++
 include/linux/efi.h               |    3 
 include/linux/libnd.h             |  129 +++
 include/linux/nd.h                |   98 ++
 include/uapi/linux/Kbuild         |    1 
 include/uapi/linux/ndctl.h        |  199 ++++
 46 files changed, 13324 insertions(+), 58 deletions(-)
 create mode 100644 Documentation/blockdev/btt.txt
 create mode 100644 Documentation/blockdev/libnd.txt
 create mode 100644 drivers/acpi/nfit.c
 create mode 100644 drivers/acpi/nfit.h
 create mode 100644 drivers/block/e820_pmem.c
 create mode 100644 drivers/block/nd/Kconfig
 create mode 100644 drivers/block/nd/Makefile
 create mode 100644 drivers/block/nd/blk.c
 create mode 100644 drivers/block/nd/btt.c
 create mode 100644 drivers/block/nd/btt.h
 create mode 100644 drivers/block/nd/btt_devs.c
 create mode 100644 drivers/block/nd/bus.c
 create mode 100644 drivers/block/nd/core.c
 create mode 100644 drivers/block/nd/dimm.c
 create mode 100644 drivers/block/nd/dimm_devs.c
 create mode 100644 drivers/block/nd/label.c
 create mode 100644 drivers/block/nd/label.h
 create mode 100644 drivers/block/nd/namespace_devs.c
 create mode 100644 drivers/block/nd/nd-private.h
 create mode 100644 drivers/block/nd/nd.h
 rename drivers/block/{pmem.c => nd/pmem.c} (70%)
 create mode 100644 drivers/block/nd/region.c
 create mode 100644 drivers/block/nd/region_devs.c
 create mode 100644 drivers/block/nd/test/Makefile
 create mode 100644 drivers/block/nd/test/iomap.c
 create mode 100644 drivers/block/nd/test/nfit.c
 create mode 100644 drivers/block/nd/test/nfit_test.h
 create mode 100644 include/acpi/acuuid.h
 create mode 100644 include/linux/libnd.h
 create mode 100644 include/linux/nd.h
 create mode 100644 include/uapi/linux/ndctl.h

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 00/21] libnd: non-volatile memory device support
@ 2015-05-20 20:56 ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:56 UTC (permalink / raw)
  To: axboe
  Cc: Boaz Harrosh, neilb, Dave Chinner, Lv Zheng, H. Peter Anvin, hch,
	linux-nvdimm, Rafael J. Wysocki, Robert Moore, mingo, linux-acpi,
	jmoyer, Nicholas Moulin, Matthew Wilcox, Ross Zwisler,
	Vishal Verma, Jens Axboe, Borislav Petkov, Thomas Gleixner,
	gregkh, linux-kernel, Andy Lutomirski, Andrew Morton,
	Linus Torvalds

Changes since v2 [1]:

1/ Rebase on the ACPICA enabling for the NFIT data structures.  The
   ACPICA project owns the definition of ACPI data structures in
   include/acpi/.  This release incorporates the NFIT and UUID definitions
   from ACPICA release R05_15_15 [2]. (Rafael, Bob)

2/ Move the ACPI NFIT driver to drivers/acpi/ (Rafael)

3/ Include documentation of the overall subsystem (Rafael)

4/ Arrange for stable block device names in the case where the platform
   configuration has not changed (Toshi and Robert)

5/ Move test infrastructure to the end of the series (Jeff)

6/ Fix up the Kconfig text for CONFIG_ND_BLK to be more descriptive
   (Andy)

7/ Report and continue upon detecting unknown NFIT tables rather than
   failing (Jeff)

8/ Rename the namespace 'type' attribute to 'nstype' so that lsblk does
   not mistake libnd block devices for scsi disks. (Robert and Christoph)

9/ Convert nd_region_{acquire|release}_lane() to user percpu variable
   infrastructure (Ross)

Thanks for all of the review!

Note, there are incremental changes to address caching, persistent
flushing, queue flags, and expanded sector size support that are
deferred until this base support is cleared to merge.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-April/000574.html
[2]: https://github.com/acpica/acpica/tree/R05_15_15

Here is the diffstat relative to v2:

 Documentation/blockdev/libnd.txt            | 804 ++++++++++++++++++++++++++++
 MAINTAINERS                                 |  39 +-
 arch/ia64/kernel/efi.c                      |   2 +-
 arch/x86/kernel/e820.c                      |   4 +-
 drivers/acpi/Kconfig                        |  27 +
 drivers/acpi/Makefile                       |   1 +
 drivers/{block/nd/acpi.c => acpi/nfit.c}    | 581 ++++++++++----------
 drivers/acpi/nfit.h                         | 160 ++++++
 drivers/block/Kconfig                       |   8 +
 drivers/block/Makefile                      |   1 +
 drivers/block/{nd/e820.c => e820_pmem.c}    |  32 +-
 drivers/block/nd/Kconfig                    |  72 +--
 drivers/block/nd/Makefile                   |  12 -
 drivers/block/nd/acpi_nfit.h                | 321 -----------
 drivers/block/nd/blk.c                      |  20 +-
 drivers/block/nd/btt.c                      |  59 +-
 drivers/block/nd/btt.h                      |   7 +-
 drivers/block/nd/btt_devs.c                 |   2 +-
 drivers/block/nd/bus.c                      |   2 +-
 drivers/block/nd/core.c                     |   9 +-
 drivers/block/nd/dimm_devs.c                |   9 +
 drivers/block/nd/label.c                    |  23 +-
 drivers/block/nd/namespace_devs.c           |  14 +-
 drivers/block/nd/nd-private.h               |   5 +-
 drivers/block/nd/nd.h                       |  18 +-
 drivers/block/nd/pmem.c                     |  25 +-
 drivers/block/nd/region.c                   |  64 ++-
 drivers/block/nd/region_devs.c              |  54 +-
 drivers/block/nd/test/nfit.c                | 794 ++++++++++++++-------------
 drivers/block/nd/test/nfit_test.h           |   2 +
 include/acpi/actbl1.h                       | 154 ++++++
 include/acpi/acuuid.h                       |  89 +++
 {drivers/block/nd => include/linux}/libnd.h |  21 +-
 33 files changed, 2202 insertions(+), 1233 deletions(-)
 create mode 100644 Documentation/blockdev/libnd.txt
 rename drivers/{block/nd/acpi.c => acpi/nfit.c} (69%)
 create mode 100644 drivers/acpi/nfit.h
 rename drivers/block/{nd/e820.c => e820_pmem.c} (69%)
 delete mode 100644 drivers/block/nd/acpi_nfit.h
 create mode 100644 include/acpi/acuuid.h
 rename {drivers/block/nd => include/linux}/libnd.h (81%)

The libndctl changes for these updates are available in ndctl.git:
https://github.com/pmem/ndctl

For this set to move forward it needs acks from ACPI and BLOCK layer
developers.  I am assuming this will ultimately go upstream via the
block tree.  A branch in nvdimm.git will be prepared at the end of the
week to give the pending acks some time to land.  Additional feedback
welcome, and hopefully it can be addressed incrementally from this
baseline going forward, i.e. aiming for inclusion in -next and no more
rebases before the 4.2 merge window opens.

---

Dan Williams (18):
      e820, efi: add ACPI 6.0 persistent memory types
      libnd, nfit: initial libnd infrastructure and NFIT support
      libnd: control character device and libnd bus sysfs attributes
      libnd, nfit: dimm/memory-devices
      libnd: control (ioctl) messages for libnd bus and dimm devices
      libnd, nd_dimm: dimm driver and base libnd device-driver infrastructure
      libnd, nfit: regions (block-data-window, persistent memory, volatile memory)
      libnd: support for legacy (non-aliasing) nvdimms
      libnd, nd_pmem: add libnd support to the pmem driver
      libnd, nfit: add interleave-set state-tracking infrastructure
      libnd: namespace indices: read and validate
      libnd: pmem label sets and namespace instantiation.
      libnd: blk labels and namespace instantiation
      libnd: write pmem label set
      libnd: write blk label set
      libnd: infrastructure for btt devices
      nfit-test: manufactured NFITs for interface development
      libnd: Non-Volatile Devices

Ross Zwisler (2):
      pmem: Dynamically allocate partition numbers
      libnd, nfit, nd_blk: driver for BLK-mode access persistent memory

Vishal Verma (1):
      nd_btt: atomic sector updates


 Documentation/blockdev/btt.txt    |  273 ++++++
 Documentation/blockdev/libnd.txt  |  804 +++++++++++++++++
 MAINTAINERS                       |   39 +
 arch/arm64/kernel/efi.c           |    1 
 arch/ia64/kernel/efi.c            |    4 
 arch/x86/boot/compressed/eboot.c  |    4 
 arch/x86/include/uapi/asm/e820.h  |    1 
 arch/x86/kernel/e820.c            |   28 +
 arch/x86/kernel/pmem.c            |    2 
 arch/x86/platform/efi/efi.c       |    3 
 drivers/acpi/Kconfig              |   27 +
 drivers/acpi/Makefile             |    1 
 drivers/acpi/nfit.c               | 1474 ++++++++++++++++++++++++++++++++
 drivers/acpi/nfit.h               |  160 +++
 drivers/block/Kconfig             |   21 
 drivers/block/Makefile            |    3 
 drivers/block/e820_pmem.c         |  100 ++
 drivers/block/nd/Kconfig          |   91 ++
 drivers/block/nd/Makefile         |   29 +
 drivers/block/nd/blk.c            |  252 +++++
 drivers/block/nd/btt.c            | 1438 +++++++++++++++++++++++++++++++
 drivers/block/nd/btt.h            |  186 ++++
 drivers/block/nd/btt_devs.c       |  443 ++++++++++
 drivers/block/nd/bus.c            |  770 +++++++++++++++++
 drivers/block/nd/core.c           |  472 ++++++++++
 drivers/block/nd/dimm.c           |  115 +++
 drivers/block/nd/dimm_devs.c      |  516 +++++++++++
 drivers/block/nd/label.c          |  922 ++++++++++++++++++++
 drivers/block/nd/label.h          |  143 +++
 drivers/block/nd/namespace_devs.c | 1701 +++++++++++++++++++++++++++++++++++++
 drivers/block/nd/nd-private.h     |  111 ++
 drivers/block/nd/nd.h             |  257 ++++++
 drivers/block/nd/pmem.c           |  107 ++
 drivers/block/nd/region.c         |  189 ++++
 drivers/block/nd/region_devs.c    |  667 +++++++++++++++
 drivers/block/nd/test/Makefile    |    5 
 drivers/block/nd/test/iomap.c     |  151 +++
 drivers/block/nd/test/nfit.c      | 1171 +++++++++++++++++++++++++
 drivers/block/nd/test/nfit_test.h |   28 +
 include/acpi/actbl1.h             |  154 +++
 include/acpi/acuuid.h             |   89 ++
 include/linux/efi.h               |    3 
 include/linux/libnd.h             |  129 +++
 include/linux/nd.h                |   98 ++
 include/uapi/linux/Kbuild         |    1 
 include/uapi/linux/ndctl.h        |  199 ++++
 46 files changed, 13324 insertions(+), 58 deletions(-)
 create mode 100644 Documentation/blockdev/btt.txt
 create mode 100644 Documentation/blockdev/libnd.txt
 create mode 100644 drivers/acpi/nfit.c
 create mode 100644 drivers/acpi/nfit.h
 create mode 100644 drivers/block/e820_pmem.c
 create mode 100644 drivers/block/nd/Kconfig
 create mode 100644 drivers/block/nd/Makefile
 create mode 100644 drivers/block/nd/blk.c
 create mode 100644 drivers/block/nd/btt.c
 create mode 100644 drivers/block/nd/btt.h
 create mode 100644 drivers/block/nd/btt_devs.c
 create mode 100644 drivers/block/nd/bus.c
 create mode 100644 drivers/block/nd/core.c
 create mode 100644 drivers/block/nd/dimm.c
 create mode 100644 drivers/block/nd/dimm_devs.c
 create mode 100644 drivers/block/nd/label.c
 create mode 100644 drivers/block/nd/label.h
 create mode 100644 drivers/block/nd/namespace_devs.c
 create mode 100644 drivers/block/nd/nd-private.h
 create mode 100644 drivers/block/nd/nd.h
 rename drivers/block/{pmem.c => nd/pmem.c} (70%)
 create mode 100644 drivers/block/nd/region.c
 create mode 100644 drivers/block/nd/region_devs.c
 create mode 100644 drivers/block/nd/test/Makefile
 create mode 100644 drivers/block/nd/test/iomap.c
 create mode 100644 drivers/block/nd/test/nfit.c
 create mode 100644 drivers/block/nd/test/nfit_test.h
 create mode 100644 include/acpi/acuuid.h
 create mode 100644 include/linux/libnd.h
 create mode 100644 include/linux/nd.h
 create mode 100644 include/uapi/linux/ndctl.h

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 01/21] e820, efi: add ACPI 6.0 persistent memory types
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:56   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:56 UTC (permalink / raw)
  To: axboe
  Cc: mingo, Boaz Harrosh, Andrew Morton, linux-nvdimm, neilb, gregkh,
	linux-kernel, hch, Jens Axboe, linux-acpi, jmoyer,
	Borislav Petkov, H. Peter Anvin, Matthew Wilcox, Thomas Gleixner,
	Andy Lutomirski, Linus Torvalds, Ross Zwisler

ACPI 6.0 formalizes e820-type-7 and efi-type-14 as persistent memory.
Mark it "reserved" and allow it to be claimed by a persistent memory
device driver.

This definition is in addition to the Linux kernel's existing type-12
definition that was recently added in support of shipping platforms with
NVDIMM support that predate ACPI 6.0 (which now classifies type-12 as
OEM reserved).

Note, /proc/iomem can be consulted for differentiating legacy
"Persistent Memory (legacy)" E820_PRAM vs standard "Persistent Memory"
E820_PMEM.

Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/arm64/kernel/efi.c          |    1 +
 arch/ia64/kernel/efi.c           |    4 ++++
 arch/x86/boot/compressed/eboot.c |    4 ++++
 arch/x86/include/uapi/asm/e820.h |    1 +
 arch/x86/kernel/e820.c           |   28 ++++++++++++++++++++++++----
 arch/x86/platform/efi/efi.c      |    3 +++
 include/linux/efi.h              |    3 ++-
 7 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index ab21e0d58278..9d4aa18f2a82 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -158,6 +158,7 @@ static __init int is_reserve_region(efi_memory_desc_t *md)
 	case EFI_BOOT_SERVICES_CODE:
 	case EFI_BOOT_SERVICES_DATA:
 	case EFI_CONVENTIONAL_MEMORY:
+	case EFI_PERSISTENT_MEMORY:
 		return 0;
 	default:
 		break;
diff --git a/arch/ia64/kernel/efi.c b/arch/ia64/kernel/efi.c
index c52d7540dc05..5f6be9dd6968 100644
--- a/arch/ia64/kernel/efi.c
+++ b/arch/ia64/kernel/efi.c
@@ -1223,6 +1223,10 @@ efi_initialize_iomem_resources(struct resource *code_resource,
 				flags |= IORESOURCE_DISABLED;
 				break;
 
+			case EFI_PERSISTENT_MEMORY:
+				name = "Persistent Memory";
+				break;
+
 			case EFI_RESERVED_TYPE:
 			case EFI_RUNTIME_SERVICES_CODE:
 			case EFI_RUNTIME_SERVICES_DATA:
diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c
index ef17683484e9..dde5bf7726f4 100644
--- a/arch/x86/boot/compressed/eboot.c
+++ b/arch/x86/boot/compressed/eboot.c
@@ -1222,6 +1222,10 @@ static efi_status_t setup_e820(struct boot_params *params,
 			e820_type = E820_NVS;
 			break;
 
+		case EFI_PERSISTENT_MEMORY:
+			e820_type = E820_PMEM;
+			break;
+
 		default:
 			continue;
 		}
diff --git a/arch/x86/include/uapi/asm/e820.h b/arch/x86/include/uapi/asm/e820.h
index 960a8a9dc4ab..0f457e6eab18 100644
--- a/arch/x86/include/uapi/asm/e820.h
+++ b/arch/x86/include/uapi/asm/e820.h
@@ -32,6 +32,7 @@
 #define E820_ACPI	3
 #define E820_NVS	4
 #define E820_UNUSABLE	5
+#define E820_PMEM	7
 
 /*
  * This is a non-standardized way to represent ADR or NVDIMM regions that
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 11cc7d54ec3f..0abe20da743a 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -149,6 +149,7 @@ static void __init e820_print_type(u32 type)
 	case E820_UNUSABLE:
 		printk(KERN_CONT "unusable");
 		break;
+	case E820_PMEM:
 	case E820_PRAM:
 		printk(KERN_CONT "persistent (type %u)", type);
 		break;
@@ -918,11 +919,32 @@ static inline const char *e820_type_to_string(int e820_type)
 	case E820_ACPI:	return "ACPI Tables";
 	case E820_NVS:	return "ACPI Non-volatile Storage";
 	case E820_UNUSABLE:	return "Unusable memory";
-	case E820_PRAM: return "Persistent RAM";
+	case E820_PRAM: return "Persistent Memory (legacy)";
+	case E820_PMEM: return "Persistent Memory";
 	default:	return "reserved";
 	}
 }
 
+static bool do_mark_busy(u32 type, struct resource *res)
+{
+	/* this is the legacy bios/dos rom-shadow + mmio region */
+	if (res->start < (1ULL<<20))
+		return true;
+
+	/*
+	 * Treat persistent memory like device memory, i.e. reserve it
+	 * for exclusive use of a driver
+	 */
+	switch (type) {
+	case E820_RESERVED:
+	case E820_PRAM:
+	case E820_PMEM:
+		return false;
+	default:
+		return true;
+	}
+}
+
 /*
  * Mark e820 reserved areas as busy for the resource manager.
  */
@@ -952,9 +974,7 @@ void __init e820_reserve_resources(void)
 		 * pci device BAR resource and insert them later in
 		 * pcibios_resource_survey()
 		 */
-		if (((e820.map[i].type != E820_RESERVED) &&
-		     (e820.map[i].type != E820_PRAM)) ||
-		     res->start < (1ULL<<20)) {
+		if (do_mark_busy(e820.map[i].type, res)) {
 			res->flags |= IORESOURCE_BUSY;
 			insert_resource(&iomem_resource, res);
 		}
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index dbc8627a5cdf..a116e236ac3f 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -145,6 +145,9 @@ static void __init do_add_efi_memmap(void)
 		case EFI_UNUSABLE_MEMORY:
 			e820_type = E820_UNUSABLE;
 			break;
+		case EFI_PERSISTENT_MEMORY:
+			e820_type = E820_PMEM;
+			break;
 		default:
 			/*
 			 * EFI_RESERVED_TYPE EFI_RUNTIME_SERVICES_CODE
diff --git a/include/linux/efi.h b/include/linux/efi.h
index cf7e431cbc73..28868504aa17 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -85,7 +85,8 @@ typedef	struct {
 #define EFI_MEMORY_MAPPED_IO		11
 #define EFI_MEMORY_MAPPED_IO_PORT_SPACE	12
 #define EFI_PAL_CODE			13
-#define EFI_MAX_MEMORY_TYPE		14
+#define EFI_PERSISTENT_MEMORY		14
+#define EFI_MAX_MEMORY_TYPE		15
 
 /* Attribute values: */
 #define EFI_MEMORY_UC		((u64)0x0000000000000001ULL)	/* uncached */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 01/21] e820, efi: add ACPI 6.0 persistent memory types
@ 2015-05-20 20:56   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:56 UTC (permalink / raw)
  To: axboe
  Cc: mingo, Boaz Harrosh, Andrew Morton, linux-nvdimm, neilb, gregkh,
	linux-kernel, hch, Jens Axboe, linux-acpi, jmoyer,
	Borislav Petkov, H. Peter Anvin, Matthew Wilcox, Thomas Gleixner,
	Andy Lutomirski, Linus Torvalds, Ross Zwisler

ACPI 6.0 formalizes e820-type-7 and efi-type-14 as persistent memory.
Mark it "reserved" and allow it to be claimed by a persistent memory
device driver.

This definition is in addition to the Linux kernel's existing type-12
definition that was recently added in support of shipping platforms with
NVDIMM support that predate ACPI 6.0 (which now classifies type-12 as
OEM reserved).

Note, /proc/iomem can be consulted for differentiating legacy
"Persistent Memory (legacy)" E820_PRAM vs standard "Persistent Memory"
E820_PMEM.

Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/arm64/kernel/efi.c          |    1 +
 arch/ia64/kernel/efi.c           |    4 ++++
 arch/x86/boot/compressed/eboot.c |    4 ++++
 arch/x86/include/uapi/asm/e820.h |    1 +
 arch/x86/kernel/e820.c           |   28 ++++++++++++++++++++++++----
 arch/x86/platform/efi/efi.c      |    3 +++
 include/linux/efi.h              |    3 ++-
 7 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index ab21e0d58278..9d4aa18f2a82 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -158,6 +158,7 @@ static __init int is_reserve_region(efi_memory_desc_t *md)
 	case EFI_BOOT_SERVICES_CODE:
 	case EFI_BOOT_SERVICES_DATA:
 	case EFI_CONVENTIONAL_MEMORY:
+	case EFI_PERSISTENT_MEMORY:
 		return 0;
 	default:
 		break;
diff --git a/arch/ia64/kernel/efi.c b/arch/ia64/kernel/efi.c
index c52d7540dc05..5f6be9dd6968 100644
--- a/arch/ia64/kernel/efi.c
+++ b/arch/ia64/kernel/efi.c
@@ -1223,6 +1223,10 @@ efi_initialize_iomem_resources(struct resource *code_resource,
 				flags |= IORESOURCE_DISABLED;
 				break;
 
+			case EFI_PERSISTENT_MEMORY:
+				name = "Persistent Memory";
+				break;
+
 			case EFI_RESERVED_TYPE:
 			case EFI_RUNTIME_SERVICES_CODE:
 			case EFI_RUNTIME_SERVICES_DATA:
diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c
index ef17683484e9..dde5bf7726f4 100644
--- a/arch/x86/boot/compressed/eboot.c
+++ b/arch/x86/boot/compressed/eboot.c
@@ -1222,6 +1222,10 @@ static efi_status_t setup_e820(struct boot_params *params,
 			e820_type = E820_NVS;
 			break;
 
+		case EFI_PERSISTENT_MEMORY:
+			e820_type = E820_PMEM;
+			break;
+
 		default:
 			continue;
 		}
diff --git a/arch/x86/include/uapi/asm/e820.h b/arch/x86/include/uapi/asm/e820.h
index 960a8a9dc4ab..0f457e6eab18 100644
--- a/arch/x86/include/uapi/asm/e820.h
+++ b/arch/x86/include/uapi/asm/e820.h
@@ -32,6 +32,7 @@
 #define E820_ACPI	3
 #define E820_NVS	4
 #define E820_UNUSABLE	5
+#define E820_PMEM	7
 
 /*
  * This is a non-standardized way to represent ADR or NVDIMM regions that
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 11cc7d54ec3f..0abe20da743a 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -149,6 +149,7 @@ static void __init e820_print_type(u32 type)
 	case E820_UNUSABLE:
 		printk(KERN_CONT "unusable");
 		break;
+	case E820_PMEM:
 	case E820_PRAM:
 		printk(KERN_CONT "persistent (type %u)", type);
 		break;
@@ -918,11 +919,32 @@ static inline const char *e820_type_to_string(int e820_type)
 	case E820_ACPI:	return "ACPI Tables";
 	case E820_NVS:	return "ACPI Non-volatile Storage";
 	case E820_UNUSABLE:	return "Unusable memory";
-	case E820_PRAM: return "Persistent RAM";
+	case E820_PRAM: return "Persistent Memory (legacy)";
+	case E820_PMEM: return "Persistent Memory";
 	default:	return "reserved";
 	}
 }
 
+static bool do_mark_busy(u32 type, struct resource *res)
+{
+	/* this is the legacy bios/dos rom-shadow + mmio region */
+	if (res->start < (1ULL<<20))
+		return true;
+
+	/*
+	 * Treat persistent memory like device memory, i.e. reserve it
+	 * for exclusive use of a driver
+	 */
+	switch (type) {
+	case E820_RESERVED:
+	case E820_PRAM:
+	case E820_PMEM:
+		return false;
+	default:
+		return true;
+	}
+}
+
 /*
  * Mark e820 reserved areas as busy for the resource manager.
  */
@@ -952,9 +974,7 @@ void __init e820_reserve_resources(void)
 		 * pci device BAR resource and insert them later in
 		 * pcibios_resource_survey()
 		 */
-		if (((e820.map[i].type != E820_RESERVED) &&
-		     (e820.map[i].type != E820_PRAM)) ||
-		     res->start < (1ULL<<20)) {
+		if (do_mark_busy(e820.map[i].type, res)) {
 			res->flags |= IORESOURCE_BUSY;
 			insert_resource(&iomem_resource, res);
 		}
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index dbc8627a5cdf..a116e236ac3f 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -145,6 +145,9 @@ static void __init do_add_efi_memmap(void)
 		case EFI_UNUSABLE_MEMORY:
 			e820_type = E820_UNUSABLE;
 			break;
+		case EFI_PERSISTENT_MEMORY:
+			e820_type = E820_PMEM;
+			break;
 		default:
 			/*
 			 * EFI_RESERVED_TYPE EFI_RUNTIME_SERVICES_CODE
diff --git a/include/linux/efi.h b/include/linux/efi.h
index cf7e431cbc73..28868504aa17 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -85,7 +85,8 @@ typedef	struct {
 #define EFI_MEMORY_MAPPED_IO		11
 #define EFI_MEMORY_MAPPED_IO_PORT_SPACE	12
 #define EFI_PAL_CODE			13
-#define EFI_MAX_MEMORY_TYPE		14
+#define EFI_PERSISTENT_MEMORY		14
+#define EFI_MAX_MEMORY_TYPE		15
 
 /* Attribute values: */
 #define EFI_MEMORY_UC		((u64)0x0000000000000001ULL)	/* uncached */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:56   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:56 UTC (permalink / raw)
  To: axboe
  Cc: mingo, linux-nvdimm, neilb, gregkh, Rafael J. Wysocki,
	linux-kernel, Robert Moore, linux-acpi, jmoyer, Lv Zheng, hch

A libnd bus is the anchor device for registering nvdimm resources and
interfaces, for example, a character control device, nvdimm devices,
and I/O region devices.  The ACPI NFIT (NVDIMM Firmware Interface Table)
is one possible platform description for such non-volatile memory
resources in a system.  The nfit.ko driver attaches to the "ACPI0012"
device that indicates the presence of the NFIT and parses the table to
register a libnd bus instance.

Cc: <linux-acpi@vger.kernel.org>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/Kconfig          |   15 +
 drivers/acpi/Makefile         |    1 
 drivers/acpi/nfit.c           |  444 +++++++++++++++++++++++++++++++++++++++++
 drivers/acpi/nfit.h           |   89 ++++++++
 drivers/block/Kconfig         |    2 
 drivers/block/Makefile        |    1 
 drivers/block/nd/Kconfig      |   20 ++
 drivers/block/nd/Makefile     |    3 
 drivers/block/nd/core.c       |   67 ++++++
 drivers/block/nd/nd-private.h |   23 ++
 include/acpi/actbl1.h         |  154 ++++++++++++++
 include/acpi/acuuid.h         |   89 ++++++++
 include/linux/libnd.h         |   34 +++
 13 files changed, 942 insertions(+)
 create mode 100644 drivers/acpi/nfit.c
 create mode 100644 drivers/acpi/nfit.h
 create mode 100644 drivers/block/nd/Kconfig
 create mode 100644 drivers/block/nd/Makefile
 create mode 100644 drivers/block/nd/core.c
 create mode 100644 drivers/block/nd/nd-private.h
 create mode 100644 include/acpi/acuuid.h
 create mode 100644 include/linux/libnd.h

diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index e6c3ddd92665..84d046d4ed17 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -375,6 +375,21 @@ config ACPI_REDUCED_HARDWARE_ONLY
 
 	  If you are unsure what to do, do not enable this option.
 
+config ACPI_NFIT
+	tristate "ACPI NVDIMM Firmware Interface Table (NFIT)"
+	depends on PHYS_ADDR_T_64BIT
+	depends on BLK_DEV
+	select ND_DEVICES
+	select LIBND
+	help
+	  Infrastructure to probe ACPI 6 compliant platforms for
+	  NVDIMMs (NFIT) and register a libnd device tree.  In
+	  addition to storage devices this also enables libnd to pass
+	  ACPI._DSM messages for platform/dimm configuration.
+
+	  To compile this driver as a module, choose M here:
+	  the module will be called nfit.
+
 source "drivers/acpi/apei/Kconfig"
 
 config ACPI_EXTLOG
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index 623b117ad1a2..cd91093b7acf 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -70,6 +70,7 @@ obj-$(CONFIG_ACPI_PCI_SLOT)	+= pci_slot.o
 obj-$(CONFIG_ACPI_PROCESSOR)	+= processor.o
 obj-y				+= container.o
 obj-$(CONFIG_ACPI_THERMAL)	+= thermal.o
+obj-$(CONFIG_ACPI_NFIT)		+= nfit.o
 obj-y				+= acpi_memhotplug.o
 obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
 obj-$(CONFIG_ACPI_BATTERY)	+= battery.o
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
new file mode 100644
index 000000000000..13132a16901c
--- /dev/null
+++ b/drivers/acpi/nfit.c
@@ -0,0 +1,444 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/list_sort.h>
+#include <linux/module.h>
+#include <linux/libnd.h>
+#include <linux/list.h>
+#include <linux/acpi.h>
+#include "nfit.h"
+
+static u8 nfit_uuid[NFIT_UUID_MAX][16];
+
+static const u8 *to_nfit_uuid(enum nfit_uuids id)
+{
+	return nfit_uuid[id];
+}
+
+static int acpi_nfit_ctl(struct nd_bus_descriptor *nd_desc,
+		struct nd_dimm *nd_dimm, unsigned int cmd, void *buf,
+		unsigned int buf_len)
+{
+	return -ENOTTY;
+}
+
+static const char *spa_type_name(u16 type)
+{
+	switch (type) {
+	case NFIT_SPA_VOLATILE: return "volatile";
+	case NFIT_SPA_PM: return "pmem";
+	case NFIT_SPA_DCR: return "dimm-control-region";
+	case NFIT_SPA_BDW: return "block-data-window";
+	default: return "unknown";
+	}
+}
+
+static int nfit_spa_type(struct acpi_nfit_system_address *spa)
+{
+	int i;
+
+	for (i = 0; i < NFIT_UUID_MAX; i++)
+		if (memcmp(to_nfit_uuid(i), spa->range_guid, 16) == 0)
+			return i;
+	return -1;
+}
+
+static void *add_table(struct acpi_nfit_desc *acpi_desc, void *table, const void *end)
+{
+	struct device *dev = acpi_desc->dev;
+	struct acpi_nfit_header *hdr;
+	void *err = ERR_PTR(-ENOMEM);
+
+	if (table >= end)
+		return NULL;
+
+	hdr = (struct acpi_nfit_header *) table;
+	switch (hdr->type) {
+	case ACPI_NFIT_TYPE_SYSTEM_ADDRESS: {
+		struct nfit_spa *nfit_spa = devm_kzalloc(dev, sizeof(*nfit_spa),
+				GFP_KERNEL);
+		struct acpi_nfit_system_address *spa = table;
+
+		if (!nfit_spa)
+			return err;
+		INIT_LIST_HEAD(&nfit_spa->list);
+		nfit_spa->spa = spa;
+		list_add_tail(&nfit_spa->list, &acpi_desc->spas);
+		dev_dbg(dev, "%s: spa index: %d type: %s\n", __func__,
+				spa->range_index,
+				spa_type_name(nfit_spa_type(spa)));
+		break;
+	}
+	case ACPI_NFIT_TYPE_MEMORY_MAP: {
+		struct nfit_memdev *nfit_memdev = devm_kzalloc(dev,
+				sizeof(*nfit_memdev), GFP_KERNEL);
+		struct acpi_nfit_memory_map *memdev = table;
+
+		if (!nfit_memdev)
+			return err;
+		INIT_LIST_HEAD(&nfit_memdev->list);
+		nfit_memdev->memdev = memdev;
+		list_add_tail(&nfit_memdev->list, &acpi_desc->memdevs);
+		dev_dbg(dev, "%s: memdev handle: %#x spa: %d dcr: %d\n",
+				__func__, memdev->device_handle, memdev->range_index,
+				memdev->region_index);
+		break;
+	}
+	case ACPI_NFIT_TYPE_CONTROL_REGION: {
+		struct nfit_dcr *nfit_dcr = devm_kzalloc(dev, sizeof(*nfit_dcr),
+				GFP_KERNEL);
+		struct acpi_nfit_control_region *dcr = table;
+
+		if (!nfit_dcr)
+			return err;
+		INIT_LIST_HEAD(&nfit_dcr->list);
+		nfit_dcr->dcr = dcr;
+		list_add_tail(&nfit_dcr->list, &acpi_desc->dcrs);
+		dev_dbg(dev, "%s: dcr index: %d windows: %d\n", __func__,
+				dcr->region_index, dcr->windows);
+		break;
+	}
+	case ACPI_NFIT_TYPE_DATA_REGION: {
+		struct nfit_bdw *nfit_bdw = devm_kzalloc(dev, sizeof(*nfit_bdw),
+				GFP_KERNEL);
+		struct acpi_nfit_data_region *bdw = table;
+
+		if (!nfit_bdw)
+			return err;
+		INIT_LIST_HEAD(&nfit_bdw->list);
+		nfit_bdw->bdw = bdw;
+		list_add_tail(&nfit_bdw->list, &acpi_desc->bdws);
+		dev_dbg(dev, "%s: bdw dcr: %d windows: %d\n", __func__,
+				bdw->region_index, bdw->windows);
+		break;
+	}
+	/* TODO */
+	case ACPI_NFIT_TYPE_INTERLEAVE:
+		dev_dbg(dev, "%s: idt\n", __func__);
+		break;
+	case ACPI_NFIT_TYPE_FLUSH_ADDRESS:
+		dev_dbg(dev, "%s: flush\n", __func__);
+		break;
+	case ACPI_NFIT_TYPE_SMBIOS:
+		dev_dbg(dev, "%s: smbios\n", __func__);
+		break;
+	default:
+		dev_err(dev, "unknown table '%d' parsing nfit\n", hdr->type);
+		break;
+	}
+
+	return table + hdr->length;
+}
+
+static void nfit_mem_find_spa_bdw(struct acpi_nfit_desc *acpi_desc,
+		struct nfit_mem *nfit_mem)
+{
+	u32 device_handle = __to_nfit_memdev(nfit_mem)->device_handle;
+	u16 dcr_index = nfit_mem->dcr->region_index;
+	struct nfit_spa *nfit_spa;
+
+	list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
+		u16 range_index = nfit_spa->spa->range_index;
+		int type = nfit_spa_type(nfit_spa->spa);
+		struct nfit_memdev *nfit_memdev;
+
+		if (type != NFIT_SPA_BDW)
+			continue;
+
+		list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
+			if (nfit_memdev->memdev->range_index != range_index)
+				continue;
+			if (nfit_memdev->memdev->device_handle != device_handle)
+				continue;
+			if (nfit_memdev->memdev->region_index != dcr_index)
+				continue;
+
+			nfit_mem->spa_bdw = nfit_spa->spa;
+			return;
+		}
+	}
+
+	dev_dbg(acpi_desc->dev, "SPA-BDW not found for SPA-DCR %d\n",
+			nfit_mem->spa_dcr->range_index);
+	nfit_mem->bdw = NULL;
+}
+
+static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
+		struct nfit_mem *nfit_mem, struct acpi_nfit_system_address *spa)
+{
+	u16 dcr_index = __to_nfit_memdev(nfit_mem)->region_index;
+	struct nfit_dcr *nfit_dcr;
+	struct nfit_bdw *nfit_bdw;
+
+	list_for_each_entry(nfit_dcr, &acpi_desc->dcrs, list) {
+		if (nfit_dcr->dcr->region_index != dcr_index)
+			continue;
+		nfit_mem->dcr = nfit_dcr->dcr;
+		break;
+	}
+
+	if (!nfit_mem->dcr) {
+		dev_dbg(acpi_desc->dev, "SPA %d missing:%s%s\n", spa->range_index,
+				__to_nfit_memdev(nfit_mem) ? "" : " MEMDEV",
+				nfit_mem->dcr ? "" : " DCR");
+		return -ENODEV;
+	}
+
+	/*
+	 * We've found enough to create an nd_dimm, optionally
+	 * find an associated BDW
+	 */
+	list_add(&nfit_mem->list, &acpi_desc->dimms);
+
+	list_for_each_entry(nfit_bdw, &acpi_desc->bdws, list) {
+		if (nfit_bdw->bdw->region_index != dcr_index)
+			continue;
+		nfit_mem->bdw = nfit_bdw->bdw;
+		break;
+	}
+
+	if (!nfit_mem->bdw)
+		return 0;
+
+	nfit_mem_find_spa_bdw(acpi_desc, nfit_mem);
+	return 0;
+}
+
+static int nfit_mem_dcr_init(struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	struct nfit_mem *nfit_mem, *found;
+	struct nfit_memdev *nfit_memdev;
+	int type = nfit_spa_type(spa);
+	u16 dcr_index;
+
+	switch (type) {
+	case NFIT_SPA_DCR:
+	case NFIT_SPA_PM:
+		break;
+	default:
+		return 0;
+	}
+
+	list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
+		int rc;
+
+		if (nfit_memdev->memdev->range_index != spa->range_index)
+			continue;
+		found = NULL;
+		dcr_index = nfit_memdev->memdev->region_index;
+		list_for_each_entry(nfit_mem, &acpi_desc->dimms, list)
+			if (__to_nfit_memdev(nfit_mem)->region_index == dcr_index) {
+				found = nfit_mem;
+				break;
+			}
+
+		if (found)
+			nfit_mem = found;
+		else {
+			nfit_mem = devm_kzalloc(acpi_desc->dev,
+					sizeof(*nfit_mem), GFP_KERNEL);
+			if (!nfit_mem)
+				return -ENOMEM;
+			INIT_LIST_HEAD(&nfit_mem->list);
+		}
+
+		if (type == NFIT_SPA_DCR) {
+			/* multiple dimms may share a SPA when interleaved */
+			nfit_mem->spa_dcr = spa;
+			nfit_mem->memdev_dcr = nfit_memdev->memdev;
+		} else {
+			/*
+			 * A single dimm may belong to multiple SPA-PM
+			 * ranges, record at least one in addition to
+			 * any SPA-DCR range.
+			 */
+			nfit_mem->memdev_pmem = nfit_memdev->memdev;
+		}
+
+		if (found)
+			continue;
+
+		rc = nfit_mem_add(acpi_desc, nfit_mem, spa);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
+
+static int nfit_mem_cmp(void *priv, struct list_head *__a, struct list_head *__b)
+{
+	struct nfit_mem *a = container_of(__a, typeof(*a), list);
+	struct nfit_mem *b = container_of(__b, typeof(*b), list);
+	u32 handleA, handleB;
+
+	handleA = __to_nfit_memdev(a)->device_handle;
+	handleB = __to_nfit_memdev(b)->device_handle;
+	if (handleA < handleB)
+		return -1;
+	else if (handleA > handleB)
+		return 1;
+	return 0;
+}
+
+static int nfit_mem_init(struct acpi_nfit_desc *acpi_desc)
+{
+	struct nfit_spa *nfit_spa;
+
+	/*
+	 * For each SPA-DCR or SPA-PMEM address range find its
+	 * corresponding MEMDEV(s).  From each MEMDEV find the
+	 * corresponding DCR.  Then, if we're operating on a SPA-DCR,
+	 * try to find a SPA-BDW and a corresponding BDW that references
+	 * the DCR.  Throw it all into an nfit_mem object.  Note, that
+	 * BDWs are optional.
+	 */
+	list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
+		int rc;
+
+		rc = nfit_mem_dcr_init(acpi_desc, nfit_spa->spa);
+		if (rc)
+			return rc;
+	}
+
+	list_sort(NULL, &acpi_desc->dimms, nfit_mem_cmp);
+
+	return 0;
+}
+
+static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
+{
+	struct device *dev = acpi_desc->dev;
+	const void *end;
+	u8 *data;
+
+	INIT_LIST_HEAD(&acpi_desc->spas);
+	INIT_LIST_HEAD(&acpi_desc->dcrs);
+	INIT_LIST_HEAD(&acpi_desc->bdws);
+	INIT_LIST_HEAD(&acpi_desc->memdevs);
+	INIT_LIST_HEAD(&acpi_desc->dimms);
+
+	data = (u8 *) acpi_desc->nfit;
+	end = data + sz;
+	data += sizeof(struct acpi_table_nfit);
+	while (!IS_ERR_OR_NULL(data))
+		data = add_table(acpi_desc, data, end);
+
+	if (IS_ERR(data)) {
+		dev_dbg(dev, "%s: nfit table parsing error: %ld\n", __func__,
+				PTR_ERR(data));
+		return PTR_ERR(data);
+	}
+
+	if (nfit_mem_init(acpi_desc) != 0)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int acpi_nfit_add(struct acpi_device *adev)
+{
+	struct nd_bus_descriptor *nd_desc;
+	struct acpi_nfit_desc *acpi_desc;
+	struct device *dev = &adev->dev;
+	struct acpi_table_header *tbl;
+	acpi_status status = AE_OK;
+	acpi_size sz;
+	int rc;
+
+	status = acpi_get_table_with_size("NFIT", 0, &tbl, &sz);
+	if (ACPI_FAILURE(status)) {
+		dev_err(dev, "failed to find NFIT\n");
+		return -ENXIO;
+	}
+
+	acpi_desc = devm_kzalloc(dev, sizeof(*acpi_desc), GFP_KERNEL);
+	if (!acpi_desc)
+		return -ENOMEM;
+
+	dev_set_drvdata(dev, acpi_desc);
+	acpi_desc->dev = dev;
+	acpi_desc->nfit = (struct acpi_table_nfit *) tbl;
+	nd_desc = &acpi_desc->nd_desc;
+	nd_desc->provider_name = "ACPI.NFIT";
+	nd_desc->ndctl = acpi_nfit_ctl;
+
+	acpi_desc->nd_bus = nd_bus_register(dev, nd_desc);
+	if (!acpi_desc->nd_bus)
+		return -ENXIO;
+
+	rc = acpi_nfit_init(acpi_desc, sz);
+	if (rc) {
+		nd_bus_unregister(acpi_desc->nd_bus);
+		return rc;
+	}
+	return 0;
+}
+
+static int acpi_nfit_remove(struct acpi_device *adev)
+{
+	struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(&adev->dev);
+
+	nd_bus_unregister(acpi_desc->nd_bus);
+	return 0;
+}
+
+static const struct acpi_device_id acpi_nfit_ids[] = {
+	{ "ACPI0012", 0 },
+	{ "", 0 },
+};
+MODULE_DEVICE_TABLE(acpi, acpi_nfit_ids);
+
+static struct acpi_driver acpi_nfit_driver = {
+	.name = KBUILD_MODNAME,
+	.ids = acpi_nfit_ids,
+	.flags = ACPI_DRIVER_ALL_NOTIFY_EVENTS,
+	.ops = {
+		.add = acpi_nfit_add,
+		.remove = acpi_nfit_remove,
+	},
+};
+
+static __init int nfit_init(void)
+{
+	BUILD_BUG_ON(sizeof(struct acpi_table_nfit) != 40);
+	BUILD_BUG_ON(sizeof(struct acpi_nfit_system_address) != 56);
+	BUILD_BUG_ON(sizeof(struct acpi_nfit_memory_map) != 48);
+	BUILD_BUG_ON(sizeof(struct acpi_nfit_interleave) != 20);
+	BUILD_BUG_ON(sizeof(struct acpi_nfit_smbios) != 9);
+	BUILD_BUG_ON(sizeof(struct acpi_nfit_control_region) != 80);
+	BUILD_BUG_ON(sizeof(struct acpi_nfit_data_region) != 40);
+
+	acpi_str_to_uuid(UUID_VOLATILE_MEMORY, nfit_uuid[NFIT_SPA_VOLATILE]);
+	acpi_str_to_uuid(UUID_PERSISTENT_MEMORY, nfit_uuid[NFIT_SPA_PM]);
+	acpi_str_to_uuid(UUID_CONTROL_REGION, nfit_uuid[NFIT_SPA_DCR]);
+	acpi_str_to_uuid(UUID_DATA_REGION, nfit_uuid[NFIT_SPA_BDW]);
+	acpi_str_to_uuid(UUID_VOLATILE_VIRTUAL_DISK, nfit_uuid[NFIT_SPA_VDISK]);
+	acpi_str_to_uuid(UUID_VOLATILE_VIRTUAL_CD, nfit_uuid[NFIT_SPA_VCD]);
+	acpi_str_to_uuid(UUID_PERSISTENT_VIRTUAL_DISK, nfit_uuid[NFIT_SPA_PDISK]);
+	acpi_str_to_uuid(UUID_PERSISTENT_VIRTUAL_CD, nfit_uuid[NFIT_SPA_PCD]);
+	acpi_str_to_uuid(UUID_NFIT_BUS, nfit_uuid[NFIT_DEV_BUS]);
+	acpi_str_to_uuid(UUID_NFIT_DIMM, nfit_uuid[NFIT_DEV_DIMM]);
+
+	return acpi_bus_register_driver(&acpi_nfit_driver);
+}
+
+static __exit void nfit_exit(void)
+{
+	acpi_bus_unregister_driver(&acpi_nfit_driver);
+}
+
+module_init(nfit_init);
+module_exit(nfit_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Intel Corporation");
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
new file mode 100644
index 000000000000..ff72da9c9694
--- /dev/null
+++ b/drivers/acpi/nfit.h
@@ -0,0 +1,89 @@
+/*
+ * NVDIMM Firmware Interface Table - NFIT
+ *
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __NFIT_H__
+#define __NFIT_H__
+#include <linux/types.h>
+#include <linux/libnd.h>
+#include <linux/uuid.h>
+#include <linux/acpi.h>
+#include <acpi/acuuid.h>
+
+#define UUID_NFIT_BUS "2f10e7a4-9e91-11e4-89d3-123b93f75cba"
+#define UUID_NFIT_DIMM "4309ac30-0d11-11e4-9191-0800200c9a66"
+
+enum nfit_uuids {
+	NFIT_SPA_VOLATILE,
+	NFIT_SPA_PM,
+	NFIT_SPA_DCR,
+	NFIT_SPA_BDW,
+	NFIT_SPA_VDISK,
+	NFIT_SPA_VCD,
+	NFIT_SPA_PDISK,
+	NFIT_SPA_PCD,
+	NFIT_DEV_BUS,
+	NFIT_DEV_DIMM,
+	NFIT_UUID_MAX,
+};
+
+struct nfit_spa {
+	struct acpi_nfit_system_address *spa;
+	struct list_head list;
+};
+
+struct nfit_dcr {
+	struct acpi_nfit_control_region *dcr;
+	struct list_head list;
+};
+
+struct nfit_bdw {
+	struct acpi_nfit_data_region *bdw;
+	struct list_head list;
+};
+
+struct nfit_memdev {
+	struct acpi_nfit_memory_map *memdev;
+	struct list_head list;
+};
+
+/* assembled tables for a given dimm/memory-device */
+struct nfit_mem {
+	struct acpi_nfit_memory_map *memdev_dcr;
+	struct acpi_nfit_memory_map *memdev_pmem;
+	struct acpi_nfit_control_region *dcr;
+	struct acpi_nfit_data_region *bdw;
+	struct acpi_nfit_system_address *spa_dcr;
+	struct acpi_nfit_system_address *spa_bdw;
+	struct list_head list;
+};
+
+struct acpi_nfit_desc {
+	struct nd_bus_descriptor nd_desc;
+	struct acpi_table_nfit *nfit;
+	struct list_head memdevs;
+	struct list_head dimms;
+	struct list_head spas;
+	struct list_head dcrs;
+	struct list_head bdws;
+	struct nd_bus *nd_bus;
+	struct device *dev;
+};
+
+static inline struct acpi_nfit_memory_map *__to_nfit_memdev(struct nfit_mem *nfit_mem)
+{
+	if (nfit_mem->memdev_dcr)
+		return nfit_mem->memdev_dcr;
+	return nfit_mem->memdev_pmem;
+}
+#endif /* __NFIT_H__ */
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index eb1fed5bd516..dfe40e5ca9bd 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -321,6 +321,8 @@ config BLK_DEV_NVME
 	  To compile this driver as a module, choose M here: the
 	  module will be called nvme.
 
+source "drivers/block/nd/Kconfig"
+
 config BLK_DEV_SKD
 	tristate "STEC S1120 Block Driver"
 	depends on PCI
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 9cc6c18a1c7e..07a6acecf4d8 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -24,6 +24,7 @@ obj-$(CONFIG_CDROM_PKTCDVD)	+= pktcdvd.o
 obj-$(CONFIG_MG_DISK)		+= mg_disk.o
 obj-$(CONFIG_SUNVDC)		+= sunvdc.o
 obj-$(CONFIG_BLK_DEV_NVME)	+= nvme.o
+obj-$(CONFIG_ND_DEVICES)	+= nd/
 obj-$(CONFIG_BLK_DEV_SKD)	+= skd.o
 obj-$(CONFIG_BLK_DEV_OSD)	+= osdblk.o
 
diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
new file mode 100644
index 000000000000..9b909c21afa1
--- /dev/null
+++ b/drivers/block/nd/Kconfig
@@ -0,0 +1,20 @@
+menuconfig ND_DEVICES
+	bool "NVDIMM (Non-Volatile Memory Device) Support"
+	help
+	  Generic support for non-volatile memory devices including
+	  ACPI-6-NFIT defined resources.  On platforms that define an
+	  NFIT, or otherwise can discover NVDIMM resources, a libnd
+	  bus is registered to advertise PMEM (persistent memory)
+	  namespaces (/dev/pmemX) and BLK (sliding mmio window(s))
+	  namespaces (/dev/ndX). A PMEM namespace refers to a memory
+	  resource that may span multiple DIMMs and support DAX (see
+	  CONFIG_DAX).  A BLK namespace refers to an NVDIMM control
+	  region which exposes an mmio register set for windowed
+	  access mode to non-volatile memory.
+
+if ND_DEVICES
+
+config LIBND
+	tristate
+
+endif
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
new file mode 100644
index 000000000000..a647ff6cf557
--- /dev/null
+++ b/drivers/block/nd/Makefile
@@ -0,0 +1,3 @@
+obj-$(CONFIG_LIBND) += libnd.o
+
+libnd-y := core.o
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
new file mode 100644
index 000000000000..15b89ce1a9af
--- /dev/null
+++ b/drivers/block/nd/core.c
@@ -0,0 +1,67 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/export.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/libnd.h>
+#include <linux/slab.h>
+#include "nd-private.h"
+
+static DEFINE_IDA(nd_ida);
+
+static void nd_bus_release(struct device *dev)
+{
+	struct nd_bus *nd_bus = container_of(dev, struct nd_bus, dev);
+
+	ida_simple_remove(&nd_ida, nd_bus->id);
+	kfree(nd_bus);
+}
+
+struct nd_bus *nd_bus_register(struct device *parent,
+		struct nd_bus_descriptor *nd_desc)
+{
+	struct nd_bus *nd_bus = kzalloc(sizeof(*nd_bus), GFP_KERNEL);
+	int rc;
+
+	if (!nd_bus)
+		return NULL;
+	nd_bus->id = ida_simple_get(&nd_ida, 0, 0, GFP_KERNEL);
+	if (nd_bus->id < 0) {
+		kfree(nd_bus);
+		return NULL;
+	}
+	nd_bus->nd_desc = nd_desc;
+	nd_bus->dev.parent = parent;
+	nd_bus->dev.release = nd_bus_release;
+	dev_set_name(&nd_bus->dev, "ndbus%d", nd_bus->id);
+	rc = device_register(&nd_bus->dev);
+	if (rc) {
+		dev_dbg(&nd_bus->dev, "device registration failed: %d\n", rc);
+		put_device(&nd_bus->dev);
+		return NULL;
+	}
+
+	return nd_bus;
+}
+EXPORT_SYMBOL_GPL(nd_bus_register);
+
+void nd_bus_unregister(struct nd_bus *nd_bus)
+{
+	if (!nd_bus)
+		return;
+	device_unregister(&nd_bus->dev);
+}
+EXPORT_SYMBOL_GPL(nd_bus_unregister);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Intel Corporation");
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
new file mode 100644
index 000000000000..a107a19ffa9c
--- /dev/null
+++ b/drivers/block/nd/nd-private.h
@@ -0,0 +1,23 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __ND_PRIVATE_H__
+#define __ND_PRIVATE_H__
+#include <linux/device.h>
+#include <linux/libnd.h>
+
+struct nd_bus {
+	struct nd_bus_descriptor *nd_desc;
+	struct device dev;
+	int id;
+};
+#endif /* __ND_PRIVATE_H__ */
diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h
index b80b0e6dabc5..b0e431dbfd9e 100644
--- a/include/acpi/actbl1.h
+++ b/include/acpi/actbl1.h
@@ -71,6 +71,7 @@
 #define ACPI_SIG_SBST           "SBST"	/* Smart Battery Specification Table */
 #define ACPI_SIG_SLIT           "SLIT"	/* System Locality Distance Information Table */
 #define ACPI_SIG_SRAT           "SRAT"	/* System Resource Affinity Table */
+#define ACPI_SIG_NFIT           "NFIT"	/* NVDIMM Firmware Interface Table */
 
 /*
  * All tables must be byte-packed to match the ACPI specification, since
@@ -908,6 +909,159 @@ struct acpi_msct_proximity {
 
 /*******************************************************************************
  *
+ * NFIT - NVDIMM Interface Table (ACPI 6.0)
+ *        Version 1
+ *
+ ******************************************************************************/
+
+struct acpi_table_nfit {
+	struct acpi_table_header header;	/* Common ACPI table header */
+	u32 reserved;		/* Reserved, must be zero */
+};
+
+/* Subtable header for NFIT */
+
+struct acpi_nfit_header {
+	u16 type;
+	u16 length;
+};
+
+/* Values for subtable type in struct acpi_nfit_header */
+
+enum acpi_nfit_type {
+	ACPI_NFIT_TYPE_SYSTEM_ADDRESS = 0,
+	ACPI_NFIT_TYPE_MEMORY_MAP = 1,
+	ACPI_NFIT_TYPE_INTERLEAVE = 2,
+	ACPI_NFIT_TYPE_SMBIOS = 3,
+	ACPI_NFIT_TYPE_CONTROL_REGION = 4,
+	ACPI_NFIT_TYPE_DATA_REGION = 5,
+	ACPI_NFIT_TYPE_FLUSH_ADDRESS = 6,
+	ACPI_NFIT_TYPE_RESERVED = 7	/* 7 and greater are reserved */
+};
+
+/*
+ * NFIT Subtables
+ */
+
+/* 0: System Physical Address Range Structure */
+
+struct acpi_nfit_system_address {
+	struct acpi_nfit_header header;
+	u16 range_index;
+	u16 flags;
+	u32 reserved;		/* Reseved, must be zero */
+	u32 proximity_domain;
+	u8 range_guid[16];
+	u64 address;
+	u64 length;
+	u64 memory_mapping;
+};
+
+/* Flags */
+
+#define ACPI_NFIT_ADD_ONLINE_ONLY       (1)	/* 00: Add/Online Operation Only */
+#define ACPI_NFIT_PROXIMITY_VALID       (1<<1)	/* 01: Proximity Domain Valid */
+
+/* Range Type GUIDs appear in the include/acuuid.h file */
+
+/* 1: Memory Device to System Address Range Map Structure */
+
+struct acpi_nfit_memory_map {
+	struct acpi_nfit_header header;
+	u32 device_handle;
+	u16 physical_id;
+	u16 region_id;
+	u16 range_index;
+	u16 region_index;
+	u64 region_size;
+	u64 region_offset;
+	u64 address;
+	u16 interleave_index;
+	u16 interleave_ways;
+	u16 flags;
+	u16 reserved;		/* Reserved, must be zero */
+};
+
+/* Flags */
+
+#define ACPI_NFIT_MEM_SAVE_FAILED       (1)	/* 00: Last SAVE to Memory Device failed */
+#define ACPI_NFIT_MEM_RESTORE_FAILED    (1<<1)	/* 01: Last RESTORE from Memory Device failed */
+#define ACPI_NFIT_MEM_FLUSH_FAILED      (1<<2)	/* 02: Platform flush failed */
+#define ACPI_NFIT_MEM_ARMED             (1<<3)	/* 03: Memory Device observed to be not armed */
+#define ACPI_NFIT_MEM_HEALTH_OBSERVED   (1<<4)	/* 04: Memory Device observed SMART/health events */
+#define ACPI_NFIT_MEM_HEALTH_ENABLED    (1<<5)	/* 05: SMART/health events enabled */
+
+/* 2: Interleave Structure */
+
+struct acpi_nfit_interleave {
+	struct acpi_nfit_header header;
+	u16 interleave_index;
+	u16 reserved;		/* Reserved, must be zero */
+	u32 line_count;
+	u32 line_size;
+	u32 line_offset[1];	/* Variable length */
+};
+
+/* 3: SMBIOS Management Information Structure */
+
+struct acpi_nfit_smbios {
+	struct acpi_nfit_header header;
+	u32 reserved;		/* Reserved, must be zero */
+	u8 data[1];		/* Variable length */
+};
+
+/* 4: NVDIMM Control Region Structure */
+
+struct acpi_nfit_control_region {
+	struct acpi_nfit_header header;
+	u16 region_index;
+	u16 vendor_id;
+	u16 device_id;
+	u16 revision_id;
+	u16 subsystem_vendor_id;
+	u16 subsystem_device_id;
+	u16 subsystem_revision_id;
+	u8 reserved[6];		/* Reserved, must be zero */
+	u32 serial_number;
+	u16 code;
+	u16 windows;
+	u64 window_size;
+	u64 command_offset;
+	u64 command_size;
+	u64 status_offset;
+	u64 status_size;
+	u16 flags;
+	u8 reserved1[6];	/* Reserved, must be zero */
+};
+
+/* Flags */
+
+#define ACPI_NFIT_CONTROL_BUFFERED      (1)	/* Block Data Windows implementation is buffered */
+
+/* 5: NVDIMM Block Data Window Region Structure */
+
+struct acpi_nfit_data_region {
+	struct acpi_nfit_header header;
+	u16 region_index;
+	u16 windows;
+	u64 offset;
+	u64 size;
+	u64 capacity;
+	u64 start_address;
+};
+
+/* 6: Flush Hint Address Structure */
+
+struct acpi_nfit_flush_address {
+	struct acpi_nfit_header header;
+	u32 device_handle;
+	u16 hint_count;
+	u8 reserved[6];		/* Reserved, must be zero */
+	u64 hint_address[1];	/* Variable length */
+};
+
+/*******************************************************************************
+ *
  * SBST - Smart Battery Specification Table
  *        Version 1
  *
diff --git a/include/acpi/acuuid.h b/include/acpi/acuuid.h
new file mode 100644
index 000000000000..7c6cbb028ffc
--- /dev/null
+++ b/include/acpi/acuuid.h
@@ -0,0 +1,89 @@
+/******************************************************************************
+ *
+ * Name: acuuid.h - ACPI-related UUID/GUID definitions
+ *
+ *****************************************************************************/
+
+/*
+ * Copyright (C) 2000 - 2015, Intel Corp.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions, and the following disclaimer,
+ *    without modification.
+ * 2. Redistributions in binary form must reproduce at minimum a disclaimer
+ *    substantially similar to the "NO WARRANTY" disclaimer below
+ *    ("Disclaimer") and any redistribution must be conditioned upon
+ *    including a substantially similar Disclaimer requirement for further
+ *    binary redistribution.
+ * 3. Neither the names of the above-listed copyright holders nor the names
+ *    of any contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * Alternatively, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") version 2 as published by the Free
+ * Software Foundation.
+ *
+ * NO WARRANTY
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+ * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGES.
+ */
+
+#ifndef __ACUUID_H__
+#define __ACUUID_H__
+
+/*
+ * Note1: UUIDs and GUIDs are defined to be identical in ACPI.
+ *
+ * Note2: This file is standalone and should remain that way.
+ */
+
+/* Controllers */
+
+#define UUID_GPIO_CONTROLLER            "4f248f40-d5e2-499f-834c-27758ea1cd3f"
+#define UUID_USB_CONTROLLER             "ce2ee385-00e6-48cb-9f05-2edb927c4899"
+#define UUID_SATA_CONTROLLER            "e4db149b-fcfe-425b-a6d8-92357d78fc7f"
+
+/* Devices */
+
+#define UUID_PCI_HOST_BRIDGE            "33db4d5b-1ff7-401c-9657-7441c03dd766"
+#define UUID_I2C_DEVICE                 "3cdff6f7-4267-4555-ad05-b30a3d8938de"
+#define UUID_POWER_BUTTON               "dfbcf3c5-e7a5-44e6-9c1f-29c76f6e059c"
+
+/* Interfaces */
+
+#define UUID_DEVICE_LABELING            "e5c937d0-3553-4d7a-9117-ea4d19c3434d"
+#define UUID_PHYSICAL_PRESENCE          "3dddfaa6-361b-4eb4-a424-8d10089d1653"
+
+/* NVDIMM - NFIT table */
+
+#define UUID_VOLATILE_MEMORY            "4f940573-dafd-e344-b16c-3f22d252e5d0"
+#define UUID_PERSISTENT_MEMORY          "79d3f066-f3b4-7440-ac43-0d3318b78cdb"
+#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"
+#define UUID_DATA_REGION                "3005af91-865d-0e47-a6b0-0a2db9408249"
+#define UUID_VOLATILE_VIRTUAL_DISK      "5a53ab77-fc45-4b62-5560-f7b281d1f96e"
+#define UUID_VOLATILE_VIRTUAL_CD        "30bd5a3d-7541-ce87-6d64-d2ade523c4bb"
+#define UUID_PERSISTENT_VIRTUAL_DISK    "c902ea5c-074d-69d3-269f-4496fbe096f9"
+#define UUID_PERSISTENT_VIRTUAL_CD      "88810108-cd42-48bb-100f-5387d53ded3d"
+
+/* Miscellaneous */
+
+#define UUID_PLATFORM_CAPABILITIES      "0811b06e-4a27-44f9-8d60-3cbbc22e7b48"
+#define UUID_DYNAMIC_ENUMERATION        "d8c1a3a6-be9b-4c9b-91bf-c3cb81fc5daf"
+#define UUID_BATTERY_THERMAL_LIMIT      "4c2067e3-887d-475c-9720-4af1d3ed602e"
+#define UUID_THERMAL_EXTENSIONS         "14d399cd-7a27-4b18-8fb4-7cb7b9f4e500"
+#define UUID_DEVICE_PROPERTIES          "daffd814-6eba-4d8c-8a91-bc9bbf4aa301"
+
+#endif				/* __AUUID_H__ */
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
new file mode 100644
index 000000000000..8e4441002868
--- /dev/null
+++ b/include/linux/libnd.h
@@ -0,0 +1,34 @@
+/*
+ * libnd - Non-volatile-memory Devices Subsystem
+ *
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __LIBND_H__
+#define __LIBND_H__
+struct nd_dimm;
+struct nd_bus_descriptor;
+typedef int (*ndctl_fn)(struct nd_bus_descriptor *nd_desc,
+		struct nd_dimm *nd_dimm, unsigned int cmd, void *buf,
+		unsigned int buf_len);
+
+struct nd_bus_descriptor {
+	unsigned long dsm_mask;
+	char *provider_name;
+	ndctl_fn ndctl;
+};
+
+struct nd_bus;
+struct device;
+struct nd_bus *nd_bus_register(struct device *parent,
+		struct nd_bus_descriptor *nfit_desc);
+void nd_bus_unregister(struct nd_bus *nd_bus);
+#endif /* __LIBND_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
@ 2015-05-20 20:56   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:56 UTC (permalink / raw)
  To: axboe
  Cc: mingo, linux-nvdimm, neilb, gregkh, Rafael J. Wysocki,
	linux-kernel, Robert Moore, linux-acpi, jmoyer, Lv Zheng, hch

A libnd bus is the anchor device for registering nvdimm resources and
interfaces, for example, a character control device, nvdimm devices,
and I/O region devices.  The ACPI NFIT (NVDIMM Firmware Interface Table)
is one possible platform description for such non-volatile memory
resources in a system.  The nfit.ko driver attaches to the "ACPI0012"
device that indicates the presence of the NFIT and parses the table to
register a libnd bus instance.

Cc: <linux-acpi@vger.kernel.org>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/Kconfig          |   15 +
 drivers/acpi/Makefile         |    1 
 drivers/acpi/nfit.c           |  444 +++++++++++++++++++++++++++++++++++++++++
 drivers/acpi/nfit.h           |   89 ++++++++
 drivers/block/Kconfig         |    2 
 drivers/block/Makefile        |    1 
 drivers/block/nd/Kconfig      |   20 ++
 drivers/block/nd/Makefile     |    3 
 drivers/block/nd/core.c       |   67 ++++++
 drivers/block/nd/nd-private.h |   23 ++
 include/acpi/actbl1.h         |  154 ++++++++++++++
 include/acpi/acuuid.h         |   89 ++++++++
 include/linux/libnd.h         |   34 +++
 13 files changed, 942 insertions(+)
 create mode 100644 drivers/acpi/nfit.c
 create mode 100644 drivers/acpi/nfit.h
 create mode 100644 drivers/block/nd/Kconfig
 create mode 100644 drivers/block/nd/Makefile
 create mode 100644 drivers/block/nd/core.c
 create mode 100644 drivers/block/nd/nd-private.h
 create mode 100644 include/acpi/acuuid.h
 create mode 100644 include/linux/libnd.h

diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index e6c3ddd92665..84d046d4ed17 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -375,6 +375,21 @@ config ACPI_REDUCED_HARDWARE_ONLY
 
 	  If you are unsure what to do, do not enable this option.
 
+config ACPI_NFIT
+	tristate "ACPI NVDIMM Firmware Interface Table (NFIT)"
+	depends on PHYS_ADDR_T_64BIT
+	depends on BLK_DEV
+	select ND_DEVICES
+	select LIBND
+	help
+	  Infrastructure to probe ACPI 6 compliant platforms for
+	  NVDIMMs (NFIT) and register a libnd device tree.  In
+	  addition to storage devices this also enables libnd to pass
+	  ACPI._DSM messages for platform/dimm configuration.
+
+	  To compile this driver as a module, choose M here:
+	  the module will be called nfit.
+
 source "drivers/acpi/apei/Kconfig"
 
 config ACPI_EXTLOG
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index 623b117ad1a2..cd91093b7acf 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -70,6 +70,7 @@ obj-$(CONFIG_ACPI_PCI_SLOT)	+= pci_slot.o
 obj-$(CONFIG_ACPI_PROCESSOR)	+= processor.o
 obj-y				+= container.o
 obj-$(CONFIG_ACPI_THERMAL)	+= thermal.o
+obj-$(CONFIG_ACPI_NFIT)		+= nfit.o
 obj-y				+= acpi_memhotplug.o
 obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
 obj-$(CONFIG_ACPI_BATTERY)	+= battery.o
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
new file mode 100644
index 000000000000..13132a16901c
--- /dev/null
+++ b/drivers/acpi/nfit.c
@@ -0,0 +1,444 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/list_sort.h>
+#include <linux/module.h>
+#include <linux/libnd.h>
+#include <linux/list.h>
+#include <linux/acpi.h>
+#include "nfit.h"
+
+static u8 nfit_uuid[NFIT_UUID_MAX][16];
+
+static const u8 *to_nfit_uuid(enum nfit_uuids id)
+{
+	return nfit_uuid[id];
+}
+
+static int acpi_nfit_ctl(struct nd_bus_descriptor *nd_desc,
+		struct nd_dimm *nd_dimm, unsigned int cmd, void *buf,
+		unsigned int buf_len)
+{
+	return -ENOTTY;
+}
+
+static const char *spa_type_name(u16 type)
+{
+	switch (type) {
+	case NFIT_SPA_VOLATILE: return "volatile";
+	case NFIT_SPA_PM: return "pmem";
+	case NFIT_SPA_DCR: return "dimm-control-region";
+	case NFIT_SPA_BDW: return "block-data-window";
+	default: return "unknown";
+	}
+}
+
+static int nfit_spa_type(struct acpi_nfit_system_address *spa)
+{
+	int i;
+
+	for (i = 0; i < NFIT_UUID_MAX; i++)
+		if (memcmp(to_nfit_uuid(i), spa->range_guid, 16) == 0)
+			return i;
+	return -1;
+}
+
+static void *add_table(struct acpi_nfit_desc *acpi_desc, void *table, const void *end)
+{
+	struct device *dev = acpi_desc->dev;
+	struct acpi_nfit_header *hdr;
+	void *err = ERR_PTR(-ENOMEM);
+
+	if (table >= end)
+		return NULL;
+
+	hdr = (struct acpi_nfit_header *) table;
+	switch (hdr->type) {
+	case ACPI_NFIT_TYPE_SYSTEM_ADDRESS: {
+		struct nfit_spa *nfit_spa = devm_kzalloc(dev, sizeof(*nfit_spa),
+				GFP_KERNEL);
+		struct acpi_nfit_system_address *spa = table;
+
+		if (!nfit_spa)
+			return err;
+		INIT_LIST_HEAD(&nfit_spa->list);
+		nfit_spa->spa = spa;
+		list_add_tail(&nfit_spa->list, &acpi_desc->spas);
+		dev_dbg(dev, "%s: spa index: %d type: %s\n", __func__,
+				spa->range_index,
+				spa_type_name(nfit_spa_type(spa)));
+		break;
+	}
+	case ACPI_NFIT_TYPE_MEMORY_MAP: {
+		struct nfit_memdev *nfit_memdev = devm_kzalloc(dev,
+				sizeof(*nfit_memdev), GFP_KERNEL);
+		struct acpi_nfit_memory_map *memdev = table;
+
+		if (!nfit_memdev)
+			return err;
+		INIT_LIST_HEAD(&nfit_memdev->list);
+		nfit_memdev->memdev = memdev;
+		list_add_tail(&nfit_memdev->list, &acpi_desc->memdevs);
+		dev_dbg(dev, "%s: memdev handle: %#x spa: %d dcr: %d\n",
+				__func__, memdev->device_handle, memdev->range_index,
+				memdev->region_index);
+		break;
+	}
+	case ACPI_NFIT_TYPE_CONTROL_REGION: {
+		struct nfit_dcr *nfit_dcr = devm_kzalloc(dev, sizeof(*nfit_dcr),
+				GFP_KERNEL);
+		struct acpi_nfit_control_region *dcr = table;
+
+		if (!nfit_dcr)
+			return err;
+		INIT_LIST_HEAD(&nfit_dcr->list);
+		nfit_dcr->dcr = dcr;
+		list_add_tail(&nfit_dcr->list, &acpi_desc->dcrs);
+		dev_dbg(dev, "%s: dcr index: %d windows: %d\n", __func__,
+				dcr->region_index, dcr->windows);
+		break;
+	}
+	case ACPI_NFIT_TYPE_DATA_REGION: {
+		struct nfit_bdw *nfit_bdw = devm_kzalloc(dev, sizeof(*nfit_bdw),
+				GFP_KERNEL);
+		struct acpi_nfit_data_region *bdw = table;
+
+		if (!nfit_bdw)
+			return err;
+		INIT_LIST_HEAD(&nfit_bdw->list);
+		nfit_bdw->bdw = bdw;
+		list_add_tail(&nfit_bdw->list, &acpi_desc->bdws);
+		dev_dbg(dev, "%s: bdw dcr: %d windows: %d\n", __func__,
+				bdw->region_index, bdw->windows);
+		break;
+	}
+	/* TODO */
+	case ACPI_NFIT_TYPE_INTERLEAVE:
+		dev_dbg(dev, "%s: idt\n", __func__);
+		break;
+	case ACPI_NFIT_TYPE_FLUSH_ADDRESS:
+		dev_dbg(dev, "%s: flush\n", __func__);
+		break;
+	case ACPI_NFIT_TYPE_SMBIOS:
+		dev_dbg(dev, "%s: smbios\n", __func__);
+		break;
+	default:
+		dev_err(dev, "unknown table '%d' parsing nfit\n", hdr->type);
+		break;
+	}
+
+	return table + hdr->length;
+}
+
+static void nfit_mem_find_spa_bdw(struct acpi_nfit_desc *acpi_desc,
+		struct nfit_mem *nfit_mem)
+{
+	u32 device_handle = __to_nfit_memdev(nfit_mem)->device_handle;
+	u16 dcr_index = nfit_mem->dcr->region_index;
+	struct nfit_spa *nfit_spa;
+
+	list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
+		u16 range_index = nfit_spa->spa->range_index;
+		int type = nfit_spa_type(nfit_spa->spa);
+		struct nfit_memdev *nfit_memdev;
+
+		if (type != NFIT_SPA_BDW)
+			continue;
+
+		list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
+			if (nfit_memdev->memdev->range_index != range_index)
+				continue;
+			if (nfit_memdev->memdev->device_handle != device_handle)
+				continue;
+			if (nfit_memdev->memdev->region_index != dcr_index)
+				continue;
+
+			nfit_mem->spa_bdw = nfit_spa->spa;
+			return;
+		}
+	}
+
+	dev_dbg(acpi_desc->dev, "SPA-BDW not found for SPA-DCR %d\n",
+			nfit_mem->spa_dcr->range_index);
+	nfit_mem->bdw = NULL;
+}
+
+static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
+		struct nfit_mem *nfit_mem, struct acpi_nfit_system_address *spa)
+{
+	u16 dcr_index = __to_nfit_memdev(nfit_mem)->region_index;
+	struct nfit_dcr *nfit_dcr;
+	struct nfit_bdw *nfit_bdw;
+
+	list_for_each_entry(nfit_dcr, &acpi_desc->dcrs, list) {
+		if (nfit_dcr->dcr->region_index != dcr_index)
+			continue;
+		nfit_mem->dcr = nfit_dcr->dcr;
+		break;
+	}
+
+	if (!nfit_mem->dcr) {
+		dev_dbg(acpi_desc->dev, "SPA %d missing:%s%s\n", spa->range_index,
+				__to_nfit_memdev(nfit_mem) ? "" : " MEMDEV",
+				nfit_mem->dcr ? "" : " DCR");
+		return -ENODEV;
+	}
+
+	/*
+	 * We've found enough to create an nd_dimm, optionally
+	 * find an associated BDW
+	 */
+	list_add(&nfit_mem->list, &acpi_desc->dimms);
+
+	list_for_each_entry(nfit_bdw, &acpi_desc->bdws, list) {
+		if (nfit_bdw->bdw->region_index != dcr_index)
+			continue;
+		nfit_mem->bdw = nfit_bdw->bdw;
+		break;
+	}
+
+	if (!nfit_mem->bdw)
+		return 0;
+
+	nfit_mem_find_spa_bdw(acpi_desc, nfit_mem);
+	return 0;
+}
+
+static int nfit_mem_dcr_init(struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	struct nfit_mem *nfit_mem, *found;
+	struct nfit_memdev *nfit_memdev;
+	int type = nfit_spa_type(spa);
+	u16 dcr_index;
+
+	switch (type) {
+	case NFIT_SPA_DCR:
+	case NFIT_SPA_PM:
+		break;
+	default:
+		return 0;
+	}
+
+	list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
+		int rc;
+
+		if (nfit_memdev->memdev->range_index != spa->range_index)
+			continue;
+		found = NULL;
+		dcr_index = nfit_memdev->memdev->region_index;
+		list_for_each_entry(nfit_mem, &acpi_desc->dimms, list)
+			if (__to_nfit_memdev(nfit_mem)->region_index == dcr_index) {
+				found = nfit_mem;
+				break;
+			}
+
+		if (found)
+			nfit_mem = found;
+		else {
+			nfit_mem = devm_kzalloc(acpi_desc->dev,
+					sizeof(*nfit_mem), GFP_KERNEL);
+			if (!nfit_mem)
+				return -ENOMEM;
+			INIT_LIST_HEAD(&nfit_mem->list);
+		}
+
+		if (type == NFIT_SPA_DCR) {
+			/* multiple dimms may share a SPA when interleaved */
+			nfit_mem->spa_dcr = spa;
+			nfit_mem->memdev_dcr = nfit_memdev->memdev;
+		} else {
+			/*
+			 * A single dimm may belong to multiple SPA-PM
+			 * ranges, record at least one in addition to
+			 * any SPA-DCR range.
+			 */
+			nfit_mem->memdev_pmem = nfit_memdev->memdev;
+		}
+
+		if (found)
+			continue;
+
+		rc = nfit_mem_add(acpi_desc, nfit_mem, spa);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
+
+static int nfit_mem_cmp(void *priv, struct list_head *__a, struct list_head *__b)
+{
+	struct nfit_mem *a = container_of(__a, typeof(*a), list);
+	struct nfit_mem *b = container_of(__b, typeof(*b), list);
+	u32 handleA, handleB;
+
+	handleA = __to_nfit_memdev(a)->device_handle;
+	handleB = __to_nfit_memdev(b)->device_handle;
+	if (handleA < handleB)
+		return -1;
+	else if (handleA > handleB)
+		return 1;
+	return 0;
+}
+
+static int nfit_mem_init(struct acpi_nfit_desc *acpi_desc)
+{
+	struct nfit_spa *nfit_spa;
+
+	/*
+	 * For each SPA-DCR or SPA-PMEM address range find its
+	 * corresponding MEMDEV(s).  From each MEMDEV find the
+	 * corresponding DCR.  Then, if we're operating on a SPA-DCR,
+	 * try to find a SPA-BDW and a corresponding BDW that references
+	 * the DCR.  Throw it all into an nfit_mem object.  Note, that
+	 * BDWs are optional.
+	 */
+	list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
+		int rc;
+
+		rc = nfit_mem_dcr_init(acpi_desc, nfit_spa->spa);
+		if (rc)
+			return rc;
+	}
+
+	list_sort(NULL, &acpi_desc->dimms, nfit_mem_cmp);
+
+	return 0;
+}
+
+static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
+{
+	struct device *dev = acpi_desc->dev;
+	const void *end;
+	u8 *data;
+
+	INIT_LIST_HEAD(&acpi_desc->spas);
+	INIT_LIST_HEAD(&acpi_desc->dcrs);
+	INIT_LIST_HEAD(&acpi_desc->bdws);
+	INIT_LIST_HEAD(&acpi_desc->memdevs);
+	INIT_LIST_HEAD(&acpi_desc->dimms);
+
+	data = (u8 *) acpi_desc->nfit;
+	end = data + sz;
+	data += sizeof(struct acpi_table_nfit);
+	while (!IS_ERR_OR_NULL(data))
+		data = add_table(acpi_desc, data, end);
+
+	if (IS_ERR(data)) {
+		dev_dbg(dev, "%s: nfit table parsing error: %ld\n", __func__,
+				PTR_ERR(data));
+		return PTR_ERR(data);
+	}
+
+	if (nfit_mem_init(acpi_desc) != 0)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int acpi_nfit_add(struct acpi_device *adev)
+{
+	struct nd_bus_descriptor *nd_desc;
+	struct acpi_nfit_desc *acpi_desc;
+	struct device *dev = &adev->dev;
+	struct acpi_table_header *tbl;
+	acpi_status status = AE_OK;
+	acpi_size sz;
+	int rc;
+
+	status = acpi_get_table_with_size("NFIT", 0, &tbl, &sz);
+	if (ACPI_FAILURE(status)) {
+		dev_err(dev, "failed to find NFIT\n");
+		return -ENXIO;
+	}
+
+	acpi_desc = devm_kzalloc(dev, sizeof(*acpi_desc), GFP_KERNEL);
+	if (!acpi_desc)
+		return -ENOMEM;
+
+	dev_set_drvdata(dev, acpi_desc);
+	acpi_desc->dev = dev;
+	acpi_desc->nfit = (struct acpi_table_nfit *) tbl;
+	nd_desc = &acpi_desc->nd_desc;
+	nd_desc->provider_name = "ACPI.NFIT";
+	nd_desc->ndctl = acpi_nfit_ctl;
+
+	acpi_desc->nd_bus = nd_bus_register(dev, nd_desc);
+	if (!acpi_desc->nd_bus)
+		return -ENXIO;
+
+	rc = acpi_nfit_init(acpi_desc, sz);
+	if (rc) {
+		nd_bus_unregister(acpi_desc->nd_bus);
+		return rc;
+	}
+	return 0;
+}
+
+static int acpi_nfit_remove(struct acpi_device *adev)
+{
+	struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(&adev->dev);
+
+	nd_bus_unregister(acpi_desc->nd_bus);
+	return 0;
+}
+
+static const struct acpi_device_id acpi_nfit_ids[] = {
+	{ "ACPI0012", 0 },
+	{ "", 0 },
+};
+MODULE_DEVICE_TABLE(acpi, acpi_nfit_ids);
+
+static struct acpi_driver acpi_nfit_driver = {
+	.name = KBUILD_MODNAME,
+	.ids = acpi_nfit_ids,
+	.flags = ACPI_DRIVER_ALL_NOTIFY_EVENTS,
+	.ops = {
+		.add = acpi_nfit_add,
+		.remove = acpi_nfit_remove,
+	},
+};
+
+static __init int nfit_init(void)
+{
+	BUILD_BUG_ON(sizeof(struct acpi_table_nfit) != 40);
+	BUILD_BUG_ON(sizeof(struct acpi_nfit_system_address) != 56);
+	BUILD_BUG_ON(sizeof(struct acpi_nfit_memory_map) != 48);
+	BUILD_BUG_ON(sizeof(struct acpi_nfit_interleave) != 20);
+	BUILD_BUG_ON(sizeof(struct acpi_nfit_smbios) != 9);
+	BUILD_BUG_ON(sizeof(struct acpi_nfit_control_region) != 80);
+	BUILD_BUG_ON(sizeof(struct acpi_nfit_data_region) != 40);
+
+	acpi_str_to_uuid(UUID_VOLATILE_MEMORY, nfit_uuid[NFIT_SPA_VOLATILE]);
+	acpi_str_to_uuid(UUID_PERSISTENT_MEMORY, nfit_uuid[NFIT_SPA_PM]);
+	acpi_str_to_uuid(UUID_CONTROL_REGION, nfit_uuid[NFIT_SPA_DCR]);
+	acpi_str_to_uuid(UUID_DATA_REGION, nfit_uuid[NFIT_SPA_BDW]);
+	acpi_str_to_uuid(UUID_VOLATILE_VIRTUAL_DISK, nfit_uuid[NFIT_SPA_VDISK]);
+	acpi_str_to_uuid(UUID_VOLATILE_VIRTUAL_CD, nfit_uuid[NFIT_SPA_VCD]);
+	acpi_str_to_uuid(UUID_PERSISTENT_VIRTUAL_DISK, nfit_uuid[NFIT_SPA_PDISK]);
+	acpi_str_to_uuid(UUID_PERSISTENT_VIRTUAL_CD, nfit_uuid[NFIT_SPA_PCD]);
+	acpi_str_to_uuid(UUID_NFIT_BUS, nfit_uuid[NFIT_DEV_BUS]);
+	acpi_str_to_uuid(UUID_NFIT_DIMM, nfit_uuid[NFIT_DEV_DIMM]);
+
+	return acpi_bus_register_driver(&acpi_nfit_driver);
+}
+
+static __exit void nfit_exit(void)
+{
+	acpi_bus_unregister_driver(&acpi_nfit_driver);
+}
+
+module_init(nfit_init);
+module_exit(nfit_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Intel Corporation");
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
new file mode 100644
index 000000000000..ff72da9c9694
--- /dev/null
+++ b/drivers/acpi/nfit.h
@@ -0,0 +1,89 @@
+/*
+ * NVDIMM Firmware Interface Table - NFIT
+ *
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __NFIT_H__
+#define __NFIT_H__
+#include <linux/types.h>
+#include <linux/libnd.h>
+#include <linux/uuid.h>
+#include <linux/acpi.h>
+#include <acpi/acuuid.h>
+
+#define UUID_NFIT_BUS "2f10e7a4-9e91-11e4-89d3-123b93f75cba"
+#define UUID_NFIT_DIMM "4309ac30-0d11-11e4-9191-0800200c9a66"
+
+enum nfit_uuids {
+	NFIT_SPA_VOLATILE,
+	NFIT_SPA_PM,
+	NFIT_SPA_DCR,
+	NFIT_SPA_BDW,
+	NFIT_SPA_VDISK,
+	NFIT_SPA_VCD,
+	NFIT_SPA_PDISK,
+	NFIT_SPA_PCD,
+	NFIT_DEV_BUS,
+	NFIT_DEV_DIMM,
+	NFIT_UUID_MAX,
+};
+
+struct nfit_spa {
+	struct acpi_nfit_system_address *spa;
+	struct list_head list;
+};
+
+struct nfit_dcr {
+	struct acpi_nfit_control_region *dcr;
+	struct list_head list;
+};
+
+struct nfit_bdw {
+	struct acpi_nfit_data_region *bdw;
+	struct list_head list;
+};
+
+struct nfit_memdev {
+	struct acpi_nfit_memory_map *memdev;
+	struct list_head list;
+};
+
+/* assembled tables for a given dimm/memory-device */
+struct nfit_mem {
+	struct acpi_nfit_memory_map *memdev_dcr;
+	struct acpi_nfit_memory_map *memdev_pmem;
+	struct acpi_nfit_control_region *dcr;
+	struct acpi_nfit_data_region *bdw;
+	struct acpi_nfit_system_address *spa_dcr;
+	struct acpi_nfit_system_address *spa_bdw;
+	struct list_head list;
+};
+
+struct acpi_nfit_desc {
+	struct nd_bus_descriptor nd_desc;
+	struct acpi_table_nfit *nfit;
+	struct list_head memdevs;
+	struct list_head dimms;
+	struct list_head spas;
+	struct list_head dcrs;
+	struct list_head bdws;
+	struct nd_bus *nd_bus;
+	struct device *dev;
+};
+
+static inline struct acpi_nfit_memory_map *__to_nfit_memdev(struct nfit_mem *nfit_mem)
+{
+	if (nfit_mem->memdev_dcr)
+		return nfit_mem->memdev_dcr;
+	return nfit_mem->memdev_pmem;
+}
+#endif /* __NFIT_H__ */
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index eb1fed5bd516..dfe40e5ca9bd 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -321,6 +321,8 @@ config BLK_DEV_NVME
 	  To compile this driver as a module, choose M here: the
 	  module will be called nvme.
 
+source "drivers/block/nd/Kconfig"
+
 config BLK_DEV_SKD
 	tristate "STEC S1120 Block Driver"
 	depends on PCI
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 9cc6c18a1c7e..07a6acecf4d8 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -24,6 +24,7 @@ obj-$(CONFIG_CDROM_PKTCDVD)	+= pktcdvd.o
 obj-$(CONFIG_MG_DISK)		+= mg_disk.o
 obj-$(CONFIG_SUNVDC)		+= sunvdc.o
 obj-$(CONFIG_BLK_DEV_NVME)	+= nvme.o
+obj-$(CONFIG_ND_DEVICES)	+= nd/
 obj-$(CONFIG_BLK_DEV_SKD)	+= skd.o
 obj-$(CONFIG_BLK_DEV_OSD)	+= osdblk.o
 
diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
new file mode 100644
index 000000000000..9b909c21afa1
--- /dev/null
+++ b/drivers/block/nd/Kconfig
@@ -0,0 +1,20 @@
+menuconfig ND_DEVICES
+	bool "NVDIMM (Non-Volatile Memory Device) Support"
+	help
+	  Generic support for non-volatile memory devices including
+	  ACPI-6-NFIT defined resources.  On platforms that define an
+	  NFIT, or otherwise can discover NVDIMM resources, a libnd
+	  bus is registered to advertise PMEM (persistent memory)
+	  namespaces (/dev/pmemX) and BLK (sliding mmio window(s))
+	  namespaces (/dev/ndX). A PMEM namespace refers to a memory
+	  resource that may span multiple DIMMs and support DAX (see
+	  CONFIG_DAX).  A BLK namespace refers to an NVDIMM control
+	  region which exposes an mmio register set for windowed
+	  access mode to non-volatile memory.
+
+if ND_DEVICES
+
+config LIBND
+	tristate
+
+endif
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
new file mode 100644
index 000000000000..a647ff6cf557
--- /dev/null
+++ b/drivers/block/nd/Makefile
@@ -0,0 +1,3 @@
+obj-$(CONFIG_LIBND) += libnd.o
+
+libnd-y := core.o
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
new file mode 100644
index 000000000000..15b89ce1a9af
--- /dev/null
+++ b/drivers/block/nd/core.c
@@ -0,0 +1,67 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/export.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/libnd.h>
+#include <linux/slab.h>
+#include "nd-private.h"
+
+static DEFINE_IDA(nd_ida);
+
+static void nd_bus_release(struct device *dev)
+{
+	struct nd_bus *nd_bus = container_of(dev, struct nd_bus, dev);
+
+	ida_simple_remove(&nd_ida, nd_bus->id);
+	kfree(nd_bus);
+}
+
+struct nd_bus *nd_bus_register(struct device *parent,
+		struct nd_bus_descriptor *nd_desc)
+{
+	struct nd_bus *nd_bus = kzalloc(sizeof(*nd_bus), GFP_KERNEL);
+	int rc;
+
+	if (!nd_bus)
+		return NULL;
+	nd_bus->id = ida_simple_get(&nd_ida, 0, 0, GFP_KERNEL);
+	if (nd_bus->id < 0) {
+		kfree(nd_bus);
+		return NULL;
+	}
+	nd_bus->nd_desc = nd_desc;
+	nd_bus->dev.parent = parent;
+	nd_bus->dev.release = nd_bus_release;
+	dev_set_name(&nd_bus->dev, "ndbus%d", nd_bus->id);
+	rc = device_register(&nd_bus->dev);
+	if (rc) {
+		dev_dbg(&nd_bus->dev, "device registration failed: %d\n", rc);
+		put_device(&nd_bus->dev);
+		return NULL;
+	}
+
+	return nd_bus;
+}
+EXPORT_SYMBOL_GPL(nd_bus_register);
+
+void nd_bus_unregister(struct nd_bus *nd_bus)
+{
+	if (!nd_bus)
+		return;
+	device_unregister(&nd_bus->dev);
+}
+EXPORT_SYMBOL_GPL(nd_bus_unregister);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Intel Corporation");
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
new file mode 100644
index 000000000000..a107a19ffa9c
--- /dev/null
+++ b/drivers/block/nd/nd-private.h
@@ -0,0 +1,23 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __ND_PRIVATE_H__
+#define __ND_PRIVATE_H__
+#include <linux/device.h>
+#include <linux/libnd.h>
+
+struct nd_bus {
+	struct nd_bus_descriptor *nd_desc;
+	struct device dev;
+	int id;
+};
+#endif /* __ND_PRIVATE_H__ */
diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h
index b80b0e6dabc5..b0e431dbfd9e 100644
--- a/include/acpi/actbl1.h
+++ b/include/acpi/actbl1.h
@@ -71,6 +71,7 @@
 #define ACPI_SIG_SBST           "SBST"	/* Smart Battery Specification Table */
 #define ACPI_SIG_SLIT           "SLIT"	/* System Locality Distance Information Table */
 #define ACPI_SIG_SRAT           "SRAT"	/* System Resource Affinity Table */
+#define ACPI_SIG_NFIT           "NFIT"	/* NVDIMM Firmware Interface Table */
 
 /*
  * All tables must be byte-packed to match the ACPI specification, since
@@ -908,6 +909,159 @@ struct acpi_msct_proximity {
 
 /*******************************************************************************
  *
+ * NFIT - NVDIMM Interface Table (ACPI 6.0)
+ *        Version 1
+ *
+ ******************************************************************************/
+
+struct acpi_table_nfit {
+	struct acpi_table_header header;	/* Common ACPI table header */
+	u32 reserved;		/* Reserved, must be zero */
+};
+
+/* Subtable header for NFIT */
+
+struct acpi_nfit_header {
+	u16 type;
+	u16 length;
+};
+
+/* Values for subtable type in struct acpi_nfit_header */
+
+enum acpi_nfit_type {
+	ACPI_NFIT_TYPE_SYSTEM_ADDRESS = 0,
+	ACPI_NFIT_TYPE_MEMORY_MAP = 1,
+	ACPI_NFIT_TYPE_INTERLEAVE = 2,
+	ACPI_NFIT_TYPE_SMBIOS = 3,
+	ACPI_NFIT_TYPE_CONTROL_REGION = 4,
+	ACPI_NFIT_TYPE_DATA_REGION = 5,
+	ACPI_NFIT_TYPE_FLUSH_ADDRESS = 6,
+	ACPI_NFIT_TYPE_RESERVED = 7	/* 7 and greater are reserved */
+};
+
+/*
+ * NFIT Subtables
+ */
+
+/* 0: System Physical Address Range Structure */
+
+struct acpi_nfit_system_address {
+	struct acpi_nfit_header header;
+	u16 range_index;
+	u16 flags;
+	u32 reserved;		/* Reseved, must be zero */
+	u32 proximity_domain;
+	u8 range_guid[16];
+	u64 address;
+	u64 length;
+	u64 memory_mapping;
+};
+
+/* Flags */
+
+#define ACPI_NFIT_ADD_ONLINE_ONLY       (1)	/* 00: Add/Online Operation Only */
+#define ACPI_NFIT_PROXIMITY_VALID       (1<<1)	/* 01: Proximity Domain Valid */
+
+/* Range Type GUIDs appear in the include/acuuid.h file */
+
+/* 1: Memory Device to System Address Range Map Structure */
+
+struct acpi_nfit_memory_map {
+	struct acpi_nfit_header header;
+	u32 device_handle;
+	u16 physical_id;
+	u16 region_id;
+	u16 range_index;
+	u16 region_index;
+	u64 region_size;
+	u64 region_offset;
+	u64 address;
+	u16 interleave_index;
+	u16 interleave_ways;
+	u16 flags;
+	u16 reserved;		/* Reserved, must be zero */
+};
+
+/* Flags */
+
+#define ACPI_NFIT_MEM_SAVE_FAILED       (1)	/* 00: Last SAVE to Memory Device failed */
+#define ACPI_NFIT_MEM_RESTORE_FAILED    (1<<1)	/* 01: Last RESTORE from Memory Device failed */
+#define ACPI_NFIT_MEM_FLUSH_FAILED      (1<<2)	/* 02: Platform flush failed */
+#define ACPI_NFIT_MEM_ARMED             (1<<3)	/* 03: Memory Device observed to be not armed */
+#define ACPI_NFIT_MEM_HEALTH_OBSERVED   (1<<4)	/* 04: Memory Device observed SMART/health events */
+#define ACPI_NFIT_MEM_HEALTH_ENABLED    (1<<5)	/* 05: SMART/health events enabled */
+
+/* 2: Interleave Structure */
+
+struct acpi_nfit_interleave {
+	struct acpi_nfit_header header;
+	u16 interleave_index;
+	u16 reserved;		/* Reserved, must be zero */
+	u32 line_count;
+	u32 line_size;
+	u32 line_offset[1];	/* Variable length */
+};
+
+/* 3: SMBIOS Management Information Structure */
+
+struct acpi_nfit_smbios {
+	struct acpi_nfit_header header;
+	u32 reserved;		/* Reserved, must be zero */
+	u8 data[1];		/* Variable length */
+};
+
+/* 4: NVDIMM Control Region Structure */
+
+struct acpi_nfit_control_region {
+	struct acpi_nfit_header header;
+	u16 region_index;
+	u16 vendor_id;
+	u16 device_id;
+	u16 revision_id;
+	u16 subsystem_vendor_id;
+	u16 subsystem_device_id;
+	u16 subsystem_revision_id;
+	u8 reserved[6];		/* Reserved, must be zero */
+	u32 serial_number;
+	u16 code;
+	u16 windows;
+	u64 window_size;
+	u64 command_offset;
+	u64 command_size;
+	u64 status_offset;
+	u64 status_size;
+	u16 flags;
+	u8 reserved1[6];	/* Reserved, must be zero */
+};
+
+/* Flags */
+
+#define ACPI_NFIT_CONTROL_BUFFERED      (1)	/* Block Data Windows implementation is buffered */
+
+/* 5: NVDIMM Block Data Window Region Structure */
+
+struct acpi_nfit_data_region {
+	struct acpi_nfit_header header;
+	u16 region_index;
+	u16 windows;
+	u64 offset;
+	u64 size;
+	u64 capacity;
+	u64 start_address;
+};
+
+/* 6: Flush Hint Address Structure */
+
+struct acpi_nfit_flush_address {
+	struct acpi_nfit_header header;
+	u32 device_handle;
+	u16 hint_count;
+	u8 reserved[6];		/* Reserved, must be zero */
+	u64 hint_address[1];	/* Variable length */
+};
+
+/*******************************************************************************
+ *
  * SBST - Smart Battery Specification Table
  *        Version 1
  *
diff --git a/include/acpi/acuuid.h b/include/acpi/acuuid.h
new file mode 100644
index 000000000000..7c6cbb028ffc
--- /dev/null
+++ b/include/acpi/acuuid.h
@@ -0,0 +1,89 @@
+/******************************************************************************
+ *
+ * Name: acuuid.h - ACPI-related UUID/GUID definitions
+ *
+ *****************************************************************************/
+
+/*
+ * Copyright (C) 2000 - 2015, Intel Corp.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions, and the following disclaimer,
+ *    without modification.
+ * 2. Redistributions in binary form must reproduce at minimum a disclaimer
+ *    substantially similar to the "NO WARRANTY" disclaimer below
+ *    ("Disclaimer") and any redistribution must be conditioned upon
+ *    including a substantially similar Disclaimer requirement for further
+ *    binary redistribution.
+ * 3. Neither the names of the above-listed copyright holders nor the names
+ *    of any contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * Alternatively, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") version 2 as published by the Free
+ * Software Foundation.
+ *
+ * NO WARRANTY
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+ * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGES.
+ */
+
+#ifndef __ACUUID_H__
+#define __ACUUID_H__
+
+/*
+ * Note1: UUIDs and GUIDs are defined to be identical in ACPI.
+ *
+ * Note2: This file is standalone and should remain that way.
+ */
+
+/* Controllers */
+
+#define UUID_GPIO_CONTROLLER            "4f248f40-d5e2-499f-834c-27758ea1cd3f"
+#define UUID_USB_CONTROLLER             "ce2ee385-00e6-48cb-9f05-2edb927c4899"
+#define UUID_SATA_CONTROLLER            "e4db149b-fcfe-425b-a6d8-92357d78fc7f"
+
+/* Devices */
+
+#define UUID_PCI_HOST_BRIDGE            "33db4d5b-1ff7-401c-9657-7441c03dd766"
+#define UUID_I2C_DEVICE                 "3cdff6f7-4267-4555-ad05-b30a3d8938de"
+#define UUID_POWER_BUTTON               "dfbcf3c5-e7a5-44e6-9c1f-29c76f6e059c"
+
+/* Interfaces */
+
+#define UUID_DEVICE_LABELING            "e5c937d0-3553-4d7a-9117-ea4d19c3434d"
+#define UUID_PHYSICAL_PRESENCE          "3dddfaa6-361b-4eb4-a424-8d10089d1653"
+
+/* NVDIMM - NFIT table */
+
+#define UUID_VOLATILE_MEMORY            "4f940573-dafd-e344-b16c-3f22d252e5d0"
+#define UUID_PERSISTENT_MEMORY          "79d3f066-f3b4-7440-ac43-0d3318b78cdb"
+#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"
+#define UUID_DATA_REGION                "3005af91-865d-0e47-a6b0-0a2db9408249"
+#define UUID_VOLATILE_VIRTUAL_DISK      "5a53ab77-fc45-4b62-5560-f7b281d1f96e"
+#define UUID_VOLATILE_VIRTUAL_CD        "30bd5a3d-7541-ce87-6d64-d2ade523c4bb"
+#define UUID_PERSISTENT_VIRTUAL_DISK    "c902ea5c-074d-69d3-269f-4496fbe096f9"
+#define UUID_PERSISTENT_VIRTUAL_CD      "88810108-cd42-48bb-100f-5387d53ded3d"
+
+/* Miscellaneous */
+
+#define UUID_PLATFORM_CAPABILITIES      "0811b06e-4a27-44f9-8d60-3cbbc22e7b48"
+#define UUID_DYNAMIC_ENUMERATION        "d8c1a3a6-be9b-4c9b-91bf-c3cb81fc5daf"
+#define UUID_BATTERY_THERMAL_LIMIT      "4c2067e3-887d-475c-9720-4af1d3ed602e"
+#define UUID_THERMAL_EXTENSIONS         "14d399cd-7a27-4b18-8fb4-7cb7b9f4e500"
+#define UUID_DEVICE_PROPERTIES          "daffd814-6eba-4d8c-8a91-bc9bbf4aa301"
+
+#endif				/* __AUUID_H__ */
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
new file mode 100644
index 000000000000..8e4441002868
--- /dev/null
+++ b/include/linux/libnd.h
@@ -0,0 +1,34 @@
+/*
+ * libnd - Non-volatile-memory Devices Subsystem
+ *
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __LIBND_H__
+#define __LIBND_H__
+struct nd_dimm;
+struct nd_bus_descriptor;
+typedef int (*ndctl_fn)(struct nd_bus_descriptor *nd_desc,
+		struct nd_dimm *nd_dimm, unsigned int cmd, void *buf,
+		unsigned int buf_len);
+
+struct nd_bus_descriptor {
+	unsigned long dsm_mask;
+	char *provider_name;
+	ndctl_fn ndctl;
+};
+
+struct nd_bus;
+struct device;
+struct nd_bus *nd_bus_register(struct device *parent,
+		struct nd_bus_descriptor *nfit_desc);
+void nd_bus_unregister(struct nd_bus *nd_bus);
+#endif /* __LIBND_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 03/21] libnd: control character device and libnd bus sysfs attributes
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:56   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:56 UTC (permalink / raw)
  To: axboe
  Cc: mingo, linux-nvdimm, neilb, gregkh, Rafael J. Wysocki,
	Robert Moore, linux-kernel, linux-acpi, jmoyer, hch

The control device for a libnd bus is registered as an "nd" class
device.  The expectation is that there will usually only be one "nd" bus
registered under /sys/class/nd.  However, we allow for the possibility
of multiple buses and they will listed in discovery order as
ndctl0...ndctlN.  This character device hosts the ioctl for passing
control messages.  The initial command set has a 1:1 correlation with
the commands listed in the by the "NFIT DSM Example" document [1], but
this scheme is extensible to future command sets.

Note, nd_ioctl() and the backing ->ndctl() implementation are defined in
a subsequent patch.  This is simply the initial registrations and sysfs
attributes.

[1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Cc: Neil Brown <neilb@suse.de>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: <linux-acpi@vger.kernel.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c           |   29 ++++++++++++++
 drivers/acpi/nfit.h           |    5 ++
 drivers/block/nd/Makefile     |    1 
 drivers/block/nd/bus.c        |   83 +++++++++++++++++++++++++++++++++++++++
 drivers/block/nd/core.c       |   87 ++++++++++++++++++++++++++++++++++++++++-
 drivers/block/nd/nd-private.h |    6 +++
 include/linux/libnd.h         |    5 ++
 7 files changed, 214 insertions(+), 2 deletions(-)
 create mode 100644 drivers/block/nd/bus.c

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 13132a16901c..d31a0fffafcc 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -316,6 +316,34 @@ static int nfit_mem_init(struct acpi_nfit_desc *acpi_desc)
 	return 0;
 }
 
+static ssize_t revision_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_bus *nd_bus = to_nd_bus(dev);
+	struct nd_bus_descriptor *nd_desc = to_nd_desc(nd_bus);
+	struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
+
+	return sprintf(buf, "%d\n", acpi_desc->nfit->header.revision);
+}
+static DEVICE_ATTR_RO(revision);
+
+static struct attribute *acpi_nfit_attributes[] = {
+	&dev_attr_revision.attr,
+	NULL,
+};
+
+static struct attribute_group acpi_nfit_attribute_group = {
+	.name = "nfit",
+	.attrs = acpi_nfit_attributes,
+};
+
+const struct attribute_group *acpi_nfit_attribute_groups[] = {
+	&nd_bus_attribute_group,
+	&acpi_nfit_attribute_group,
+	NULL,
+};
+EXPORT_SYMBOL_GPL(acpi_nfit_attribute_groups);
+
 static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 {
 	struct device *dev = acpi_desc->dev;
@@ -372,6 +400,7 @@ static int acpi_nfit_add(struct acpi_device *adev)
 	nd_desc = &acpi_desc->nd_desc;
 	nd_desc->provider_name = "ACPI.NFIT";
 	nd_desc->ndctl = acpi_nfit_ctl;
+	nd_desc->attr_groups = acpi_nfit_attribute_groups;
 
 	acpi_desc->nd_bus = nd_bus_register(dev, nd_desc);
 	if (!acpi_desc->nd_bus)
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index ff72da9c9694..b6c85d773ca1 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -86,4 +86,9 @@ static inline struct acpi_nfit_memory_map *__to_nfit_memdev(struct nfit_mem *nfi
 		return nfit_mem->memdev_dcr;
 	return nfit_mem->memdev_pmem;
 }
+
+static inline struct acpi_nfit_desc *to_acpi_desc(struct nd_bus_descriptor *nd_desc)
+{
+	return container_of(nd_desc, struct acpi_nfit_desc, nd_desc);
+}
 #endif /* __NFIT_H__ */
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index a647ff6cf557..34d1b58b3258 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -1,3 +1,4 @@
 obj-$(CONFIG_LIBND) += libnd.o
 
 libnd-y := core.o
+libnd-y += bus.o
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
new file mode 100644
index 000000000000..635f2e926426
--- /dev/null
+++ b/drivers/block/nd/bus.c
@@ -0,0 +1,83 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/uaccess.h>
+#include <linux/fcntl.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/io.h>
+#include "nd-private.h"
+
+static int nd_bus_major;
+static struct class *nd_class;
+
+int nd_bus_create_ndctl(struct nd_bus *nd_bus)
+{
+	dev_t devt = MKDEV(nd_bus_major, nd_bus->id);
+	struct device *dev;
+
+	dev = device_create(nd_class, &nd_bus->dev, devt, nd_bus, "ndctl%d",
+			nd_bus->id);
+
+	if (IS_ERR(dev)) {
+		dev_dbg(&nd_bus->dev, "failed to register ndctl%d: %ld\n",
+				nd_bus->id, PTR_ERR(dev));
+		return PTR_ERR(dev);
+	}
+	return 0;
+}
+
+void nd_bus_destroy_ndctl(struct nd_bus *nd_bus)
+{
+	device_destroy(nd_class, MKDEV(nd_bus_major, nd_bus->id));
+}
+
+static long nd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	return -ENXIO;
+}
+
+static const struct file_operations nd_bus_fops = {
+	.owner = THIS_MODULE,
+	.open = nonseekable_open,
+	.unlocked_ioctl = nd_ioctl,
+	.compat_ioctl = nd_ioctl,
+	.llseek = noop_llseek,
+};
+
+int __init nd_bus_init(void)
+{
+	int rc;
+
+	rc = register_chrdev(0, "ndctl", &nd_bus_fops);
+	if (rc < 0)
+		return rc;
+	nd_bus_major = rc;
+
+	nd_class = class_create(THIS_MODULE, "nd");
+	if (IS_ERR(nd_class))
+		goto err_class;
+
+	return 0;
+
+ err_class:
+	unregister_chrdev(nd_bus_major, "ndctl");
+
+	return rc;
+}
+
+void __exit nd_bus_exit(void)
+{
+	class_destroy(nd_class);
+	unregister_chrdev(nd_bus_major, "ndctl");
+}
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index 15b89ce1a9af..49b7ac8f7606 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -14,9 +14,12 @@
 #include <linux/module.h>
 #include <linux/device.h>
 #include <linux/libnd.h>
+#include <linux/mutex.h>
 #include <linux/slab.h>
 #include "nd-private.h"
 
+LIST_HEAD(nd_bus_list);
+DEFINE_MUTEX(nd_bus_list_mutex);
 static DEFINE_IDA(nd_ida);
 
 static void nd_bus_release(struct device *dev)
@@ -27,6 +30,54 @@ static void nd_bus_release(struct device *dev)
 	kfree(nd_bus);
 }
 
+struct nd_bus *to_nd_bus(struct device *dev)
+{
+	struct nd_bus *nd_bus = container_of(dev, struct nd_bus, dev);
+
+	WARN_ON(nd_bus->dev.release != nd_bus_release);
+	return nd_bus;
+}
+EXPORT_SYMBOL_GPL(to_nd_bus);
+
+struct nd_bus_descriptor *to_nd_desc(struct nd_bus *nd_bus)
+{
+	/* struct nd_bus definition is private to libnd */
+	return nd_bus->nd_desc;
+}
+EXPORT_SYMBOL_GPL(to_nd_desc);
+
+static const char *nd_bus_provider(struct nd_bus *nd_bus)
+{
+	struct nd_bus_descriptor *nd_desc = nd_bus->nd_desc;
+	struct device *parent = nd_bus->dev.parent;
+
+	if (nd_desc->provider_name)
+		return nd_desc->provider_name;
+	else if (parent)
+		return dev_name(parent);
+	else
+		return "unknown";
+}
+
+static ssize_t provider_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_bus *nd_bus = to_nd_bus(dev);
+
+	return sprintf(buf, "%s\n", nd_bus_provider(nd_bus));
+}
+static DEVICE_ATTR_RO(provider);
+
+static struct attribute *nd_bus_attributes[] = {
+	&dev_attr_provider.attr,
+	NULL,
+};
+
+struct attribute_group nd_bus_attribute_group = {
+	.attrs = nd_bus_attributes,
+};
+EXPORT_SYMBOL_GPL(nd_bus_attribute_group);
+
 struct nd_bus *nd_bus_register(struct device *parent,
 		struct nd_bus_descriptor *nd_desc)
 {
@@ -35,6 +86,7 @@ struct nd_bus *nd_bus_register(struct device *parent,
 
 	if (!nd_bus)
 		return NULL;
+	INIT_LIST_HEAD(&nd_bus->list);
 	nd_bus->id = ida_simple_get(&nd_ida, 0, 0, GFP_KERNEL);
 	if (nd_bus->id < 0) {
 		kfree(nd_bus);
@@ -43,15 +95,26 @@ struct nd_bus *nd_bus_register(struct device *parent,
 	nd_bus->nd_desc = nd_desc;
 	nd_bus->dev.parent = parent;
 	nd_bus->dev.release = nd_bus_release;
+	nd_bus->dev.groups = nd_desc->attr_groups;
 	dev_set_name(&nd_bus->dev, "ndbus%d", nd_bus->id);
 	rc = device_register(&nd_bus->dev);
 	if (rc) {
 		dev_dbg(&nd_bus->dev, "device registration failed: %d\n", rc);
-		put_device(&nd_bus->dev);
-		return NULL;
+		goto err;
 	}
 
+	rc = nd_bus_create_ndctl(nd_bus);
+	if (rc)
+		goto err;
+
+	mutex_lock(&nd_bus_list_mutex);
+	list_add_tail(&nd_bus->list, &nd_bus_list);
+	mutex_unlock(&nd_bus_list_mutex);
+
 	return nd_bus;
+ err:
+	put_device(&nd_bus->dev);
+	return NULL;
 }
 EXPORT_SYMBOL_GPL(nd_bus_register);
 
@@ -59,9 +122,29 @@ void nd_bus_unregister(struct nd_bus *nd_bus)
 {
 	if (!nd_bus)
 		return;
+
+	mutex_lock(&nd_bus_list_mutex);
+	list_del_init(&nd_bus->list);
+	mutex_unlock(&nd_bus_list_mutex);
+
+	nd_bus_destroy_ndctl(nd_bus);
+
 	device_unregister(&nd_bus->dev);
 }
 EXPORT_SYMBOL_GPL(nd_bus_unregister);
 
+static __init int libnd_init(void)
+{
+	return nd_bus_init();
+}
+
+static __exit void libnd_exit(void)
+{
+	WARN_ON(!list_empty(&nd_bus_list));
+	nd_bus_exit();
+}
+
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR("Intel Corporation");
+module_init(libnd_init);
+module_exit(libnd_exit);
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index a107a19ffa9c..884601f65a15 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -17,7 +17,13 @@
 
 struct nd_bus {
 	struct nd_bus_descriptor *nd_desc;
+	struct list_head list;
 	struct device dev;
 	int id;
 };
+
+int __init nd_bus_init(void);
+void __exit nd_bus_exit(void);
+int nd_bus_create_ndctl(struct nd_bus *nd_bus);
+void nd_bus_destroy_ndctl(struct nd_bus *nd_bus);
 #endif /* __ND_PRIVATE_H__ */
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index 8e4441002868..04a97653d56c 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -14,6 +14,8 @@
  */
 #ifndef __LIBND_H__
 #define __LIBND_H__
+extern struct attribute_group nd_bus_attribute_group;
+
 struct nd_dimm;
 struct nd_bus_descriptor;
 typedef int (*ndctl_fn)(struct nd_bus_descriptor *nd_desc,
@@ -21,6 +23,7 @@ typedef int (*ndctl_fn)(struct nd_bus_descriptor *nd_desc,
 		unsigned int buf_len);
 
 struct nd_bus_descriptor {
+	const struct attribute_group **attr_groups;
 	unsigned long dsm_mask;
 	char *provider_name;
 	ndctl_fn ndctl;
@@ -31,4 +34,6 @@ struct device;
 struct nd_bus *nd_bus_register(struct device *parent,
 		struct nd_bus_descriptor *nfit_desc);
 void nd_bus_unregister(struct nd_bus *nd_bus);
+struct nd_bus *to_nd_bus(struct device *dev);
+struct nd_bus_descriptor *to_nd_desc(struct nd_bus *nd_bus);
 #endif /* __LIBND_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 03/21] libnd: control character device and libnd bus sysfs attributes
@ 2015-05-20 20:56   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:56 UTC (permalink / raw)
  To: axboe
  Cc: mingo, linux-nvdimm, neilb, gregkh, Rafael J. Wysocki,
	Robert Moore, linux-kernel, linux-acpi, jmoyer, hch

The control device for a libnd bus is registered as an "nd" class
device.  The expectation is that there will usually only be one "nd" bus
registered under /sys/class/nd.  However, we allow for the possibility
of multiple buses and they will listed in discovery order as
ndctl0...ndctlN.  This character device hosts the ioctl for passing
control messages.  The initial command set has a 1:1 correlation with
the commands listed in the by the "NFIT DSM Example" document [1], but
this scheme is extensible to future command sets.

Note, nd_ioctl() and the backing ->ndctl() implementation are defined in
a subsequent patch.  This is simply the initial registrations and sysfs
attributes.

[1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Cc: Neil Brown <neilb@suse.de>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: <linux-acpi@vger.kernel.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c           |   29 ++++++++++++++
 drivers/acpi/nfit.h           |    5 ++
 drivers/block/nd/Makefile     |    1 
 drivers/block/nd/bus.c        |   83 +++++++++++++++++++++++++++++++++++++++
 drivers/block/nd/core.c       |   87 ++++++++++++++++++++++++++++++++++++++++-
 drivers/block/nd/nd-private.h |    6 +++
 include/linux/libnd.h         |    5 ++
 7 files changed, 214 insertions(+), 2 deletions(-)
 create mode 100644 drivers/block/nd/bus.c

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 13132a16901c..d31a0fffafcc 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -316,6 +316,34 @@ static int nfit_mem_init(struct acpi_nfit_desc *acpi_desc)
 	return 0;
 }
 
+static ssize_t revision_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_bus *nd_bus = to_nd_bus(dev);
+	struct nd_bus_descriptor *nd_desc = to_nd_desc(nd_bus);
+	struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
+
+	return sprintf(buf, "%d\n", acpi_desc->nfit->header.revision);
+}
+static DEVICE_ATTR_RO(revision);
+
+static struct attribute *acpi_nfit_attributes[] = {
+	&dev_attr_revision.attr,
+	NULL,
+};
+
+static struct attribute_group acpi_nfit_attribute_group = {
+	.name = "nfit",
+	.attrs = acpi_nfit_attributes,
+};
+
+const struct attribute_group *acpi_nfit_attribute_groups[] = {
+	&nd_bus_attribute_group,
+	&acpi_nfit_attribute_group,
+	NULL,
+};
+EXPORT_SYMBOL_GPL(acpi_nfit_attribute_groups);
+
 static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 {
 	struct device *dev = acpi_desc->dev;
@@ -372,6 +400,7 @@ static int acpi_nfit_add(struct acpi_device *adev)
 	nd_desc = &acpi_desc->nd_desc;
 	nd_desc->provider_name = "ACPI.NFIT";
 	nd_desc->ndctl = acpi_nfit_ctl;
+	nd_desc->attr_groups = acpi_nfit_attribute_groups;
 
 	acpi_desc->nd_bus = nd_bus_register(dev, nd_desc);
 	if (!acpi_desc->nd_bus)
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index ff72da9c9694..b6c85d773ca1 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -86,4 +86,9 @@ static inline struct acpi_nfit_memory_map *__to_nfit_memdev(struct nfit_mem *nfi
 		return nfit_mem->memdev_dcr;
 	return nfit_mem->memdev_pmem;
 }
+
+static inline struct acpi_nfit_desc *to_acpi_desc(struct nd_bus_descriptor *nd_desc)
+{
+	return container_of(nd_desc, struct acpi_nfit_desc, nd_desc);
+}
 #endif /* __NFIT_H__ */
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index a647ff6cf557..34d1b58b3258 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -1,3 +1,4 @@
 obj-$(CONFIG_LIBND) += libnd.o
 
 libnd-y := core.o
+libnd-y += bus.o
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
new file mode 100644
index 000000000000..635f2e926426
--- /dev/null
+++ b/drivers/block/nd/bus.c
@@ -0,0 +1,83 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/uaccess.h>
+#include <linux/fcntl.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/io.h>
+#include "nd-private.h"
+
+static int nd_bus_major;
+static struct class *nd_class;
+
+int nd_bus_create_ndctl(struct nd_bus *nd_bus)
+{
+	dev_t devt = MKDEV(nd_bus_major, nd_bus->id);
+	struct device *dev;
+
+	dev = device_create(nd_class, &nd_bus->dev, devt, nd_bus, "ndctl%d",
+			nd_bus->id);
+
+	if (IS_ERR(dev)) {
+		dev_dbg(&nd_bus->dev, "failed to register ndctl%d: %ld\n",
+				nd_bus->id, PTR_ERR(dev));
+		return PTR_ERR(dev);
+	}
+	return 0;
+}
+
+void nd_bus_destroy_ndctl(struct nd_bus *nd_bus)
+{
+	device_destroy(nd_class, MKDEV(nd_bus_major, nd_bus->id));
+}
+
+static long nd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	return -ENXIO;
+}
+
+static const struct file_operations nd_bus_fops = {
+	.owner = THIS_MODULE,
+	.open = nonseekable_open,
+	.unlocked_ioctl = nd_ioctl,
+	.compat_ioctl = nd_ioctl,
+	.llseek = noop_llseek,
+};
+
+int __init nd_bus_init(void)
+{
+	int rc;
+
+	rc = register_chrdev(0, "ndctl", &nd_bus_fops);
+	if (rc < 0)
+		return rc;
+	nd_bus_major = rc;
+
+	nd_class = class_create(THIS_MODULE, "nd");
+	if (IS_ERR(nd_class))
+		goto err_class;
+
+	return 0;
+
+ err_class:
+	unregister_chrdev(nd_bus_major, "ndctl");
+
+	return rc;
+}
+
+void __exit nd_bus_exit(void)
+{
+	class_destroy(nd_class);
+	unregister_chrdev(nd_bus_major, "ndctl");
+}
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index 15b89ce1a9af..49b7ac8f7606 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -14,9 +14,12 @@
 #include <linux/module.h>
 #include <linux/device.h>
 #include <linux/libnd.h>
+#include <linux/mutex.h>
 #include <linux/slab.h>
 #include "nd-private.h"
 
+LIST_HEAD(nd_bus_list);
+DEFINE_MUTEX(nd_bus_list_mutex);
 static DEFINE_IDA(nd_ida);
 
 static void nd_bus_release(struct device *dev)
@@ -27,6 +30,54 @@ static void nd_bus_release(struct device *dev)
 	kfree(nd_bus);
 }
 
+struct nd_bus *to_nd_bus(struct device *dev)
+{
+	struct nd_bus *nd_bus = container_of(dev, struct nd_bus, dev);
+
+	WARN_ON(nd_bus->dev.release != nd_bus_release);
+	return nd_bus;
+}
+EXPORT_SYMBOL_GPL(to_nd_bus);
+
+struct nd_bus_descriptor *to_nd_desc(struct nd_bus *nd_bus)
+{
+	/* struct nd_bus definition is private to libnd */
+	return nd_bus->nd_desc;
+}
+EXPORT_SYMBOL_GPL(to_nd_desc);
+
+static const char *nd_bus_provider(struct nd_bus *nd_bus)
+{
+	struct nd_bus_descriptor *nd_desc = nd_bus->nd_desc;
+	struct device *parent = nd_bus->dev.parent;
+
+	if (nd_desc->provider_name)
+		return nd_desc->provider_name;
+	else if (parent)
+		return dev_name(parent);
+	else
+		return "unknown";
+}
+
+static ssize_t provider_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_bus *nd_bus = to_nd_bus(dev);
+
+	return sprintf(buf, "%s\n", nd_bus_provider(nd_bus));
+}
+static DEVICE_ATTR_RO(provider);
+
+static struct attribute *nd_bus_attributes[] = {
+	&dev_attr_provider.attr,
+	NULL,
+};
+
+struct attribute_group nd_bus_attribute_group = {
+	.attrs = nd_bus_attributes,
+};
+EXPORT_SYMBOL_GPL(nd_bus_attribute_group);
+
 struct nd_bus *nd_bus_register(struct device *parent,
 		struct nd_bus_descriptor *nd_desc)
 {
@@ -35,6 +86,7 @@ struct nd_bus *nd_bus_register(struct device *parent,
 
 	if (!nd_bus)
 		return NULL;
+	INIT_LIST_HEAD(&nd_bus->list);
 	nd_bus->id = ida_simple_get(&nd_ida, 0, 0, GFP_KERNEL);
 	if (nd_bus->id < 0) {
 		kfree(nd_bus);
@@ -43,15 +95,26 @@ struct nd_bus *nd_bus_register(struct device *parent,
 	nd_bus->nd_desc = nd_desc;
 	nd_bus->dev.parent = parent;
 	nd_bus->dev.release = nd_bus_release;
+	nd_bus->dev.groups = nd_desc->attr_groups;
 	dev_set_name(&nd_bus->dev, "ndbus%d", nd_bus->id);
 	rc = device_register(&nd_bus->dev);
 	if (rc) {
 		dev_dbg(&nd_bus->dev, "device registration failed: %d\n", rc);
-		put_device(&nd_bus->dev);
-		return NULL;
+		goto err;
 	}
 
+	rc = nd_bus_create_ndctl(nd_bus);
+	if (rc)
+		goto err;
+
+	mutex_lock(&nd_bus_list_mutex);
+	list_add_tail(&nd_bus->list, &nd_bus_list);
+	mutex_unlock(&nd_bus_list_mutex);
+
 	return nd_bus;
+ err:
+	put_device(&nd_bus->dev);
+	return NULL;
 }
 EXPORT_SYMBOL_GPL(nd_bus_register);
 
@@ -59,9 +122,29 @@ void nd_bus_unregister(struct nd_bus *nd_bus)
 {
 	if (!nd_bus)
 		return;
+
+	mutex_lock(&nd_bus_list_mutex);
+	list_del_init(&nd_bus->list);
+	mutex_unlock(&nd_bus_list_mutex);
+
+	nd_bus_destroy_ndctl(nd_bus);
+
 	device_unregister(&nd_bus->dev);
 }
 EXPORT_SYMBOL_GPL(nd_bus_unregister);
 
+static __init int libnd_init(void)
+{
+	return nd_bus_init();
+}
+
+static __exit void libnd_exit(void)
+{
+	WARN_ON(!list_empty(&nd_bus_list));
+	nd_bus_exit();
+}
+
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR("Intel Corporation");
+module_init(libnd_init);
+module_exit(libnd_exit);
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index a107a19ffa9c..884601f65a15 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -17,7 +17,13 @@
 
 struct nd_bus {
 	struct nd_bus_descriptor *nd_desc;
+	struct list_head list;
 	struct device dev;
 	int id;
 };
+
+int __init nd_bus_init(void);
+void __exit nd_bus_exit(void);
+int nd_bus_create_ndctl(struct nd_bus *nd_bus);
+void nd_bus_destroy_ndctl(struct nd_bus *nd_bus);
 #endif /* __ND_PRIVATE_H__ */
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index 8e4441002868..04a97653d56c 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -14,6 +14,8 @@
  */
 #ifndef __LIBND_H__
 #define __LIBND_H__
+extern struct attribute_group nd_bus_attribute_group;
+
 struct nd_dimm;
 struct nd_bus_descriptor;
 typedef int (*ndctl_fn)(struct nd_bus_descriptor *nd_desc,
@@ -21,6 +23,7 @@ typedef int (*ndctl_fn)(struct nd_bus_descriptor *nd_desc,
 		unsigned int buf_len);
 
 struct nd_bus_descriptor {
+	const struct attribute_group **attr_groups;
 	unsigned long dsm_mask;
 	char *provider_name;
 	ndctl_fn ndctl;
@@ -31,4 +34,6 @@ struct device;
 struct nd_bus *nd_bus_register(struct device *parent,
 		struct nd_bus_descriptor *nfit_desc);
 void nd_bus_unregister(struct nd_bus *nd_bus);
+struct nd_bus *to_nd_bus(struct device *dev);
+struct nd_bus_descriptor *to_nd_desc(struct nd_bus *nd_bus);
 #endif /* __LIBND_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 04/21] libnd, nfit: dimm/memory-devices
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:56   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:56 UTC (permalink / raw)
  To: axboe
  Cc: mingo, linux-nvdimm, neilb, gregkh, Rafael J. Wysocki,
	Robert Moore, linux-kernel, linux-acpi, jmoyer, hch

Enable dimm devices to be registered on a libnd bus.  The kernel
assigned device id for dimms is dynamic.  If userspace needs a more
static identifier it should consult a provider-specific attribute.  In
the case where NFIT is the provider, the 'nmemX/nfit/handle' or
'nmemX/nfit/serial' attributes may be used for this purpose.

Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c           |  160 +++++++++++++++++++++++++++++++++++++++++
 drivers/acpi/nfit.h           |    1 
 drivers/block/nd/Makefile     |    1 
 drivers/block/nd/bus.c        |   14 +++-
 drivers/block/nd/core.c       |   29 +++++++
 drivers/block/nd/dimm_devs.c  |   92 ++++++++++++++++++++++++
 drivers/block/nd/nd-private.h |   12 +++
 include/linux/libnd.h         |   11 +++
 8 files changed, 318 insertions(+), 2 deletions(-)
 create mode 100644 drivers/block/nd/dimm_devs.c

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index d31a0fffafcc..b26e1a4a59e3 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -344,6 +344,164 @@ const struct attribute_group *acpi_nfit_attribute_groups[] = {
 };
 EXPORT_SYMBOL_GPL(acpi_nfit_attribute_groups);
 
+static struct acpi_nfit_memory_map *to_nfit_memdev(struct device *dev)
+{
+	struct nd_dimm *nd_dimm = to_nd_dimm(dev);
+	struct nfit_mem *nfit_mem = nd_dimm_provider_data(nd_dimm);
+
+	return __to_nfit_memdev(nfit_mem);
+}
+
+static struct acpi_nfit_control_region *to_nfit_dcr(struct device *dev)
+{
+	struct nd_dimm *nd_dimm = to_nd_dimm(dev);
+	struct nfit_mem *nfit_mem = nd_dimm_provider_data(nd_dimm);
+
+	return nfit_mem->dcr;
+}
+
+static ssize_t handle_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct acpi_nfit_memory_map *memdev = to_nfit_memdev(dev);
+
+	return sprintf(buf, "%#x\n", memdev->device_handle);
+}
+static DEVICE_ATTR_RO(handle);
+
+static ssize_t phys_id_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct acpi_nfit_memory_map *memdev = to_nfit_memdev(dev);
+
+	return sprintf(buf, "%#x\n", memdev->physical_id);
+}
+static DEVICE_ATTR_RO(phys_id);
+
+static ssize_t vendor_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct acpi_nfit_control_region *dcr = to_nfit_dcr(dev);
+
+	return sprintf(buf, "%#x\n", dcr->vendor_id);
+}
+static DEVICE_ATTR_RO(vendor);
+
+static ssize_t rev_id_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct acpi_nfit_control_region *dcr = to_nfit_dcr(dev);
+
+	return sprintf(buf, "%#x\n", dcr->revision_id);
+}
+static DEVICE_ATTR_RO(rev_id);
+
+static ssize_t device_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct acpi_nfit_control_region *dcr = to_nfit_dcr(dev);
+
+	return sprintf(buf, "%#x\n", dcr->device_id);
+}
+static DEVICE_ATTR_RO(device);
+
+static ssize_t format_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct acpi_nfit_control_region *dcr = to_nfit_dcr(dev);
+
+	return sprintf(buf, "%#x\n", dcr->code);
+}
+static DEVICE_ATTR_RO(format);
+
+static ssize_t serial_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct acpi_nfit_control_region *dcr = to_nfit_dcr(dev);
+
+	return sprintf(buf, "%#x\n", dcr->serial_number);
+}
+static DEVICE_ATTR_RO(serial);
+
+static struct attribute *acpi_nfit_dimm_attributes[] = {
+	&dev_attr_handle.attr,
+	&dev_attr_phys_id.attr,
+	&dev_attr_vendor.attr,
+	&dev_attr_device.attr,
+	&dev_attr_format.attr,
+	&dev_attr_serial.attr,
+	&dev_attr_rev_id.attr,
+	NULL,
+};
+
+static umode_t acpi_nfit_dimm_attr_visible(struct kobject *kobj, struct attribute *a, int n)
+{
+	struct device *dev = container_of(kobj, struct device, kobj);
+
+	if (to_nfit_dcr(dev))
+		return a->mode;
+	else
+		return 0;
+}
+
+static struct attribute_group acpi_nfit_dimm_attribute_group = {
+	.name = "nfit",
+	.attrs = acpi_nfit_dimm_attributes,
+	.is_visible = acpi_nfit_dimm_attr_visible,
+};
+
+static const struct attribute_group *acpi_nfit_dimm_attribute_groups[] = {
+	&acpi_nfit_dimm_attribute_group,
+	NULL,
+};
+
+static struct nd_dimm *acpi_nfit_dimm_by_handle(struct acpi_nfit_desc *acpi_desc,
+		u32 device_handle)
+{
+	struct nfit_mem *nfit_mem;
+
+	list_for_each_entry(nfit_mem, &acpi_desc->dimms, list)
+		if (__to_nfit_memdev(nfit_mem)->device_handle == device_handle)
+			return nfit_mem->nd_dimm;
+
+	return NULL;
+}
+
+static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
+{
+	struct nfit_mem *nfit_mem;
+
+	list_for_each_entry(nfit_mem, &acpi_desc->dimms, list) {
+		struct nd_dimm *nd_dimm;
+		unsigned long flags = 0;
+		u32 device_handle;
+
+		device_handle = __to_nfit_memdev(nfit_mem)->device_handle;
+		nd_dimm = acpi_nfit_dimm_by_handle(acpi_desc, device_handle);
+		if (nd_dimm) {
+			/*
+			 * If for some reason we find multiple DCRs the
+			 * first one wins
+			 */
+			dev_err(acpi_desc->dev, "duplicate DCR detected: %s\n",
+					nd_dimm_name(nd_dimm));
+			continue;
+		}
+
+		if (nfit_mem->bdw && nfit_mem->memdev_pmem)
+			flags |= NDD_ALIASING;
+
+		nd_dimm = nd_dimm_create(acpi_desc->nd_bus, nfit_mem,
+				acpi_nfit_dimm_attribute_groups, flags);
+		if (!nd_dimm)
+			return -ENOMEM;
+
+		nfit_mem->nd_dimm = nd_dimm;
+	}
+
+	return 0;
+}
+
 static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 {
 	struct device *dev = acpi_desc->dev;
@@ -371,7 +529,7 @@ static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 	if (nfit_mem_init(acpi_desc) != 0)
 		return -ENOMEM;
 
-	return 0;
+	return acpi_nfit_register_dimms(acpi_desc);
 }
 
 static int acpi_nfit_add(struct acpi_device *adev)
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index b6c85d773ca1..9d4c1634cb0e 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -59,6 +59,7 @@ struct nfit_memdev {
 
 /* assembled tables for a given dimm/memory-device */
 struct nfit_mem {
+	struct nd_dimm *nd_dimm;
 	struct acpi_nfit_memory_map *memdev_dcr;
 	struct acpi_nfit_memory_map *memdev_pmem;
 	struct acpi_nfit_control_region *dcr;
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 34d1b58b3258..2954b9543bec 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -2,3 +2,4 @@ obj-$(CONFIG_LIBND) += libnd.o
 
 libnd-y := core.o
 libnd-y += bus.o
+libnd-y += dimm_devs.o
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index 635f2e926426..ee56aa1ab2ad 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -13,6 +13,7 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include <linux/uaccess.h>
 #include <linux/fcntl.h>
+#include <linux/async.h>
 #include <linux/slab.h>
 #include <linux/fs.h>
 #include <linux/io.h>
@@ -21,6 +22,10 @@
 static int nd_bus_major;
 static struct class *nd_class;
 
+struct bus_type nd_bus_type = {
+	.name = "nd",
+};
+
 int nd_bus_create_ndctl(struct nd_bus *nd_bus)
 {
 	dev_t devt = MKDEV(nd_bus_major, nd_bus->id);
@@ -59,9 +64,13 @@ int __init nd_bus_init(void)
 {
 	int rc;
 
+	rc = bus_register(&nd_bus_type);
+	if (rc)
+		return rc;
+
 	rc = register_chrdev(0, "ndctl", &nd_bus_fops);
 	if (rc < 0)
-		return rc;
+		goto err_chrdev;
 	nd_bus_major = rc;
 
 	nd_class = class_create(THIS_MODULE, "nd");
@@ -72,6 +81,8 @@ int __init nd_bus_init(void)
 
  err_class:
 	unregister_chrdev(nd_bus_major, "ndctl");
+ err_chrdev:
+	bus_unregister(&nd_bus_type);
 
 	return rc;
 }
@@ -80,4 +91,5 @@ void __exit nd_bus_exit(void)
 {
 	class_destroy(nd_class);
 	unregister_chrdev(nd_bus_major, "ndctl");
+	bus_unregister(&nd_bus_type);
 }
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index 49b7ac8f7606..4d0e53ecdcb0 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -46,6 +46,19 @@ struct nd_bus_descriptor *to_nd_desc(struct nd_bus *nd_bus)
 }
 EXPORT_SYMBOL_GPL(to_nd_desc);
 
+struct nd_bus *walk_to_nd_bus(struct device *nd_dev)
+{
+	struct device *dev;
+
+	for (dev = nd_dev; dev; dev = dev->parent)
+		if (dev->release == nd_bus_release)
+			break;
+	dev_WARN_ONCE(nd_dev, !dev, "invalid dev, not on nd bus\n");
+	if (dev)
+		return to_nd_bus(dev);
+	return NULL;
+}
+
 static const char *nd_bus_provider(struct nd_bus *nd_bus)
 {
 	struct nd_bus_descriptor *nd_desc = nd_bus->nd_desc;
@@ -118,6 +131,21 @@ struct nd_bus *nd_bus_register(struct device *parent,
 }
 EXPORT_SYMBOL_GPL(nd_bus_register);
 
+static int child_unregister(struct device *dev, void *data)
+{
+	/*
+	 * the singular ndctl class device per bus needs to be
+	 * "device_destroy"ed, so skip it here
+	 *
+	 * i.e. remove classless children
+	 */
+	if (dev->class)
+		/* pass */;
+	else
+		device_unregister(dev);
+	return 0;
+}
+
 void nd_bus_unregister(struct nd_bus *nd_bus)
 {
 	if (!nd_bus)
@@ -127,6 +155,7 @@ void nd_bus_unregister(struct nd_bus *nd_bus)
 	list_del_init(&nd_bus->list);
 	mutex_unlock(&nd_bus_list_mutex);
 
+	device_for_each_child(&nd_bus->dev, NULL, child_unregister);
 	nd_bus_destroy_ndctl(nd_bus);
 
 	device_unregister(&nd_bus->dev);
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
new file mode 100644
index 000000000000..19b081392f2f
--- /dev/null
+++ b/drivers/block/nd/dimm_devs.c
@@ -0,0 +1,92 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/io.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include "nd-private.h"
+
+static DEFINE_IDA(dimm_ida);
+
+static void nd_dimm_release(struct device *dev)
+{
+	struct nd_dimm *nd_dimm = to_nd_dimm(dev);
+
+	ida_simple_remove(&dimm_ida, nd_dimm->id);
+	kfree(nd_dimm);
+}
+
+static struct device_type nd_dimm_device_type = {
+	.name = "nd_dimm",
+	.release = nd_dimm_release,
+};
+
+static bool is_nd_dimm(struct device *dev)
+{
+	return dev->type == &nd_dimm_device_type;
+}
+
+struct nd_dimm *to_nd_dimm(struct device *dev)
+{
+	struct nd_dimm *nd_dimm = container_of(dev, struct nd_dimm, dev);
+
+	WARN_ON(!is_nd_dimm(dev));
+	return nd_dimm;
+}
+EXPORT_SYMBOL_GPL(to_nd_dimm);
+
+const char *nd_dimm_name(struct nd_dimm *nd_dimm)
+{
+	return dev_name(&nd_dimm->dev);
+}
+EXPORT_SYMBOL_GPL(nd_dimm_name);
+
+void *nd_dimm_provider_data(struct nd_dimm *nd_dimm)
+{
+	return nd_dimm->provider_data;
+}
+EXPORT_SYMBOL_GPL(nd_dimm_provider_data);
+
+struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
+		const struct attribute_group **groups, unsigned long flags)
+{
+	struct nd_dimm *nd_dimm = kzalloc(sizeof(*nd_dimm), GFP_KERNEL);
+	struct device *dev;
+
+	if (!nd_dimm)
+		return NULL;
+
+	nd_dimm->id = ida_simple_get(&dimm_ida, 0, 0, GFP_KERNEL);
+	if (nd_dimm->id < 0) {
+		kfree(nd_dimm);
+		return NULL;
+	}
+	nd_dimm->provider_data = provider_data;
+	nd_dimm->flags = flags;
+
+	dev = &nd_dimm->dev;
+	dev_set_name(dev, "nmem%d", nd_dimm->id);
+	dev->parent = &nd_bus->dev;
+	dev->type = &nd_dimm_device_type;
+	dev->bus = &nd_bus_type;
+	dev->groups = groups;
+	if (device_register(dev) != 0) {
+		put_device(dev);
+		return NULL;
+	}
+
+	return nd_dimm;
+}
+EXPORT_SYMBOL_GPL(nd_dimm_create);
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index 884601f65a15..251ecdd77153 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -15,6 +15,10 @@
 #include <linux/device.h>
 #include <linux/libnd.h>
 
+extern struct list_head nd_bus_list;
+extern struct mutex nd_bus_list_mutex;
+extern struct bus_type nd_bus_type;
+
 struct nd_bus {
 	struct nd_bus_descriptor *nd_desc;
 	struct list_head list;
@@ -22,6 +26,14 @@ struct nd_bus {
 	int id;
 };
 
+struct nd_dimm {
+	unsigned long flags;
+	void *provider_data;
+	struct device dev;
+	int id;
+};
+
+struct nd_bus *walk_to_nd_bus(struct device *nd_dev);
 int __init nd_bus_init(void);
 void __exit nd_bus_exit(void);
 int nd_bus_create_ndctl(struct nd_bus *nd_bus);
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index 04a97653d56c..76d5839fb50e 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -14,6 +14,12 @@
  */
 #ifndef __LIBND_H__
 #define __LIBND_H__
+
+enum {
+	/* when a dimm supports both PMEM and BLK access a label is required */
+	NDD_ALIASING = 1 << 0,
+};
+
 extern struct attribute_group nd_bus_attribute_group;
 
 struct nd_dimm;
@@ -35,5 +41,10 @@ struct nd_bus *nd_bus_register(struct device *parent,
 		struct nd_bus_descriptor *nfit_desc);
 void nd_bus_unregister(struct nd_bus *nd_bus);
 struct nd_bus *to_nd_bus(struct device *dev);
+struct nd_dimm *to_nd_dimm(struct device *dev);
 struct nd_bus_descriptor *to_nd_desc(struct nd_bus *nd_bus);
+const char *nd_dimm_name(struct nd_dimm *nd_dimm);
+void *nd_dimm_provider_data(struct nd_dimm *nd_dimm);
+struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
+		const struct attribute_group **groups, unsigned long flags);
 #endif /* __LIBND_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 04/21] libnd, nfit: dimm/memory-devices
@ 2015-05-20 20:56   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:56 UTC (permalink / raw)
  To: axboe
  Cc: mingo, linux-nvdimm, neilb, gregkh, Rafael J. Wysocki,
	Robert Moore, linux-kernel, linux-acpi, jmoyer, hch

Enable dimm devices to be registered on a libnd bus.  The kernel
assigned device id for dimms is dynamic.  If userspace needs a more
static identifier it should consult a provider-specific attribute.  In
the case where NFIT is the provider, the 'nmemX/nfit/handle' or
'nmemX/nfit/serial' attributes may be used for this purpose.

Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c           |  160 +++++++++++++++++++++++++++++++++++++++++
 drivers/acpi/nfit.h           |    1 
 drivers/block/nd/Makefile     |    1 
 drivers/block/nd/bus.c        |   14 +++-
 drivers/block/nd/core.c       |   29 +++++++
 drivers/block/nd/dimm_devs.c  |   92 ++++++++++++++++++++++++
 drivers/block/nd/nd-private.h |   12 +++
 include/linux/libnd.h         |   11 +++
 8 files changed, 318 insertions(+), 2 deletions(-)
 create mode 100644 drivers/block/nd/dimm_devs.c

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index d31a0fffafcc..b26e1a4a59e3 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -344,6 +344,164 @@ const struct attribute_group *acpi_nfit_attribute_groups[] = {
 };
 EXPORT_SYMBOL_GPL(acpi_nfit_attribute_groups);
 
+static struct acpi_nfit_memory_map *to_nfit_memdev(struct device *dev)
+{
+	struct nd_dimm *nd_dimm = to_nd_dimm(dev);
+	struct nfit_mem *nfit_mem = nd_dimm_provider_data(nd_dimm);
+
+	return __to_nfit_memdev(nfit_mem);
+}
+
+static struct acpi_nfit_control_region *to_nfit_dcr(struct device *dev)
+{
+	struct nd_dimm *nd_dimm = to_nd_dimm(dev);
+	struct nfit_mem *nfit_mem = nd_dimm_provider_data(nd_dimm);
+
+	return nfit_mem->dcr;
+}
+
+static ssize_t handle_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct acpi_nfit_memory_map *memdev = to_nfit_memdev(dev);
+
+	return sprintf(buf, "%#x\n", memdev->device_handle);
+}
+static DEVICE_ATTR_RO(handle);
+
+static ssize_t phys_id_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct acpi_nfit_memory_map *memdev = to_nfit_memdev(dev);
+
+	return sprintf(buf, "%#x\n", memdev->physical_id);
+}
+static DEVICE_ATTR_RO(phys_id);
+
+static ssize_t vendor_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct acpi_nfit_control_region *dcr = to_nfit_dcr(dev);
+
+	return sprintf(buf, "%#x\n", dcr->vendor_id);
+}
+static DEVICE_ATTR_RO(vendor);
+
+static ssize_t rev_id_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct acpi_nfit_control_region *dcr = to_nfit_dcr(dev);
+
+	return sprintf(buf, "%#x\n", dcr->revision_id);
+}
+static DEVICE_ATTR_RO(rev_id);
+
+static ssize_t device_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct acpi_nfit_control_region *dcr = to_nfit_dcr(dev);
+
+	return sprintf(buf, "%#x\n", dcr->device_id);
+}
+static DEVICE_ATTR_RO(device);
+
+static ssize_t format_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct acpi_nfit_control_region *dcr = to_nfit_dcr(dev);
+
+	return sprintf(buf, "%#x\n", dcr->code);
+}
+static DEVICE_ATTR_RO(format);
+
+static ssize_t serial_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct acpi_nfit_control_region *dcr = to_nfit_dcr(dev);
+
+	return sprintf(buf, "%#x\n", dcr->serial_number);
+}
+static DEVICE_ATTR_RO(serial);
+
+static struct attribute *acpi_nfit_dimm_attributes[] = {
+	&dev_attr_handle.attr,
+	&dev_attr_phys_id.attr,
+	&dev_attr_vendor.attr,
+	&dev_attr_device.attr,
+	&dev_attr_format.attr,
+	&dev_attr_serial.attr,
+	&dev_attr_rev_id.attr,
+	NULL,
+};
+
+static umode_t acpi_nfit_dimm_attr_visible(struct kobject *kobj, struct attribute *a, int n)
+{
+	struct device *dev = container_of(kobj, struct device, kobj);
+
+	if (to_nfit_dcr(dev))
+		return a->mode;
+	else
+		return 0;
+}
+
+static struct attribute_group acpi_nfit_dimm_attribute_group = {
+	.name = "nfit",
+	.attrs = acpi_nfit_dimm_attributes,
+	.is_visible = acpi_nfit_dimm_attr_visible,
+};
+
+static const struct attribute_group *acpi_nfit_dimm_attribute_groups[] = {
+	&acpi_nfit_dimm_attribute_group,
+	NULL,
+};
+
+static struct nd_dimm *acpi_nfit_dimm_by_handle(struct acpi_nfit_desc *acpi_desc,
+		u32 device_handle)
+{
+	struct nfit_mem *nfit_mem;
+
+	list_for_each_entry(nfit_mem, &acpi_desc->dimms, list)
+		if (__to_nfit_memdev(nfit_mem)->device_handle == device_handle)
+			return nfit_mem->nd_dimm;
+
+	return NULL;
+}
+
+static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
+{
+	struct nfit_mem *nfit_mem;
+
+	list_for_each_entry(nfit_mem, &acpi_desc->dimms, list) {
+		struct nd_dimm *nd_dimm;
+		unsigned long flags = 0;
+		u32 device_handle;
+
+		device_handle = __to_nfit_memdev(nfit_mem)->device_handle;
+		nd_dimm = acpi_nfit_dimm_by_handle(acpi_desc, device_handle);
+		if (nd_dimm) {
+			/*
+			 * If for some reason we find multiple DCRs the
+			 * first one wins
+			 */
+			dev_err(acpi_desc->dev, "duplicate DCR detected: %s\n",
+					nd_dimm_name(nd_dimm));
+			continue;
+		}
+
+		if (nfit_mem->bdw && nfit_mem->memdev_pmem)
+			flags |= NDD_ALIASING;
+
+		nd_dimm = nd_dimm_create(acpi_desc->nd_bus, nfit_mem,
+				acpi_nfit_dimm_attribute_groups, flags);
+		if (!nd_dimm)
+			return -ENOMEM;
+
+		nfit_mem->nd_dimm = nd_dimm;
+	}
+
+	return 0;
+}
+
 static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 {
 	struct device *dev = acpi_desc->dev;
@@ -371,7 +529,7 @@ static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 	if (nfit_mem_init(acpi_desc) != 0)
 		return -ENOMEM;
 
-	return 0;
+	return acpi_nfit_register_dimms(acpi_desc);
 }
 
 static int acpi_nfit_add(struct acpi_device *adev)
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index b6c85d773ca1..9d4c1634cb0e 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -59,6 +59,7 @@ struct nfit_memdev {
 
 /* assembled tables for a given dimm/memory-device */
 struct nfit_mem {
+	struct nd_dimm *nd_dimm;
 	struct acpi_nfit_memory_map *memdev_dcr;
 	struct acpi_nfit_memory_map *memdev_pmem;
 	struct acpi_nfit_control_region *dcr;
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 34d1b58b3258..2954b9543bec 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -2,3 +2,4 @@ obj-$(CONFIG_LIBND) += libnd.o
 
 libnd-y := core.o
 libnd-y += bus.o
+libnd-y += dimm_devs.o
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index 635f2e926426..ee56aa1ab2ad 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -13,6 +13,7 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include <linux/uaccess.h>
 #include <linux/fcntl.h>
+#include <linux/async.h>
 #include <linux/slab.h>
 #include <linux/fs.h>
 #include <linux/io.h>
@@ -21,6 +22,10 @@
 static int nd_bus_major;
 static struct class *nd_class;
 
+struct bus_type nd_bus_type = {
+	.name = "nd",
+};
+
 int nd_bus_create_ndctl(struct nd_bus *nd_bus)
 {
 	dev_t devt = MKDEV(nd_bus_major, nd_bus->id);
@@ -59,9 +64,13 @@ int __init nd_bus_init(void)
 {
 	int rc;
 
+	rc = bus_register(&nd_bus_type);
+	if (rc)
+		return rc;
+
 	rc = register_chrdev(0, "ndctl", &nd_bus_fops);
 	if (rc < 0)
-		return rc;
+		goto err_chrdev;
 	nd_bus_major = rc;
 
 	nd_class = class_create(THIS_MODULE, "nd");
@@ -72,6 +81,8 @@ int __init nd_bus_init(void)
 
  err_class:
 	unregister_chrdev(nd_bus_major, "ndctl");
+ err_chrdev:
+	bus_unregister(&nd_bus_type);
 
 	return rc;
 }
@@ -80,4 +91,5 @@ void __exit nd_bus_exit(void)
 {
 	class_destroy(nd_class);
 	unregister_chrdev(nd_bus_major, "ndctl");
+	bus_unregister(&nd_bus_type);
 }
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index 49b7ac8f7606..4d0e53ecdcb0 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -46,6 +46,19 @@ struct nd_bus_descriptor *to_nd_desc(struct nd_bus *nd_bus)
 }
 EXPORT_SYMBOL_GPL(to_nd_desc);
 
+struct nd_bus *walk_to_nd_bus(struct device *nd_dev)
+{
+	struct device *dev;
+
+	for (dev = nd_dev; dev; dev = dev->parent)
+		if (dev->release == nd_bus_release)
+			break;
+	dev_WARN_ONCE(nd_dev, !dev, "invalid dev, not on nd bus\n");
+	if (dev)
+		return to_nd_bus(dev);
+	return NULL;
+}
+
 static const char *nd_bus_provider(struct nd_bus *nd_bus)
 {
 	struct nd_bus_descriptor *nd_desc = nd_bus->nd_desc;
@@ -118,6 +131,21 @@ struct nd_bus *nd_bus_register(struct device *parent,
 }
 EXPORT_SYMBOL_GPL(nd_bus_register);
 
+static int child_unregister(struct device *dev, void *data)
+{
+	/*
+	 * the singular ndctl class device per bus needs to be
+	 * "device_destroy"ed, so skip it here
+	 *
+	 * i.e. remove classless children
+	 */
+	if (dev->class)
+		/* pass */;
+	else
+		device_unregister(dev);
+	return 0;
+}
+
 void nd_bus_unregister(struct nd_bus *nd_bus)
 {
 	if (!nd_bus)
@@ -127,6 +155,7 @@ void nd_bus_unregister(struct nd_bus *nd_bus)
 	list_del_init(&nd_bus->list);
 	mutex_unlock(&nd_bus_list_mutex);
 
+	device_for_each_child(&nd_bus->dev, NULL, child_unregister);
 	nd_bus_destroy_ndctl(nd_bus);
 
 	device_unregister(&nd_bus->dev);
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
new file mode 100644
index 000000000000..19b081392f2f
--- /dev/null
+++ b/drivers/block/nd/dimm_devs.c
@@ -0,0 +1,92 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/io.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include "nd-private.h"
+
+static DEFINE_IDA(dimm_ida);
+
+static void nd_dimm_release(struct device *dev)
+{
+	struct nd_dimm *nd_dimm = to_nd_dimm(dev);
+
+	ida_simple_remove(&dimm_ida, nd_dimm->id);
+	kfree(nd_dimm);
+}
+
+static struct device_type nd_dimm_device_type = {
+	.name = "nd_dimm",
+	.release = nd_dimm_release,
+};
+
+static bool is_nd_dimm(struct device *dev)
+{
+	return dev->type == &nd_dimm_device_type;
+}
+
+struct nd_dimm *to_nd_dimm(struct device *dev)
+{
+	struct nd_dimm *nd_dimm = container_of(dev, struct nd_dimm, dev);
+
+	WARN_ON(!is_nd_dimm(dev));
+	return nd_dimm;
+}
+EXPORT_SYMBOL_GPL(to_nd_dimm);
+
+const char *nd_dimm_name(struct nd_dimm *nd_dimm)
+{
+	return dev_name(&nd_dimm->dev);
+}
+EXPORT_SYMBOL_GPL(nd_dimm_name);
+
+void *nd_dimm_provider_data(struct nd_dimm *nd_dimm)
+{
+	return nd_dimm->provider_data;
+}
+EXPORT_SYMBOL_GPL(nd_dimm_provider_data);
+
+struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
+		const struct attribute_group **groups, unsigned long flags)
+{
+	struct nd_dimm *nd_dimm = kzalloc(sizeof(*nd_dimm), GFP_KERNEL);
+	struct device *dev;
+
+	if (!nd_dimm)
+		return NULL;
+
+	nd_dimm->id = ida_simple_get(&dimm_ida, 0, 0, GFP_KERNEL);
+	if (nd_dimm->id < 0) {
+		kfree(nd_dimm);
+		return NULL;
+	}
+	nd_dimm->provider_data = provider_data;
+	nd_dimm->flags = flags;
+
+	dev = &nd_dimm->dev;
+	dev_set_name(dev, "nmem%d", nd_dimm->id);
+	dev->parent = &nd_bus->dev;
+	dev->type = &nd_dimm_device_type;
+	dev->bus = &nd_bus_type;
+	dev->groups = groups;
+	if (device_register(dev) != 0) {
+		put_device(dev);
+		return NULL;
+	}
+
+	return nd_dimm;
+}
+EXPORT_SYMBOL_GPL(nd_dimm_create);
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index 884601f65a15..251ecdd77153 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -15,6 +15,10 @@
 #include <linux/device.h>
 #include <linux/libnd.h>
 
+extern struct list_head nd_bus_list;
+extern struct mutex nd_bus_list_mutex;
+extern struct bus_type nd_bus_type;
+
 struct nd_bus {
 	struct nd_bus_descriptor *nd_desc;
 	struct list_head list;
@@ -22,6 +26,14 @@ struct nd_bus {
 	int id;
 };
 
+struct nd_dimm {
+	unsigned long flags;
+	void *provider_data;
+	struct device dev;
+	int id;
+};
+
+struct nd_bus *walk_to_nd_bus(struct device *nd_dev);
 int __init nd_bus_init(void);
 void __exit nd_bus_exit(void);
 int nd_bus_create_ndctl(struct nd_bus *nd_bus);
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index 04a97653d56c..76d5839fb50e 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -14,6 +14,12 @@
  */
 #ifndef __LIBND_H__
 #define __LIBND_H__
+
+enum {
+	/* when a dimm supports both PMEM and BLK access a label is required */
+	NDD_ALIASING = 1 << 0,
+};
+
 extern struct attribute_group nd_bus_attribute_group;
 
 struct nd_dimm;
@@ -35,5 +41,10 @@ struct nd_bus *nd_bus_register(struct device *parent,
 		struct nd_bus_descriptor *nfit_desc);
 void nd_bus_unregister(struct nd_bus *nd_bus);
 struct nd_bus *to_nd_bus(struct device *dev);
+struct nd_dimm *to_nd_dimm(struct device *dev);
 struct nd_bus_descriptor *to_nd_desc(struct nd_bus *nd_bus);
+const char *nd_dimm_name(struct nd_dimm *nd_dimm);
+void *nd_dimm_provider_data(struct nd_dimm *nd_dimm);
+struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
+		const struct attribute_group **groups, unsigned long flags);
 #endif /* __LIBND_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 05/21] libnd: control (ioctl) messages for libnd bus and dimm devices
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:56   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:56 UTC (permalink / raw)
  To: axboe
  Cc: mingo, linux-nvdimm, neilb, gregkh, Rafael J. Wysocki,
	Robert Moore, linux-kernel, linux-acpi, jmoyer, Nicholas Moulin,
	hch

Most discovery/configuration of the libnd-subsystem is done via sysfs
attributes.  However, some libnd buses, particularly the ACPI.NFIT bus,
define a small set of messages that can be passed to the platform.  For
convenience we derive the initial libnd-ioctl command formats directly
from the NFIT DSM Interface Example formats.

    ND_CMD_SMART: media health and diagnostics
    ND_CMD_GET_CONFIG_SIZE: size of the label space
    ND_CMD_GET_CONFIG_DATA: read label space
    ND_CMD_SET_CONFIG_DATA: write label space
    ND_CMD_VENDOR: vendor-specific command passthrough
    ND_CMD_ARS_CAP: report address-range-scrubbing capabilities
    ND_CMD_START_ARS: initiate scrubbing
    ND_CMD_QUERY_ARS: report on scrubbing state
    ND_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events

If a platform later defines different commands than this set it is
straightforward to extend support to those formats.

Most of the commands target a specific dimm.  However, the
address-range-scrubbing commands target the bus.  The 'commands'
attribute in sysfs of a libnd-bus, or a libnd-nmem (dimm device)
enumerate the supported commands for that object.

Cc: <linux-acpi@vger.kernel.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/Kconfig          |   12 ++
 drivers/acpi/nfit.c           |  213 +++++++++++++++++++++++++++
 drivers/acpi/nfit.h           |    3 
 drivers/block/nd/bus.c        |  324 ++++++++++++++++++++++++++++++++++++++++-
 drivers/block/nd/core.c       |   16 ++
 drivers/block/nd/dimm_devs.c  |   38 ++++-
 drivers/block/nd/nd-private.h |    3 
 include/linux/libnd.h         |   25 +++
 include/uapi/linux/Kbuild     |    1 
 include/uapi/linux/ndctl.h    |  178 +++++++++++++++++++++++
 10 files changed, 803 insertions(+), 10 deletions(-)
 create mode 100644 include/uapi/linux/ndctl.h

diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index 84d046d4ed17..0690045ba270 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -390,6 +390,18 @@ config ACPI_NFIT
 	  To compile this driver as a module, choose M here:
 	  the module will be called nfit.
 
+config ACPI_NFIT_DEBUG
+	bool "NFIT DSM debug"
+	depends on ACPI_NFIT
+	depends on DYNAMIC_DEBUG
+	default n
+	help
+	  Enabling this option causes the nfit driver to dump the
+	  input and output buffers of _DSM operations on the ACPI0012
+	  device and its children.  This can be very verbose, so leave
+	  it disabled unless you are debugging a hardware / firmware
+	  issue.
+
 source "drivers/acpi/apei/Kconfig"
 
 config ACPI_EXTLOG
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index b26e1a4a59e3..b7c1c5a5b589 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -13,6 +13,7 @@
 #include <linux/list_sort.h>
 #include <linux/module.h>
 #include <linux/libnd.h>
+#include <linux/ndctl.h>
 #include <linux/list.h>
 #include <linux/acpi.h>
 #include "nfit.h"
@@ -24,11 +25,150 @@ static const u8 *to_nfit_uuid(enum nfit_uuids id)
 	return nfit_uuid[id];
 }
 
+static struct acpi_nfit_desc *to_acpi_nfit_desc(struct nd_bus_descriptor *nd_desc)
+{
+	return container_of(nd_desc, struct acpi_nfit_desc, nd_desc);
+}
+
+static struct acpi_device *to_acpi_dev(struct acpi_nfit_desc *acpi_desc)
+{
+	struct nd_bus_descriptor *nd_desc = &acpi_desc->nd_desc;
+
+	/*
+	 * If provider == 'ACPI.NFIT' we can assume 'dev' is a struct
+	 * acpi_device.
+	 */
+	if (!nd_desc->provider_name
+			|| strcmp(nd_desc->provider_name, "ACPI.NFIT") != 0)
+		return NULL;
+
+	return to_acpi_device(acpi_desc->dev);
+}
+
 static int acpi_nfit_ctl(struct nd_bus_descriptor *nd_desc,
 		struct nd_dimm *nd_dimm, unsigned int cmd, void *buf,
 		unsigned int buf_len)
 {
-	return -ENOTTY;
+	struct acpi_nfit_desc *acpi_desc = to_acpi_nfit_desc(nd_desc);
+	const struct nd_cmd_desc const *desc = NULL;
+	union acpi_object in_obj, in_buf, *out_obj;
+	struct device *dev = acpi_desc->dev;
+	const char *cmd_name, *dimm_name;
+	unsigned long dsm_mask;
+	acpi_handle handle;
+	const u8 *uuid;
+	u32 offset;
+	int rc, i;
+
+	if (nd_dimm) {
+		struct nfit_mem *nfit_mem = nd_dimm_provider_data(nd_dimm);
+		struct acpi_device *adev = nfit_mem->adev;
+
+		if (!adev)
+			return -ENOTTY;
+		dimm_name = dev_name(&adev->dev);
+		cmd_name = nd_dimm_cmd_name(cmd);
+		dsm_mask = nfit_mem->dsm_mask;
+		desc = nd_cmd_dimm_desc(cmd);
+		uuid = to_nfit_uuid(NFIT_DEV_DIMM);
+		handle = adev->handle;
+	} else {
+		struct acpi_device *adev = to_acpi_dev(acpi_desc);
+
+		cmd_name = nd_bus_cmd_name(cmd);
+		dsm_mask = nd_desc->dsm_mask;
+		desc = nd_cmd_bus_desc(cmd);
+		uuid = to_nfit_uuid(NFIT_DEV_BUS);
+		handle = adev->handle;
+		dimm_name = "bus";
+	}
+
+	if (!desc || (cmd && (desc->out_num + desc->in_num == 0)))
+		return -ENOTTY;
+
+	if (!test_bit(cmd, &dsm_mask))
+		return -ENOTTY;
+
+	in_obj.type = ACPI_TYPE_PACKAGE;
+	in_obj.package.count = 1;
+	in_obj.package.elements = &in_buf;
+	in_buf.type = ACPI_TYPE_BUFFER;
+	in_buf.buffer.pointer = buf;
+	in_buf.buffer.length = 0;
+
+	/* libnd has already validated the input envelope */
+	for (i = 0; i < desc->in_num; i++)
+		in_buf.buffer.length += nd_cmd_in_size(nd_dimm, cmd, desc, i, buf);
+
+	if (IS_ENABLED(CONFIG_ACPI_NFIT_DEBUG)) {
+		dev_dbg(dev, "%s:%s cmd: %s input length: %d\n", __func__,
+				dimm_name, cmd_name, in_buf.buffer.length);
+		print_hex_dump_debug(cmd_name, DUMP_PREFIX_OFFSET, 4,
+				4, in_buf.buffer.pointer, min_t(u32, 128,
+					in_buf.buffer.length), true);
+	}
+
+	out_obj = acpi_evaluate_dsm(handle, uuid, 1, cmd, &in_obj);
+	if (!out_obj) {
+		dev_dbg(dev, "%s:%s _DSM failed cmd: %s\n", __func__, dimm_name,
+				cmd_name);
+		return -EINVAL;
+	}
+
+	if (out_obj->package.type != ACPI_TYPE_BUFFER) {
+		dev_dbg(dev, "%s:%s unexpected output object type cmd: %s type: %d\n",
+				__func__, dimm_name, cmd_name, out_obj->type);
+		rc = -EINVAL;
+		goto out;
+	}
+
+	if (IS_ENABLED(CONFIG_ACPI_NFIT_DEBUG)) {
+		dev_dbg(dev, "%s:%s cmd: %s output length: %d\n", __func__,
+				dimm_name, cmd_name, out_obj->buffer.length);
+		print_hex_dump_debug(cmd_name, DUMP_PREFIX_OFFSET, 4,
+				4, out_obj->buffer.pointer, min_t(u32, 128,
+					out_obj->buffer.length), true);
+	}
+
+	for (i = 0, offset = 0; i < desc->out_num; i++) {
+		u32 out_size = nd_cmd_out_size(nd_dimm, cmd, desc, i, buf,
+				(u32 *) out_obj->buffer.pointer);
+
+		if (offset + out_size > out_obj->buffer.length) {
+			dev_dbg(dev, "%s:%s output object underflow cmd: %s field: %d\n",
+					__func__, dimm_name, cmd_name, i);
+			break;
+		}
+
+		if (in_buf.buffer.length + offset + out_size > buf_len) {
+			dev_dbg(dev, "%s:%s output overrun cmd: %s field: %d\n",
+					__func__, dimm_name, cmd_name, i);
+			rc = -ENXIO;
+			goto out;
+		}
+		memcpy(buf + in_buf.buffer.length + offset,
+				out_obj->buffer.pointer + offset, out_size);
+		offset += out_size;
+	}
+	if (offset + in_buf.buffer.length < buf_len) {
+		if (i >= 1) {
+			/*
+			 * status valid, return the number of bytes left
+			 * unfilled in the output buffer
+			 */
+			rc = buf_len - offset - in_buf.buffer.length;
+		} else {
+			dev_err(dev, "%s:%s underrun cmd: %s buf_len: %d out_len: %d\n",
+					__func__, dimm_name, cmd_name, buf_len, offset);
+			rc = -ENXIO;
+		}
+	} else
+		rc = 0;
+
+ out:
+	ACPI_FREE(out_obj);
+
+	return rc;
 }
 
 static const char *spa_type_name(u16 type)
@@ -451,6 +591,7 @@ static struct attribute_group acpi_nfit_dimm_attribute_group = {
 };
 
 static const struct attribute_group *acpi_nfit_dimm_attribute_groups[] = {
+	&nd_dimm_attribute_group,
 	&acpi_nfit_dimm_attribute_group,
 	NULL,
 };
@@ -467,6 +608,50 @@ static struct nd_dimm *acpi_nfit_dimm_by_handle(struct acpi_nfit_desc *acpi_desc
 	return NULL;
 }
 
+static int acpi_nfit_add_dimm(struct acpi_nfit_desc *acpi_desc,
+		struct nfit_mem *nfit_mem, u32 device_handle)
+{
+	struct acpi_device *adev, *adev_dimm;
+	struct device *dev = acpi_desc->dev;
+	const u8 *uuid = to_nfit_uuid(NFIT_DEV_DIMM);
+	unsigned long long sta;
+	int i, rc = -ENODEV;
+	acpi_status status;
+
+	nfit_mem->dsm_mask = acpi_desc->dimm_dsm_force_en;
+	adev = to_acpi_dev(acpi_desc);
+	if (!adev)
+		return 0;
+
+	adev_dimm = acpi_find_child_device(adev, device_handle, false);
+	nfit_mem->adev = adev_dimm;
+	if (!adev_dimm) {
+		dev_err(dev, "no ACPI.NFIT device with _ADR %#x, disabling...\n",
+				device_handle);
+		return -ENODEV;
+	}
+
+	status = acpi_evaluate_integer(adev_dimm->handle, "_STA", NULL, &sta);
+	if (status == AE_NOT_FOUND) {
+		dev_dbg(dev, "%s missing _STA, assuming enabled...\n",
+				dev_name(&adev_dimm->dev));
+		rc = 0;
+	} else if (ACPI_FAILURE(status))
+		dev_err(dev, "%s failed to retrieve_STA, disabling...\n",
+				dev_name(&adev_dimm->dev));
+	else if ((sta & ACPI_STA_DEVICE_ENABLED) == 0)
+		dev_info(dev, "%s disabled by firmware\n",
+				dev_name(&adev_dimm->dev));
+	else
+		rc = 0;
+
+	for (i = ND_CMD_SMART; i <= ND_CMD_VENDOR; i++)
+		if (acpi_check_dsm(adev_dimm->handle, uuid, 1, 1ULL << i))
+			set_bit(i, &nfit_mem->dsm_mask);
+
+	return rc;
+}
+
 static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
 {
 	struct nfit_mem *nfit_mem;
@@ -475,6 +660,7 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
 		struct nd_dimm *nd_dimm;
 		unsigned long flags = 0;
 		u32 device_handle;
+		int rc;
 
 		device_handle = __to_nfit_memdev(nfit_mem)->device_handle;
 		nd_dimm = acpi_nfit_dimm_by_handle(acpi_desc, device_handle);
@@ -491,8 +677,13 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
 		if (nfit_mem->bdw && nfit_mem->memdev_pmem)
 			flags |= NDD_ALIASING;
 
+		rc = acpi_nfit_add_dimm(acpi_desc, nfit_mem, device_handle);
+		if (rc)
+			continue;
+
 		nd_dimm = nd_dimm_create(acpi_desc->nd_bus, nfit_mem,
-				acpi_nfit_dimm_attribute_groups, flags);
+				acpi_nfit_dimm_attribute_groups,
+				flags, &nfit_mem->dsm_mask);
 		if (!nd_dimm)
 			return -ENOMEM;
 
@@ -502,6 +693,22 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
 	return 0;
 }
 
+static void acpi_nfit_init_dsms(struct acpi_nfit_desc *acpi_desc)
+{
+	struct nd_bus_descriptor *nd_desc = &acpi_desc->nd_desc;
+	const u8 *uuid = to_nfit_uuid(NFIT_DEV_BUS);
+	struct acpi_device *adev;
+	int i;
+
+	adev = to_acpi_dev(acpi_desc);
+	if (!adev)
+		return;
+
+	for (i = ND_CMD_ARS_CAP; i <= ND_CMD_ARS_QUERY; i++)
+		if (acpi_check_dsm(adev->handle, uuid, 1, 1ULL << i))
+			set_bit(i, &nd_desc->dsm_mask);
+}
+
 static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 {
 	struct device *dev = acpi_desc->dev;
@@ -529,6 +736,8 @@ static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 	if (nfit_mem_init(acpi_desc) != 0)
 		return -ENOMEM;
 
+	acpi_nfit_init_dsms(acpi_desc);
+
 	return acpi_nfit_register_dimms(acpi_desc);
 }
 
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index 9d4c1634cb0e..cc496ba6bbd2 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -67,6 +67,8 @@ struct nfit_mem {
 	struct acpi_nfit_system_address *spa_dcr;
 	struct acpi_nfit_system_address *spa_bdw;
 	struct list_head list;
+	struct acpi_device *adev;
+	unsigned long dsm_mask;
 };
 
 struct acpi_nfit_desc {
@@ -79,6 +81,7 @@ struct acpi_nfit_desc {
 	struct list_head bdws;
 	struct nd_bus *nd_bus;
 	struct device *dev;
+	unsigned long dimm_dsm_force_en;
 };
 
 static inline struct acpi_nfit_memory_map *__to_nfit_memdev(struct nfit_mem *nfit_mem)
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index ee56aa1ab2ad..f072a9e0c1bd 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -11,14 +11,18 @@
  * General Public License for more details.
  */
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/vmalloc.h>
 #include <linux/uaccess.h>
 #include <linux/fcntl.h>
 #include <linux/async.h>
+#include <linux/ndctl.h>
 #include <linux/slab.h>
 #include <linux/fs.h>
 #include <linux/io.h>
+#include <linux/mm.h>
 #include "nd-private.h"
 
+int nd_dimm_major;
 static int nd_bus_major;
 static struct class *nd_class;
 
@@ -47,19 +51,323 @@ void nd_bus_destroy_ndctl(struct nd_bus *nd_bus)
 	device_destroy(nd_class, MKDEV(nd_bus_major, nd_bus->id));
 }
 
+static const struct nd_cmd_desc const __nd_cmd_dimm_descs[] = {
+	[ND_CMD_IMPLEMENTED] = { },
+	[ND_CMD_SMART] = {
+		.out_num = 2,
+		.out_sizes = { 4, 8, },
+	},
+	[ND_CMD_SMART_THRESHOLD] = {
+		.out_num = 2,
+		.out_sizes = { 4, 8, },
+	},
+	[ND_CMD_DIMM_FLAGS] = {
+		.out_num = 2,
+		.out_sizes = { 4, 4 },
+	},
+	[ND_CMD_GET_CONFIG_SIZE] = {
+		.out_num = 3,
+		.out_sizes = { 4, 4, 4, },
+	},
+	[ND_CMD_GET_CONFIG_DATA] = {
+		.in_num = 2,
+		.in_sizes = { 4, 4, },
+		.out_num = 2,
+		.out_sizes = { 4, UINT_MAX, },
+	},
+	[ND_CMD_SET_CONFIG_DATA] = {
+		.in_num = 3,
+		.in_sizes = { 4, 4, UINT_MAX, },
+		.out_num = 1,
+		.out_sizes = { 4, },
+	},
+	[ND_CMD_VENDOR] = {
+		.in_num = 3,
+		.in_sizes = { 4, 4, UINT_MAX, },
+		.out_num = 3,
+		.out_sizes = { 4, 4, UINT_MAX, },
+	},
+};
+
+const struct nd_cmd_desc *nd_cmd_dimm_desc(int cmd)
+{
+	if (cmd < ARRAY_SIZE(__nd_cmd_dimm_descs))
+		return &__nd_cmd_dimm_descs[cmd];
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(nd_cmd_dimm_desc);
+
+static const struct nd_cmd_desc const __nd_cmd_bus_descs[] = {
+	[ND_CMD_IMPLEMENTED] = { },
+	[ND_CMD_ARS_CAP] = {
+		.in_num = 2,
+		.in_sizes = { 8, 8, },
+		.out_num = 2,
+		.out_sizes = { 4, 4, },
+	},
+	[ND_CMD_ARS_START] = {
+		.in_num = 4,
+		.in_sizes = { 8, 8, 2, 6, },
+		.out_num = 1,
+		.out_sizes = { 4, },
+	},
+	[ND_CMD_ARS_QUERY] = {
+		.out_num = 2,
+		.out_sizes = { 4, UINT_MAX, },
+	},
+};
+
+const struct nd_cmd_desc *nd_cmd_bus_desc(int cmd)
+{
+	if (cmd < ARRAY_SIZE(__nd_cmd_bus_descs))
+		return &__nd_cmd_bus_descs[cmd];
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(nd_cmd_bus_desc);
+
+u32 nd_cmd_in_size(struct nd_dimm *nd_dimm, int cmd,
+		const struct nd_cmd_desc *desc, int idx, void *buf)
+{
+	if (idx >= desc->in_num)
+		return UINT_MAX;
+
+	if (desc->in_sizes[idx] < UINT_MAX)
+		return desc->in_sizes[idx];
+
+	if (nd_dimm && cmd == ND_CMD_SET_CONFIG_DATA && idx == 2) {
+		struct nd_cmd_set_config_hdr *hdr = buf;
+
+		return hdr->in_length;
+	} else if (nd_dimm && cmd == ND_CMD_VENDOR && idx == 2) {
+		struct nd_cmd_vendor_hdr *hdr = buf;
+
+		return hdr->in_length;
+	}
+
+	return UINT_MAX;
+}
+EXPORT_SYMBOL_GPL(nd_cmd_in_size);
+
+u32 nd_cmd_out_size(struct nd_dimm *nd_dimm, int cmd,
+		const struct nd_cmd_desc *desc, int idx, const u32 *in_field,
+		const u32 *out_field)
+{
+	if (idx >= desc->out_num)
+		return UINT_MAX;
+
+	if (desc->out_sizes[idx] < UINT_MAX)
+		return desc->out_sizes[idx];
+
+	if (nd_dimm && cmd == ND_CMD_GET_CONFIG_DATA && idx == 1)
+		return in_field[1];
+	else if (nd_dimm && cmd == ND_CMD_VENDOR && idx == 2)
+		return out_field[1];
+	else if (!nd_dimm && cmd == ND_CMD_ARS_QUERY && idx == 1)
+		return ND_CMD_ARS_QUERY_MAX;
+
+	return UINT_MAX;
+}
+EXPORT_SYMBOL_GPL(nd_cmd_out_size);
+
+static int __nd_ioctl(struct nd_bus *nd_bus, struct nd_dimm *nd_dimm,
+		int read_only, unsigned int ioctl_cmd, unsigned long arg)
+{
+	struct nd_bus_descriptor *nd_desc = nd_bus->nd_desc;
+	size_t buf_len = 0, in_len = 0, out_len = 0;
+	static char out_env[ND_CMD_MAX_ENVELOPE];
+	static char in_env[ND_CMD_MAX_ENVELOPE];
+	const struct nd_cmd_desc *desc = NULL;
+	unsigned int cmd = _IOC_NR(ioctl_cmd);
+	void __user *p = (void __user *) arg;
+	struct device *dev = &nd_bus->dev;
+	const char *cmd_name, *dimm_name;
+	unsigned long dsm_mask;
+	void *buf;
+	int rc, i;
+
+	if (nd_dimm) {
+		desc = nd_cmd_dimm_desc(cmd);
+		cmd_name = nd_dimm_cmd_name(cmd);
+		dsm_mask = nd_dimm->dsm_mask ? *(nd_dimm->dsm_mask) : 0;
+		dimm_name = dev_name(&nd_dimm->dev);
+	} else {
+		desc = nd_cmd_bus_desc(cmd);
+		cmd_name = nd_bus_cmd_name(cmd);
+		dsm_mask = nd_desc->dsm_mask;
+		dimm_name = "bus";
+	}
+
+	if (!desc || (desc->out_num + desc->in_num == 0) ||
+			!test_bit(cmd, &dsm_mask))
+		return -ENOTTY;
+
+	/* fail write commands (when read-only) */
+	if (read_only)
+		switch (ioctl_cmd) {
+		case ND_IOCTL_VENDOR:
+		case ND_IOCTL_SET_CONFIG_DATA:
+		case ND_IOCTL_ARS_START:
+			dev_dbg(&nd_bus->dev, "'%s' command while read-only.\n",
+					nd_dimm ? nd_dimm_cmd_name(cmd)
+					: nd_bus_cmd_name(cmd));
+			return -EPERM;
+		default:
+			break;
+		}
+
+	/* process an input envelope */
+	for (i = 0; i < desc->in_num; i++) {
+		u32 in_size, copy;
+
+		in_size = nd_cmd_in_size(nd_dimm, cmd, desc, i, in_env);
+		if (in_size == UINT_MAX) {
+			dev_err(dev, "%s:%s unknown input size cmd: %s field: %d\n",
+					__func__, dimm_name, cmd_name, i);
+			return -ENXIO;
+		}
+		if (!access_ok(VERIFY_READ, p + in_len, in_size))
+			return -EFAULT;
+		if (in_len < sizeof(in_env))
+			copy = min_t(u32, sizeof(in_env) - in_len, in_size);
+		else
+			copy = 0;
+		if (copy && copy_from_user(&in_env[in_len], p + in_len, copy))
+			return -EFAULT;
+		in_len += in_size;
+	}
+
+	/* process an output envelope */
+	for (i = 0; i < desc->out_num; i++) {
+		u32 out_size = nd_cmd_out_size(nd_dimm, cmd, desc, i,
+				(u32 *) in_env, (u32 *) out_env);
+		u32 copy;
+
+		if (out_size == UINT_MAX) {
+			dev_dbg(dev, "%s:%s unknown output size cmd: %s field: %d\n",
+					__func__, dimm_name, cmd_name, i);
+			return -EFAULT;
+		}
+		if (!access_ok(VERIFY_WRITE, p + in_len + out_len, out_size))
+			return -EFAULT;
+		if (out_len < sizeof(out_env))
+			copy = min_t(u32, sizeof(out_env) - out_len, out_size);
+		else
+			copy = 0;
+		if (copy && copy_from_user(&out_env[out_len], p + in_len + out_len,
+					copy))
+			return -EFAULT;
+		out_len += out_size;
+	}
+
+	buf_len = out_len + in_len;
+	if (!access_ok(VERIFY_WRITE, p, sizeof(buf_len)))
+		return -EFAULT;
+
+	if (buf_len > ND_IOCTL_MAX_BUFLEN) {
+		dev_dbg(dev, "%s:%s cmd: %s buf_len: %zd > %d\n", __func__,
+				dimm_name, cmd_name, buf_len,
+				ND_IOCTL_MAX_BUFLEN);
+		return -EINVAL;
+	}
+
+	buf = vmalloc(buf_len);
+	if (!buf)
+		return -ENOMEM;
+
+	if (copy_from_user(buf, p, buf_len)) {
+		rc = -EFAULT;
+		goto out;
+	}
+
+	rc = nd_desc->ndctl(nd_desc, nd_dimm, cmd, buf, buf_len);
+	if (rc < 0)
+		goto out;
+	if (copy_to_user(p, buf, buf_len))
+		rc = -EFAULT;
+ out:
+	vfree(buf);
+	return rc;
+}
+
 static long nd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
-	return -ENXIO;
+	long id = (long) file->private_data;
+	int rc = -ENXIO, read_only;
+	struct nd_bus *nd_bus;
+
+	read_only = (O_RDWR != (file->f_flags & O_ACCMODE));
+	mutex_lock(&nd_bus_list_mutex);
+	list_for_each_entry(nd_bus, &nd_bus_list, list) {
+		if (nd_bus->id == id) {
+			rc = __nd_ioctl(nd_bus, NULL, read_only, cmd, arg);
+			break;
+		}
+	}
+	mutex_unlock(&nd_bus_list_mutex);
+
+	return rc;
+}
+
+static int match_dimm(struct device *dev, void *data)
+{
+	long id = (long) data;
+
+	if (is_nd_dimm(dev)) {
+		struct nd_dimm *nd_dimm = to_nd_dimm(dev);
+
+		return nd_dimm->id == id;
+	}
+
+	return 0;
+}
+
+static long nd_dimm_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	int rc = -ENXIO, read_only;
+	struct nd_bus *nd_bus;
+
+	read_only = (O_RDWR != (file->f_flags & O_ACCMODE));
+	mutex_lock(&nd_bus_list_mutex);
+	list_for_each_entry(nd_bus, &nd_bus_list, list) {
+		struct device *dev = device_find_child(&nd_bus->dev,
+				file->private_data, match_dimm);
+
+		if (!dev)
+			continue;
+
+		rc = __nd_ioctl(nd_bus, to_nd_dimm(dev), read_only, cmd, arg);
+		put_device(dev);
+		break;
+	}
+	mutex_unlock(&nd_bus_list_mutex);
+
+	return rc;
+}
+
+static int nd_open(struct inode *inode, struct file *file)
+{
+	long minor = iminor(inode);
+
+	file->private_data = (void *) minor;
+	return 0;
 }
 
 static const struct file_operations nd_bus_fops = {
 	.owner = THIS_MODULE,
-	.open = nonseekable_open,
+	.open = nd_open,
 	.unlocked_ioctl = nd_ioctl,
 	.compat_ioctl = nd_ioctl,
 	.llseek = noop_llseek,
 };
 
+static const struct file_operations nd_dimm_fops = {
+	.owner = THIS_MODULE,
+	.open = nd_open,
+	.unlocked_ioctl = nd_dimm_ioctl,
+	.compat_ioctl = nd_dimm_ioctl,
+	.llseek = noop_llseek,
+};
+
 int __init nd_bus_init(void)
 {
 	int rc;
@@ -70,9 +378,14 @@ int __init nd_bus_init(void)
 
 	rc = register_chrdev(0, "ndctl", &nd_bus_fops);
 	if (rc < 0)
-		goto err_chrdev;
+		goto err_bus_chrdev;
 	nd_bus_major = rc;
 
+	rc = register_chrdev(0, "dimmctl", &nd_dimm_fops);
+	if (rc < 0)
+		goto err_dimm_chrdev;
+	nd_dimm_major = rc;
+
 	nd_class = class_create(THIS_MODULE, "nd");
 	if (IS_ERR(nd_class))
 		goto err_class;
@@ -80,8 +393,10 @@ int __init nd_bus_init(void)
 	return 0;
 
  err_class:
+	unregister_chrdev(nd_dimm_major, "dimmctl");
+ err_dimm_chrdev:
 	unregister_chrdev(nd_bus_major, "ndctl");
- err_chrdev:
+ err_bus_chrdev:
 	bus_unregister(&nd_bus_type);
 
 	return rc;
@@ -91,5 +406,6 @@ void __exit nd_bus_exit(void)
 {
 	class_destroy(nd_class);
 	unregister_chrdev(nd_bus_major, "ndctl");
+	unregister_chrdev(nd_dimm_major, "dimmctl");
 	bus_unregister(&nd_bus_type);
 }
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index 4d0e53ecdcb0..d7a922913da2 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -14,6 +14,7 @@
 #include <linux/module.h>
 #include <linux/device.h>
 #include <linux/libnd.h>
+#include <linux/ndctl.h>
 #include <linux/mutex.h>
 #include <linux/slab.h>
 #include "nd-private.h"
@@ -59,6 +60,20 @@ struct nd_bus *walk_to_nd_bus(struct device *nd_dev)
 	return NULL;
 }
 
+static ssize_t commands_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	int cmd, len = 0;
+	struct nd_bus *nd_bus = to_nd_bus(dev);
+	struct nd_bus_descriptor *nd_desc = nd_bus->nd_desc;
+
+	for_each_set_bit(cmd, &nd_desc->dsm_mask, BITS_PER_LONG)
+		len += sprintf(buf + len, "%s ", nd_bus_cmd_name(cmd));
+	len += sprintf(buf + len, "\n");
+	return len;
+}
+static DEVICE_ATTR_RO(commands);
+
 static const char *nd_bus_provider(struct nd_bus *nd_bus)
 {
 	struct nd_bus_descriptor *nd_desc = nd_bus->nd_desc;
@@ -82,6 +97,7 @@ static ssize_t provider_show(struct device *dev,
 static DEVICE_ATTR_RO(provider);
 
 static struct attribute *nd_bus_attributes[] = {
+	&dev_attr_commands.attr,
 	&dev_attr_provider.attr,
 	NULL,
 };
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 19b081392f2f..3fa26f61c3db 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -12,6 +12,7 @@
  */
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include <linux/device.h>
+#include <linux/ndctl.h>
 #include <linux/slab.h>
 #include <linux/io.h>
 #include <linux/fs.h>
@@ -33,7 +34,7 @@ static struct device_type nd_dimm_device_type = {
 	.release = nd_dimm_release,
 };
 
-static bool is_nd_dimm(struct device *dev)
+bool is_nd_dimm(struct device *dev)
 {
 	return dev->type == &nd_dimm_device_type;
 }
@@ -55,12 +56,41 @@ EXPORT_SYMBOL_GPL(nd_dimm_name);
 
 void *nd_dimm_provider_data(struct nd_dimm *nd_dimm)
 {
-	return nd_dimm->provider_data;
+	if (nd_dimm)
+		return nd_dimm->provider_data;
+	return NULL;
 }
 EXPORT_SYMBOL_GPL(nd_dimm_provider_data);
 
+static ssize_t commands_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_dimm *nd_dimm = to_nd_dimm(dev);
+	int cmd, len = 0;
+
+	if (!nd_dimm->dsm_mask)
+		return sprintf(buf, "\n");
+
+	for_each_set_bit(cmd, nd_dimm->dsm_mask, BITS_PER_LONG)
+		len += sprintf(buf + len, "%s ", nd_dimm_cmd_name(cmd));
+	len += sprintf(buf + len, "\n");
+	return len;
+}
+static DEVICE_ATTR_RO(commands);
+
+static struct attribute *nd_dimm_attributes[] = {
+	&dev_attr_commands.attr,
+	NULL,
+};
+
+struct attribute_group nd_dimm_attribute_group = {
+	.attrs = nd_dimm_attributes,
+};
+EXPORT_SYMBOL_GPL(nd_dimm_attribute_group);
+
 struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
-		const struct attribute_group **groups, unsigned long flags)
+		const struct attribute_group **groups, unsigned long flags,
+		unsigned long *dsm_mask)
 {
 	struct nd_dimm *nd_dimm = kzalloc(sizeof(*nd_dimm), GFP_KERNEL);
 	struct device *dev;
@@ -75,12 +105,14 @@ struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
 	}
 	nd_dimm->provider_data = provider_data;
 	nd_dimm->flags = flags;
+	nd_dimm->dsm_mask = dsm_mask;
 
 	dev = &nd_dimm->dev;
 	dev_set_name(dev, "nmem%d", nd_dimm->id);
 	dev->parent = &nd_bus->dev;
 	dev->type = &nd_dimm_device_type;
 	dev->bus = &nd_bus_type;
+	dev->devt = MKDEV(nd_dimm_major, nd_dimm->id);
 	dev->groups = groups;
 	if (device_register(dev) != 0) {
 		put_device(dev);
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index 251ecdd77153..c71a5f34355a 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -18,6 +18,7 @@
 extern struct list_head nd_bus_list;
 extern struct mutex nd_bus_list_mutex;
 extern struct bus_type nd_bus_type;
+extern int nd_dimm_major;
 
 struct nd_bus {
 	struct nd_bus_descriptor *nd_desc;
@@ -29,10 +30,12 @@ struct nd_bus {
 struct nd_dimm {
 	unsigned long flags;
 	void *provider_data;
+	unsigned long *dsm_mask;
 	struct device dev;
 	int id;
 };
 
+bool is_nd_dimm(struct device *dev);
 struct nd_bus *walk_to_nd_bus(struct device *nd_dev);
 int __init nd_bus_init(void);
 void __exit nd_bus_exit(void);
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index 76d5839fb50e..ca72c49ae376 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -14,13 +14,21 @@
  */
 #ifndef __LIBND_H__
 #define __LIBND_H__
+#include <linux/sizes.h>
 
 enum {
 	/* when a dimm supports both PMEM and BLK access a label is required */
 	NDD_ALIASING = 1 << 0,
+
+	/* need to set a limit somewhere, but yes, this is likely overkill */
+	ND_IOCTL_MAX_BUFLEN = SZ_4M,
+	ND_CMD_MAX_ELEM = 4,
+	ND_CMD_MAX_ENVELOPE = 16,
+	ND_CMD_ARS_QUERY_MAX = SZ_4K,
 };
 
 extern struct attribute_group nd_bus_attribute_group;
+extern struct attribute_group nd_dimm_attribute_group;
 
 struct nd_dimm;
 struct nd_bus_descriptor;
@@ -35,6 +43,13 @@ struct nd_bus_descriptor {
 	ndctl_fn ndctl;
 };
 
+struct nd_cmd_desc {
+	int in_num;
+	int out_num;
+	u32 in_sizes[ND_CMD_MAX_ELEM];
+	int out_sizes[ND_CMD_MAX_ELEM];
+};
+
 struct nd_bus;
 struct device;
 struct nd_bus *nd_bus_register(struct device *parent,
@@ -46,5 +61,13 @@ struct nd_bus_descriptor *to_nd_desc(struct nd_bus *nd_bus);
 const char *nd_dimm_name(struct nd_dimm *nd_dimm);
 void *nd_dimm_provider_data(struct nd_dimm *nd_dimm);
 struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
-		const struct attribute_group **groups, unsigned long flags);
+		const struct attribute_group **groups, unsigned long flags,
+		unsigned long *dsm_mask);
+const struct nd_cmd_desc *nd_cmd_dimm_desc(int cmd);
+const struct nd_cmd_desc *nd_cmd_bus_desc(int cmd);
+u32 nd_cmd_in_size(struct nd_dimm *nd_dimm, int cmd,
+		const struct nd_cmd_desc *desc, int idx, void *buf);
+u32 nd_cmd_out_size(struct nd_dimm *nd_dimm, int cmd,
+		const struct nd_cmd_desc *desc, int idx, const u32 *in_field,
+		const u32 *out_field);
 #endif /* __LIBND_H__ */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 68ceb97c458c..384e8d212b04 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -270,6 +270,7 @@ header-y += ncp_fs.h
 header-y += ncp.h
 header-y += ncp_mount.h
 header-y += ncp_no.h
+header-y += ndctl.h
 header-y += neighbour.h
 header-y += netconf.h
 header-y += netdevice.h
diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
new file mode 100644
index 000000000000..62c01bf76198
--- /dev/null
+++ b/include/uapi/linux/ndctl.h
@@ -0,0 +1,178 @@
+/*
+ * Copyright (c) 2014-2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU Lesser General Public License,
+ * version 2.1, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT ANY
+ * WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+ * FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License for
+ * more details.
+ */
+#ifndef __NDCTL_H__
+#define __NDCTL_H__
+
+#include <linux/types.h>
+
+struct nd_cmd_smart {
+	__u32 status;
+	__u8 data[128];
+} __packed;
+
+struct nd_cmd_smart_threshold {
+	__u32 status;
+	__u8 data[8];
+} __packed;
+
+struct nd_cmd_dimm_flags {
+	__u32 status;
+	__u32 flags;
+} __packed;
+
+struct nd_cmd_get_config_size {
+	__u32 status;
+	__u32 config_size;
+	__u32 max_xfer;
+} __packed;
+
+struct nd_cmd_get_config_data_hdr {
+	__u32 in_offset;
+	__u32 in_length;
+	__u32 status;
+	__u8 out_buf[0];
+} __packed;
+
+struct nd_cmd_set_config_hdr {
+	__u32 in_offset;
+	__u32 in_length;
+	__u8 in_buf[0];
+} __packed;
+
+struct nd_cmd_vendor_hdr {
+	__u32 opcode;
+	__u32 in_length;
+	__u8 in_buf[0];
+} __packed;
+
+struct nd_cmd_vendor_tail {
+	__u32 status;
+	__u32 out_length;
+	__u8 out_buf[0];
+} __packed;
+
+struct nd_cmd_ars_cap {
+	__u64 address;
+	__u64 length;
+	__u32 status;
+	__u32 max_ars_out;
+} __packed;
+
+struct nd_cmd_ars_start {
+	__u64 address;
+	__u64 length;
+	__u16 type;
+	__u8 reserved[6];
+	__u32 status;
+} __packed;
+
+struct nd_cmd_ars_query {
+	__u32 status;
+	__u32 out_length;
+	__u64 address;
+	__u64 length;
+	__u16 type;
+	__u32 num_records;
+	struct nd_ars_record {
+		__u32 handle;
+		__u32 flags;
+		__u64 err_address;
+		__u64 mask;
+	} __packed records[0];
+} __packed;
+
+enum {
+	ND_CMD_IMPLEMENTED = 0,
+
+	/* bus commands */
+	ND_CMD_ARS_CAP = 1,
+	ND_CMD_ARS_START = 2,
+	ND_CMD_ARS_QUERY = 3,
+
+	/* per-dimm commands */
+	ND_CMD_SMART = 1,
+	ND_CMD_SMART_THRESHOLD = 2,
+	ND_CMD_DIMM_FLAGS = 3,
+	ND_CMD_GET_CONFIG_SIZE = 4,
+	ND_CMD_GET_CONFIG_DATA = 5,
+	ND_CMD_SET_CONFIG_DATA = 6,
+	ND_CMD_VENDOR_EFFECT_LOG_SIZE = 7,
+	ND_CMD_VENDOR_EFFECT_LOG = 8,
+	ND_CMD_VENDOR = 9,
+};
+
+static inline const char *nd_bus_cmd_name(unsigned cmd)
+{
+	static const char * const names[] = {
+		[ND_CMD_ARS_CAP] = "ars_cap",
+		[ND_CMD_ARS_START] = "ars_start",
+		[ND_CMD_ARS_QUERY] = "ars_query",
+	};
+
+	if (cmd < ARRAY_SIZE(names) && names[cmd])
+		return names[cmd];
+	return "unknown";
+}
+
+static inline const char *nd_dimm_cmd_name(unsigned cmd)
+{
+	static const char * const names[] = {
+		[ND_CMD_SMART] = "smart",
+		[ND_CMD_SMART_THRESHOLD] = "smart_thresh",
+		[ND_CMD_DIMM_FLAGS] = "flags",
+		[ND_CMD_GET_CONFIG_SIZE] = "get_size",
+		[ND_CMD_GET_CONFIG_DATA] = "get_data",
+		[ND_CMD_SET_CONFIG_DATA] = "set_data",
+		[ND_CMD_VENDOR_EFFECT_LOG_SIZE] = "effect_size",
+		[ND_CMD_VENDOR_EFFECT_LOG] = "effect_log",
+		[ND_CMD_VENDOR] = "vendor",
+	};
+
+	if (cmd < ARRAY_SIZE(names) && names[cmd])
+		return names[cmd];
+	return "unknown";
+}
+
+#define ND_IOCTL 'N'
+
+#define ND_IOCTL_SMART			_IOWR(ND_IOCTL, ND_CMD_SMART,\
+					struct nd_cmd_smart)
+
+#define ND_IOCTL_SMART_THRESHOLD	_IOWR(ND_IOCTL, ND_CMD_SMART_THRESHOLD,\
+					struct nd_cmd_smart_threshold)
+
+#define ND_IOCTL_DIMM_FLAGS		_IOWR(ND_IOCTL, ND_CMD_DIMM_FLAGS,\
+					struct nd_cmd_dimm_flags)
+
+#define ND_IOCTL_GET_CONFIG_SIZE	_IOWR(ND_IOCTL, ND_CMD_GET_CONFIG_SIZE,\
+					struct nd_cmd_get_config_size)
+
+#define ND_IOCTL_GET_CONFIG_DATA	_IOWR(ND_IOCTL, ND_CMD_GET_CONFIG_DATA,\
+					struct nd_cmd_get_config_data_hdr)
+
+#define ND_IOCTL_SET_CONFIG_DATA	_IOWR(ND_IOCTL, ND_CMD_SET_CONFIG_DATA,\
+					struct nd_cmd_set_config_hdr)
+
+#define ND_IOCTL_VENDOR			_IOWR(ND_IOCTL, ND_CMD_VENDOR,\
+					struct nd_cmd_vendor_hdr)
+
+#define ND_IOCTL_ARS_CAP		_IOWR(ND_IOCTL, ND_CMD_ARS_CAP,\
+					struct nd_cmd_ars_cap)
+
+#define ND_IOCTL_ARS_START		_IOWR(ND_IOCTL, ND_CMD_ARS_START,\
+					struct nd_cmd_ars_start)
+
+#define ND_IOCTL_ARS_QUERY		_IOWR(ND_IOCTL, ND_CMD_ARS_QUERY,\
+					struct nd_cmd_ars_query)
+
+#endif /* __NDCTL_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 05/21] libnd: control (ioctl) messages for libnd bus and dimm devices
@ 2015-05-20 20:56   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:56 UTC (permalink / raw)
  To: axboe
  Cc: mingo, linux-nvdimm, neilb, gregkh, Rafael J. Wysocki,
	Robert Moore, linux-kernel, linux-acpi, jmoyer, Nicholas Moulin,
	hch

Most discovery/configuration of the libnd-subsystem is done via sysfs
attributes.  However, some libnd buses, particularly the ACPI.NFIT bus,
define a small set of messages that can be passed to the platform.  For
convenience we derive the initial libnd-ioctl command formats directly
from the NFIT DSM Interface Example formats.

    ND_CMD_SMART: media health and diagnostics
    ND_CMD_GET_CONFIG_SIZE: size of the label space
    ND_CMD_GET_CONFIG_DATA: read label space
    ND_CMD_SET_CONFIG_DATA: write label space
    ND_CMD_VENDOR: vendor-specific command passthrough
    ND_CMD_ARS_CAP: report address-range-scrubbing capabilities
    ND_CMD_START_ARS: initiate scrubbing
    ND_CMD_QUERY_ARS: report on scrubbing state
    ND_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events

If a platform later defines different commands than this set it is
straightforward to extend support to those formats.

Most of the commands target a specific dimm.  However, the
address-range-scrubbing commands target the bus.  The 'commands'
attribute in sysfs of a libnd-bus, or a libnd-nmem (dimm device)
enumerate the supported commands for that object.

Cc: <linux-acpi@vger.kernel.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/Kconfig          |   12 ++
 drivers/acpi/nfit.c           |  213 +++++++++++++++++++++++++++
 drivers/acpi/nfit.h           |    3 
 drivers/block/nd/bus.c        |  324 ++++++++++++++++++++++++++++++++++++++++-
 drivers/block/nd/core.c       |   16 ++
 drivers/block/nd/dimm_devs.c  |   38 ++++-
 drivers/block/nd/nd-private.h |    3 
 include/linux/libnd.h         |   25 +++
 include/uapi/linux/Kbuild     |    1 
 include/uapi/linux/ndctl.h    |  178 +++++++++++++++++++++++
 10 files changed, 803 insertions(+), 10 deletions(-)
 create mode 100644 include/uapi/linux/ndctl.h

diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index 84d046d4ed17..0690045ba270 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -390,6 +390,18 @@ config ACPI_NFIT
 	  To compile this driver as a module, choose M here:
 	  the module will be called nfit.
 
+config ACPI_NFIT_DEBUG
+	bool "NFIT DSM debug"
+	depends on ACPI_NFIT
+	depends on DYNAMIC_DEBUG
+	default n
+	help
+	  Enabling this option causes the nfit driver to dump the
+	  input and output buffers of _DSM operations on the ACPI0012
+	  device and its children.  This can be very verbose, so leave
+	  it disabled unless you are debugging a hardware / firmware
+	  issue.
+
 source "drivers/acpi/apei/Kconfig"
 
 config ACPI_EXTLOG
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index b26e1a4a59e3..b7c1c5a5b589 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -13,6 +13,7 @@
 #include <linux/list_sort.h>
 #include <linux/module.h>
 #include <linux/libnd.h>
+#include <linux/ndctl.h>
 #include <linux/list.h>
 #include <linux/acpi.h>
 #include "nfit.h"
@@ -24,11 +25,150 @@ static const u8 *to_nfit_uuid(enum nfit_uuids id)
 	return nfit_uuid[id];
 }
 
+static struct acpi_nfit_desc *to_acpi_nfit_desc(struct nd_bus_descriptor *nd_desc)
+{
+	return container_of(nd_desc, struct acpi_nfit_desc, nd_desc);
+}
+
+static struct acpi_device *to_acpi_dev(struct acpi_nfit_desc *acpi_desc)
+{
+	struct nd_bus_descriptor *nd_desc = &acpi_desc->nd_desc;
+
+	/*
+	 * If provider == 'ACPI.NFIT' we can assume 'dev' is a struct
+	 * acpi_device.
+	 */
+	if (!nd_desc->provider_name
+			|| strcmp(nd_desc->provider_name, "ACPI.NFIT") != 0)
+		return NULL;
+
+	return to_acpi_device(acpi_desc->dev);
+}
+
 static int acpi_nfit_ctl(struct nd_bus_descriptor *nd_desc,
 		struct nd_dimm *nd_dimm, unsigned int cmd, void *buf,
 		unsigned int buf_len)
 {
-	return -ENOTTY;
+	struct acpi_nfit_desc *acpi_desc = to_acpi_nfit_desc(nd_desc);
+	const struct nd_cmd_desc const *desc = NULL;
+	union acpi_object in_obj, in_buf, *out_obj;
+	struct device *dev = acpi_desc->dev;
+	const char *cmd_name, *dimm_name;
+	unsigned long dsm_mask;
+	acpi_handle handle;
+	const u8 *uuid;
+	u32 offset;
+	int rc, i;
+
+	if (nd_dimm) {
+		struct nfit_mem *nfit_mem = nd_dimm_provider_data(nd_dimm);
+		struct acpi_device *adev = nfit_mem->adev;
+
+		if (!adev)
+			return -ENOTTY;
+		dimm_name = dev_name(&adev->dev);
+		cmd_name = nd_dimm_cmd_name(cmd);
+		dsm_mask = nfit_mem->dsm_mask;
+		desc = nd_cmd_dimm_desc(cmd);
+		uuid = to_nfit_uuid(NFIT_DEV_DIMM);
+		handle = adev->handle;
+	} else {
+		struct acpi_device *adev = to_acpi_dev(acpi_desc);
+
+		cmd_name = nd_bus_cmd_name(cmd);
+		dsm_mask = nd_desc->dsm_mask;
+		desc = nd_cmd_bus_desc(cmd);
+		uuid = to_nfit_uuid(NFIT_DEV_BUS);
+		handle = adev->handle;
+		dimm_name = "bus";
+	}
+
+	if (!desc || (cmd && (desc->out_num + desc->in_num == 0)))
+		return -ENOTTY;
+
+	if (!test_bit(cmd, &dsm_mask))
+		return -ENOTTY;
+
+	in_obj.type = ACPI_TYPE_PACKAGE;
+	in_obj.package.count = 1;
+	in_obj.package.elements = &in_buf;
+	in_buf.type = ACPI_TYPE_BUFFER;
+	in_buf.buffer.pointer = buf;
+	in_buf.buffer.length = 0;
+
+	/* libnd has already validated the input envelope */
+	for (i = 0; i < desc->in_num; i++)
+		in_buf.buffer.length += nd_cmd_in_size(nd_dimm, cmd, desc, i, buf);
+
+	if (IS_ENABLED(CONFIG_ACPI_NFIT_DEBUG)) {
+		dev_dbg(dev, "%s:%s cmd: %s input length: %d\n", __func__,
+				dimm_name, cmd_name, in_buf.buffer.length);
+		print_hex_dump_debug(cmd_name, DUMP_PREFIX_OFFSET, 4,
+				4, in_buf.buffer.pointer, min_t(u32, 128,
+					in_buf.buffer.length), true);
+	}
+
+	out_obj = acpi_evaluate_dsm(handle, uuid, 1, cmd, &in_obj);
+	if (!out_obj) {
+		dev_dbg(dev, "%s:%s _DSM failed cmd: %s\n", __func__, dimm_name,
+				cmd_name);
+		return -EINVAL;
+	}
+
+	if (out_obj->package.type != ACPI_TYPE_BUFFER) {
+		dev_dbg(dev, "%s:%s unexpected output object type cmd: %s type: %d\n",
+				__func__, dimm_name, cmd_name, out_obj->type);
+		rc = -EINVAL;
+		goto out;
+	}
+
+	if (IS_ENABLED(CONFIG_ACPI_NFIT_DEBUG)) {
+		dev_dbg(dev, "%s:%s cmd: %s output length: %d\n", __func__,
+				dimm_name, cmd_name, out_obj->buffer.length);
+		print_hex_dump_debug(cmd_name, DUMP_PREFIX_OFFSET, 4,
+				4, out_obj->buffer.pointer, min_t(u32, 128,
+					out_obj->buffer.length), true);
+	}
+
+	for (i = 0, offset = 0; i < desc->out_num; i++) {
+		u32 out_size = nd_cmd_out_size(nd_dimm, cmd, desc, i, buf,
+				(u32 *) out_obj->buffer.pointer);
+
+		if (offset + out_size > out_obj->buffer.length) {
+			dev_dbg(dev, "%s:%s output object underflow cmd: %s field: %d\n",
+					__func__, dimm_name, cmd_name, i);
+			break;
+		}
+
+		if (in_buf.buffer.length + offset + out_size > buf_len) {
+			dev_dbg(dev, "%s:%s output overrun cmd: %s field: %d\n",
+					__func__, dimm_name, cmd_name, i);
+			rc = -ENXIO;
+			goto out;
+		}
+		memcpy(buf + in_buf.buffer.length + offset,
+				out_obj->buffer.pointer + offset, out_size);
+		offset += out_size;
+	}
+	if (offset + in_buf.buffer.length < buf_len) {
+		if (i >= 1) {
+			/*
+			 * status valid, return the number of bytes left
+			 * unfilled in the output buffer
+			 */
+			rc = buf_len - offset - in_buf.buffer.length;
+		} else {
+			dev_err(dev, "%s:%s underrun cmd: %s buf_len: %d out_len: %d\n",
+					__func__, dimm_name, cmd_name, buf_len, offset);
+			rc = -ENXIO;
+		}
+	} else
+		rc = 0;
+
+ out:
+	ACPI_FREE(out_obj);
+
+	return rc;
 }
 
 static const char *spa_type_name(u16 type)
@@ -451,6 +591,7 @@ static struct attribute_group acpi_nfit_dimm_attribute_group = {
 };
 
 static const struct attribute_group *acpi_nfit_dimm_attribute_groups[] = {
+	&nd_dimm_attribute_group,
 	&acpi_nfit_dimm_attribute_group,
 	NULL,
 };
@@ -467,6 +608,50 @@ static struct nd_dimm *acpi_nfit_dimm_by_handle(struct acpi_nfit_desc *acpi_desc
 	return NULL;
 }
 
+static int acpi_nfit_add_dimm(struct acpi_nfit_desc *acpi_desc,
+		struct nfit_mem *nfit_mem, u32 device_handle)
+{
+	struct acpi_device *adev, *adev_dimm;
+	struct device *dev = acpi_desc->dev;
+	const u8 *uuid = to_nfit_uuid(NFIT_DEV_DIMM);
+	unsigned long long sta;
+	int i, rc = -ENODEV;
+	acpi_status status;
+
+	nfit_mem->dsm_mask = acpi_desc->dimm_dsm_force_en;
+	adev = to_acpi_dev(acpi_desc);
+	if (!adev)
+		return 0;
+
+	adev_dimm = acpi_find_child_device(adev, device_handle, false);
+	nfit_mem->adev = adev_dimm;
+	if (!adev_dimm) {
+		dev_err(dev, "no ACPI.NFIT device with _ADR %#x, disabling...\n",
+				device_handle);
+		return -ENODEV;
+	}
+
+	status = acpi_evaluate_integer(adev_dimm->handle, "_STA", NULL, &sta);
+	if (status == AE_NOT_FOUND) {
+		dev_dbg(dev, "%s missing _STA, assuming enabled...\n",
+				dev_name(&adev_dimm->dev));
+		rc = 0;
+	} else if (ACPI_FAILURE(status))
+		dev_err(dev, "%s failed to retrieve_STA, disabling...\n",
+				dev_name(&adev_dimm->dev));
+	else if ((sta & ACPI_STA_DEVICE_ENABLED) == 0)
+		dev_info(dev, "%s disabled by firmware\n",
+				dev_name(&adev_dimm->dev));
+	else
+		rc = 0;
+
+	for (i = ND_CMD_SMART; i <= ND_CMD_VENDOR; i++)
+		if (acpi_check_dsm(adev_dimm->handle, uuid, 1, 1ULL << i))
+			set_bit(i, &nfit_mem->dsm_mask);
+
+	return rc;
+}
+
 static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
 {
 	struct nfit_mem *nfit_mem;
@@ -475,6 +660,7 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
 		struct nd_dimm *nd_dimm;
 		unsigned long flags = 0;
 		u32 device_handle;
+		int rc;
 
 		device_handle = __to_nfit_memdev(nfit_mem)->device_handle;
 		nd_dimm = acpi_nfit_dimm_by_handle(acpi_desc, device_handle);
@@ -491,8 +677,13 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
 		if (nfit_mem->bdw && nfit_mem->memdev_pmem)
 			flags |= NDD_ALIASING;
 
+		rc = acpi_nfit_add_dimm(acpi_desc, nfit_mem, device_handle);
+		if (rc)
+			continue;
+
 		nd_dimm = nd_dimm_create(acpi_desc->nd_bus, nfit_mem,
-				acpi_nfit_dimm_attribute_groups, flags);
+				acpi_nfit_dimm_attribute_groups,
+				flags, &nfit_mem->dsm_mask);
 		if (!nd_dimm)
 			return -ENOMEM;
 
@@ -502,6 +693,22 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
 	return 0;
 }
 
+static void acpi_nfit_init_dsms(struct acpi_nfit_desc *acpi_desc)
+{
+	struct nd_bus_descriptor *nd_desc = &acpi_desc->nd_desc;
+	const u8 *uuid = to_nfit_uuid(NFIT_DEV_BUS);
+	struct acpi_device *adev;
+	int i;
+
+	adev = to_acpi_dev(acpi_desc);
+	if (!adev)
+		return;
+
+	for (i = ND_CMD_ARS_CAP; i <= ND_CMD_ARS_QUERY; i++)
+		if (acpi_check_dsm(adev->handle, uuid, 1, 1ULL << i))
+			set_bit(i, &nd_desc->dsm_mask);
+}
+
 static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 {
 	struct device *dev = acpi_desc->dev;
@@ -529,6 +736,8 @@ static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 	if (nfit_mem_init(acpi_desc) != 0)
 		return -ENOMEM;
 
+	acpi_nfit_init_dsms(acpi_desc);
+
 	return acpi_nfit_register_dimms(acpi_desc);
 }
 
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index 9d4c1634cb0e..cc496ba6bbd2 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -67,6 +67,8 @@ struct nfit_mem {
 	struct acpi_nfit_system_address *spa_dcr;
 	struct acpi_nfit_system_address *spa_bdw;
 	struct list_head list;
+	struct acpi_device *adev;
+	unsigned long dsm_mask;
 };
 
 struct acpi_nfit_desc {
@@ -79,6 +81,7 @@ struct acpi_nfit_desc {
 	struct list_head bdws;
 	struct nd_bus *nd_bus;
 	struct device *dev;
+	unsigned long dimm_dsm_force_en;
 };
 
 static inline struct acpi_nfit_memory_map *__to_nfit_memdev(struct nfit_mem *nfit_mem)
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index ee56aa1ab2ad..f072a9e0c1bd 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -11,14 +11,18 @@
  * General Public License for more details.
  */
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/vmalloc.h>
 #include <linux/uaccess.h>
 #include <linux/fcntl.h>
 #include <linux/async.h>
+#include <linux/ndctl.h>
 #include <linux/slab.h>
 #include <linux/fs.h>
 #include <linux/io.h>
+#include <linux/mm.h>
 #include "nd-private.h"
 
+int nd_dimm_major;
 static int nd_bus_major;
 static struct class *nd_class;
 
@@ -47,19 +51,323 @@ void nd_bus_destroy_ndctl(struct nd_bus *nd_bus)
 	device_destroy(nd_class, MKDEV(nd_bus_major, nd_bus->id));
 }
 
+static const struct nd_cmd_desc const __nd_cmd_dimm_descs[] = {
+	[ND_CMD_IMPLEMENTED] = { },
+	[ND_CMD_SMART] = {
+		.out_num = 2,
+		.out_sizes = { 4, 8, },
+	},
+	[ND_CMD_SMART_THRESHOLD] = {
+		.out_num = 2,
+		.out_sizes = { 4, 8, },
+	},
+	[ND_CMD_DIMM_FLAGS] = {
+		.out_num = 2,
+		.out_sizes = { 4, 4 },
+	},
+	[ND_CMD_GET_CONFIG_SIZE] = {
+		.out_num = 3,
+		.out_sizes = { 4, 4, 4, },
+	},
+	[ND_CMD_GET_CONFIG_DATA] = {
+		.in_num = 2,
+		.in_sizes = { 4, 4, },
+		.out_num = 2,
+		.out_sizes = { 4, UINT_MAX, },
+	},
+	[ND_CMD_SET_CONFIG_DATA] = {
+		.in_num = 3,
+		.in_sizes = { 4, 4, UINT_MAX, },
+		.out_num = 1,
+		.out_sizes = { 4, },
+	},
+	[ND_CMD_VENDOR] = {
+		.in_num = 3,
+		.in_sizes = { 4, 4, UINT_MAX, },
+		.out_num = 3,
+		.out_sizes = { 4, 4, UINT_MAX, },
+	},
+};
+
+const struct nd_cmd_desc *nd_cmd_dimm_desc(int cmd)
+{
+	if (cmd < ARRAY_SIZE(__nd_cmd_dimm_descs))
+		return &__nd_cmd_dimm_descs[cmd];
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(nd_cmd_dimm_desc);
+
+static const struct nd_cmd_desc const __nd_cmd_bus_descs[] = {
+	[ND_CMD_IMPLEMENTED] = { },
+	[ND_CMD_ARS_CAP] = {
+		.in_num = 2,
+		.in_sizes = { 8, 8, },
+		.out_num = 2,
+		.out_sizes = { 4, 4, },
+	},
+	[ND_CMD_ARS_START] = {
+		.in_num = 4,
+		.in_sizes = { 8, 8, 2, 6, },
+		.out_num = 1,
+		.out_sizes = { 4, },
+	},
+	[ND_CMD_ARS_QUERY] = {
+		.out_num = 2,
+		.out_sizes = { 4, UINT_MAX, },
+	},
+};
+
+const struct nd_cmd_desc *nd_cmd_bus_desc(int cmd)
+{
+	if (cmd < ARRAY_SIZE(__nd_cmd_bus_descs))
+		return &__nd_cmd_bus_descs[cmd];
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(nd_cmd_bus_desc);
+
+u32 nd_cmd_in_size(struct nd_dimm *nd_dimm, int cmd,
+		const struct nd_cmd_desc *desc, int idx, void *buf)
+{
+	if (idx >= desc->in_num)
+		return UINT_MAX;
+
+	if (desc->in_sizes[idx] < UINT_MAX)
+		return desc->in_sizes[idx];
+
+	if (nd_dimm && cmd == ND_CMD_SET_CONFIG_DATA && idx == 2) {
+		struct nd_cmd_set_config_hdr *hdr = buf;
+
+		return hdr->in_length;
+	} else if (nd_dimm && cmd == ND_CMD_VENDOR && idx == 2) {
+		struct nd_cmd_vendor_hdr *hdr = buf;
+
+		return hdr->in_length;
+	}
+
+	return UINT_MAX;
+}
+EXPORT_SYMBOL_GPL(nd_cmd_in_size);
+
+u32 nd_cmd_out_size(struct nd_dimm *nd_dimm, int cmd,
+		const struct nd_cmd_desc *desc, int idx, const u32 *in_field,
+		const u32 *out_field)
+{
+	if (idx >= desc->out_num)
+		return UINT_MAX;
+
+	if (desc->out_sizes[idx] < UINT_MAX)
+		return desc->out_sizes[idx];
+
+	if (nd_dimm && cmd == ND_CMD_GET_CONFIG_DATA && idx == 1)
+		return in_field[1];
+	else if (nd_dimm && cmd == ND_CMD_VENDOR && idx == 2)
+		return out_field[1];
+	else if (!nd_dimm && cmd == ND_CMD_ARS_QUERY && idx == 1)
+		return ND_CMD_ARS_QUERY_MAX;
+
+	return UINT_MAX;
+}
+EXPORT_SYMBOL_GPL(nd_cmd_out_size);
+
+static int __nd_ioctl(struct nd_bus *nd_bus, struct nd_dimm *nd_dimm,
+		int read_only, unsigned int ioctl_cmd, unsigned long arg)
+{
+	struct nd_bus_descriptor *nd_desc = nd_bus->nd_desc;
+	size_t buf_len = 0, in_len = 0, out_len = 0;
+	static char out_env[ND_CMD_MAX_ENVELOPE];
+	static char in_env[ND_CMD_MAX_ENVELOPE];
+	const struct nd_cmd_desc *desc = NULL;
+	unsigned int cmd = _IOC_NR(ioctl_cmd);
+	void __user *p = (void __user *) arg;
+	struct device *dev = &nd_bus->dev;
+	const char *cmd_name, *dimm_name;
+	unsigned long dsm_mask;
+	void *buf;
+	int rc, i;
+
+	if (nd_dimm) {
+		desc = nd_cmd_dimm_desc(cmd);
+		cmd_name = nd_dimm_cmd_name(cmd);
+		dsm_mask = nd_dimm->dsm_mask ? *(nd_dimm->dsm_mask) : 0;
+		dimm_name = dev_name(&nd_dimm->dev);
+	} else {
+		desc = nd_cmd_bus_desc(cmd);
+		cmd_name = nd_bus_cmd_name(cmd);
+		dsm_mask = nd_desc->dsm_mask;
+		dimm_name = "bus";
+	}
+
+	if (!desc || (desc->out_num + desc->in_num == 0) ||
+			!test_bit(cmd, &dsm_mask))
+		return -ENOTTY;
+
+	/* fail write commands (when read-only) */
+	if (read_only)
+		switch (ioctl_cmd) {
+		case ND_IOCTL_VENDOR:
+		case ND_IOCTL_SET_CONFIG_DATA:
+		case ND_IOCTL_ARS_START:
+			dev_dbg(&nd_bus->dev, "'%s' command while read-only.\n",
+					nd_dimm ? nd_dimm_cmd_name(cmd)
+					: nd_bus_cmd_name(cmd));
+			return -EPERM;
+		default:
+			break;
+		}
+
+	/* process an input envelope */
+	for (i = 0; i < desc->in_num; i++) {
+		u32 in_size, copy;
+
+		in_size = nd_cmd_in_size(nd_dimm, cmd, desc, i, in_env);
+		if (in_size == UINT_MAX) {
+			dev_err(dev, "%s:%s unknown input size cmd: %s field: %d\n",
+					__func__, dimm_name, cmd_name, i);
+			return -ENXIO;
+		}
+		if (!access_ok(VERIFY_READ, p + in_len, in_size))
+			return -EFAULT;
+		if (in_len < sizeof(in_env))
+			copy = min_t(u32, sizeof(in_env) - in_len, in_size);
+		else
+			copy = 0;
+		if (copy && copy_from_user(&in_env[in_len], p + in_len, copy))
+			return -EFAULT;
+		in_len += in_size;
+	}
+
+	/* process an output envelope */
+	for (i = 0; i < desc->out_num; i++) {
+		u32 out_size = nd_cmd_out_size(nd_dimm, cmd, desc, i,
+				(u32 *) in_env, (u32 *) out_env);
+		u32 copy;
+
+		if (out_size == UINT_MAX) {
+			dev_dbg(dev, "%s:%s unknown output size cmd: %s field: %d\n",
+					__func__, dimm_name, cmd_name, i);
+			return -EFAULT;
+		}
+		if (!access_ok(VERIFY_WRITE, p + in_len + out_len, out_size))
+			return -EFAULT;
+		if (out_len < sizeof(out_env))
+			copy = min_t(u32, sizeof(out_env) - out_len, out_size);
+		else
+			copy = 0;
+		if (copy && copy_from_user(&out_env[out_len], p + in_len + out_len,
+					copy))
+			return -EFAULT;
+		out_len += out_size;
+	}
+
+	buf_len = out_len + in_len;
+	if (!access_ok(VERIFY_WRITE, p, sizeof(buf_len)))
+		return -EFAULT;
+
+	if (buf_len > ND_IOCTL_MAX_BUFLEN) {
+		dev_dbg(dev, "%s:%s cmd: %s buf_len: %zd > %d\n", __func__,
+				dimm_name, cmd_name, buf_len,
+				ND_IOCTL_MAX_BUFLEN);
+		return -EINVAL;
+	}
+
+	buf = vmalloc(buf_len);
+	if (!buf)
+		return -ENOMEM;
+
+	if (copy_from_user(buf, p, buf_len)) {
+		rc = -EFAULT;
+		goto out;
+	}
+
+	rc = nd_desc->ndctl(nd_desc, nd_dimm, cmd, buf, buf_len);
+	if (rc < 0)
+		goto out;
+	if (copy_to_user(p, buf, buf_len))
+		rc = -EFAULT;
+ out:
+	vfree(buf);
+	return rc;
+}
+
 static long nd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
-	return -ENXIO;
+	long id = (long) file->private_data;
+	int rc = -ENXIO, read_only;
+	struct nd_bus *nd_bus;
+
+	read_only = (O_RDWR != (file->f_flags & O_ACCMODE));
+	mutex_lock(&nd_bus_list_mutex);
+	list_for_each_entry(nd_bus, &nd_bus_list, list) {
+		if (nd_bus->id == id) {
+			rc = __nd_ioctl(nd_bus, NULL, read_only, cmd, arg);
+			break;
+		}
+	}
+	mutex_unlock(&nd_bus_list_mutex);
+
+	return rc;
+}
+
+static int match_dimm(struct device *dev, void *data)
+{
+	long id = (long) data;
+
+	if (is_nd_dimm(dev)) {
+		struct nd_dimm *nd_dimm = to_nd_dimm(dev);
+
+		return nd_dimm->id == id;
+	}
+
+	return 0;
+}
+
+static long nd_dimm_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	int rc = -ENXIO, read_only;
+	struct nd_bus *nd_bus;
+
+	read_only = (O_RDWR != (file->f_flags & O_ACCMODE));
+	mutex_lock(&nd_bus_list_mutex);
+	list_for_each_entry(nd_bus, &nd_bus_list, list) {
+		struct device *dev = device_find_child(&nd_bus->dev,
+				file->private_data, match_dimm);
+
+		if (!dev)
+			continue;
+
+		rc = __nd_ioctl(nd_bus, to_nd_dimm(dev), read_only, cmd, arg);
+		put_device(dev);
+		break;
+	}
+	mutex_unlock(&nd_bus_list_mutex);
+
+	return rc;
+}
+
+static int nd_open(struct inode *inode, struct file *file)
+{
+	long minor = iminor(inode);
+
+	file->private_data = (void *) minor;
+	return 0;
 }
 
 static const struct file_operations nd_bus_fops = {
 	.owner = THIS_MODULE,
-	.open = nonseekable_open,
+	.open = nd_open,
 	.unlocked_ioctl = nd_ioctl,
 	.compat_ioctl = nd_ioctl,
 	.llseek = noop_llseek,
 };
 
+static const struct file_operations nd_dimm_fops = {
+	.owner = THIS_MODULE,
+	.open = nd_open,
+	.unlocked_ioctl = nd_dimm_ioctl,
+	.compat_ioctl = nd_dimm_ioctl,
+	.llseek = noop_llseek,
+};
+
 int __init nd_bus_init(void)
 {
 	int rc;
@@ -70,9 +378,14 @@ int __init nd_bus_init(void)
 
 	rc = register_chrdev(0, "ndctl", &nd_bus_fops);
 	if (rc < 0)
-		goto err_chrdev;
+		goto err_bus_chrdev;
 	nd_bus_major = rc;
 
+	rc = register_chrdev(0, "dimmctl", &nd_dimm_fops);
+	if (rc < 0)
+		goto err_dimm_chrdev;
+	nd_dimm_major = rc;
+
 	nd_class = class_create(THIS_MODULE, "nd");
 	if (IS_ERR(nd_class))
 		goto err_class;
@@ -80,8 +393,10 @@ int __init nd_bus_init(void)
 	return 0;
 
  err_class:
+	unregister_chrdev(nd_dimm_major, "dimmctl");
+ err_dimm_chrdev:
 	unregister_chrdev(nd_bus_major, "ndctl");
- err_chrdev:
+ err_bus_chrdev:
 	bus_unregister(&nd_bus_type);
 
 	return rc;
@@ -91,5 +406,6 @@ void __exit nd_bus_exit(void)
 {
 	class_destroy(nd_class);
 	unregister_chrdev(nd_bus_major, "ndctl");
+	unregister_chrdev(nd_dimm_major, "dimmctl");
 	bus_unregister(&nd_bus_type);
 }
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index 4d0e53ecdcb0..d7a922913da2 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -14,6 +14,7 @@
 #include <linux/module.h>
 #include <linux/device.h>
 #include <linux/libnd.h>
+#include <linux/ndctl.h>
 #include <linux/mutex.h>
 #include <linux/slab.h>
 #include "nd-private.h"
@@ -59,6 +60,20 @@ struct nd_bus *walk_to_nd_bus(struct device *nd_dev)
 	return NULL;
 }
 
+static ssize_t commands_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	int cmd, len = 0;
+	struct nd_bus *nd_bus = to_nd_bus(dev);
+	struct nd_bus_descriptor *nd_desc = nd_bus->nd_desc;
+
+	for_each_set_bit(cmd, &nd_desc->dsm_mask, BITS_PER_LONG)
+		len += sprintf(buf + len, "%s ", nd_bus_cmd_name(cmd));
+	len += sprintf(buf + len, "\n");
+	return len;
+}
+static DEVICE_ATTR_RO(commands);
+
 static const char *nd_bus_provider(struct nd_bus *nd_bus)
 {
 	struct nd_bus_descriptor *nd_desc = nd_bus->nd_desc;
@@ -82,6 +97,7 @@ static ssize_t provider_show(struct device *dev,
 static DEVICE_ATTR_RO(provider);
 
 static struct attribute *nd_bus_attributes[] = {
+	&dev_attr_commands.attr,
 	&dev_attr_provider.attr,
 	NULL,
 };
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 19b081392f2f..3fa26f61c3db 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -12,6 +12,7 @@
  */
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include <linux/device.h>
+#include <linux/ndctl.h>
 #include <linux/slab.h>
 #include <linux/io.h>
 #include <linux/fs.h>
@@ -33,7 +34,7 @@ static struct device_type nd_dimm_device_type = {
 	.release = nd_dimm_release,
 };
 
-static bool is_nd_dimm(struct device *dev)
+bool is_nd_dimm(struct device *dev)
 {
 	return dev->type == &nd_dimm_device_type;
 }
@@ -55,12 +56,41 @@ EXPORT_SYMBOL_GPL(nd_dimm_name);
 
 void *nd_dimm_provider_data(struct nd_dimm *nd_dimm)
 {
-	return nd_dimm->provider_data;
+	if (nd_dimm)
+		return nd_dimm->provider_data;
+	return NULL;
 }
 EXPORT_SYMBOL_GPL(nd_dimm_provider_data);
 
+static ssize_t commands_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_dimm *nd_dimm = to_nd_dimm(dev);
+	int cmd, len = 0;
+
+	if (!nd_dimm->dsm_mask)
+		return sprintf(buf, "\n");
+
+	for_each_set_bit(cmd, nd_dimm->dsm_mask, BITS_PER_LONG)
+		len += sprintf(buf + len, "%s ", nd_dimm_cmd_name(cmd));
+	len += sprintf(buf + len, "\n");
+	return len;
+}
+static DEVICE_ATTR_RO(commands);
+
+static struct attribute *nd_dimm_attributes[] = {
+	&dev_attr_commands.attr,
+	NULL,
+};
+
+struct attribute_group nd_dimm_attribute_group = {
+	.attrs = nd_dimm_attributes,
+};
+EXPORT_SYMBOL_GPL(nd_dimm_attribute_group);
+
 struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
-		const struct attribute_group **groups, unsigned long flags)
+		const struct attribute_group **groups, unsigned long flags,
+		unsigned long *dsm_mask)
 {
 	struct nd_dimm *nd_dimm = kzalloc(sizeof(*nd_dimm), GFP_KERNEL);
 	struct device *dev;
@@ -75,12 +105,14 @@ struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
 	}
 	nd_dimm->provider_data = provider_data;
 	nd_dimm->flags = flags;
+	nd_dimm->dsm_mask = dsm_mask;
 
 	dev = &nd_dimm->dev;
 	dev_set_name(dev, "nmem%d", nd_dimm->id);
 	dev->parent = &nd_bus->dev;
 	dev->type = &nd_dimm_device_type;
 	dev->bus = &nd_bus_type;
+	dev->devt = MKDEV(nd_dimm_major, nd_dimm->id);
 	dev->groups = groups;
 	if (device_register(dev) != 0) {
 		put_device(dev);
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index 251ecdd77153..c71a5f34355a 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -18,6 +18,7 @@
 extern struct list_head nd_bus_list;
 extern struct mutex nd_bus_list_mutex;
 extern struct bus_type nd_bus_type;
+extern int nd_dimm_major;
 
 struct nd_bus {
 	struct nd_bus_descriptor *nd_desc;
@@ -29,10 +30,12 @@ struct nd_bus {
 struct nd_dimm {
 	unsigned long flags;
 	void *provider_data;
+	unsigned long *dsm_mask;
 	struct device dev;
 	int id;
 };
 
+bool is_nd_dimm(struct device *dev);
 struct nd_bus *walk_to_nd_bus(struct device *nd_dev);
 int __init nd_bus_init(void);
 void __exit nd_bus_exit(void);
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index 76d5839fb50e..ca72c49ae376 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -14,13 +14,21 @@
  */
 #ifndef __LIBND_H__
 #define __LIBND_H__
+#include <linux/sizes.h>
 
 enum {
 	/* when a dimm supports both PMEM and BLK access a label is required */
 	NDD_ALIASING = 1 << 0,
+
+	/* need to set a limit somewhere, but yes, this is likely overkill */
+	ND_IOCTL_MAX_BUFLEN = SZ_4M,
+	ND_CMD_MAX_ELEM = 4,
+	ND_CMD_MAX_ENVELOPE = 16,
+	ND_CMD_ARS_QUERY_MAX = SZ_4K,
 };
 
 extern struct attribute_group nd_bus_attribute_group;
+extern struct attribute_group nd_dimm_attribute_group;
 
 struct nd_dimm;
 struct nd_bus_descriptor;
@@ -35,6 +43,13 @@ struct nd_bus_descriptor {
 	ndctl_fn ndctl;
 };
 
+struct nd_cmd_desc {
+	int in_num;
+	int out_num;
+	u32 in_sizes[ND_CMD_MAX_ELEM];
+	int out_sizes[ND_CMD_MAX_ELEM];
+};
+
 struct nd_bus;
 struct device;
 struct nd_bus *nd_bus_register(struct device *parent,
@@ -46,5 +61,13 @@ struct nd_bus_descriptor *to_nd_desc(struct nd_bus *nd_bus);
 const char *nd_dimm_name(struct nd_dimm *nd_dimm);
 void *nd_dimm_provider_data(struct nd_dimm *nd_dimm);
 struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
-		const struct attribute_group **groups, unsigned long flags);
+		const struct attribute_group **groups, unsigned long flags,
+		unsigned long *dsm_mask);
+const struct nd_cmd_desc *nd_cmd_dimm_desc(int cmd);
+const struct nd_cmd_desc *nd_cmd_bus_desc(int cmd);
+u32 nd_cmd_in_size(struct nd_dimm *nd_dimm, int cmd,
+		const struct nd_cmd_desc *desc, int idx, void *buf);
+u32 nd_cmd_out_size(struct nd_dimm *nd_dimm, int cmd,
+		const struct nd_cmd_desc *desc, int idx, const u32 *in_field,
+		const u32 *out_field);
 #endif /* __LIBND_H__ */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 68ceb97c458c..384e8d212b04 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -270,6 +270,7 @@ header-y += ncp_fs.h
 header-y += ncp.h
 header-y += ncp_mount.h
 header-y += ncp_no.h
+header-y += ndctl.h
 header-y += neighbour.h
 header-y += netconf.h
 header-y += netdevice.h
diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
new file mode 100644
index 000000000000..62c01bf76198
--- /dev/null
+++ b/include/uapi/linux/ndctl.h
@@ -0,0 +1,178 @@
+/*
+ * Copyright (c) 2014-2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU Lesser General Public License,
+ * version 2.1, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT ANY
+ * WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+ * FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License for
+ * more details.
+ */
+#ifndef __NDCTL_H__
+#define __NDCTL_H__
+
+#include <linux/types.h>
+
+struct nd_cmd_smart {
+	__u32 status;
+	__u8 data[128];
+} __packed;
+
+struct nd_cmd_smart_threshold {
+	__u32 status;
+	__u8 data[8];
+} __packed;
+
+struct nd_cmd_dimm_flags {
+	__u32 status;
+	__u32 flags;
+} __packed;
+
+struct nd_cmd_get_config_size {
+	__u32 status;
+	__u32 config_size;
+	__u32 max_xfer;
+} __packed;
+
+struct nd_cmd_get_config_data_hdr {
+	__u32 in_offset;
+	__u32 in_length;
+	__u32 status;
+	__u8 out_buf[0];
+} __packed;
+
+struct nd_cmd_set_config_hdr {
+	__u32 in_offset;
+	__u32 in_length;
+	__u8 in_buf[0];
+} __packed;
+
+struct nd_cmd_vendor_hdr {
+	__u32 opcode;
+	__u32 in_length;
+	__u8 in_buf[0];
+} __packed;
+
+struct nd_cmd_vendor_tail {
+	__u32 status;
+	__u32 out_length;
+	__u8 out_buf[0];
+} __packed;
+
+struct nd_cmd_ars_cap {
+	__u64 address;
+	__u64 length;
+	__u32 status;
+	__u32 max_ars_out;
+} __packed;
+
+struct nd_cmd_ars_start {
+	__u64 address;
+	__u64 length;
+	__u16 type;
+	__u8 reserved[6];
+	__u32 status;
+} __packed;
+
+struct nd_cmd_ars_query {
+	__u32 status;
+	__u32 out_length;
+	__u64 address;
+	__u64 length;
+	__u16 type;
+	__u32 num_records;
+	struct nd_ars_record {
+		__u32 handle;
+		__u32 flags;
+		__u64 err_address;
+		__u64 mask;
+	} __packed records[0];
+} __packed;
+
+enum {
+	ND_CMD_IMPLEMENTED = 0,
+
+	/* bus commands */
+	ND_CMD_ARS_CAP = 1,
+	ND_CMD_ARS_START = 2,
+	ND_CMD_ARS_QUERY = 3,
+
+	/* per-dimm commands */
+	ND_CMD_SMART = 1,
+	ND_CMD_SMART_THRESHOLD = 2,
+	ND_CMD_DIMM_FLAGS = 3,
+	ND_CMD_GET_CONFIG_SIZE = 4,
+	ND_CMD_GET_CONFIG_DATA = 5,
+	ND_CMD_SET_CONFIG_DATA = 6,
+	ND_CMD_VENDOR_EFFECT_LOG_SIZE = 7,
+	ND_CMD_VENDOR_EFFECT_LOG = 8,
+	ND_CMD_VENDOR = 9,
+};
+
+static inline const char *nd_bus_cmd_name(unsigned cmd)
+{
+	static const char * const names[] = {
+		[ND_CMD_ARS_CAP] = "ars_cap",
+		[ND_CMD_ARS_START] = "ars_start",
+		[ND_CMD_ARS_QUERY] = "ars_query",
+	};
+
+	if (cmd < ARRAY_SIZE(names) && names[cmd])
+		return names[cmd];
+	return "unknown";
+}
+
+static inline const char *nd_dimm_cmd_name(unsigned cmd)
+{
+	static const char * const names[] = {
+		[ND_CMD_SMART] = "smart",
+		[ND_CMD_SMART_THRESHOLD] = "smart_thresh",
+		[ND_CMD_DIMM_FLAGS] = "flags",
+		[ND_CMD_GET_CONFIG_SIZE] = "get_size",
+		[ND_CMD_GET_CONFIG_DATA] = "get_data",
+		[ND_CMD_SET_CONFIG_DATA] = "set_data",
+		[ND_CMD_VENDOR_EFFECT_LOG_SIZE] = "effect_size",
+		[ND_CMD_VENDOR_EFFECT_LOG] = "effect_log",
+		[ND_CMD_VENDOR] = "vendor",
+	};
+
+	if (cmd < ARRAY_SIZE(names) && names[cmd])
+		return names[cmd];
+	return "unknown";
+}
+
+#define ND_IOCTL 'N'
+
+#define ND_IOCTL_SMART			_IOWR(ND_IOCTL, ND_CMD_SMART,\
+					struct nd_cmd_smart)
+
+#define ND_IOCTL_SMART_THRESHOLD	_IOWR(ND_IOCTL, ND_CMD_SMART_THRESHOLD,\
+					struct nd_cmd_smart_threshold)
+
+#define ND_IOCTL_DIMM_FLAGS		_IOWR(ND_IOCTL, ND_CMD_DIMM_FLAGS,\
+					struct nd_cmd_dimm_flags)
+
+#define ND_IOCTL_GET_CONFIG_SIZE	_IOWR(ND_IOCTL, ND_CMD_GET_CONFIG_SIZE,\
+					struct nd_cmd_get_config_size)
+
+#define ND_IOCTL_GET_CONFIG_DATA	_IOWR(ND_IOCTL, ND_CMD_GET_CONFIG_DATA,\
+					struct nd_cmd_get_config_data_hdr)
+
+#define ND_IOCTL_SET_CONFIG_DATA	_IOWR(ND_IOCTL, ND_CMD_SET_CONFIG_DATA,\
+					struct nd_cmd_set_config_hdr)
+
+#define ND_IOCTL_VENDOR			_IOWR(ND_IOCTL, ND_CMD_VENDOR,\
+					struct nd_cmd_vendor_hdr)
+
+#define ND_IOCTL_ARS_CAP		_IOWR(ND_IOCTL, ND_CMD_ARS_CAP,\
+					struct nd_cmd_ars_cap)
+
+#define ND_IOCTL_ARS_START		_IOWR(ND_IOCTL, ND_CMD_ARS_START,\
+					struct nd_cmd_ars_start)
+
+#define ND_IOCTL_ARS_QUERY		_IOWR(ND_IOCTL, ND_CMD_ARS_QUERY,\
+					struct nd_cmd_ars_query)
+
+#endif /* __NDCTL_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 06/21] libnd, nd_dimm: dimm driver and base libnd device-driver infrastructure
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:56   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:56 UTC (permalink / raw)
  To: axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, mingo, linux-acpi,
	jmoyer, hch

* Implement the device-model infrastructure for loading modules and
  attaching drivers to nd devices.  This is a simple association of a
  nd-device-type number with a driver that has a bitmask of supported
  device types.  To facilitate userspace bind/unbind operations 'modalias'
  and 'devtype', that also appear in the uevent, are added as generic
  sysfs attributes for all nd devices.  The reason for the device-type
  number is to support sub-types within a given parent devtype, be it a
  vendor-specific sub-type or otherwise.

* The first consumer of this infrastructure is the driver
  for dimm devices.  It simply uses control messages to retrieve and
  store the configuration-data image (label set) from each dimm.

Note: nd_device_register() arranges for asynchronous registration of
      nd bus devices by default.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c           |   13 ++-
 drivers/block/nd/Makefile     |    1 
 drivers/block/nd/bus.c        |  168 +++++++++++++++++++++++++++++++++++++++++
 drivers/block/nd/core.c       |   43 ++++++++++
 drivers/block/nd/dimm.c       |   92 ++++++++++++++++++++++
 drivers/block/nd/dimm_devs.c  |  136 ++++++++++++++++++++++++++++++++-
 drivers/block/nd/nd-private.h |    8 +-
 drivers/block/nd/nd.h         |   34 ++++++++
 include/linux/libnd.h         |    2 
 include/linux/nd.h            |   39 ++++++++++
 include/uapi/linux/ndctl.h    |    6 +
 11 files changed, 527 insertions(+), 15 deletions(-)
 create mode 100644 drivers/block/nd/dimm.c
 create mode 100644 drivers/block/nd/nd.h
 create mode 100644 include/linux/nd.h

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index b7c1c5a5b589..c75f4bf1c230 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -18,6 +18,10 @@
 #include <linux/acpi.h>
 #include "nfit.h"
 
+static bool force_enable_dimms;
+module_param(force_enable_dimms, bool, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(force_enable_dimms, "Ignore _STA (ACPI DIMM device) status");
+
 static u8 nfit_uuid[NFIT_UUID_MAX][16];
 
 static const u8 *to_nfit_uuid(enum nfit_uuids id)
@@ -592,6 +596,7 @@ static struct attribute_group acpi_nfit_dimm_attribute_group = {
 
 static const struct attribute_group *acpi_nfit_dimm_attribute_groups[] = {
 	&nd_dimm_attribute_group,
+	&nd_device_attribute_group,
 	&acpi_nfit_dimm_attribute_group,
 	NULL,
 };
@@ -628,7 +633,7 @@ static int acpi_nfit_add_dimm(struct acpi_nfit_desc *acpi_desc,
 	if (!adev_dimm) {
 		dev_err(dev, "no ACPI.NFIT device with _ADR %#x, disabling...\n",
 				device_handle);
-		return -ENODEV;
+		return force_enable_dimms ? 0 : -ENODEV;
 	}
 
 	status = acpi_evaluate_integer(adev_dimm->handle, "_STA", NULL, &sta);
@@ -649,12 +654,13 @@ static int acpi_nfit_add_dimm(struct acpi_nfit_desc *acpi_desc,
 		if (acpi_check_dsm(adev_dimm->handle, uuid, 1, 1ULL << i))
 			set_bit(i, &nfit_mem->dsm_mask);
 
-	return rc;
+	return force_enable_dimms ? 0 : rc;
 }
 
 static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
 {
 	struct nfit_mem *nfit_mem;
+	int dimm_count = 0;
 
 	list_for_each_entry(nfit_mem, &acpi_desc->dimms, list) {
 		struct nd_dimm *nd_dimm;
@@ -688,9 +694,10 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
 			return -ENOMEM;
 
 		nfit_mem->nd_dimm = nd_dimm;
+		dimm_count++;
 	}
 
-	return 0;
+	return nd_bus_validate_dimm_count(acpi_desc->nd_bus, dimm_count);
 }
 
 static void acpi_nfit_init_dsms(struct acpi_nfit_desc *acpi_desc)
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 2954b9543bec..d9ef4496e8d3 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -3,3 +3,4 @@ obj-$(CONFIG_LIBND) += libnd.o
 libnd-y := core.o
 libnd-y += bus.o
 libnd-y += dimm_devs.o
+libnd-y += dimm.o
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index f072a9e0c1bd..3f5cdbc24973 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -16,19 +16,183 @@
 #include <linux/fcntl.h>
 #include <linux/async.h>
 #include <linux/ndctl.h>
+#include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/fs.h>
 #include <linux/io.h>
 #include <linux/mm.h>
+#include <linux/nd.h>
 #include "nd-private.h"
+#include "nd.h"
 
 int nd_dimm_major;
 static int nd_bus_major;
 static struct class *nd_class;
 
-struct bus_type nd_bus_type = {
+static int to_nd_device_type(struct device *dev)
+{
+	if (is_nd_dimm(dev))
+		return ND_DEVICE_DIMM;
+
+	return 0;
+}
+
+static int nd_bus_uevent(struct device *dev, struct kobj_uevent_env *env)
+{
+	return add_uevent_var(env, "MODALIAS=" ND_DEVICE_MODALIAS_FMT,
+			to_nd_device_type(dev));
+}
+
+static int nd_bus_match(struct device *dev, struct device_driver *drv)
+{
+	struct nd_device_driver *nd_drv = to_nd_device_driver(drv);
+
+	return test_bit(to_nd_device_type(dev), &nd_drv->type);
+}
+
+static int nd_bus_probe(struct device *dev)
+{
+	struct nd_device_driver *nd_drv = to_nd_device_driver(dev->driver);
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+	int rc;
+
+	rc = nd_drv->probe(dev);
+	dev_dbg(&nd_bus->dev, "%s.probe(%s) = %d\n", dev->driver->name,
+			dev_name(dev), rc);
+	return rc;
+}
+
+static int nd_bus_remove(struct device *dev)
+{
+	struct nd_device_driver *nd_drv = to_nd_device_driver(dev->driver);
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+	int rc;
+
+	rc = nd_drv->remove(dev);
+	dev_dbg(&nd_bus->dev, "%s.remove(%s) = %d\n", dev->driver->name,
+			dev_name(dev), rc);
+	return rc;
+}
+
+static struct bus_type nd_bus_type = {
 	.name = "nd",
+	.uevent = nd_bus_uevent,
+	.match = nd_bus_match,
+	.probe = nd_bus_probe,
+	.remove = nd_bus_remove,
+};
+
+static ASYNC_DOMAIN_EXCLUSIVE(nd_async_domain);
+
+void nd_synchronize(void)
+{
+	async_synchronize_full_domain(&nd_async_domain);
+}
+EXPORT_SYMBOL_GPL(nd_synchronize);
+
+static void nd_async_device_register(void *d, async_cookie_t cookie)
+{
+	struct device *dev = d;
+
+	if (device_add(dev) != 0) {
+		dev_err(dev, "%s: failed\n", __func__);
+		put_device(dev);
+	}
+	put_device(dev);
+}
+
+static void nd_async_device_unregister(void *d, async_cookie_t cookie)
+{
+	struct device *dev = d;
+
+	device_unregister(dev);
+	put_device(dev);
+}
+
+void nd_device_register(struct device *dev)
+{
+	dev->bus = &nd_bus_type;
+	device_initialize(dev);
+	get_device(dev);
+	async_schedule_domain(nd_async_device_register, dev,
+			&nd_async_domain);
+}
+EXPORT_SYMBOL(nd_device_register);
+
+void nd_device_unregister(struct device *dev, enum nd_async_mode mode)
+{
+	switch (mode) {
+	case ND_ASYNC:
+		get_device(dev);
+		async_schedule_domain(nd_async_device_unregister, dev,
+				&nd_async_domain);
+		break;
+	case ND_SYNC:
+		nd_synchronize();
+		device_unregister(dev);
+		break;
+	}
+}
+EXPORT_SYMBOL(nd_device_unregister);
+
+/**
+ * __nd_driver_register() - register a region or a namespace driver
+ * @nd_drv: driver to register
+ * @owner: automatically set by nd_driver_register() macro
+ * @mod_name: automatically set by nd_driver_register() macro
+ */
+int __nd_driver_register(struct nd_device_driver *nd_drv, struct module *owner,
+		const char *mod_name)
+{
+	struct device_driver *drv = &nd_drv->drv;
+
+	if (!nd_drv->type) {
+		pr_debug("driver type bitmask not set (%pf)\n",
+				__builtin_return_address(0));
+		return -EINVAL;
+	}
+
+	if (!nd_drv->probe || !nd_drv->remove) {
+		pr_debug("->probe() and ->remove() must be specified\n");
+		return -EINVAL;
+	}
+
+	drv->bus = &nd_bus_type;
+	drv->owner = owner;
+	drv->mod_name = mod_name;
+
+	return driver_register(drv);
+}
+EXPORT_SYMBOL(__nd_driver_register);
+
+static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
+		char *buf)
+{
+	return sprintf(buf, ND_DEVICE_MODALIAS_FMT "\n",
+			to_nd_device_type(dev));
+}
+static DEVICE_ATTR_RO(modalias);
+
+static ssize_t devtype_show(struct device *dev, struct device_attribute *attr,
+		char *buf)
+{
+	return sprintf(buf, "%s\n", dev->type->name);
+}
+DEVICE_ATTR_RO(devtype);
+
+static struct attribute *nd_device_attributes[] = {
+	&dev_attr_modalias.attr,
+	&dev_attr_devtype.attr,
+	NULL,
+};
+
+/**
+ * nd_device_attribute_group - generic attributes for all devices on an nd bus
+ */
+struct attribute_group nd_device_attribute_group = {
+	.attrs = nd_device_attributes,
 };
+EXPORT_SYMBOL_GPL(nd_device_attribute_group);
 
 int nd_bus_create_ndctl(struct nd_bus *nd_bus)
 {
@@ -402,7 +566,7 @@ int __init nd_bus_init(void)
 	return rc;
 }
 
-void __exit nd_bus_exit(void)
+void nd_bus_exit(void)
 {
 	class_destroy(nd_class);
 	unregister_chrdev(nd_bus_major, "ndctl");
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index d7a922913da2..a3dd3a22ce92 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -18,6 +18,7 @@
 #include <linux/mutex.h>
 #include <linux/slab.h>
 #include "nd-private.h"
+#include "nd.h"
 
 LIST_HEAD(nd_bus_list);
 DEFINE_MUTEX(nd_bus_list_mutex);
@@ -96,8 +97,33 @@ static ssize_t provider_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(provider);
 
+static int flush_namespaces(struct device *dev, void *data)
+{
+	device_lock(dev);
+	device_unlock(dev);
+	return 0;
+}
+
+static int flush_regions_dimms(struct device *dev, void *data)
+{
+	device_lock(dev);
+	device_unlock(dev);
+	device_for_each_child(dev, NULL, flush_namespaces);
+	return 0;
+}
+
+static ssize_t wait_probe_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	nd_synchronize();
+	device_for_each_child(dev, NULL, flush_regions_dimms);
+	return sprintf(buf, "1\n");
+}
+static DEVICE_ATTR_RO(wait_probe);
+
 static struct attribute *nd_bus_attributes[] = {
 	&dev_attr_commands.attr,
+	&dev_attr_wait_probe.attr,
 	&dev_attr_provider.attr,
 	NULL,
 };
@@ -158,7 +184,7 @@ static int child_unregister(struct device *dev, void *data)
 	if (dev->class)
 		/* pass */;
 	else
-		device_unregister(dev);
+		nd_device_unregister(dev, ND_SYNC);
 	return 0;
 }
 
@@ -171,6 +197,7 @@ void nd_bus_unregister(struct nd_bus *nd_bus)
 	list_del_init(&nd_bus->list);
 	mutex_unlock(&nd_bus_list_mutex);
 
+	nd_synchronize();
 	device_for_each_child(&nd_bus->dev, NULL, child_unregister);
 	nd_bus_destroy_ndctl(nd_bus);
 
@@ -180,12 +207,24 @@ EXPORT_SYMBOL_GPL(nd_bus_unregister);
 
 static __init int libnd_init(void)
 {
-	return nd_bus_init();
+	int rc;
+
+	rc = nd_bus_init();
+	if (rc)
+		return rc;
+	rc = nd_dimm_init();
+	if (rc)
+		goto err_dimm;
+	return 0;
+ err_dimm:
+	nd_bus_exit();
+	return rc;
 }
 
 static __exit void libnd_exit(void)
 {
 	WARN_ON(!list_empty(&nd_bus_list));
+	nd_dimm_exit();
 	nd_bus_exit();
 }
 
diff --git a/drivers/block/nd/dimm.c b/drivers/block/nd/dimm.c
new file mode 100644
index 000000000000..1665b7d69e3a
--- /dev/null
+++ b/drivers/block/nd/dimm.c
@@ -0,0 +1,92 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/vmalloc.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/sizes.h>
+#include <linux/ndctl.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+#include <linux/nd.h>
+#include "nd.h"
+
+static void free_data(struct nd_dimm_drvdata *ndd)
+{
+	if (!ndd)
+		return;
+
+	if (ndd->data && is_vmalloc_addr(ndd->data))
+		vfree(ndd->data);
+	else
+		kfree(ndd->data);
+	kfree(ndd);
+}
+
+static int nd_dimm_probe(struct device *dev)
+{
+	struct nd_dimm_drvdata *ndd;
+	int rc;
+
+	ndd = kzalloc(sizeof(*ndd), GFP_KERNEL);
+	if (!ndd)
+		return -ENOMEM;
+
+	dev_set_drvdata(dev, ndd);
+	ndd->dev = dev;
+
+	rc = nd_dimm_init_nsarea(ndd);
+	if (rc)
+		goto err;
+
+	rc = nd_dimm_init_config_data(ndd);
+	if (rc)
+		goto err;
+
+	dev_dbg(dev, "config data size: %d\n", ndd->nsarea.config_size);
+
+	return 0;
+
+ err:
+	free_data(ndd);
+	return rc;
+}
+
+static int nd_dimm_remove(struct device *dev)
+{
+	struct nd_dimm_drvdata *ndd = dev_get_drvdata(dev);
+
+	free_data(ndd);
+
+	return 0;
+}
+
+static struct nd_device_driver nd_dimm_driver = {
+	.probe = nd_dimm_probe,
+	.remove = nd_dimm_remove,
+	.drv = {
+		.name = "nd_dimm",
+	},
+	.type = ND_DRIVER_DIMM,
+};
+
+int __init nd_dimm_init(void)
+{
+	return nd_driver_register(&nd_dimm_driver);
+}
+
+void __exit nd_dimm_exit(void)
+{
+	driver_unregister(&nd_dimm_driver.drv);
+}
+
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_DIMM);
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 3fa26f61c3db..33b6d5336096 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -11,6 +11,7 @@
  * General Public License for more details.
  */
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/vmalloc.h>
 #include <linux/device.h>
 #include <linux/ndctl.h>
 #include <linux/slab.h>
@@ -18,9 +19,115 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include "nd-private.h"
+#include "nd.h"
 
 static DEFINE_IDA(dimm_ida);
 
+/*
+ * Retrieve bus and dimm handle and return if this bus supports
+ * get_config_data commands
+ */
+static int __validate_dimm(struct nd_dimm_drvdata *ndd)
+{
+	struct nd_dimm *nd_dimm;
+
+	if (!ndd)
+		return -EINVAL;
+
+	nd_dimm = to_nd_dimm(ndd->dev);
+
+	if (!nd_dimm->dsm_mask)
+		return -ENXIO;
+	if (!test_bit(ND_CMD_GET_CONFIG_DATA, nd_dimm->dsm_mask))
+		return -ENXIO;
+
+	return 0;
+}
+
+static int validate_dimm(struct nd_dimm_drvdata *ndd)
+{
+	int rc = __validate_dimm(ndd);
+
+	if (rc && ndd)
+		dev_dbg(ndd->dev, "%pf: %s error: %d\n",
+				__builtin_return_address(0), __func__, rc);
+	return rc;
+}
+
+/**
+ * nd_dimm_init_nsarea - determine the geometry of a dimm's namespace area
+ * @nd_dimm: dimm to initialize
+ */
+int nd_dimm_init_nsarea(struct nd_dimm_drvdata *ndd)
+{
+	struct nd_cmd_get_config_size *cmd = &ndd->nsarea;
+	struct nd_bus *nd_bus = walk_to_nd_bus(ndd->dev);
+	struct nd_bus_descriptor *nd_desc;
+	int rc = validate_dimm(ndd);
+
+	if (rc)
+		return rc;
+
+	if (cmd->config_size)
+		return 0; /* already valid */
+
+	memset(cmd, 0, sizeof(*cmd));
+	nd_desc = nd_bus->nd_desc;
+	return nd_desc->ndctl(nd_desc, to_nd_dimm(ndd->dev),
+			ND_CMD_GET_CONFIG_SIZE, cmd, sizeof(*cmd));
+}
+
+int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd)
+{
+	struct nd_bus *nd_bus = walk_to_nd_bus(ndd->dev);
+	struct nd_cmd_get_config_data_hdr *cmd;
+	struct nd_bus_descriptor *nd_desc;
+	int rc = validate_dimm(ndd);
+	u32 max_cmd_size, config_size;
+	size_t offset;
+
+	if (rc)
+		return rc;
+
+	if (ndd->data)
+		return 0;
+
+	if (ndd->nsarea.status || ndd->nsarea.max_xfer == 0)
+		return -ENXIO;
+
+	ndd->data = kmalloc(ndd->nsarea.config_size, GFP_KERNEL);
+	if (!ndd->data)
+		ndd->data = vmalloc(ndd->nsarea.config_size);
+
+	if (!ndd->data)
+		return -ENOMEM;
+
+	max_cmd_size = min_t(u32, PAGE_SIZE, ndd->nsarea.max_xfer);
+	cmd = kzalloc(max_cmd_size + sizeof(*cmd), GFP_KERNEL);
+	if (!cmd)
+		return -ENOMEM;
+
+	nd_desc = nd_bus->nd_desc;
+	for (config_size = ndd->nsarea.config_size, offset = 0;
+			config_size; config_size -= cmd->in_length,
+			offset += cmd->in_length) {
+		cmd->in_length = min(config_size, max_cmd_size);
+		cmd->in_offset = offset;
+		rc = nd_desc->ndctl(nd_desc, to_nd_dimm(ndd->dev),
+				ND_CMD_GET_CONFIG_DATA, cmd,
+				cmd->in_length + sizeof(*cmd));
+		if (rc || cmd->status) {
+			rc = -ENXIO;
+			break;
+		}
+		memcpy(ndd->data + offset, cmd->out_buf, cmd->in_length);
+	}
+	dev_dbg(ndd->dev, "%s: len: %zd rc: %d\n", __func__, offset, rc);
+	kfree(cmd);
+
+	return rc;
+}
+
 static void nd_dimm_release(struct device *dev)
 {
 	struct nd_dimm *nd_dimm = to_nd_dimm(dev);
@@ -111,14 +218,33 @@ struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
 	dev_set_name(dev, "nmem%d", nd_dimm->id);
 	dev->parent = &nd_bus->dev;
 	dev->type = &nd_dimm_device_type;
-	dev->bus = &nd_bus_type;
 	dev->devt = MKDEV(nd_dimm_major, nd_dimm->id);
 	dev->groups = groups;
-	if (device_register(dev) != 0) {
-		put_device(dev);
-		return NULL;
-	}
+	nd_device_register(dev);
 
 	return nd_dimm;
 }
 EXPORT_SYMBOL_GPL(nd_dimm_create);
+
+static int count_dimms(struct device *dev, void *c)
+{
+	int *count = c;
+
+	if (is_nd_dimm(dev))
+		(*count)++;
+	return 0;
+}
+
+int nd_bus_validate_dimm_count(struct nd_bus *nd_bus, int dimm_count)
+{
+	int count = 0;
+	/* Flush any possible dimm registration failures */
+	nd_synchronize();
+
+	device_for_each_child(&nd_bus->dev, &count, count_dimms);
+	dev_dbg(&nd_bus->dev, "%s: count: %d\n", __func__, count);
+	if (count != dimm_count)
+		return -ENXIO;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(nd_bus_validate_dimm_count);
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index c71a5f34355a..a333f3401ca7 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -17,7 +17,6 @@
 
 extern struct list_head nd_bus_list;
 extern struct mutex nd_bus_list_mutex;
-extern struct bus_type nd_bus_type;
 extern int nd_dimm_major;
 
 struct nd_bus {
@@ -35,10 +34,13 @@ struct nd_dimm {
 	int id;
 };
 
-bool is_nd_dimm(struct device *dev);
 struct nd_bus *walk_to_nd_bus(struct device *nd_dev);
 int __init nd_bus_init(void);
-void __exit nd_bus_exit(void);
+void nd_bus_exit(void);
+int __init nd_dimm_init(void);
+void __exit nd_dimm_exit(void);
 int nd_bus_create_ndctl(struct nd_bus *nd_bus);
 void nd_bus_destroy_ndctl(struct nd_bus *nd_bus);
+void nd_synchronize(void);
+bool is_nd_dimm(struct device *dev);
 #endif /* __ND_PRIVATE_H__ */
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
new file mode 100644
index 000000000000..1a5a081ce640
--- /dev/null
+++ b/drivers/block/nd/nd.h
@@ -0,0 +1,34 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __ND_H__
+#define __ND_H__
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/ndctl.h>
+
+struct nd_dimm_drvdata {
+	struct device *dev;
+	struct nd_cmd_get_config_size nsarea;
+	void *data;
+};
+
+enum nd_async_mode {
+	ND_SYNC,
+	ND_ASYNC,
+};
+
+void nd_device_register(struct device *dev);
+void nd_device_unregister(struct device *dev, enum nd_async_mode mode);
+int nd_dimm_init_nsarea(struct nd_dimm_drvdata *ndd);
+int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd);
+#endif /* __ND_H__ */
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index ca72c49ae376..0d7e82401e4b 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -29,6 +29,7 @@ enum {
 
 extern struct attribute_group nd_bus_attribute_group;
 extern struct attribute_group nd_dimm_attribute_group;
+extern struct attribute_group nd_device_attribute_group;
 
 struct nd_dimm;
 struct nd_bus_descriptor;
@@ -70,4 +71,5 @@ u32 nd_cmd_in_size(struct nd_dimm *nd_dimm, int cmd,
 u32 nd_cmd_out_size(struct nd_dimm *nd_dimm, int cmd,
 		const struct nd_cmd_desc *desc, int idx, const u32 *in_field,
 		const u32 *out_field);
+int nd_bus_validate_dimm_count(struct nd_bus *nd_bus, int dimm_count);
 #endif /* __LIBND_H__ */
diff --git a/include/linux/nd.h b/include/linux/nd.h
new file mode 100644
index 000000000000..e074f67e53a3
--- /dev/null
+++ b/include/linux/nd.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __LINUX_ND_H__
+#define __LINUX_ND_H__
+#include <linux/ndctl.h>
+#include <linux/device.h>
+
+struct nd_device_driver {
+	struct device_driver drv;
+	unsigned long type;
+	int (*probe)(struct device *dev);
+	int (*remove)(struct device *dev);
+};
+
+static inline struct nd_device_driver *to_nd_device_driver(
+		struct device_driver *drv)
+{
+	return container_of(drv, struct nd_device_driver, drv);
+}
+
+#define MODULE_ALIAS_ND_DEVICE(type) \
+	MODULE_ALIAS("nd:t" __stringify(type) "*")
+#define ND_DEVICE_MODALIAS_FMT "nd:t%d"
+
+int __must_check __nd_driver_register(struct nd_device_driver *nd_drv,
+		struct module *module, const char *mod_name);
+#define nd_driver_register(driver) \
+	__nd_driver_register(driver, THIS_MODULE, KBUILD_MODNAME)
+#endif /* __LINUX_ND_H__ */
diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
index 62c01bf76198..1ccd2c633193 100644
--- a/include/uapi/linux/ndctl.h
+++ b/include/uapi/linux/ndctl.h
@@ -175,4 +175,10 @@ static inline const char *nd_dimm_cmd_name(unsigned cmd)
 #define ND_IOCTL_ARS_QUERY		_IOWR(ND_IOCTL, ND_CMD_ARS_QUERY,\
 					struct nd_cmd_ars_query)
 
+
+#define ND_DEVICE_DIMM 1            /* nd_dimm: container for "config data" */
+
+enum nd_driver_flags {
+	ND_DRIVER_DIMM            = 1 << ND_DEVICE_DIMM,
+};
 #endif /* __NDCTL_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 06/21] libnd, nd_dimm: dimm driver and base libnd device-driver infrastructure
@ 2015-05-20 20:56   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:56 UTC (permalink / raw)
  To: axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, mingo, linux-acpi,
	jmoyer, hch

* Implement the device-model infrastructure for loading modules and
  attaching drivers to nd devices.  This is a simple association of a
  nd-device-type number with a driver that has a bitmask of supported
  device types.  To facilitate userspace bind/unbind operations 'modalias'
  and 'devtype', that also appear in the uevent, are added as generic
  sysfs attributes for all nd devices.  The reason for the device-type
  number is to support sub-types within a given parent devtype, be it a
  vendor-specific sub-type or otherwise.

* The first consumer of this infrastructure is the driver
  for dimm devices.  It simply uses control messages to retrieve and
  store the configuration-data image (label set) from each dimm.

Note: nd_device_register() arranges for asynchronous registration of
      nd bus devices by default.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c           |   13 ++-
 drivers/block/nd/Makefile     |    1 
 drivers/block/nd/bus.c        |  168 +++++++++++++++++++++++++++++++++++++++++
 drivers/block/nd/core.c       |   43 ++++++++++
 drivers/block/nd/dimm.c       |   92 ++++++++++++++++++++++
 drivers/block/nd/dimm_devs.c  |  136 ++++++++++++++++++++++++++++++++-
 drivers/block/nd/nd-private.h |    8 +-
 drivers/block/nd/nd.h         |   34 ++++++++
 include/linux/libnd.h         |    2 
 include/linux/nd.h            |   39 ++++++++++
 include/uapi/linux/ndctl.h    |    6 +
 11 files changed, 527 insertions(+), 15 deletions(-)
 create mode 100644 drivers/block/nd/dimm.c
 create mode 100644 drivers/block/nd/nd.h
 create mode 100644 include/linux/nd.h

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index b7c1c5a5b589..c75f4bf1c230 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -18,6 +18,10 @@
 #include <linux/acpi.h>
 #include "nfit.h"
 
+static bool force_enable_dimms;
+module_param(force_enable_dimms, bool, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(force_enable_dimms, "Ignore _STA (ACPI DIMM device) status");
+
 static u8 nfit_uuid[NFIT_UUID_MAX][16];
 
 static const u8 *to_nfit_uuid(enum nfit_uuids id)
@@ -592,6 +596,7 @@ static struct attribute_group acpi_nfit_dimm_attribute_group = {
 
 static const struct attribute_group *acpi_nfit_dimm_attribute_groups[] = {
 	&nd_dimm_attribute_group,
+	&nd_device_attribute_group,
 	&acpi_nfit_dimm_attribute_group,
 	NULL,
 };
@@ -628,7 +633,7 @@ static int acpi_nfit_add_dimm(struct acpi_nfit_desc *acpi_desc,
 	if (!adev_dimm) {
 		dev_err(dev, "no ACPI.NFIT device with _ADR %#x, disabling...\n",
 				device_handle);
-		return -ENODEV;
+		return force_enable_dimms ? 0 : -ENODEV;
 	}
 
 	status = acpi_evaluate_integer(adev_dimm->handle, "_STA", NULL, &sta);
@@ -649,12 +654,13 @@ static int acpi_nfit_add_dimm(struct acpi_nfit_desc *acpi_desc,
 		if (acpi_check_dsm(adev_dimm->handle, uuid, 1, 1ULL << i))
 			set_bit(i, &nfit_mem->dsm_mask);
 
-	return rc;
+	return force_enable_dimms ? 0 : rc;
 }
 
 static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
 {
 	struct nfit_mem *nfit_mem;
+	int dimm_count = 0;
 
 	list_for_each_entry(nfit_mem, &acpi_desc->dimms, list) {
 		struct nd_dimm *nd_dimm;
@@ -688,9 +694,10 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
 			return -ENOMEM;
 
 		nfit_mem->nd_dimm = nd_dimm;
+		dimm_count++;
 	}
 
-	return 0;
+	return nd_bus_validate_dimm_count(acpi_desc->nd_bus, dimm_count);
 }
 
 static void acpi_nfit_init_dsms(struct acpi_nfit_desc *acpi_desc)
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 2954b9543bec..d9ef4496e8d3 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -3,3 +3,4 @@ obj-$(CONFIG_LIBND) += libnd.o
 libnd-y := core.o
 libnd-y += bus.o
 libnd-y += dimm_devs.o
+libnd-y += dimm.o
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index f072a9e0c1bd..3f5cdbc24973 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -16,19 +16,183 @@
 #include <linux/fcntl.h>
 #include <linux/async.h>
 #include <linux/ndctl.h>
+#include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/fs.h>
 #include <linux/io.h>
 #include <linux/mm.h>
+#include <linux/nd.h>
 #include "nd-private.h"
+#include "nd.h"
 
 int nd_dimm_major;
 static int nd_bus_major;
 static struct class *nd_class;
 
-struct bus_type nd_bus_type = {
+static int to_nd_device_type(struct device *dev)
+{
+	if (is_nd_dimm(dev))
+		return ND_DEVICE_DIMM;
+
+	return 0;
+}
+
+static int nd_bus_uevent(struct device *dev, struct kobj_uevent_env *env)
+{
+	return add_uevent_var(env, "MODALIAS=" ND_DEVICE_MODALIAS_FMT,
+			to_nd_device_type(dev));
+}
+
+static int nd_bus_match(struct device *dev, struct device_driver *drv)
+{
+	struct nd_device_driver *nd_drv = to_nd_device_driver(drv);
+
+	return test_bit(to_nd_device_type(dev), &nd_drv->type);
+}
+
+static int nd_bus_probe(struct device *dev)
+{
+	struct nd_device_driver *nd_drv = to_nd_device_driver(dev->driver);
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+	int rc;
+
+	rc = nd_drv->probe(dev);
+	dev_dbg(&nd_bus->dev, "%s.probe(%s) = %d\n", dev->driver->name,
+			dev_name(dev), rc);
+	return rc;
+}
+
+static int nd_bus_remove(struct device *dev)
+{
+	struct nd_device_driver *nd_drv = to_nd_device_driver(dev->driver);
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+	int rc;
+
+	rc = nd_drv->remove(dev);
+	dev_dbg(&nd_bus->dev, "%s.remove(%s) = %d\n", dev->driver->name,
+			dev_name(dev), rc);
+	return rc;
+}
+
+static struct bus_type nd_bus_type = {
 	.name = "nd",
+	.uevent = nd_bus_uevent,
+	.match = nd_bus_match,
+	.probe = nd_bus_probe,
+	.remove = nd_bus_remove,
+};
+
+static ASYNC_DOMAIN_EXCLUSIVE(nd_async_domain);
+
+void nd_synchronize(void)
+{
+	async_synchronize_full_domain(&nd_async_domain);
+}
+EXPORT_SYMBOL_GPL(nd_synchronize);
+
+static void nd_async_device_register(void *d, async_cookie_t cookie)
+{
+	struct device *dev = d;
+
+	if (device_add(dev) != 0) {
+		dev_err(dev, "%s: failed\n", __func__);
+		put_device(dev);
+	}
+	put_device(dev);
+}
+
+static void nd_async_device_unregister(void *d, async_cookie_t cookie)
+{
+	struct device *dev = d;
+
+	device_unregister(dev);
+	put_device(dev);
+}
+
+void nd_device_register(struct device *dev)
+{
+	dev->bus = &nd_bus_type;
+	device_initialize(dev);
+	get_device(dev);
+	async_schedule_domain(nd_async_device_register, dev,
+			&nd_async_domain);
+}
+EXPORT_SYMBOL(nd_device_register);
+
+void nd_device_unregister(struct device *dev, enum nd_async_mode mode)
+{
+	switch (mode) {
+	case ND_ASYNC:
+		get_device(dev);
+		async_schedule_domain(nd_async_device_unregister, dev,
+				&nd_async_domain);
+		break;
+	case ND_SYNC:
+		nd_synchronize();
+		device_unregister(dev);
+		break;
+	}
+}
+EXPORT_SYMBOL(nd_device_unregister);
+
+/**
+ * __nd_driver_register() - register a region or a namespace driver
+ * @nd_drv: driver to register
+ * @owner: automatically set by nd_driver_register() macro
+ * @mod_name: automatically set by nd_driver_register() macro
+ */
+int __nd_driver_register(struct nd_device_driver *nd_drv, struct module *owner,
+		const char *mod_name)
+{
+	struct device_driver *drv = &nd_drv->drv;
+
+	if (!nd_drv->type) {
+		pr_debug("driver type bitmask not set (%pf)\n",
+				__builtin_return_address(0));
+		return -EINVAL;
+	}
+
+	if (!nd_drv->probe || !nd_drv->remove) {
+		pr_debug("->probe() and ->remove() must be specified\n");
+		return -EINVAL;
+	}
+
+	drv->bus = &nd_bus_type;
+	drv->owner = owner;
+	drv->mod_name = mod_name;
+
+	return driver_register(drv);
+}
+EXPORT_SYMBOL(__nd_driver_register);
+
+static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
+		char *buf)
+{
+	return sprintf(buf, ND_DEVICE_MODALIAS_FMT "\n",
+			to_nd_device_type(dev));
+}
+static DEVICE_ATTR_RO(modalias);
+
+static ssize_t devtype_show(struct device *dev, struct device_attribute *attr,
+		char *buf)
+{
+	return sprintf(buf, "%s\n", dev->type->name);
+}
+DEVICE_ATTR_RO(devtype);
+
+static struct attribute *nd_device_attributes[] = {
+	&dev_attr_modalias.attr,
+	&dev_attr_devtype.attr,
+	NULL,
+};
+
+/**
+ * nd_device_attribute_group - generic attributes for all devices on an nd bus
+ */
+struct attribute_group nd_device_attribute_group = {
+	.attrs = nd_device_attributes,
 };
+EXPORT_SYMBOL_GPL(nd_device_attribute_group);
 
 int nd_bus_create_ndctl(struct nd_bus *nd_bus)
 {
@@ -402,7 +566,7 @@ int __init nd_bus_init(void)
 	return rc;
 }
 
-void __exit nd_bus_exit(void)
+void nd_bus_exit(void)
 {
 	class_destroy(nd_class);
 	unregister_chrdev(nd_bus_major, "ndctl");
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index d7a922913da2..a3dd3a22ce92 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -18,6 +18,7 @@
 #include <linux/mutex.h>
 #include <linux/slab.h>
 #include "nd-private.h"
+#include "nd.h"
 
 LIST_HEAD(nd_bus_list);
 DEFINE_MUTEX(nd_bus_list_mutex);
@@ -96,8 +97,33 @@ static ssize_t provider_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(provider);
 
+static int flush_namespaces(struct device *dev, void *data)
+{
+	device_lock(dev);
+	device_unlock(dev);
+	return 0;
+}
+
+static int flush_regions_dimms(struct device *dev, void *data)
+{
+	device_lock(dev);
+	device_unlock(dev);
+	device_for_each_child(dev, NULL, flush_namespaces);
+	return 0;
+}
+
+static ssize_t wait_probe_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	nd_synchronize();
+	device_for_each_child(dev, NULL, flush_regions_dimms);
+	return sprintf(buf, "1\n");
+}
+static DEVICE_ATTR_RO(wait_probe);
+
 static struct attribute *nd_bus_attributes[] = {
 	&dev_attr_commands.attr,
+	&dev_attr_wait_probe.attr,
 	&dev_attr_provider.attr,
 	NULL,
 };
@@ -158,7 +184,7 @@ static int child_unregister(struct device *dev, void *data)
 	if (dev->class)
 		/* pass */;
 	else
-		device_unregister(dev);
+		nd_device_unregister(dev, ND_SYNC);
 	return 0;
 }
 
@@ -171,6 +197,7 @@ void nd_bus_unregister(struct nd_bus *nd_bus)
 	list_del_init(&nd_bus->list);
 	mutex_unlock(&nd_bus_list_mutex);
 
+	nd_synchronize();
 	device_for_each_child(&nd_bus->dev, NULL, child_unregister);
 	nd_bus_destroy_ndctl(nd_bus);
 
@@ -180,12 +207,24 @@ EXPORT_SYMBOL_GPL(nd_bus_unregister);
 
 static __init int libnd_init(void)
 {
-	return nd_bus_init();
+	int rc;
+
+	rc = nd_bus_init();
+	if (rc)
+		return rc;
+	rc = nd_dimm_init();
+	if (rc)
+		goto err_dimm;
+	return 0;
+ err_dimm:
+	nd_bus_exit();
+	return rc;
 }
 
 static __exit void libnd_exit(void)
 {
 	WARN_ON(!list_empty(&nd_bus_list));
+	nd_dimm_exit();
 	nd_bus_exit();
 }
 
diff --git a/drivers/block/nd/dimm.c b/drivers/block/nd/dimm.c
new file mode 100644
index 000000000000..1665b7d69e3a
--- /dev/null
+++ b/drivers/block/nd/dimm.c
@@ -0,0 +1,92 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/vmalloc.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/sizes.h>
+#include <linux/ndctl.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+#include <linux/nd.h>
+#include "nd.h"
+
+static void free_data(struct nd_dimm_drvdata *ndd)
+{
+	if (!ndd)
+		return;
+
+	if (ndd->data && is_vmalloc_addr(ndd->data))
+		vfree(ndd->data);
+	else
+		kfree(ndd->data);
+	kfree(ndd);
+}
+
+static int nd_dimm_probe(struct device *dev)
+{
+	struct nd_dimm_drvdata *ndd;
+	int rc;
+
+	ndd = kzalloc(sizeof(*ndd), GFP_KERNEL);
+	if (!ndd)
+		return -ENOMEM;
+
+	dev_set_drvdata(dev, ndd);
+	ndd->dev = dev;
+
+	rc = nd_dimm_init_nsarea(ndd);
+	if (rc)
+		goto err;
+
+	rc = nd_dimm_init_config_data(ndd);
+	if (rc)
+		goto err;
+
+	dev_dbg(dev, "config data size: %d\n", ndd->nsarea.config_size);
+
+	return 0;
+
+ err:
+	free_data(ndd);
+	return rc;
+}
+
+static int nd_dimm_remove(struct device *dev)
+{
+	struct nd_dimm_drvdata *ndd = dev_get_drvdata(dev);
+
+	free_data(ndd);
+
+	return 0;
+}
+
+static struct nd_device_driver nd_dimm_driver = {
+	.probe = nd_dimm_probe,
+	.remove = nd_dimm_remove,
+	.drv = {
+		.name = "nd_dimm",
+	},
+	.type = ND_DRIVER_DIMM,
+};
+
+int __init nd_dimm_init(void)
+{
+	return nd_driver_register(&nd_dimm_driver);
+}
+
+void __exit nd_dimm_exit(void)
+{
+	driver_unregister(&nd_dimm_driver.drv);
+}
+
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_DIMM);
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 3fa26f61c3db..33b6d5336096 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -11,6 +11,7 @@
  * General Public License for more details.
  */
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/vmalloc.h>
 #include <linux/device.h>
 #include <linux/ndctl.h>
 #include <linux/slab.h>
@@ -18,9 +19,115 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include "nd-private.h"
+#include "nd.h"
 
 static DEFINE_IDA(dimm_ida);
 
+/*
+ * Retrieve bus and dimm handle and return if this bus supports
+ * get_config_data commands
+ */
+static int __validate_dimm(struct nd_dimm_drvdata *ndd)
+{
+	struct nd_dimm *nd_dimm;
+
+	if (!ndd)
+		return -EINVAL;
+
+	nd_dimm = to_nd_dimm(ndd->dev);
+
+	if (!nd_dimm->dsm_mask)
+		return -ENXIO;
+	if (!test_bit(ND_CMD_GET_CONFIG_DATA, nd_dimm->dsm_mask))
+		return -ENXIO;
+
+	return 0;
+}
+
+static int validate_dimm(struct nd_dimm_drvdata *ndd)
+{
+	int rc = __validate_dimm(ndd);
+
+	if (rc && ndd)
+		dev_dbg(ndd->dev, "%pf: %s error: %d\n",
+				__builtin_return_address(0), __func__, rc);
+	return rc;
+}
+
+/**
+ * nd_dimm_init_nsarea - determine the geometry of a dimm's namespace area
+ * @nd_dimm: dimm to initialize
+ */
+int nd_dimm_init_nsarea(struct nd_dimm_drvdata *ndd)
+{
+	struct nd_cmd_get_config_size *cmd = &ndd->nsarea;
+	struct nd_bus *nd_bus = walk_to_nd_bus(ndd->dev);
+	struct nd_bus_descriptor *nd_desc;
+	int rc = validate_dimm(ndd);
+
+	if (rc)
+		return rc;
+
+	if (cmd->config_size)
+		return 0; /* already valid */
+
+	memset(cmd, 0, sizeof(*cmd));
+	nd_desc = nd_bus->nd_desc;
+	return nd_desc->ndctl(nd_desc, to_nd_dimm(ndd->dev),
+			ND_CMD_GET_CONFIG_SIZE, cmd, sizeof(*cmd));
+}
+
+int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd)
+{
+	struct nd_bus *nd_bus = walk_to_nd_bus(ndd->dev);
+	struct nd_cmd_get_config_data_hdr *cmd;
+	struct nd_bus_descriptor *nd_desc;
+	int rc = validate_dimm(ndd);
+	u32 max_cmd_size, config_size;
+	size_t offset;
+
+	if (rc)
+		return rc;
+
+	if (ndd->data)
+		return 0;
+
+	if (ndd->nsarea.status || ndd->nsarea.max_xfer == 0)
+		return -ENXIO;
+
+	ndd->data = kmalloc(ndd->nsarea.config_size, GFP_KERNEL);
+	if (!ndd->data)
+		ndd->data = vmalloc(ndd->nsarea.config_size);
+
+	if (!ndd->data)
+		return -ENOMEM;
+
+	max_cmd_size = min_t(u32, PAGE_SIZE, ndd->nsarea.max_xfer);
+	cmd = kzalloc(max_cmd_size + sizeof(*cmd), GFP_KERNEL);
+	if (!cmd)
+		return -ENOMEM;
+
+	nd_desc = nd_bus->nd_desc;
+	for (config_size = ndd->nsarea.config_size, offset = 0;
+			config_size; config_size -= cmd->in_length,
+			offset += cmd->in_length) {
+		cmd->in_length = min(config_size, max_cmd_size);
+		cmd->in_offset = offset;
+		rc = nd_desc->ndctl(nd_desc, to_nd_dimm(ndd->dev),
+				ND_CMD_GET_CONFIG_DATA, cmd,
+				cmd->in_length + sizeof(*cmd));
+		if (rc || cmd->status) {
+			rc = -ENXIO;
+			break;
+		}
+		memcpy(ndd->data + offset, cmd->out_buf, cmd->in_length);
+	}
+	dev_dbg(ndd->dev, "%s: len: %zd rc: %d\n", __func__, offset, rc);
+	kfree(cmd);
+
+	return rc;
+}
+
 static void nd_dimm_release(struct device *dev)
 {
 	struct nd_dimm *nd_dimm = to_nd_dimm(dev);
@@ -111,14 +218,33 @@ struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
 	dev_set_name(dev, "nmem%d", nd_dimm->id);
 	dev->parent = &nd_bus->dev;
 	dev->type = &nd_dimm_device_type;
-	dev->bus = &nd_bus_type;
 	dev->devt = MKDEV(nd_dimm_major, nd_dimm->id);
 	dev->groups = groups;
-	if (device_register(dev) != 0) {
-		put_device(dev);
-		return NULL;
-	}
+	nd_device_register(dev);
 
 	return nd_dimm;
 }
 EXPORT_SYMBOL_GPL(nd_dimm_create);
+
+static int count_dimms(struct device *dev, void *c)
+{
+	int *count = c;
+
+	if (is_nd_dimm(dev))
+		(*count)++;
+	return 0;
+}
+
+int nd_bus_validate_dimm_count(struct nd_bus *nd_bus, int dimm_count)
+{
+	int count = 0;
+	/* Flush any possible dimm registration failures */
+	nd_synchronize();
+
+	device_for_each_child(&nd_bus->dev, &count, count_dimms);
+	dev_dbg(&nd_bus->dev, "%s: count: %d\n", __func__, count);
+	if (count != dimm_count)
+		return -ENXIO;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(nd_bus_validate_dimm_count);
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index c71a5f34355a..a333f3401ca7 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -17,7 +17,6 @@
 
 extern struct list_head nd_bus_list;
 extern struct mutex nd_bus_list_mutex;
-extern struct bus_type nd_bus_type;
 extern int nd_dimm_major;
 
 struct nd_bus {
@@ -35,10 +34,13 @@ struct nd_dimm {
 	int id;
 };
 
-bool is_nd_dimm(struct device *dev);
 struct nd_bus *walk_to_nd_bus(struct device *nd_dev);
 int __init nd_bus_init(void);
-void __exit nd_bus_exit(void);
+void nd_bus_exit(void);
+int __init nd_dimm_init(void);
+void __exit nd_dimm_exit(void);
 int nd_bus_create_ndctl(struct nd_bus *nd_bus);
 void nd_bus_destroy_ndctl(struct nd_bus *nd_bus);
+void nd_synchronize(void);
+bool is_nd_dimm(struct device *dev);
 #endif /* __ND_PRIVATE_H__ */
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
new file mode 100644
index 000000000000..1a5a081ce640
--- /dev/null
+++ b/drivers/block/nd/nd.h
@@ -0,0 +1,34 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __ND_H__
+#define __ND_H__
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/ndctl.h>
+
+struct nd_dimm_drvdata {
+	struct device *dev;
+	struct nd_cmd_get_config_size nsarea;
+	void *data;
+};
+
+enum nd_async_mode {
+	ND_SYNC,
+	ND_ASYNC,
+};
+
+void nd_device_register(struct device *dev);
+void nd_device_unregister(struct device *dev, enum nd_async_mode mode);
+int nd_dimm_init_nsarea(struct nd_dimm_drvdata *ndd);
+int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd);
+#endif /* __ND_H__ */
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index ca72c49ae376..0d7e82401e4b 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -29,6 +29,7 @@ enum {
 
 extern struct attribute_group nd_bus_attribute_group;
 extern struct attribute_group nd_dimm_attribute_group;
+extern struct attribute_group nd_device_attribute_group;
 
 struct nd_dimm;
 struct nd_bus_descriptor;
@@ -70,4 +71,5 @@ u32 nd_cmd_in_size(struct nd_dimm *nd_dimm, int cmd,
 u32 nd_cmd_out_size(struct nd_dimm *nd_dimm, int cmd,
 		const struct nd_cmd_desc *desc, int idx, const u32 *in_field,
 		const u32 *out_field);
+int nd_bus_validate_dimm_count(struct nd_bus *nd_bus, int dimm_count);
 #endif /* __LIBND_H__ */
diff --git a/include/linux/nd.h b/include/linux/nd.h
new file mode 100644
index 000000000000..e074f67e53a3
--- /dev/null
+++ b/include/linux/nd.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __LINUX_ND_H__
+#define __LINUX_ND_H__
+#include <linux/ndctl.h>
+#include <linux/device.h>
+
+struct nd_device_driver {
+	struct device_driver drv;
+	unsigned long type;
+	int (*probe)(struct device *dev);
+	int (*remove)(struct device *dev);
+};
+
+static inline struct nd_device_driver *to_nd_device_driver(
+		struct device_driver *drv)
+{
+	return container_of(drv, struct nd_device_driver, drv);
+}
+
+#define MODULE_ALIAS_ND_DEVICE(type) \
+	MODULE_ALIAS("nd:t" __stringify(type) "*")
+#define ND_DEVICE_MODALIAS_FMT "nd:t%d"
+
+int __must_check __nd_driver_register(struct nd_device_driver *nd_drv,
+		struct module *module, const char *mod_name);
+#define nd_driver_register(driver) \
+	__nd_driver_register(driver, THIS_MODULE, KBUILD_MODNAME)
+#endif /* __LINUX_ND_H__ */
diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
index 62c01bf76198..1ccd2c633193 100644
--- a/include/uapi/linux/ndctl.h
+++ b/include/uapi/linux/ndctl.h
@@ -175,4 +175,10 @@ static inline const char *nd_dimm_cmd_name(unsigned cmd)
 #define ND_IOCTL_ARS_QUERY		_IOWR(ND_IOCTL, ND_CMD_ARS_QUERY,\
 					struct nd_cmd_ars_query)
 
+
+#define ND_DEVICE_DIMM 1            /* nd_dimm: container for "config data" */
+
+enum nd_driver_flags {
+	ND_DRIVER_DIMM            = 1 << ND_DEVICE_DIMM,
+};
 #endif /* __NDCTL_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 07/21] libnd, nfit: regions (block-data-window, persistent memory, volatile memory)
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:56   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:56 UTC (permalink / raw)
  To: axboe
  Cc: mingo, linux-nvdimm, neilb, gregkh, Rafael J. Wysocki,
	Robert Moore, linux-kernel, linux-acpi, jmoyer, hch

A "region" device represents the maximum capacity of a BLK range (mmio
block-data-window(s)), or a PMEM range (DAX-capable persistent memory or
volatile memory), without regard for aliasing.  Aliasing, in the
dimm-local address space (DPA), is resolved by metadata on a dimm to
designate which exclusive interface will access the aliased DPA ranges.
Support for the per-dimm metadata/label arrvies is in a subsequent
patch.

The name format of "region" devices is "regionN" where, like dimms, N is
a global ida index assigned at discovery time.  This id is not reliable
across reboots nor in the presence of hotplug.  Look to attributes of
the region or static id-data of the sub-namespace to generate a
persistent name.  However, if the platform configuration does not change
it is reasonable to expect the same region id to be assigned at the next
boot.

"region"s have 2 generic attributes "size", and "mapping"s where:
- size: the BLK accessible capacity or the span of the
  system physical address range in the case of PMEM.

- mappingN: a tuple describing a dimm's contribution to the region's
  capacity in the format (<nmemX>,<dpa>,<size>).  For a
  PMEM-region there will be at least one mapping per dimm in the interleave
  set.  For a BLK-region there is only "mapping0" listing the starting
  DPA of the BLK-region and the available DPA capacity of that space
  (matches "size" above).

The max number of mappings per "region" is hard coded per the constraints of
sysfs attribute groups.  That said the number of mappings per region should
never exceed the maximum number of possible dimms in the system.  If the
current number turns out to not be enough then the "mappings" attribute
clarifies how many there are supposed to be. "32 should be enough for
anybody...".

Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c            |  130 ++++++++++++++++++
 drivers/block/nd/Makefile      |    1 
 drivers/block/nd/nd-private.h  |    3 
 drivers/block/nd/nd.h          |   11 +
 drivers/block/nd/region_devs.c |  294 ++++++++++++++++++++++++++++++++++++++++
 include/linux/libnd.h          |   25 +++
 6 files changed, 463 insertions(+), 1 deletion(-)
 create mode 100644 drivers/block/nd/region_devs.c

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index c75f4bf1c230..c510c7b4a6c0 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -716,11 +716,135 @@ static void acpi_nfit_init_dsms(struct acpi_nfit_desc *acpi_desc)
 			set_bit(i, &nd_desc->dsm_mask);
 }
 
+static ssize_t range_index_show(struct device *dev,
+                struct device_attribute *attr, char *buf)
+{
+        struct nd_region *nd_region = to_nd_region(dev);
+        struct nfit_spa *nfit_spa = nd_region_provider_data(nd_region);
+
+        return sprintf(buf, "%d\n", nfit_spa->spa->range_index);
+}
+static DEVICE_ATTR_RO(range_index);
+
+static struct attribute *acpi_nfit_region_attributes[] = {
+	&dev_attr_range_index.attr,
+	NULL,
+};
+
+static struct attribute_group acpi_nfit_region_attribute_group = {
+	.name = "nfit",
+	.attrs = acpi_nfit_region_attributes,
+};
+
+static const struct attribute_group *acpi_nfit_region_attribute_groups[] = {
+	&nd_region_attribute_group,
+	&nd_mapping_attribute_group,
+	&acpi_nfit_region_attribute_group,
+	NULL,
+};
+
+static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
+		struct nfit_spa *nfit_spa)
+{
+	static struct nd_mapping nd_mappings[ND_MAX_MAPPINGS];
+	struct acpi_nfit_system_address *spa = nfit_spa->spa;
+	struct nfit_memdev *nfit_memdev;
+	struct nd_region_desc ndr_desc;
+	int spa_type, count = 0;
+	struct resource res;
+	u16 range_index;
+
+	spa_type = nfit_spa_type(spa);
+	range_index = spa->range_index;
+	if (range_index == 0) {
+		dev_dbg(acpi_desc->dev, "%s: detected invalid spa index\n",
+				__func__);
+		return 0;
+	}
+
+	memset(&res, 0, sizeof(res));
+	memset(&nd_mappings, 0, sizeof(nd_mappings));
+	memset(&ndr_desc, 0, sizeof(ndr_desc));
+	res.start = spa->address;
+	res.end = res.start + spa->length - 1;
+	ndr_desc.res = &res;
+	ndr_desc.provider_data = nfit_spa;
+	ndr_desc.attr_groups = acpi_nfit_region_attribute_groups;
+	list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
+		struct acpi_nfit_memory_map *memdev = nfit_memdev->memdev;
+		struct nd_mapping *nd_mapping;
+		struct nd_dimm *nd_dimm;
+
+		if (memdev->range_index != range_index)
+			continue;
+		if (count >= ND_MAX_MAPPINGS) {
+			dev_err(acpi_desc->dev, "spa%d exceeds max mappings %d\n",
+					range_index, ND_MAX_MAPPINGS);
+			return -ENXIO;
+		}
+		nd_dimm = acpi_nfit_dimm_by_handle(acpi_desc, memdev->device_handle);
+		if (!nd_dimm) {
+			dev_err(acpi_desc->dev, "spa%d dimm: %#x not found\n",
+					range_index, memdev->device_handle);
+			return -ENODEV;
+		}
+		nd_mapping = &nd_mappings[count++];
+		nd_mapping->nd_dimm = nd_dimm;
+		if (spa_type == NFIT_SPA_PM || spa_type == NFIT_SPA_VOLATILE) {
+			nd_mapping->start = memdev->address;
+			nd_mapping->size = memdev->region_size;
+		} else if (spa_type == NFIT_SPA_DCR) {
+			struct nfit_mem *nfit_mem;
+			int blk_valid = 1;
+
+			nfit_mem = nd_dimm_provider_data(nd_dimm);
+			if (!nfit_mem || !nfit_mem->bdw) {
+				dev_dbg(acpi_desc->dev, "%s: spa%d missing bdw\n",
+						nd_dimm_name(nd_dimm), range_index);
+				blk_valid = 0;
+			} else {
+				nd_mapping->size = nfit_mem->bdw->capacity;
+				nd_mapping->start = nfit_mem->bdw->start_address;
+			}
+
+			ndr_desc.nd_mapping = nd_mapping;
+			ndr_desc.num_mappings = blk_valid;
+			if (!nd_blk_region_create(acpi_desc->nd_bus, &ndr_desc))
+				return -ENOMEM;
+		}
+	}
+
+	ndr_desc.nd_mapping = nd_mappings;
+	ndr_desc.num_mappings = count;
+	if (spa_type == NFIT_SPA_PM) {
+		if (!nd_pmem_region_create(acpi_desc->nd_bus, &ndr_desc))
+			return -ENOMEM;
+	} else if (spa_type == NFIT_SPA_VOLATILE) {
+		if (!nd_volatile_region_create(acpi_desc->nd_bus, &ndr_desc))
+			return -ENOMEM;
+	}
+	return 0;
+}
+
+static int acpi_nfit_register_regions(struct acpi_nfit_desc *acpi_desc)
+{
+	struct nfit_spa *nfit_spa;
+
+	list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
+		int rc = acpi_nfit_register_region(acpi_desc, nfit_spa);
+
+		if (rc)
+			return rc;
+	}
+	return 0;
+}
+
 static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 {
 	struct device *dev = acpi_desc->dev;
 	const void *end;
 	u8 *data;
+	int rc;
 
 	INIT_LIST_HEAD(&acpi_desc->spas);
 	INIT_LIST_HEAD(&acpi_desc->dcrs);
@@ -745,7 +869,11 @@ static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 
 	acpi_nfit_init_dsms(acpi_desc);
 
-	return acpi_nfit_register_dimms(acpi_desc);
+	rc = acpi_nfit_register_dimms(acpi_desc);
+	if (rc)
+		return rc;
+
+	return acpi_nfit_register_regions(acpi_desc);
 }
 
 static int acpi_nfit_add(struct acpi_device *adev)
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index d9ef4496e8d3..43fdf4b206d6 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -4,3 +4,4 @@ libnd-y := core.o
 libnd-y += bus.o
 libnd-y += dimm_devs.o
 libnd-y += dimm.o
+libnd-y += region_devs.o
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index a333f3401ca7..8fee471e8dfc 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -42,5 +42,8 @@ void __exit nd_dimm_exit(void);
 int nd_bus_create_ndctl(struct nd_bus *nd_bus);
 void nd_bus_destroy_ndctl(struct nd_bus *nd_bus);
 void nd_synchronize(void);
+int nd_bus_register_dimms(struct nd_bus *nd_bus);
+int nd_bus_register_regions(struct nd_bus *nd_bus);
+int nd_match_dimm(struct device *dev, void *data);
 bool is_nd_dimm(struct device *dev);
 #endif /* __ND_PRIVATE_H__ */
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index 1a5a081ce640..d08871ceb3cf 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -13,6 +13,7 @@
 #ifndef __ND_H__
 #define __ND_H__
 #include <linux/device.h>
+#include <linux/libnd.h>
 #include <linux/mutex.h>
 #include <linux/ndctl.h>
 
@@ -22,6 +23,16 @@ struct nd_dimm_drvdata {
 	void *data;
 };
 
+struct nd_region {
+	struct device dev;
+	u16 ndr_mappings;
+	u64 ndr_size;
+	u64 ndr_start;
+	int id;
+	void *provider_data;
+	struct nd_mapping mapping[0];
+};
+
 enum nd_async_mode {
 	ND_SYNC,
 	ND_ASYNC,
diff --git a/drivers/block/nd/region_devs.c b/drivers/block/nd/region_devs.c
new file mode 100644
index 000000000000..12a5415acfcc
--- /dev/null
+++ b/drivers/block/nd/region_devs.c
@@ -0,0 +1,294 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/slab.h>
+#include <linux/io.h>
+#include "nd-private.h"
+#include "nd.h"
+
+static DEFINE_IDA(region_ida);
+
+static void nd_region_release(struct device *dev)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+	u16 i;
+
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
+
+		put_device(&nd_dimm->dev);
+	}
+	ida_simple_remove(&region_ida, nd_region->id);
+	kfree(nd_region);
+}
+
+static struct device_type nd_blk_device_type = {
+	.name = "nd_blk",
+	.release = nd_region_release,
+};
+
+static struct device_type nd_pmem_device_type = {
+	.name = "nd_pmem",
+	.release = nd_region_release,
+};
+
+static struct device_type nd_volatile_device_type = {
+	.name = "nd_volatile",
+	.release = nd_region_release,
+};
+
+static bool is_nd_pmem(struct device *dev)
+{
+	return dev ? dev->type == &nd_pmem_device_type : false;
+}
+
+struct nd_region *to_nd_region(struct device *dev)
+{
+	struct nd_region *nd_region = container_of(dev, struct nd_region, dev);
+
+	WARN_ON(dev->type->release != nd_region_release);
+	return nd_region;
+}
+EXPORT_SYMBOL_GPL(to_nd_region);
+
+static ssize_t size_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+	unsigned long long size = 0;
+
+	if (is_nd_pmem(dev)) {
+		size = nd_region->ndr_size;
+	} else if (nd_region->ndr_mappings == 1) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[0];
+
+		size = nd_mapping->size;
+	}
+
+	return sprintf(buf, "%llu\n", size);
+}
+static DEVICE_ATTR_RO(size);
+
+static ssize_t mappings_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+
+	return sprintf(buf, "%d\n", nd_region->ndr_mappings);
+}
+static DEVICE_ATTR_RO(mappings);
+
+static struct attribute *nd_region_attributes[] = {
+	&dev_attr_size.attr,
+	&dev_attr_mappings.attr,
+	NULL,
+};
+
+struct attribute_group nd_region_attribute_group = {
+	.attrs = nd_region_attributes,
+};
+EXPORT_SYMBOL_GPL(nd_region_attribute_group);
+
+static ssize_t mappingN(struct device *dev, char *buf, int n)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+	struct nd_mapping *nd_mapping;
+	struct nd_dimm *nd_dimm;
+
+	if (n >= nd_region->ndr_mappings)
+		return -ENXIO;
+	nd_mapping = &nd_region->mapping[n];
+	nd_dimm = nd_mapping->nd_dimm;
+
+	return sprintf(buf, "%s,%llu,%llu\n", dev_name(&nd_dimm->dev),
+			nd_mapping->start, nd_mapping->size);
+}
+
+#define REGION_MAPPING(idx) \
+static ssize_t mapping##idx##_show(struct device *dev,		\
+		struct device_attribute *attr, char *buf)	\
+{								\
+	return mappingN(dev, buf, idx);				\
+}								\
+static DEVICE_ATTR_RO(mapping##idx)
+
+/*
+ * 32 should be enough for a while, even in the presence of socket
+ * interleave a 32-way interleave set is a degenerate case.
+ */
+REGION_MAPPING(0);
+REGION_MAPPING(1);
+REGION_MAPPING(2);
+REGION_MAPPING(3);
+REGION_MAPPING(4);
+REGION_MAPPING(5);
+REGION_MAPPING(6);
+REGION_MAPPING(7);
+REGION_MAPPING(8);
+REGION_MAPPING(9);
+REGION_MAPPING(10);
+REGION_MAPPING(11);
+REGION_MAPPING(12);
+REGION_MAPPING(13);
+REGION_MAPPING(14);
+REGION_MAPPING(15);
+REGION_MAPPING(16);
+REGION_MAPPING(17);
+REGION_MAPPING(18);
+REGION_MAPPING(19);
+REGION_MAPPING(20);
+REGION_MAPPING(21);
+REGION_MAPPING(22);
+REGION_MAPPING(23);
+REGION_MAPPING(24);
+REGION_MAPPING(25);
+REGION_MAPPING(26);
+REGION_MAPPING(27);
+REGION_MAPPING(28);
+REGION_MAPPING(29);
+REGION_MAPPING(30);
+REGION_MAPPING(31);
+
+static umode_t nd_mapping_visible(struct kobject *kobj, struct attribute *a, int n)
+{
+	struct device *dev = container_of(kobj, struct device, kobj);
+	struct nd_region *nd_region = to_nd_region(dev);
+
+	if (n < nd_region->ndr_mappings)
+		return a->mode;
+	return 0;
+}
+
+static struct attribute *nd_mapping_attributes[] = {
+	&dev_attr_mapping0.attr,
+	&dev_attr_mapping1.attr,
+	&dev_attr_mapping2.attr,
+	&dev_attr_mapping3.attr,
+	&dev_attr_mapping4.attr,
+	&dev_attr_mapping5.attr,
+	&dev_attr_mapping6.attr,
+	&dev_attr_mapping7.attr,
+	&dev_attr_mapping8.attr,
+	&dev_attr_mapping9.attr,
+	&dev_attr_mapping10.attr,
+	&dev_attr_mapping11.attr,
+	&dev_attr_mapping12.attr,
+	&dev_attr_mapping13.attr,
+	&dev_attr_mapping14.attr,
+	&dev_attr_mapping15.attr,
+	&dev_attr_mapping16.attr,
+	&dev_attr_mapping17.attr,
+	&dev_attr_mapping18.attr,
+	&dev_attr_mapping19.attr,
+	&dev_attr_mapping20.attr,
+	&dev_attr_mapping21.attr,
+	&dev_attr_mapping22.attr,
+	&dev_attr_mapping23.attr,
+	&dev_attr_mapping24.attr,
+	&dev_attr_mapping25.attr,
+	&dev_attr_mapping26.attr,
+	&dev_attr_mapping27.attr,
+	&dev_attr_mapping28.attr,
+	&dev_attr_mapping29.attr,
+	&dev_attr_mapping30.attr,
+	&dev_attr_mapping31.attr,
+	NULL,
+};
+
+struct attribute_group nd_mapping_attribute_group = {
+	.is_visible = nd_mapping_visible,
+	.attrs = nd_mapping_attributes,
+};
+EXPORT_SYMBOL_GPL(nd_mapping_attribute_group);
+
+void *nd_region_provider_data(struct nd_region *nd_region)
+{
+	return nd_region->provider_data;
+}
+EXPORT_SYMBOL_GPL(nd_region_provider_data);
+
+static noinline struct nd_region *nd_region_create(struct nd_bus *nd_bus,
+		struct nd_region_desc *ndr_desc, struct device_type *dev_type)
+{
+	struct nd_region *nd_region;
+	struct device *dev;
+	u16 i;
+
+	for (i = 0; i < ndr_desc->num_mappings; i++) {
+		struct nd_mapping *nd_mapping = &ndr_desc->nd_mapping[i];
+		struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
+
+		if ((nd_mapping->start | nd_mapping->size) % SZ_4K) {
+			dev_err(&nd_bus->dev, "%pf: %s mapping%d is not 4K aligned\n",
+					__builtin_return_address(0),
+					dev_name(&nd_dimm->dev), i);
+
+			return NULL;
+		}
+	}
+
+	nd_region = kzalloc(sizeof(struct nd_region)
+			+ sizeof(struct nd_mapping) * ndr_desc->num_mappings,
+			GFP_KERNEL);
+	if (!nd_region)
+		return NULL;
+	nd_region->id = ida_simple_get(&region_ida, 0, 0, GFP_KERNEL);
+	if (nd_region->id < 0) {
+		kfree(nd_region);
+		return NULL;
+	}
+
+	memcpy(nd_region->mapping, ndr_desc->nd_mapping,
+			sizeof(struct nd_mapping) * ndr_desc->num_mappings);
+	for (i = 0; i < ndr_desc->num_mappings; i++) {
+		struct nd_mapping *nd_mapping = &ndr_desc->nd_mapping[i];
+		struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
+
+		get_device(&nd_dimm->dev);
+	}
+	nd_region->ndr_mappings = ndr_desc->num_mappings;
+	nd_region->provider_data = ndr_desc->provider_data;
+	dev = &nd_region->dev;
+	dev_set_name(dev, "region%d", nd_region->id);
+	dev->parent = &nd_bus->dev;
+	dev->type = dev_type;
+	dev->groups = ndr_desc->attr_groups;
+	nd_region->ndr_size = resource_size(ndr_desc->res);
+	nd_region->ndr_start = ndr_desc->res->start;
+	nd_device_register(dev);
+
+	return nd_region;
+}
+
+struct nd_region *nd_pmem_region_create(struct nd_bus *nd_bus,
+		struct nd_region_desc *ndr_desc)
+{
+	return nd_region_create(nd_bus, ndr_desc, &nd_pmem_device_type);
+}
+EXPORT_SYMBOL_GPL(nd_pmem_region_create);
+
+struct nd_region *nd_blk_region_create(struct nd_bus *nd_bus,
+		struct nd_region_desc *ndr_desc)
+{
+	if (ndr_desc->num_mappings > 1)
+		return NULL;
+	return nd_region_create(nd_bus, ndr_desc, &nd_blk_device_type);
+}
+EXPORT_SYMBOL_GPL(nd_blk_region_create);
+
+struct nd_region *nd_volatile_region_create(struct nd_bus *nd_bus,
+		struct nd_region_desc *ndr_desc)
+{
+	return nd_region_create(nd_bus, ndr_desc, &nd_volatile_device_type);
+}
+EXPORT_SYMBOL_GPL(nd_volatile_region_create);
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index 0d7e82401e4b..f45407727216 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -25,11 +25,14 @@ enum {
 	ND_CMD_MAX_ELEM = 4,
 	ND_CMD_MAX_ENVELOPE = 16,
 	ND_CMD_ARS_QUERY_MAX = SZ_4K,
+	ND_MAX_MAPPINGS = 32,
 };
 
 extern struct attribute_group nd_bus_attribute_group;
 extern struct attribute_group nd_dimm_attribute_group;
 extern struct attribute_group nd_device_attribute_group;
+extern struct attribute_group nd_region_attribute_group;
+extern struct attribute_group nd_mapping_attribute_group;
 
 struct nd_dimm;
 struct nd_bus_descriptor;
@@ -37,6 +40,12 @@ typedef int (*ndctl_fn)(struct nd_bus_descriptor *nd_desc,
 		struct nd_dimm *nd_dimm, unsigned int cmd, void *buf,
 		unsigned int buf_len);
 
+struct nd_mapping {
+	struct nd_dimm *nd_dimm;
+	u64 start;
+	u64 size;
+};
+
 struct nd_bus_descriptor {
 	const struct attribute_group **attr_groups;
 	unsigned long dsm_mask;
@@ -51,6 +60,14 @@ struct nd_cmd_desc {
 	int out_sizes[ND_CMD_MAX_ELEM];
 };
 
+struct nd_region_desc {
+	struct resource *res;
+	struct nd_mapping *nd_mapping;
+	u16 num_mappings;
+	const struct attribute_group **attr_groups;
+	void *provider_data;
+};
+
 struct nd_bus;
 struct device;
 struct nd_bus *nd_bus_register(struct device *parent,
@@ -58,9 +75,11 @@ struct nd_bus *nd_bus_register(struct device *parent,
 void nd_bus_unregister(struct nd_bus *nd_bus);
 struct nd_bus *to_nd_bus(struct device *dev);
 struct nd_dimm *to_nd_dimm(struct device *dev);
+struct nd_region *to_nd_region(struct device *dev);
 struct nd_bus_descriptor *to_nd_desc(struct nd_bus *nd_bus);
 const char *nd_dimm_name(struct nd_dimm *nd_dimm);
 void *nd_dimm_provider_data(struct nd_dimm *nd_dimm);
+void *nd_region_provider_data(struct nd_region *nd_region);
 struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
 		const struct attribute_group **groups, unsigned long flags,
 		unsigned long *dsm_mask);
@@ -72,4 +91,10 @@ u32 nd_cmd_out_size(struct nd_dimm *nd_dimm, int cmd,
 		const struct nd_cmd_desc *desc, int idx, const u32 *in_field,
 		const u32 *out_field);
 int nd_bus_validate_dimm_count(struct nd_bus *nd_bus, int dimm_count);
+struct nd_region *nd_pmem_region_create(struct nd_bus *nd_bus,
+		struct nd_region_desc *ndr_desc);
+struct nd_region *nd_blk_region_create(struct nd_bus *nd_bus,
+		struct nd_region_desc *ndr_desc);
+struct nd_region *nd_volatile_region_create(struct nd_bus *nd_bus,
+		struct nd_region_desc *ndr_desc);
 #endif /* __LIBND_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 07/21] libnd, nfit: regions (block-data-window, persistent memory, volatile memory)
@ 2015-05-20 20:56   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:56 UTC (permalink / raw)
  To: axboe
  Cc: mingo, linux-nvdimm, neilb, gregkh, Rafael J. Wysocki,
	Robert Moore, linux-kernel, linux-acpi, jmoyer, hch

A "region" device represents the maximum capacity of a BLK range (mmio
block-data-window(s)), or a PMEM range (DAX-capable persistent memory or
volatile memory), without regard for aliasing.  Aliasing, in the
dimm-local address space (DPA), is resolved by metadata on a dimm to
designate which exclusive interface will access the aliased DPA ranges.
Support for the per-dimm metadata/label arrvies is in a subsequent
patch.

The name format of "region" devices is "regionN" where, like dimms, N is
a global ida index assigned at discovery time.  This id is not reliable
across reboots nor in the presence of hotplug.  Look to attributes of
the region or static id-data of the sub-namespace to generate a
persistent name.  However, if the platform configuration does not change
it is reasonable to expect the same region id to be assigned at the next
boot.

"region"s have 2 generic attributes "size", and "mapping"s where:
- size: the BLK accessible capacity or the span of the
  system physical address range in the case of PMEM.

- mappingN: a tuple describing a dimm's contribution to the region's
  capacity in the format (<nmemX>,<dpa>,<size>).  For a
  PMEM-region there will be at least one mapping per dimm in the interleave
  set.  For a BLK-region there is only "mapping0" listing the starting
  DPA of the BLK-region and the available DPA capacity of that space
  (matches "size" above).

The max number of mappings per "region" is hard coded per the constraints of
sysfs attribute groups.  That said the number of mappings per region should
never exceed the maximum number of possible dimms in the system.  If the
current number turns out to not be enough then the "mappings" attribute
clarifies how many there are supposed to be. "32 should be enough for
anybody...".

Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c            |  130 ++++++++++++++++++
 drivers/block/nd/Makefile      |    1 
 drivers/block/nd/nd-private.h  |    3 
 drivers/block/nd/nd.h          |   11 +
 drivers/block/nd/region_devs.c |  294 ++++++++++++++++++++++++++++++++++++++++
 include/linux/libnd.h          |   25 +++
 6 files changed, 463 insertions(+), 1 deletion(-)
 create mode 100644 drivers/block/nd/region_devs.c

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index c75f4bf1c230..c510c7b4a6c0 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -716,11 +716,135 @@ static void acpi_nfit_init_dsms(struct acpi_nfit_desc *acpi_desc)
 			set_bit(i, &nd_desc->dsm_mask);
 }
 
+static ssize_t range_index_show(struct device *dev,
+                struct device_attribute *attr, char *buf)
+{
+        struct nd_region *nd_region = to_nd_region(dev);
+        struct nfit_spa *nfit_spa = nd_region_provider_data(nd_region);
+
+        return sprintf(buf, "%d\n", nfit_spa->spa->range_index);
+}
+static DEVICE_ATTR_RO(range_index);
+
+static struct attribute *acpi_nfit_region_attributes[] = {
+	&dev_attr_range_index.attr,
+	NULL,
+};
+
+static struct attribute_group acpi_nfit_region_attribute_group = {
+	.name = "nfit",
+	.attrs = acpi_nfit_region_attributes,
+};
+
+static const struct attribute_group *acpi_nfit_region_attribute_groups[] = {
+	&nd_region_attribute_group,
+	&nd_mapping_attribute_group,
+	&acpi_nfit_region_attribute_group,
+	NULL,
+};
+
+static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
+		struct nfit_spa *nfit_spa)
+{
+	static struct nd_mapping nd_mappings[ND_MAX_MAPPINGS];
+	struct acpi_nfit_system_address *spa = nfit_spa->spa;
+	struct nfit_memdev *nfit_memdev;
+	struct nd_region_desc ndr_desc;
+	int spa_type, count = 0;
+	struct resource res;
+	u16 range_index;
+
+	spa_type = nfit_spa_type(spa);
+	range_index = spa->range_index;
+	if (range_index == 0) {
+		dev_dbg(acpi_desc->dev, "%s: detected invalid spa index\n",
+				__func__);
+		return 0;
+	}
+
+	memset(&res, 0, sizeof(res));
+	memset(&nd_mappings, 0, sizeof(nd_mappings));
+	memset(&ndr_desc, 0, sizeof(ndr_desc));
+	res.start = spa->address;
+	res.end = res.start + spa->length - 1;
+	ndr_desc.res = &res;
+	ndr_desc.provider_data = nfit_spa;
+	ndr_desc.attr_groups = acpi_nfit_region_attribute_groups;
+	list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
+		struct acpi_nfit_memory_map *memdev = nfit_memdev->memdev;
+		struct nd_mapping *nd_mapping;
+		struct nd_dimm *nd_dimm;
+
+		if (memdev->range_index != range_index)
+			continue;
+		if (count >= ND_MAX_MAPPINGS) {
+			dev_err(acpi_desc->dev, "spa%d exceeds max mappings %d\n",
+					range_index, ND_MAX_MAPPINGS);
+			return -ENXIO;
+		}
+		nd_dimm = acpi_nfit_dimm_by_handle(acpi_desc, memdev->device_handle);
+		if (!nd_dimm) {
+			dev_err(acpi_desc->dev, "spa%d dimm: %#x not found\n",
+					range_index, memdev->device_handle);
+			return -ENODEV;
+		}
+		nd_mapping = &nd_mappings[count++];
+		nd_mapping->nd_dimm = nd_dimm;
+		if (spa_type == NFIT_SPA_PM || spa_type == NFIT_SPA_VOLATILE) {
+			nd_mapping->start = memdev->address;
+			nd_mapping->size = memdev->region_size;
+		} else if (spa_type == NFIT_SPA_DCR) {
+			struct nfit_mem *nfit_mem;
+			int blk_valid = 1;
+
+			nfit_mem = nd_dimm_provider_data(nd_dimm);
+			if (!nfit_mem || !nfit_mem->bdw) {
+				dev_dbg(acpi_desc->dev, "%s: spa%d missing bdw\n",
+						nd_dimm_name(nd_dimm), range_index);
+				blk_valid = 0;
+			} else {
+				nd_mapping->size = nfit_mem->bdw->capacity;
+				nd_mapping->start = nfit_mem->bdw->start_address;
+			}
+
+			ndr_desc.nd_mapping = nd_mapping;
+			ndr_desc.num_mappings = blk_valid;
+			if (!nd_blk_region_create(acpi_desc->nd_bus, &ndr_desc))
+				return -ENOMEM;
+		}
+	}
+
+	ndr_desc.nd_mapping = nd_mappings;
+	ndr_desc.num_mappings = count;
+	if (spa_type == NFIT_SPA_PM) {
+		if (!nd_pmem_region_create(acpi_desc->nd_bus, &ndr_desc))
+			return -ENOMEM;
+	} else if (spa_type == NFIT_SPA_VOLATILE) {
+		if (!nd_volatile_region_create(acpi_desc->nd_bus, &ndr_desc))
+			return -ENOMEM;
+	}
+	return 0;
+}
+
+static int acpi_nfit_register_regions(struct acpi_nfit_desc *acpi_desc)
+{
+	struct nfit_spa *nfit_spa;
+
+	list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
+		int rc = acpi_nfit_register_region(acpi_desc, nfit_spa);
+
+		if (rc)
+			return rc;
+	}
+	return 0;
+}
+
 static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 {
 	struct device *dev = acpi_desc->dev;
 	const void *end;
 	u8 *data;
+	int rc;
 
 	INIT_LIST_HEAD(&acpi_desc->spas);
 	INIT_LIST_HEAD(&acpi_desc->dcrs);
@@ -745,7 +869,11 @@ static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 
 	acpi_nfit_init_dsms(acpi_desc);
 
-	return acpi_nfit_register_dimms(acpi_desc);
+	rc = acpi_nfit_register_dimms(acpi_desc);
+	if (rc)
+		return rc;
+
+	return acpi_nfit_register_regions(acpi_desc);
 }
 
 static int acpi_nfit_add(struct acpi_device *adev)
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index d9ef4496e8d3..43fdf4b206d6 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -4,3 +4,4 @@ libnd-y := core.o
 libnd-y += bus.o
 libnd-y += dimm_devs.o
 libnd-y += dimm.o
+libnd-y += region_devs.o
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index a333f3401ca7..8fee471e8dfc 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -42,5 +42,8 @@ void __exit nd_dimm_exit(void);
 int nd_bus_create_ndctl(struct nd_bus *nd_bus);
 void nd_bus_destroy_ndctl(struct nd_bus *nd_bus);
 void nd_synchronize(void);
+int nd_bus_register_dimms(struct nd_bus *nd_bus);
+int nd_bus_register_regions(struct nd_bus *nd_bus);
+int nd_match_dimm(struct device *dev, void *data);
 bool is_nd_dimm(struct device *dev);
 #endif /* __ND_PRIVATE_H__ */
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index 1a5a081ce640..d08871ceb3cf 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -13,6 +13,7 @@
 #ifndef __ND_H__
 #define __ND_H__
 #include <linux/device.h>
+#include <linux/libnd.h>
 #include <linux/mutex.h>
 #include <linux/ndctl.h>
 
@@ -22,6 +23,16 @@ struct nd_dimm_drvdata {
 	void *data;
 };
 
+struct nd_region {
+	struct device dev;
+	u16 ndr_mappings;
+	u64 ndr_size;
+	u64 ndr_start;
+	int id;
+	void *provider_data;
+	struct nd_mapping mapping[0];
+};
+
 enum nd_async_mode {
 	ND_SYNC,
 	ND_ASYNC,
diff --git a/drivers/block/nd/region_devs.c b/drivers/block/nd/region_devs.c
new file mode 100644
index 000000000000..12a5415acfcc
--- /dev/null
+++ b/drivers/block/nd/region_devs.c
@@ -0,0 +1,294 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/slab.h>
+#include <linux/io.h>
+#include "nd-private.h"
+#include "nd.h"
+
+static DEFINE_IDA(region_ida);
+
+static void nd_region_release(struct device *dev)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+	u16 i;
+
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
+
+		put_device(&nd_dimm->dev);
+	}
+	ida_simple_remove(&region_ida, nd_region->id);
+	kfree(nd_region);
+}
+
+static struct device_type nd_blk_device_type = {
+	.name = "nd_blk",
+	.release = nd_region_release,
+};
+
+static struct device_type nd_pmem_device_type = {
+	.name = "nd_pmem",
+	.release = nd_region_release,
+};
+
+static struct device_type nd_volatile_device_type = {
+	.name = "nd_volatile",
+	.release = nd_region_release,
+};
+
+static bool is_nd_pmem(struct device *dev)
+{
+	return dev ? dev->type == &nd_pmem_device_type : false;
+}
+
+struct nd_region *to_nd_region(struct device *dev)
+{
+	struct nd_region *nd_region = container_of(dev, struct nd_region, dev);
+
+	WARN_ON(dev->type->release != nd_region_release);
+	return nd_region;
+}
+EXPORT_SYMBOL_GPL(to_nd_region);
+
+static ssize_t size_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+	unsigned long long size = 0;
+
+	if (is_nd_pmem(dev)) {
+		size = nd_region->ndr_size;
+	} else if (nd_region->ndr_mappings == 1) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[0];
+
+		size = nd_mapping->size;
+	}
+
+	return sprintf(buf, "%llu\n", size);
+}
+static DEVICE_ATTR_RO(size);
+
+static ssize_t mappings_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+
+	return sprintf(buf, "%d\n", nd_region->ndr_mappings);
+}
+static DEVICE_ATTR_RO(mappings);
+
+static struct attribute *nd_region_attributes[] = {
+	&dev_attr_size.attr,
+	&dev_attr_mappings.attr,
+	NULL,
+};
+
+struct attribute_group nd_region_attribute_group = {
+	.attrs = nd_region_attributes,
+};
+EXPORT_SYMBOL_GPL(nd_region_attribute_group);
+
+static ssize_t mappingN(struct device *dev, char *buf, int n)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+	struct nd_mapping *nd_mapping;
+	struct nd_dimm *nd_dimm;
+
+	if (n >= nd_region->ndr_mappings)
+		return -ENXIO;
+	nd_mapping = &nd_region->mapping[n];
+	nd_dimm = nd_mapping->nd_dimm;
+
+	return sprintf(buf, "%s,%llu,%llu\n", dev_name(&nd_dimm->dev),
+			nd_mapping->start, nd_mapping->size);
+}
+
+#define REGION_MAPPING(idx) \
+static ssize_t mapping##idx##_show(struct device *dev,		\
+		struct device_attribute *attr, char *buf)	\
+{								\
+	return mappingN(dev, buf, idx);				\
+}								\
+static DEVICE_ATTR_RO(mapping##idx)
+
+/*
+ * 32 should be enough for a while, even in the presence of socket
+ * interleave a 32-way interleave set is a degenerate case.
+ */
+REGION_MAPPING(0);
+REGION_MAPPING(1);
+REGION_MAPPING(2);
+REGION_MAPPING(3);
+REGION_MAPPING(4);
+REGION_MAPPING(5);
+REGION_MAPPING(6);
+REGION_MAPPING(7);
+REGION_MAPPING(8);
+REGION_MAPPING(9);
+REGION_MAPPING(10);
+REGION_MAPPING(11);
+REGION_MAPPING(12);
+REGION_MAPPING(13);
+REGION_MAPPING(14);
+REGION_MAPPING(15);
+REGION_MAPPING(16);
+REGION_MAPPING(17);
+REGION_MAPPING(18);
+REGION_MAPPING(19);
+REGION_MAPPING(20);
+REGION_MAPPING(21);
+REGION_MAPPING(22);
+REGION_MAPPING(23);
+REGION_MAPPING(24);
+REGION_MAPPING(25);
+REGION_MAPPING(26);
+REGION_MAPPING(27);
+REGION_MAPPING(28);
+REGION_MAPPING(29);
+REGION_MAPPING(30);
+REGION_MAPPING(31);
+
+static umode_t nd_mapping_visible(struct kobject *kobj, struct attribute *a, int n)
+{
+	struct device *dev = container_of(kobj, struct device, kobj);
+	struct nd_region *nd_region = to_nd_region(dev);
+
+	if (n < nd_region->ndr_mappings)
+		return a->mode;
+	return 0;
+}
+
+static struct attribute *nd_mapping_attributes[] = {
+	&dev_attr_mapping0.attr,
+	&dev_attr_mapping1.attr,
+	&dev_attr_mapping2.attr,
+	&dev_attr_mapping3.attr,
+	&dev_attr_mapping4.attr,
+	&dev_attr_mapping5.attr,
+	&dev_attr_mapping6.attr,
+	&dev_attr_mapping7.attr,
+	&dev_attr_mapping8.attr,
+	&dev_attr_mapping9.attr,
+	&dev_attr_mapping10.attr,
+	&dev_attr_mapping11.attr,
+	&dev_attr_mapping12.attr,
+	&dev_attr_mapping13.attr,
+	&dev_attr_mapping14.attr,
+	&dev_attr_mapping15.attr,
+	&dev_attr_mapping16.attr,
+	&dev_attr_mapping17.attr,
+	&dev_attr_mapping18.attr,
+	&dev_attr_mapping19.attr,
+	&dev_attr_mapping20.attr,
+	&dev_attr_mapping21.attr,
+	&dev_attr_mapping22.attr,
+	&dev_attr_mapping23.attr,
+	&dev_attr_mapping24.attr,
+	&dev_attr_mapping25.attr,
+	&dev_attr_mapping26.attr,
+	&dev_attr_mapping27.attr,
+	&dev_attr_mapping28.attr,
+	&dev_attr_mapping29.attr,
+	&dev_attr_mapping30.attr,
+	&dev_attr_mapping31.attr,
+	NULL,
+};
+
+struct attribute_group nd_mapping_attribute_group = {
+	.is_visible = nd_mapping_visible,
+	.attrs = nd_mapping_attributes,
+};
+EXPORT_SYMBOL_GPL(nd_mapping_attribute_group);
+
+void *nd_region_provider_data(struct nd_region *nd_region)
+{
+	return nd_region->provider_data;
+}
+EXPORT_SYMBOL_GPL(nd_region_provider_data);
+
+static noinline struct nd_region *nd_region_create(struct nd_bus *nd_bus,
+		struct nd_region_desc *ndr_desc, struct device_type *dev_type)
+{
+	struct nd_region *nd_region;
+	struct device *dev;
+	u16 i;
+
+	for (i = 0; i < ndr_desc->num_mappings; i++) {
+		struct nd_mapping *nd_mapping = &ndr_desc->nd_mapping[i];
+		struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
+
+		if ((nd_mapping->start | nd_mapping->size) % SZ_4K) {
+			dev_err(&nd_bus->dev, "%pf: %s mapping%d is not 4K aligned\n",
+					__builtin_return_address(0),
+					dev_name(&nd_dimm->dev), i);
+
+			return NULL;
+		}
+	}
+
+	nd_region = kzalloc(sizeof(struct nd_region)
+			+ sizeof(struct nd_mapping) * ndr_desc->num_mappings,
+			GFP_KERNEL);
+	if (!nd_region)
+		return NULL;
+	nd_region->id = ida_simple_get(&region_ida, 0, 0, GFP_KERNEL);
+	if (nd_region->id < 0) {
+		kfree(nd_region);
+		return NULL;
+	}
+
+	memcpy(nd_region->mapping, ndr_desc->nd_mapping,
+			sizeof(struct nd_mapping) * ndr_desc->num_mappings);
+	for (i = 0; i < ndr_desc->num_mappings; i++) {
+		struct nd_mapping *nd_mapping = &ndr_desc->nd_mapping[i];
+		struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
+
+		get_device(&nd_dimm->dev);
+	}
+	nd_region->ndr_mappings = ndr_desc->num_mappings;
+	nd_region->provider_data = ndr_desc->provider_data;
+	dev = &nd_region->dev;
+	dev_set_name(dev, "region%d", nd_region->id);
+	dev->parent = &nd_bus->dev;
+	dev->type = dev_type;
+	dev->groups = ndr_desc->attr_groups;
+	nd_region->ndr_size = resource_size(ndr_desc->res);
+	nd_region->ndr_start = ndr_desc->res->start;
+	nd_device_register(dev);
+
+	return nd_region;
+}
+
+struct nd_region *nd_pmem_region_create(struct nd_bus *nd_bus,
+		struct nd_region_desc *ndr_desc)
+{
+	return nd_region_create(nd_bus, ndr_desc, &nd_pmem_device_type);
+}
+EXPORT_SYMBOL_GPL(nd_pmem_region_create);
+
+struct nd_region *nd_blk_region_create(struct nd_bus *nd_bus,
+		struct nd_region_desc *ndr_desc)
+{
+	if (ndr_desc->num_mappings > 1)
+		return NULL;
+	return nd_region_create(nd_bus, ndr_desc, &nd_blk_device_type);
+}
+EXPORT_SYMBOL_GPL(nd_blk_region_create);
+
+struct nd_region *nd_volatile_region_create(struct nd_bus *nd_bus,
+		struct nd_region_desc *ndr_desc)
+{
+	return nd_region_create(nd_bus, ndr_desc, &nd_volatile_device_type);
+}
+EXPORT_SYMBOL_GPL(nd_volatile_region_create);
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index 0d7e82401e4b..f45407727216 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -25,11 +25,14 @@ enum {
 	ND_CMD_MAX_ELEM = 4,
 	ND_CMD_MAX_ENVELOPE = 16,
 	ND_CMD_ARS_QUERY_MAX = SZ_4K,
+	ND_MAX_MAPPINGS = 32,
 };
 
 extern struct attribute_group nd_bus_attribute_group;
 extern struct attribute_group nd_dimm_attribute_group;
 extern struct attribute_group nd_device_attribute_group;
+extern struct attribute_group nd_region_attribute_group;
+extern struct attribute_group nd_mapping_attribute_group;
 
 struct nd_dimm;
 struct nd_bus_descriptor;
@@ -37,6 +40,12 @@ typedef int (*ndctl_fn)(struct nd_bus_descriptor *nd_desc,
 		struct nd_dimm *nd_dimm, unsigned int cmd, void *buf,
 		unsigned int buf_len);
 
+struct nd_mapping {
+	struct nd_dimm *nd_dimm;
+	u64 start;
+	u64 size;
+};
+
 struct nd_bus_descriptor {
 	const struct attribute_group **attr_groups;
 	unsigned long dsm_mask;
@@ -51,6 +60,14 @@ struct nd_cmd_desc {
 	int out_sizes[ND_CMD_MAX_ELEM];
 };
 
+struct nd_region_desc {
+	struct resource *res;
+	struct nd_mapping *nd_mapping;
+	u16 num_mappings;
+	const struct attribute_group **attr_groups;
+	void *provider_data;
+};
+
 struct nd_bus;
 struct device;
 struct nd_bus *nd_bus_register(struct device *parent,
@@ -58,9 +75,11 @@ struct nd_bus *nd_bus_register(struct device *parent,
 void nd_bus_unregister(struct nd_bus *nd_bus);
 struct nd_bus *to_nd_bus(struct device *dev);
 struct nd_dimm *to_nd_dimm(struct device *dev);
+struct nd_region *to_nd_region(struct device *dev);
 struct nd_bus_descriptor *to_nd_desc(struct nd_bus *nd_bus);
 const char *nd_dimm_name(struct nd_dimm *nd_dimm);
 void *nd_dimm_provider_data(struct nd_dimm *nd_dimm);
+void *nd_region_provider_data(struct nd_region *nd_region);
 struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
 		const struct attribute_group **groups, unsigned long flags,
 		unsigned long *dsm_mask);
@@ -72,4 +91,10 @@ u32 nd_cmd_out_size(struct nd_dimm *nd_dimm, int cmd,
 		const struct nd_cmd_desc *desc, int idx, const u32 *in_field,
 		const u32 *out_field);
 int nd_bus_validate_dimm_count(struct nd_bus *nd_bus, int dimm_count);
+struct nd_region *nd_pmem_region_create(struct nd_bus *nd_bus,
+		struct nd_region_desc *ndr_desc);
+struct nd_region *nd_blk_region_create(struct nd_bus *nd_bus,
+		struct nd_region_desc *ndr_desc);
+struct nd_region *nd_volatile_region_create(struct nd_bus *nd_bus,
+		struct nd_region_desc *ndr_desc);
 #endif /* __LIBND_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 08/21] libnd: support for legacy (non-aliasing) nvdimms
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:56   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:56 UTC (permalink / raw)
  To: axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, mingo, linux-acpi,
	jmoyer, hch

The libnd region driver is an intermediary driver that translates
non-volatile "region"s into "namespace" sub-devices that are surfaced by
persistent memory block-device drivers (PMEM and BLK).

ACPI 6 introduces the concept that a given nvdimm may simultaneously
offer multiple access modes to its media through direct PMEM load/store
access, or windowed BLK mode.  Existing nvdimms mostly implement a PMEM
interface, some offer a BLK-like mode, but never both as ACPI 6 defines.
If an nvdimm is single interfaced, then there is no need for dimm
metadata labels.  For these devices we can take the region boundaries
directly to create a child namespace device (nd_namespace_io).

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c               |    1 
 drivers/block/nd/Makefile         |    2 +
 drivers/block/nd/bus.c            |   26 +++++++++
 drivers/block/nd/core.c           |   44 ++++++++++++++-
 drivers/block/nd/dimm.c           |    2 -
 drivers/block/nd/namespace_devs.c |  111 +++++++++++++++++++++++++++++++++++++
 drivers/block/nd/nd-private.h     |   10 +++
 drivers/block/nd/nd.h             |   11 ++++
 drivers/block/nd/region.c         |   93 +++++++++++++++++++++++++++++++
 drivers/block/nd/region_devs.c    |   66 ++++++++++++++++++++++
 include/linux/libnd.h             |    6 +-
 include/linux/nd.h                |   10 +++
 include/uapi/linux/ndctl.h        |   10 +++
 13 files changed, 383 insertions(+), 9 deletions(-)
 create mode 100644 drivers/block/nd/namespace_devs.c
 create mode 100644 drivers/block/nd/region.c

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index c510c7b4a6c0..aa719ef0418f 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -739,6 +739,7 @@ static struct attribute_group acpi_nfit_region_attribute_group = {
 static const struct attribute_group *acpi_nfit_region_attribute_groups[] = {
 	&nd_region_attribute_group,
 	&nd_mapping_attribute_group,
+	&nd_device_attribute_group,
 	&acpi_nfit_region_attribute_group,
 	NULL,
 };
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 43fdf4b206d6..235d9e6be94a 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -5,3 +5,5 @@ libnd-y += bus.o
 libnd-y += dimm_devs.o
 libnd-y += dimm.o
 libnd-y += region_devs.o
+libnd-y += region.o
+libnd-y += namespace_devs.o
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index 3f5cdbc24973..d2a62a6142f3 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -13,6 +13,7 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include <linux/vmalloc.h>
 #include <linux/uaccess.h>
+#include <linux/module.h>
 #include <linux/fcntl.h>
 #include <linux/async.h>
 #include <linux/ndctl.h>
@@ -33,6 +34,12 @@ static int to_nd_device_type(struct device *dev)
 {
 	if (is_nd_dimm(dev))
 		return ND_DEVICE_DIMM;
+	else if (is_nd_pmem(dev))
+		return ND_DEVICE_REGION_PMEM;
+	else if (is_nd_blk(dev))
+		return ND_DEVICE_REGION_BLK;
+	else if (is_nd_pmem(dev->parent) || is_nd_blk(dev->parent))
+		return nd_region_to_namespace_type(to_nd_region(dev->parent));
 
 	return 0;
 }
@@ -50,27 +57,46 @@ static int nd_bus_match(struct device *dev, struct device_driver *drv)
 	return test_bit(to_nd_device_type(dev), &nd_drv->type);
 }
 
+static struct module *to_bus_provider(struct device *dev)
+{
+	/* pin bus providers while regions are enabled */
+	if (is_nd_pmem(dev) || is_nd_blk(dev)) {
+		struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
+		return nd_bus->module;
+	}
+	return NULL;
+}
+
 static int nd_bus_probe(struct device *dev)
 {
 	struct nd_device_driver *nd_drv = to_nd_device_driver(dev->driver);
+	struct module *provider = to_bus_provider(dev);
 	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
 	int rc;
 
+	if (!try_module_get(provider))
+		return -ENXIO;
+
 	rc = nd_drv->probe(dev);
 	dev_dbg(&nd_bus->dev, "%s.probe(%s) = %d\n", dev->driver->name,
 			dev_name(dev), rc);
+	if (rc != 0)
+		module_put(provider);
 	return rc;
 }
 
 static int nd_bus_remove(struct device *dev)
 {
 	struct nd_device_driver *nd_drv = to_nd_device_driver(dev->driver);
+	struct module *provider = to_bus_provider(dev);
 	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
 	int rc;
 
 	rc = nd_drv->remove(dev);
 	dev_dbg(&nd_bus->dev, "%s.remove(%s) = %d\n", dev->driver->name,
 			dev_name(dev), rc);
+	module_put(provider);
 	return rc;
 }
 
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index a3dd3a22ce92..7bf88fb124b7 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -24,6 +24,36 @@ LIST_HEAD(nd_bus_list);
 DEFINE_MUTEX(nd_bus_list_mutex);
 static DEFINE_IDA(nd_ida);
 
+void nd_bus_lock(struct device *dev)
+{
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
+	if (!nd_bus)
+		return;
+	mutex_lock(&nd_bus->reconfig_mutex);
+}
+EXPORT_SYMBOL(nd_bus_lock);
+
+void nd_bus_unlock(struct device *dev)
+{
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
+	if (!nd_bus)
+		return;
+	mutex_unlock(&nd_bus->reconfig_mutex);
+}
+EXPORT_SYMBOL(nd_bus_unlock);
+
+bool is_nd_bus_locked(struct device *dev)
+{
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
+	if (!nd_bus)
+		return false;
+	return mutex_is_locked(&nd_bus->reconfig_mutex);
+}
+EXPORT_SYMBOL(is_nd_bus_locked);
+
 static void nd_bus_release(struct device *dev)
 {
 	struct nd_bus *nd_bus = container_of(dev, struct nd_bus, dev);
@@ -133,8 +163,8 @@ struct attribute_group nd_bus_attribute_group = {
 };
 EXPORT_SYMBOL_GPL(nd_bus_attribute_group);
 
-struct nd_bus *nd_bus_register(struct device *parent,
-		struct nd_bus_descriptor *nd_desc)
+struct nd_bus *__nd_bus_register(struct device *parent,
+		struct nd_bus_descriptor *nd_desc, struct module *module)
 {
 	struct nd_bus *nd_bus = kzalloc(sizeof(*nd_bus), GFP_KERNEL);
 	int rc;
@@ -143,11 +173,13 @@ struct nd_bus *nd_bus_register(struct device *parent,
 		return NULL;
 	INIT_LIST_HEAD(&nd_bus->list);
 	nd_bus->id = ida_simple_get(&nd_ida, 0, 0, GFP_KERNEL);
+	mutex_init(&nd_bus->reconfig_mutex);
 	if (nd_bus->id < 0) {
 		kfree(nd_bus);
 		return NULL;
 	}
 	nd_bus->nd_desc = nd_desc;
+	nd_bus->module = module;
 	nd_bus->dev.parent = parent;
 	nd_bus->dev.release = nd_bus_release;
 	nd_bus->dev.groups = nd_desc->attr_groups;
@@ -171,7 +203,7 @@ struct nd_bus *nd_bus_register(struct device *parent,
 	put_device(&nd_bus->dev);
 	return NULL;
 }
-EXPORT_SYMBOL_GPL(nd_bus_register);
+EXPORT_SYMBOL_GPL(__nd_bus_register);
 
 static int child_unregister(struct device *dev, void *data)
 {
@@ -215,7 +247,12 @@ static __init int libnd_init(void)
 	rc = nd_dimm_init();
 	if (rc)
 		goto err_dimm;
+	rc = nd_region_init();
+	if (rc)
+		goto err_region;
 	return 0;
+ err_region:
+	nd_dimm_exit();
  err_dimm:
 	nd_bus_exit();
 	return rc;
@@ -224,6 +261,7 @@ static __init int libnd_init(void)
 static __exit void libnd_exit(void)
 {
 	WARN_ON(!list_empty(&nd_bus_list));
+	nd_region_exit();
 	nd_dimm_exit();
 	nd_bus_exit();
 }
diff --git a/drivers/block/nd/dimm.c b/drivers/block/nd/dimm.c
index 1665b7d69e3a..c4df1a32a68b 100644
--- a/drivers/block/nd/dimm.c
+++ b/drivers/block/nd/dimm.c
@@ -84,7 +84,7 @@ int __init nd_dimm_init(void)
 	return nd_driver_register(&nd_dimm_driver);
 }
 
-void __exit nd_dimm_exit(void)
+void nd_dimm_exit(void)
 {
 	driver_unregister(&nd_dimm_driver.drv);
 }
diff --git a/drivers/block/nd/namespace_devs.c b/drivers/block/nd/namespace_devs.c
new file mode 100644
index 000000000000..8fbdf68c64d8
--- /dev/null
+++ b/drivers/block/nd/namespace_devs.c
@@ -0,0 +1,111 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/nd.h>
+#include "nd.h"
+
+static void namespace_io_release(struct device *dev)
+{
+	struct nd_namespace_io *nsio = to_nd_namespace_io(dev);
+
+	kfree(nsio);
+}
+
+static struct device_type namespace_io_device_type = {
+	.name = "nd_namespace_io",
+	.release = namespace_io_release,
+};
+
+static ssize_t nstype_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+
+	return sprintf(buf, "%d\n", nd_region_to_namespace_type(nd_region));
+}
+static DEVICE_ATTR_RO(nstype);
+
+static struct attribute *nd_namespace_attributes[] = {
+	&dev_attr_nstype.attr,
+	NULL,
+};
+
+static struct attribute_group nd_namespace_attribute_group = {
+	.attrs = nd_namespace_attributes,
+};
+
+static const struct attribute_group *nd_namespace_attribute_groups[] = {
+	&nd_device_attribute_group,
+	&nd_namespace_attribute_group,
+	NULL,
+};
+
+static struct device **create_namespace_io(struct nd_region *nd_region)
+{
+	struct nd_namespace_io *nsio;
+	struct device *dev, **devs;
+	struct resource *res;
+
+	nsio = kzalloc(sizeof(*nsio), GFP_KERNEL);
+	if (!nsio)
+		return NULL;
+
+	devs = kcalloc(2, sizeof(struct device *), GFP_KERNEL);
+	if (!devs) {
+		kfree(nsio);
+		return NULL;
+	}
+
+	dev = &nsio->dev;
+	dev->type = &namespace_io_device_type;
+	res = &nsio->res;
+	res->name = dev_name(&nd_region->dev);
+	res->flags = IORESOURCE_MEM;
+	res->start = nd_region->ndr_start;
+	res->end = res->start + nd_region->ndr_size - 1;
+
+	devs[0] = dev;
+	return devs;
+}
+
+int nd_region_register_namespaces(struct nd_region *nd_region, int *err)
+{
+	struct device **devs = NULL;
+	int i;
+
+	*err = 0;
+	switch (nd_region_to_namespace_type(nd_region)) {
+	case ND_DEVICE_NAMESPACE_IO:
+		devs = create_namespace_io(nd_region);
+		break;
+	default:
+		break;
+	}
+
+	if (!devs)
+		return -ENODEV;
+
+	for (i = 0; devs[i]; i++) {
+		struct device *dev = devs[i];
+
+		dev_set_name(dev, "namespace%d.%d", nd_region->id, i);
+		dev->parent = &nd_region->dev;
+		dev->groups = nd_namespace_attribute_groups;
+		nd_device_register(dev);
+	}
+	kfree(devs);
+
+	return i;
+}
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index 8fee471e8dfc..8ef3a1b50f44 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -21,9 +21,11 @@ extern int nd_dimm_major;
 
 struct nd_bus {
 	struct nd_bus_descriptor *nd_desc;
+	struct module *module;
 	struct list_head list;
 	struct device dev;
 	int id;
+	struct mutex reconfig_mutex;
 };
 
 struct nd_dimm {
@@ -34,16 +36,20 @@ struct nd_dimm {
 	int id;
 };
 
+bool is_nd_dimm(struct device *dev);
+bool is_nd_blk(struct device *dev);
+bool is_nd_pmem(struct device *dev);
 struct nd_bus *walk_to_nd_bus(struct device *nd_dev);
 int __init nd_bus_init(void);
 void nd_bus_exit(void);
 int __init nd_dimm_init(void);
-void __exit nd_dimm_exit(void);
+int __init nd_region_init(void);
+void nd_dimm_exit(void);
+int nd_region_exit(void);
 int nd_bus_create_ndctl(struct nd_bus *nd_bus);
 void nd_bus_destroy_ndctl(struct nd_bus *nd_bus);
 void nd_synchronize(void);
 int nd_bus_register_dimms(struct nd_bus *nd_bus);
 int nd_bus_register_regions(struct nd_bus *nd_bus);
 int nd_match_dimm(struct device *dev, void *data);
-bool is_nd_dimm(struct device *dev);
 #endif /* __ND_PRIVATE_H__ */
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index d08871ceb3cf..72f4d7b76059 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -23,6 +23,11 @@ struct nd_dimm_drvdata {
 	void *data;
 };
 
+struct nd_region_namespaces {
+	int count;
+	int active;
+};
+
 struct nd_region {
 	struct device dev;
 	u16 ndr_mappings;
@@ -42,4 +47,10 @@ void nd_device_register(struct device *dev);
 void nd_device_unregister(struct device *dev, enum nd_async_mode mode);
 int nd_dimm_init_nsarea(struct nd_dimm_drvdata *ndd);
 int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd);
+struct nd_region *to_nd_region(struct device *dev);
+int nd_region_to_namespace_type(struct nd_region *nd_region);
+int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
+void nd_bus_lock(struct device *dev);
+void nd_bus_unlock(struct device *dev);
+bool is_nd_bus_locked(struct device *dev);
 #endif /* __ND_H__ */
diff --git a/drivers/block/nd/region.c b/drivers/block/nd/region.c
new file mode 100644
index 000000000000..7e58b2a700c2
--- /dev/null
+++ b/drivers/block/nd/region.c
@@ -0,0 +1,93 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/nd.h>
+#include "nd.h"
+
+static int nd_region_probe(struct device *dev)
+{
+	int err;
+	struct nd_region_namespaces *num_ns;
+	struct nd_region *nd_region = to_nd_region(dev);
+	int rc = nd_region_register_namespaces(nd_region, &err);
+
+	num_ns = devm_kzalloc(dev, sizeof(*num_ns), GFP_KERNEL);
+	if (!num_ns)
+		return -ENOMEM;
+
+	if (rc < 0)
+		return rc;
+
+	num_ns->active = rc;
+	num_ns->count = rc + err;
+	dev_set_drvdata(dev, num_ns);
+
+	if (err == 0)
+		return 0;
+
+	if (rc == err)
+		return -ENODEV;
+
+	/*
+	 * Given multiple namespaces per region, we do not want to
+	 * disable all the successfully registered peer namespaces upon
+	 * a single registration failure.  If userspace is missing a
+	 * namespace that it expects it can disable/re-enable the region
+	 * to retry discovery after correcting the failure.
+	 * <regionX>/namespaces returns the current
+	 * "<async-registered>/<total>" namespace count.
+	 */
+	dev_err(dev, "failed to register %d namespace%s, continuing...\n",
+			err, err == 1 ? "" : "s");
+	return 0;
+}
+
+static int child_unregister(struct device *dev, void *data)
+{
+	nd_device_unregister(dev, ND_SYNC);
+	return 0;
+}
+
+static int nd_region_remove(struct device *dev)
+{
+	/* flush attribute readers and disable */
+	nd_bus_lock(dev);
+	dev_set_drvdata(dev, NULL);
+	nd_bus_unlock(dev);
+
+	device_for_each_child(dev, NULL, child_unregister);
+	return 0;
+}
+
+static struct nd_device_driver nd_region_driver = {
+	.probe = nd_region_probe,
+	.remove = nd_region_remove,
+	.drv = {
+		.name = "nd_region",
+	},
+	.type = ND_DRIVER_REGION_BLK | ND_DRIVER_REGION_PMEM,
+};
+
+int __init nd_region_init(void)
+{
+	return nd_driver_register(&nd_region_driver);
+}
+
+void __exit nd_region_exit(void)
+{
+	driver_unregister(&nd_region_driver.drv);
+}
+
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_REGION_PMEM);
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_REGION_BLK);
diff --git a/drivers/block/nd/region_devs.c b/drivers/block/nd/region_devs.c
index 12a5415acfcc..fdc58e333b78 100644
--- a/drivers/block/nd/region_devs.c
+++ b/drivers/block/nd/region_devs.c
@@ -47,11 +47,16 @@ static struct device_type nd_volatile_device_type = {
 	.release = nd_region_release,
 };
 
-static bool is_nd_pmem(struct device *dev)
+bool is_nd_pmem(struct device *dev)
 {
 	return dev ? dev->type == &nd_pmem_device_type : false;
 }
 
+bool is_nd_blk(struct device *dev)
+{
+	return dev ? dev->type == &nd_blk_device_type : false;
+}
+
 struct nd_region *to_nd_region(struct device *dev)
 {
 	struct nd_region *nd_region = container_of(dev, struct nd_region, dev);
@@ -61,6 +66,37 @@ struct nd_region *to_nd_region(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(to_nd_region);
 
+/**
+ * nd_region_to_namespace_type() - region to an integer namespace type
+ * @nd_region: region-device to interrogate
+ *
+ * This is the 'nstype' attribute of a region as well, an input to the
+ * MODALIAS for namespace devices, and bit number for a nd_bus to match
+ * namespace devices with namespace drivers.
+ */
+int nd_region_to_namespace_type(struct nd_region *nd_region)
+{
+	if (is_nd_pmem(&nd_region->dev)) {
+		u16 i, alias;
+
+		for (i = 0, alias = 0; i < nd_region->ndr_mappings; i++) {
+			struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+			struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
+
+			if (nd_dimm->flags & NDD_ALIASING)
+				alias++;
+		}
+		if (alias)
+			return ND_DEVICE_NAMESPACE_PMEM;
+		else
+			return ND_DEVICE_NAMESPACE_IO;
+	} else if (is_nd_blk(&nd_region->dev)) {
+		return ND_DEVICE_NAMESPACE_BLK;
+	}
+
+	return 0;
+}
+
 static ssize_t size_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -88,9 +124,37 @@ static ssize_t mappings_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(mappings);
 
+static ssize_t nstype_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+
+	return sprintf(buf, "%d\n", nd_region_to_namespace_type(nd_region));
+}
+static DEVICE_ATTR_RO(nstype);
+
+static ssize_t init_namespaces_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region_namespaces *num_ns = dev_get_drvdata(dev);
+	ssize_t rc;
+
+	nd_bus_lock(dev);
+	if (num_ns)
+		rc = sprintf(buf, "%d/%d\n", num_ns->active, num_ns->count);
+	else
+		rc = -ENXIO;
+	nd_bus_unlock(dev);
+
+	return rc;
+}
+static DEVICE_ATTR_RO(init_namespaces);
+
 static struct attribute *nd_region_attributes[] = {
 	&dev_attr_size.attr,
+	&dev_attr_nstype.attr,
 	&dev_attr_mappings.attr,
+	&dev_attr_init_namespaces.attr,
 	NULL,
 };
 
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index f45407727216..6747da2c7cb6 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -70,8 +70,10 @@ struct nd_region_desc {
 
 struct nd_bus;
 struct device;
-struct nd_bus *nd_bus_register(struct device *parent,
-		struct nd_bus_descriptor *nfit_desc);
+struct nd_bus *__nd_bus_register(struct device *parent,
+		struct nd_bus_descriptor *nfit_desc, struct module *module);
+#define nd_bus_register(parent, desc) \
+	__nd_bus_register(parent, desc, THIS_MODULE)
 void nd_bus_unregister(struct nd_bus *nd_bus);
 struct nd_bus *to_nd_bus(struct device *dev);
 struct nd_dimm *to_nd_dimm(struct device *dev);
diff --git a/include/linux/nd.h b/include/linux/nd.h
index e074f67e53a3..da70e9962197 100644
--- a/include/linux/nd.h
+++ b/include/linux/nd.h
@@ -26,6 +26,16 @@ static inline struct nd_device_driver *to_nd_device_driver(
 		struct device_driver *drv)
 {
 	return container_of(drv, struct nd_device_driver, drv);
+};
+
+struct nd_namespace_io {
+	struct device dev;
+	struct resource res;
+};
+
+static inline struct nd_namespace_io *to_nd_namespace_io(struct device *dev)
+{
+	return container_of(dev, struct nd_namespace_io, dev);
 }
 
 #define MODULE_ALIAS_ND_DEVICE(type) \
diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
index 1ccd2c633193..5ffa319f3408 100644
--- a/include/uapi/linux/ndctl.h
+++ b/include/uapi/linux/ndctl.h
@@ -177,8 +177,18 @@ static inline const char *nd_dimm_cmd_name(unsigned cmd)
 
 
 #define ND_DEVICE_DIMM 1            /* nd_dimm: container for "config data" */
+#define ND_DEVICE_REGION_PMEM 2     /* nd_region: (parent of pmem namespaces) */
+#define ND_DEVICE_REGION_BLK 3      /* nd_region: (parent of blk namespaces) */
+#define ND_DEVICE_NAMESPACE_IO 4    /* legacy persistent memory */
+#define ND_DEVICE_NAMESPACE_PMEM 5  /* persistent memory namespace (may alias) */
+#define ND_DEVICE_NAMESPACE_BLK 6   /* block-data-window namespace (may alias) */
 
 enum nd_driver_flags {
 	ND_DRIVER_DIMM            = 1 << ND_DEVICE_DIMM,
+	ND_DRIVER_REGION_PMEM     = 1 << ND_DEVICE_REGION_PMEM,
+	ND_DRIVER_REGION_BLK      = 1 << ND_DEVICE_REGION_BLK,
+	ND_DRIVER_NAMESPACE_IO    = 1 << ND_DEVICE_NAMESPACE_IO,
+	ND_DRIVER_NAMESPACE_PMEM  = 1 << ND_DEVICE_NAMESPACE_PMEM,
+	ND_DRIVER_NAMESPACE_BLK   = 1 << ND_DEVICE_NAMESPACE_BLK,
 };
 #endif /* __NDCTL_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 08/21] libnd: support for legacy (non-aliasing) nvdimms
@ 2015-05-20 20:56   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:56 UTC (permalink / raw)
  To: axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, mingo, linux-acpi,
	jmoyer, hch

The libnd region driver is an intermediary driver that translates
non-volatile "region"s into "namespace" sub-devices that are surfaced by
persistent memory block-device drivers (PMEM and BLK).

ACPI 6 introduces the concept that a given nvdimm may simultaneously
offer multiple access modes to its media through direct PMEM load/store
access, or windowed BLK mode.  Existing nvdimms mostly implement a PMEM
interface, some offer a BLK-like mode, but never both as ACPI 6 defines.
If an nvdimm is single interfaced, then there is no need for dimm
metadata labels.  For these devices we can take the region boundaries
directly to create a child namespace device (nd_namespace_io).

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c               |    1 
 drivers/block/nd/Makefile         |    2 +
 drivers/block/nd/bus.c            |   26 +++++++++
 drivers/block/nd/core.c           |   44 ++++++++++++++-
 drivers/block/nd/dimm.c           |    2 -
 drivers/block/nd/namespace_devs.c |  111 +++++++++++++++++++++++++++++++++++++
 drivers/block/nd/nd-private.h     |   10 +++
 drivers/block/nd/nd.h             |   11 ++++
 drivers/block/nd/region.c         |   93 +++++++++++++++++++++++++++++++
 drivers/block/nd/region_devs.c    |   66 ++++++++++++++++++++++
 include/linux/libnd.h             |    6 +-
 include/linux/nd.h                |   10 +++
 include/uapi/linux/ndctl.h        |   10 +++
 13 files changed, 383 insertions(+), 9 deletions(-)
 create mode 100644 drivers/block/nd/namespace_devs.c
 create mode 100644 drivers/block/nd/region.c

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index c510c7b4a6c0..aa719ef0418f 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -739,6 +739,7 @@ static struct attribute_group acpi_nfit_region_attribute_group = {
 static const struct attribute_group *acpi_nfit_region_attribute_groups[] = {
 	&nd_region_attribute_group,
 	&nd_mapping_attribute_group,
+	&nd_device_attribute_group,
 	&acpi_nfit_region_attribute_group,
 	NULL,
 };
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 43fdf4b206d6..235d9e6be94a 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -5,3 +5,5 @@ libnd-y += bus.o
 libnd-y += dimm_devs.o
 libnd-y += dimm.o
 libnd-y += region_devs.o
+libnd-y += region.o
+libnd-y += namespace_devs.o
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index 3f5cdbc24973..d2a62a6142f3 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -13,6 +13,7 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include <linux/vmalloc.h>
 #include <linux/uaccess.h>
+#include <linux/module.h>
 #include <linux/fcntl.h>
 #include <linux/async.h>
 #include <linux/ndctl.h>
@@ -33,6 +34,12 @@ static int to_nd_device_type(struct device *dev)
 {
 	if (is_nd_dimm(dev))
 		return ND_DEVICE_DIMM;
+	else if (is_nd_pmem(dev))
+		return ND_DEVICE_REGION_PMEM;
+	else if (is_nd_blk(dev))
+		return ND_DEVICE_REGION_BLK;
+	else if (is_nd_pmem(dev->parent) || is_nd_blk(dev->parent))
+		return nd_region_to_namespace_type(to_nd_region(dev->parent));
 
 	return 0;
 }
@@ -50,27 +57,46 @@ static int nd_bus_match(struct device *dev, struct device_driver *drv)
 	return test_bit(to_nd_device_type(dev), &nd_drv->type);
 }
 
+static struct module *to_bus_provider(struct device *dev)
+{
+	/* pin bus providers while regions are enabled */
+	if (is_nd_pmem(dev) || is_nd_blk(dev)) {
+		struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
+		return nd_bus->module;
+	}
+	return NULL;
+}
+
 static int nd_bus_probe(struct device *dev)
 {
 	struct nd_device_driver *nd_drv = to_nd_device_driver(dev->driver);
+	struct module *provider = to_bus_provider(dev);
 	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
 	int rc;
 
+	if (!try_module_get(provider))
+		return -ENXIO;
+
 	rc = nd_drv->probe(dev);
 	dev_dbg(&nd_bus->dev, "%s.probe(%s) = %d\n", dev->driver->name,
 			dev_name(dev), rc);
+	if (rc != 0)
+		module_put(provider);
 	return rc;
 }
 
 static int nd_bus_remove(struct device *dev)
 {
 	struct nd_device_driver *nd_drv = to_nd_device_driver(dev->driver);
+	struct module *provider = to_bus_provider(dev);
 	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
 	int rc;
 
 	rc = nd_drv->remove(dev);
 	dev_dbg(&nd_bus->dev, "%s.remove(%s) = %d\n", dev->driver->name,
 			dev_name(dev), rc);
+	module_put(provider);
 	return rc;
 }
 
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index a3dd3a22ce92..7bf88fb124b7 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -24,6 +24,36 @@ LIST_HEAD(nd_bus_list);
 DEFINE_MUTEX(nd_bus_list_mutex);
 static DEFINE_IDA(nd_ida);
 
+void nd_bus_lock(struct device *dev)
+{
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
+	if (!nd_bus)
+		return;
+	mutex_lock(&nd_bus->reconfig_mutex);
+}
+EXPORT_SYMBOL(nd_bus_lock);
+
+void nd_bus_unlock(struct device *dev)
+{
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
+	if (!nd_bus)
+		return;
+	mutex_unlock(&nd_bus->reconfig_mutex);
+}
+EXPORT_SYMBOL(nd_bus_unlock);
+
+bool is_nd_bus_locked(struct device *dev)
+{
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
+	if (!nd_bus)
+		return false;
+	return mutex_is_locked(&nd_bus->reconfig_mutex);
+}
+EXPORT_SYMBOL(is_nd_bus_locked);
+
 static void nd_bus_release(struct device *dev)
 {
 	struct nd_bus *nd_bus = container_of(dev, struct nd_bus, dev);
@@ -133,8 +163,8 @@ struct attribute_group nd_bus_attribute_group = {
 };
 EXPORT_SYMBOL_GPL(nd_bus_attribute_group);
 
-struct nd_bus *nd_bus_register(struct device *parent,
-		struct nd_bus_descriptor *nd_desc)
+struct nd_bus *__nd_bus_register(struct device *parent,
+		struct nd_bus_descriptor *nd_desc, struct module *module)
 {
 	struct nd_bus *nd_bus = kzalloc(sizeof(*nd_bus), GFP_KERNEL);
 	int rc;
@@ -143,11 +173,13 @@ struct nd_bus *nd_bus_register(struct device *parent,
 		return NULL;
 	INIT_LIST_HEAD(&nd_bus->list);
 	nd_bus->id = ida_simple_get(&nd_ida, 0, 0, GFP_KERNEL);
+	mutex_init(&nd_bus->reconfig_mutex);
 	if (nd_bus->id < 0) {
 		kfree(nd_bus);
 		return NULL;
 	}
 	nd_bus->nd_desc = nd_desc;
+	nd_bus->module = module;
 	nd_bus->dev.parent = parent;
 	nd_bus->dev.release = nd_bus_release;
 	nd_bus->dev.groups = nd_desc->attr_groups;
@@ -171,7 +203,7 @@ struct nd_bus *nd_bus_register(struct device *parent,
 	put_device(&nd_bus->dev);
 	return NULL;
 }
-EXPORT_SYMBOL_GPL(nd_bus_register);
+EXPORT_SYMBOL_GPL(__nd_bus_register);
 
 static int child_unregister(struct device *dev, void *data)
 {
@@ -215,7 +247,12 @@ static __init int libnd_init(void)
 	rc = nd_dimm_init();
 	if (rc)
 		goto err_dimm;
+	rc = nd_region_init();
+	if (rc)
+		goto err_region;
 	return 0;
+ err_region:
+	nd_dimm_exit();
  err_dimm:
 	nd_bus_exit();
 	return rc;
@@ -224,6 +261,7 @@ static __init int libnd_init(void)
 static __exit void libnd_exit(void)
 {
 	WARN_ON(!list_empty(&nd_bus_list));
+	nd_region_exit();
 	nd_dimm_exit();
 	nd_bus_exit();
 }
diff --git a/drivers/block/nd/dimm.c b/drivers/block/nd/dimm.c
index 1665b7d69e3a..c4df1a32a68b 100644
--- a/drivers/block/nd/dimm.c
+++ b/drivers/block/nd/dimm.c
@@ -84,7 +84,7 @@ int __init nd_dimm_init(void)
 	return nd_driver_register(&nd_dimm_driver);
 }
 
-void __exit nd_dimm_exit(void)
+void nd_dimm_exit(void)
 {
 	driver_unregister(&nd_dimm_driver.drv);
 }
diff --git a/drivers/block/nd/namespace_devs.c b/drivers/block/nd/namespace_devs.c
new file mode 100644
index 000000000000..8fbdf68c64d8
--- /dev/null
+++ b/drivers/block/nd/namespace_devs.c
@@ -0,0 +1,111 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/nd.h>
+#include "nd.h"
+
+static void namespace_io_release(struct device *dev)
+{
+	struct nd_namespace_io *nsio = to_nd_namespace_io(dev);
+
+	kfree(nsio);
+}
+
+static struct device_type namespace_io_device_type = {
+	.name = "nd_namespace_io",
+	.release = namespace_io_release,
+};
+
+static ssize_t nstype_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+
+	return sprintf(buf, "%d\n", nd_region_to_namespace_type(nd_region));
+}
+static DEVICE_ATTR_RO(nstype);
+
+static struct attribute *nd_namespace_attributes[] = {
+	&dev_attr_nstype.attr,
+	NULL,
+};
+
+static struct attribute_group nd_namespace_attribute_group = {
+	.attrs = nd_namespace_attributes,
+};
+
+static const struct attribute_group *nd_namespace_attribute_groups[] = {
+	&nd_device_attribute_group,
+	&nd_namespace_attribute_group,
+	NULL,
+};
+
+static struct device **create_namespace_io(struct nd_region *nd_region)
+{
+	struct nd_namespace_io *nsio;
+	struct device *dev, **devs;
+	struct resource *res;
+
+	nsio = kzalloc(sizeof(*nsio), GFP_KERNEL);
+	if (!nsio)
+		return NULL;
+
+	devs = kcalloc(2, sizeof(struct device *), GFP_KERNEL);
+	if (!devs) {
+		kfree(nsio);
+		return NULL;
+	}
+
+	dev = &nsio->dev;
+	dev->type = &namespace_io_device_type;
+	res = &nsio->res;
+	res->name = dev_name(&nd_region->dev);
+	res->flags = IORESOURCE_MEM;
+	res->start = nd_region->ndr_start;
+	res->end = res->start + nd_region->ndr_size - 1;
+
+	devs[0] = dev;
+	return devs;
+}
+
+int nd_region_register_namespaces(struct nd_region *nd_region, int *err)
+{
+	struct device **devs = NULL;
+	int i;
+
+	*err = 0;
+	switch (nd_region_to_namespace_type(nd_region)) {
+	case ND_DEVICE_NAMESPACE_IO:
+		devs = create_namespace_io(nd_region);
+		break;
+	default:
+		break;
+	}
+
+	if (!devs)
+		return -ENODEV;
+
+	for (i = 0; devs[i]; i++) {
+		struct device *dev = devs[i];
+
+		dev_set_name(dev, "namespace%d.%d", nd_region->id, i);
+		dev->parent = &nd_region->dev;
+		dev->groups = nd_namespace_attribute_groups;
+		nd_device_register(dev);
+	}
+	kfree(devs);
+
+	return i;
+}
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index 8fee471e8dfc..8ef3a1b50f44 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -21,9 +21,11 @@ extern int nd_dimm_major;
 
 struct nd_bus {
 	struct nd_bus_descriptor *nd_desc;
+	struct module *module;
 	struct list_head list;
 	struct device dev;
 	int id;
+	struct mutex reconfig_mutex;
 };
 
 struct nd_dimm {
@@ -34,16 +36,20 @@ struct nd_dimm {
 	int id;
 };
 
+bool is_nd_dimm(struct device *dev);
+bool is_nd_blk(struct device *dev);
+bool is_nd_pmem(struct device *dev);
 struct nd_bus *walk_to_nd_bus(struct device *nd_dev);
 int __init nd_bus_init(void);
 void nd_bus_exit(void);
 int __init nd_dimm_init(void);
-void __exit nd_dimm_exit(void);
+int __init nd_region_init(void);
+void nd_dimm_exit(void);
+int nd_region_exit(void);
 int nd_bus_create_ndctl(struct nd_bus *nd_bus);
 void nd_bus_destroy_ndctl(struct nd_bus *nd_bus);
 void nd_synchronize(void);
 int nd_bus_register_dimms(struct nd_bus *nd_bus);
 int nd_bus_register_regions(struct nd_bus *nd_bus);
 int nd_match_dimm(struct device *dev, void *data);
-bool is_nd_dimm(struct device *dev);
 #endif /* __ND_PRIVATE_H__ */
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index d08871ceb3cf..72f4d7b76059 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -23,6 +23,11 @@ struct nd_dimm_drvdata {
 	void *data;
 };
 
+struct nd_region_namespaces {
+	int count;
+	int active;
+};
+
 struct nd_region {
 	struct device dev;
 	u16 ndr_mappings;
@@ -42,4 +47,10 @@ void nd_device_register(struct device *dev);
 void nd_device_unregister(struct device *dev, enum nd_async_mode mode);
 int nd_dimm_init_nsarea(struct nd_dimm_drvdata *ndd);
 int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd);
+struct nd_region *to_nd_region(struct device *dev);
+int nd_region_to_namespace_type(struct nd_region *nd_region);
+int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
+void nd_bus_lock(struct device *dev);
+void nd_bus_unlock(struct device *dev);
+bool is_nd_bus_locked(struct device *dev);
 #endif /* __ND_H__ */
diff --git a/drivers/block/nd/region.c b/drivers/block/nd/region.c
new file mode 100644
index 000000000000..7e58b2a700c2
--- /dev/null
+++ b/drivers/block/nd/region.c
@@ -0,0 +1,93 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/nd.h>
+#include "nd.h"
+
+static int nd_region_probe(struct device *dev)
+{
+	int err;
+	struct nd_region_namespaces *num_ns;
+	struct nd_region *nd_region = to_nd_region(dev);
+	int rc = nd_region_register_namespaces(nd_region, &err);
+
+	num_ns = devm_kzalloc(dev, sizeof(*num_ns), GFP_KERNEL);
+	if (!num_ns)
+		return -ENOMEM;
+
+	if (rc < 0)
+		return rc;
+
+	num_ns->active = rc;
+	num_ns->count = rc + err;
+	dev_set_drvdata(dev, num_ns);
+
+	if (err == 0)
+		return 0;
+
+	if (rc == err)
+		return -ENODEV;
+
+	/*
+	 * Given multiple namespaces per region, we do not want to
+	 * disable all the successfully registered peer namespaces upon
+	 * a single registration failure.  If userspace is missing a
+	 * namespace that it expects it can disable/re-enable the region
+	 * to retry discovery after correcting the failure.
+	 * <regionX>/namespaces returns the current
+	 * "<async-registered>/<total>" namespace count.
+	 */
+	dev_err(dev, "failed to register %d namespace%s, continuing...\n",
+			err, err == 1 ? "" : "s");
+	return 0;
+}
+
+static int child_unregister(struct device *dev, void *data)
+{
+	nd_device_unregister(dev, ND_SYNC);
+	return 0;
+}
+
+static int nd_region_remove(struct device *dev)
+{
+	/* flush attribute readers and disable */
+	nd_bus_lock(dev);
+	dev_set_drvdata(dev, NULL);
+	nd_bus_unlock(dev);
+
+	device_for_each_child(dev, NULL, child_unregister);
+	return 0;
+}
+
+static struct nd_device_driver nd_region_driver = {
+	.probe = nd_region_probe,
+	.remove = nd_region_remove,
+	.drv = {
+		.name = "nd_region",
+	},
+	.type = ND_DRIVER_REGION_BLK | ND_DRIVER_REGION_PMEM,
+};
+
+int __init nd_region_init(void)
+{
+	return nd_driver_register(&nd_region_driver);
+}
+
+void __exit nd_region_exit(void)
+{
+	driver_unregister(&nd_region_driver.drv);
+}
+
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_REGION_PMEM);
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_REGION_BLK);
diff --git a/drivers/block/nd/region_devs.c b/drivers/block/nd/region_devs.c
index 12a5415acfcc..fdc58e333b78 100644
--- a/drivers/block/nd/region_devs.c
+++ b/drivers/block/nd/region_devs.c
@@ -47,11 +47,16 @@ static struct device_type nd_volatile_device_type = {
 	.release = nd_region_release,
 };
 
-static bool is_nd_pmem(struct device *dev)
+bool is_nd_pmem(struct device *dev)
 {
 	return dev ? dev->type == &nd_pmem_device_type : false;
 }
 
+bool is_nd_blk(struct device *dev)
+{
+	return dev ? dev->type == &nd_blk_device_type : false;
+}
+
 struct nd_region *to_nd_region(struct device *dev)
 {
 	struct nd_region *nd_region = container_of(dev, struct nd_region, dev);
@@ -61,6 +66,37 @@ struct nd_region *to_nd_region(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(to_nd_region);
 
+/**
+ * nd_region_to_namespace_type() - region to an integer namespace type
+ * @nd_region: region-device to interrogate
+ *
+ * This is the 'nstype' attribute of a region as well, an input to the
+ * MODALIAS for namespace devices, and bit number for a nd_bus to match
+ * namespace devices with namespace drivers.
+ */
+int nd_region_to_namespace_type(struct nd_region *nd_region)
+{
+	if (is_nd_pmem(&nd_region->dev)) {
+		u16 i, alias;
+
+		for (i = 0, alias = 0; i < nd_region->ndr_mappings; i++) {
+			struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+			struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
+
+			if (nd_dimm->flags & NDD_ALIASING)
+				alias++;
+		}
+		if (alias)
+			return ND_DEVICE_NAMESPACE_PMEM;
+		else
+			return ND_DEVICE_NAMESPACE_IO;
+	} else if (is_nd_blk(&nd_region->dev)) {
+		return ND_DEVICE_NAMESPACE_BLK;
+	}
+
+	return 0;
+}
+
 static ssize_t size_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -88,9 +124,37 @@ static ssize_t mappings_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(mappings);
 
+static ssize_t nstype_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+
+	return sprintf(buf, "%d\n", nd_region_to_namespace_type(nd_region));
+}
+static DEVICE_ATTR_RO(nstype);
+
+static ssize_t init_namespaces_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region_namespaces *num_ns = dev_get_drvdata(dev);
+	ssize_t rc;
+
+	nd_bus_lock(dev);
+	if (num_ns)
+		rc = sprintf(buf, "%d/%d\n", num_ns->active, num_ns->count);
+	else
+		rc = -ENXIO;
+	nd_bus_unlock(dev);
+
+	return rc;
+}
+static DEVICE_ATTR_RO(init_namespaces);
+
 static struct attribute *nd_region_attributes[] = {
 	&dev_attr_size.attr,
+	&dev_attr_nstype.attr,
 	&dev_attr_mappings.attr,
+	&dev_attr_init_namespaces.attr,
 	NULL,
 };
 
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index f45407727216..6747da2c7cb6 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -70,8 +70,10 @@ struct nd_region_desc {
 
 struct nd_bus;
 struct device;
-struct nd_bus *nd_bus_register(struct device *parent,
-		struct nd_bus_descriptor *nfit_desc);
+struct nd_bus *__nd_bus_register(struct device *parent,
+		struct nd_bus_descriptor *nfit_desc, struct module *module);
+#define nd_bus_register(parent, desc) \
+	__nd_bus_register(parent, desc, THIS_MODULE)
 void nd_bus_unregister(struct nd_bus *nd_bus);
 struct nd_bus *to_nd_bus(struct device *dev);
 struct nd_dimm *to_nd_dimm(struct device *dev);
diff --git a/include/linux/nd.h b/include/linux/nd.h
index e074f67e53a3..da70e9962197 100644
--- a/include/linux/nd.h
+++ b/include/linux/nd.h
@@ -26,6 +26,16 @@ static inline struct nd_device_driver *to_nd_device_driver(
 		struct device_driver *drv)
 {
 	return container_of(drv, struct nd_device_driver, drv);
+};
+
+struct nd_namespace_io {
+	struct device dev;
+	struct resource res;
+};
+
+static inline struct nd_namespace_io *to_nd_namespace_io(struct device *dev)
+{
+	return container_of(dev, struct nd_namespace_io, dev);
 }
 
 #define MODULE_ALIAS_ND_DEVICE(type) \
diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
index 1ccd2c633193..5ffa319f3408 100644
--- a/include/uapi/linux/ndctl.h
+++ b/include/uapi/linux/ndctl.h
@@ -177,8 +177,18 @@ static inline const char *nd_dimm_cmd_name(unsigned cmd)
 
 
 #define ND_DEVICE_DIMM 1            /* nd_dimm: container for "config data" */
+#define ND_DEVICE_REGION_PMEM 2     /* nd_region: (parent of pmem namespaces) */
+#define ND_DEVICE_REGION_BLK 3      /* nd_region: (parent of blk namespaces) */
+#define ND_DEVICE_NAMESPACE_IO 4    /* legacy persistent memory */
+#define ND_DEVICE_NAMESPACE_PMEM 5  /* persistent memory namespace (may alias) */
+#define ND_DEVICE_NAMESPACE_BLK 6   /* block-data-window namespace (may alias) */
 
 enum nd_driver_flags {
 	ND_DRIVER_DIMM            = 1 << ND_DEVICE_DIMM,
+	ND_DRIVER_REGION_PMEM     = 1 << ND_DEVICE_REGION_PMEM,
+	ND_DRIVER_REGION_BLK      = 1 << ND_DEVICE_REGION_BLK,
+	ND_DRIVER_NAMESPACE_IO    = 1 << ND_DEVICE_NAMESPACE_IO,
+	ND_DRIVER_NAMESPACE_PMEM  = 1 << ND_DEVICE_NAMESPACE_PMEM,
+	ND_DRIVER_NAMESPACE_BLK   = 1 << ND_DEVICE_NAMESPACE_BLK,
 };
 #endif /* __NDCTL_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 09/21] libnd, nd_pmem: add libnd support to the pmem driver
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:57   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: mingo, Boaz Harrosh, linux-nvdimm, neilb, gregkh, linux-kernel,
	Andy Lutomirski, Jens Axboe, linux-acpi, jmoyer, H. Peter Anvin,
	hch

nd_pmem attaches to persistent memory regions and namespaces emitted by
the libnd subsystem, and, same as the original pmem driver, presents the
system-physical-address range as a block device.

The existing e820-type-12 to pmem setup is converted to a full libnd bus
that emits an nd_namespace_io device.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/kernel/pmem.c    |    2 -
 drivers/block/Kconfig     |   19 ++++-----
 drivers/block/Makefile    |    2 -
 drivers/block/e820_pmem.c |  100 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/block/nd/Kconfig  |   17 ++++++++
 drivers/block/nd/Makefile |    3 +
 drivers/block/nd/pmem.c   |   60 +++++++++++++--------------
 7 files changed, 159 insertions(+), 44 deletions(-)
 create mode 100644 drivers/block/e820_pmem.c
 rename drivers/block/{pmem.c => nd/pmem.c} (85%)

diff --git a/arch/x86/kernel/pmem.c b/arch/x86/kernel/pmem.c
index 3420c874ddc5..279328c42f87 100644
--- a/arch/x86/kernel/pmem.c
+++ b/arch/x86/kernel/pmem.c
@@ -13,7 +13,7 @@ static __init void register_pmem_device(struct resource *res)
 	struct platform_device *pdev;
 	int error;
 
-	pdev = platform_device_alloc("pmem", PLATFORM_DEVID_AUTO);
+	pdev = platform_device_alloc("e820_pmem", PLATFORM_DEVID_AUTO);
 	if (!pdev)
 		return;
 
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index dfe40e5ca9bd..4c2cfb91755f 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -323,6 +323,14 @@ config BLK_DEV_NVME
 
 source "drivers/block/nd/Kconfig"
 
+config E820_PMEM
+	tristate "E820 defined Persistent Memory (legacy)"
+	depends on PHYS_ADDR_T_64BIT
+	depends on X86_PMEM_LEGACY
+	default m if X86_PMEM_LEGACY
+	select ND_DEVICES
+	select LIBND
+
 config BLK_DEV_SKD
 	tristate "STEC S1120 Block Driver"
 	depends on PCI
@@ -406,17 +414,6 @@ config BLK_DEV_RAM_DAX
 	  and will prevent RAM block device backing store memory from being
 	  allocated from highmem (only a problem for highmem systems).
 
-config BLK_DEV_PMEM
-	tristate "Persistent memory block device support"
-	help
-	  Saying Y here will allow you to use a contiguous range of reserved
-	  memory as one or more persistent block devices.
-
-	  To compile this driver as a module, choose M here: the module will be
-	  called 'pmem'.
-
-	  If unsure, say N.
-
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 07a6acecf4d8..4cd5f8a919d8 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -14,7 +14,6 @@ obj-$(CONFIG_PS3_VRAM)		+= ps3vram.o
 obj-$(CONFIG_ATARI_FLOPPY)	+= ataflop.o
 obj-$(CONFIG_AMIGA_Z2RAM)	+= z2ram.o
 obj-$(CONFIG_BLK_DEV_RAM)	+= brd.o
-obj-$(CONFIG_BLK_DEV_PMEM)	+= pmem.o
 obj-$(CONFIG_BLK_DEV_LOOP)	+= loop.o
 obj-$(CONFIG_BLK_CPQ_DA)	+= cpqarray.o
 obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.o
@@ -25,6 +24,7 @@ obj-$(CONFIG_MG_DISK)		+= mg_disk.o
 obj-$(CONFIG_SUNVDC)		+= sunvdc.o
 obj-$(CONFIG_BLK_DEV_NVME)	+= nvme.o
 obj-$(CONFIG_ND_DEVICES)	+= nd/
+obj-$(CONFIG_E820_PMEM)		+= e820_pmem.o
 obj-$(CONFIG_BLK_DEV_SKD)	+= skd.o
 obj-$(CONFIG_BLK_DEV_OSD)	+= osdblk.o
 
diff --git a/drivers/block/e820_pmem.c b/drivers/block/e820_pmem.c
new file mode 100644
index 000000000000..48c33e43f39e
--- /dev/null
+++ b/drivers/block/e820_pmem.c
@@ -0,0 +1,100 @@
+/*
+ * libnd e820 support
+ *
+ * Copyright (c) 2014-2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <linux/platform_device.h>
+#include <linux/module.h>
+#include <linux/libnd.h>
+
+static const struct attribute_group *e820_pmem_attribute_groups[] = {
+	&nd_bus_attribute_group,
+	NULL,
+};
+
+static const struct attribute_group *e820_pmem_region_attribute_groups[] = {
+	&nd_region_attribute_group,
+	&nd_device_attribute_group,
+	NULL,
+};
+
+static int e820_pmem_probe(struct platform_device *pdev)
+{
+	struct nd_bus_descriptor *nd_desc;
+	struct nd_region_desc ndr_desc;
+	struct nd_bus *nd_bus;
+	struct resource *res;
+
+	if (WARN_ON(pdev->num_resources > 1))
+		return -ENXIO;
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	if (!res)
+		return -ENXIO;
+
+	nd_desc = devm_kzalloc(&pdev->dev, sizeof(*nd_desc), GFP_KERNEL);
+	if (!nd_desc)
+		return -ENOMEM;
+
+	nd_desc->attr_groups = e820_pmem_attribute_groups;
+	nd_desc->provider_name = "e820";
+	nd_bus = nd_bus_register(&pdev->dev, nd_desc);
+	if (!nd_bus)
+		return -ENXIO;
+
+	memset(&ndr_desc, 0, sizeof(ndr_desc));
+	ndr_desc.res = res;
+	ndr_desc.attr_groups = e820_pmem_region_attribute_groups;
+	if (!nd_pmem_region_create(nd_bus, &ndr_desc)) {
+		nd_bus_unregister(nd_bus);
+		return -ENXIO;
+	}
+
+	platform_set_drvdata(pdev, nd_bus);
+
+	return 0;
+}
+
+static int e820_pmem_remove(struct platform_device *pdev)
+{
+	struct nd_bus *nd_bus = platform_get_drvdata(pdev);
+
+	nd_bus_unregister(nd_bus);
+
+	return 0;
+}
+
+static struct platform_driver e820_pmem_driver = {
+	.probe		= e820_pmem_probe,
+	.remove		= e820_pmem_remove,
+	.driver		= {
+		.owner	= THIS_MODULE,
+		.name	= "e820_pmem",
+	},
+};
+
+MODULE_ALIAS("platform:e820_pmem*");
+
+static int __init e820_pmem_init(void)
+{
+	return platform_driver_register(&e820_pmem_driver);
+}
+module_init(e820_pmem_init);
+
+static void e820_pmem_exit(void)
+{
+	platform_driver_unregister(&e820_pmem_driver);
+}
+module_exit(e820_pmem_exit);
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_LICENSE("GPL v2");
diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index 9b909c21afa1..03f572f0e3d0 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -17,4 +17,21 @@ if ND_DEVICES
 config LIBND
 	tristate
 
+config BLK_DEV_PMEM
+	tristate "PMEM: Persistent memory block device support"
+	depends on LIBND
+	default LIBND
+	help
+	  Memory ranges for PMEM are described by either an NFIT
+	  (NVDIMM Firmware Interface Table, see CONFIG_NFIT_ACPI), a
+	  non-standard OEM-specific E820 memory type (type-12, see
+	  CONFIG_X86_PMEM_LEGACY), or it is manually specified by the
+	  'memmap=nn[KMG]!ss[KMG]' kernel command line (see
+	  Documentation/kernel-parameters.txt).  This driver converts
+	  these persistent memory ranges into block devices that are
+	  capable of DAX (direct-access) file system mappings.  See
+	  Documentation/blockdev/nd.txt for more details.
+
+	  Say Y if you want to use a NVDIMM described by NFIT
+
 endif
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 235d9e6be94a..6f539f01fa82 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -1,4 +1,7 @@
 obj-$(CONFIG_LIBND) += libnd.o
+obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
+
+nd_pmem-y := pmem.o
 
 libnd-y := core.o
 libnd-y += bus.o
diff --git a/drivers/block/pmem.c b/drivers/block/nd/pmem.c
similarity index 85%
rename from drivers/block/pmem.c
rename to drivers/block/nd/pmem.c
index eabf4a8d0085..529a1444a918 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/nd/pmem.c
@@ -1,7 +1,7 @@
 /*
  * Persistent Memory Driver
  *
- * Copyright (c) 2014, Intel Corporation.
+ * Copyright (c) 2014-2015, Intel Corporation.
  * Copyright (c) 2015, Christoph Hellwig <hch@lst.de>.
  * Copyright (c) 2015, Boaz Harrosh <boaz@plexistor.com>.
  *
@@ -23,6 +23,8 @@
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/slab.h>
+#include <linux/nd.h>
+#include "nd.h"
 
 #define PMEM_MINORS		16
 
@@ -37,7 +39,6 @@ struct pmem_device {
 };
 
 static int pmem_major;
-static atomic_t pmem_index;
 
 static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 			unsigned int len, unsigned int off, int rw,
@@ -118,11 +119,11 @@ static const struct block_device_operations pmem_fops = {
 	.direct_access =	pmem_direct_access,
 };
 
-static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
+static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res, int id)
 {
 	struct pmem_device *pmem;
 	struct gendisk *disk;
-	int idx, err;
+	int err;
 
 	err = -ENOMEM;
 	pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
@@ -159,15 +160,13 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
 	if (!disk)
 		goto out_free_queue;
 
-	idx = atomic_inc_return(&pmem_index) - 1;
-
 	disk->major		= pmem_major;
-	disk->first_minor	= PMEM_MINORS * idx;
+	disk->first_minor	= PMEM_MINORS * id;
 	disk->fops		= &pmem_fops;
 	disk->private_data	= pmem;
 	disk->queue		= pmem->pmem_queue;
 	disk->flags		= GENHD_FL_EXT_DEVT;
-	sprintf(disk->disk_name, "pmem%d", idx);
+	sprintf(disk->disk_name, "pmem%d", id);
 	disk->driverfs_dev = dev;
 	set_capacity(disk, pmem->size >> 9);
 	pmem->pmem_disk = disk;
@@ -198,42 +197,38 @@ static void pmem_free(struct pmem_device *pmem)
 	kfree(pmem);
 }
 
-static int pmem_probe(struct platform_device *pdev)
+static int nd_pmem_probe(struct device *dev)
 {
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+	struct nd_namespace_io *nsio = to_nd_namespace_io(dev);
 	struct pmem_device *pmem;
-	struct resource *res;
-
-	if (WARN_ON(pdev->num_resources > 1))
-		return -ENXIO;
-
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-	if (!res)
-		return -ENXIO;
 
-	pmem = pmem_alloc(&pdev->dev, res);
+	pmem = pmem_alloc(dev, &nsio->res, nd_region->id);
 	if (IS_ERR(pmem))
 		return PTR_ERR(pmem);
 
-	platform_set_drvdata(pdev, pmem);
+	dev_set_drvdata(dev, pmem);
 
 	return 0;
 }
 
-static int pmem_remove(struct platform_device *pdev)
+static int nd_pmem_remove(struct device *dev)
 {
-	struct pmem_device *pmem = platform_get_drvdata(pdev);
+	struct pmem_device *pmem = dev_get_drvdata(dev);
 
 	pmem_free(pmem);
 	return 0;
 }
 
-static struct platform_driver pmem_driver = {
-	.probe		= pmem_probe,
-	.remove		= pmem_remove,
-	.driver		= {
-		.owner	= THIS_MODULE,
-		.name	= "pmem",
+MODULE_ALIAS("pmem");
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_IO);
+static struct nd_device_driver nd_pmem_driver = {
+	.probe = nd_pmem_probe,
+	.remove = nd_pmem_remove,
+	.drv = {
+		.name = "pmem",
 	},
+	.type = ND_DRIVER_NAMESPACE_IO,
 };
 
 static int __init pmem_init(void)
@@ -244,16 +239,19 @@ static int __init pmem_init(void)
 	if (pmem_major < 0)
 		return pmem_major;
 
-	error = platform_driver_register(&pmem_driver);
-	if (error)
+	error = nd_driver_register(&nd_pmem_driver);
+	if (error) {
 		unregister_blkdev(pmem_major, "pmem");
-	return error;
+		return error;
+	}
+
+	return 0;
 }
 module_init(pmem_init);
 
 static void pmem_exit(void)
 {
-	platform_driver_unregister(&pmem_driver);
+	driver_unregister(&nd_pmem_driver.drv);
 	unregister_blkdev(pmem_major, "pmem");
 }
 module_exit(pmem_exit);


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 09/21] libnd, nd_pmem: add libnd support to the pmem driver
@ 2015-05-20 20:57   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: mingo, Boaz Harrosh, linux-nvdimm, neilb, gregkh, linux-kernel,
	Andy Lutomirski, Jens Axboe, linux-acpi, jmoyer, H. Peter Anvin,
	hch

nd_pmem attaches to persistent memory regions and namespaces emitted by
the libnd subsystem, and, same as the original pmem driver, presents the
system-physical-address range as a block device.

The existing e820-type-12 to pmem setup is converted to a full libnd bus
that emits an nd_namespace_io device.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/kernel/pmem.c    |    2 -
 drivers/block/Kconfig     |   19 ++++-----
 drivers/block/Makefile    |    2 -
 drivers/block/e820_pmem.c |  100 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/block/nd/Kconfig  |   17 ++++++++
 drivers/block/nd/Makefile |    3 +
 drivers/block/nd/pmem.c   |   60 +++++++++++++--------------
 7 files changed, 159 insertions(+), 44 deletions(-)
 create mode 100644 drivers/block/e820_pmem.c
 rename drivers/block/{pmem.c => nd/pmem.c} (85%)

diff --git a/arch/x86/kernel/pmem.c b/arch/x86/kernel/pmem.c
index 3420c874ddc5..279328c42f87 100644
--- a/arch/x86/kernel/pmem.c
+++ b/arch/x86/kernel/pmem.c
@@ -13,7 +13,7 @@ static __init void register_pmem_device(struct resource *res)
 	struct platform_device *pdev;
 	int error;
 
-	pdev = platform_device_alloc("pmem", PLATFORM_DEVID_AUTO);
+	pdev = platform_device_alloc("e820_pmem", PLATFORM_DEVID_AUTO);
 	if (!pdev)
 		return;
 
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index dfe40e5ca9bd..4c2cfb91755f 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -323,6 +323,14 @@ config BLK_DEV_NVME
 
 source "drivers/block/nd/Kconfig"
 
+config E820_PMEM
+	tristate "E820 defined Persistent Memory (legacy)"
+	depends on PHYS_ADDR_T_64BIT
+	depends on X86_PMEM_LEGACY
+	default m if X86_PMEM_LEGACY
+	select ND_DEVICES
+	select LIBND
+
 config BLK_DEV_SKD
 	tristate "STEC S1120 Block Driver"
 	depends on PCI
@@ -406,17 +414,6 @@ config BLK_DEV_RAM_DAX
 	  and will prevent RAM block device backing store memory from being
 	  allocated from highmem (only a problem for highmem systems).
 
-config BLK_DEV_PMEM
-	tristate "Persistent memory block device support"
-	help
-	  Saying Y here will allow you to use a contiguous range of reserved
-	  memory as one or more persistent block devices.
-
-	  To compile this driver as a module, choose M here: the module will be
-	  called 'pmem'.
-
-	  If unsure, say N.
-
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 07a6acecf4d8..4cd5f8a919d8 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -14,7 +14,6 @@ obj-$(CONFIG_PS3_VRAM)		+= ps3vram.o
 obj-$(CONFIG_ATARI_FLOPPY)	+= ataflop.o
 obj-$(CONFIG_AMIGA_Z2RAM)	+= z2ram.o
 obj-$(CONFIG_BLK_DEV_RAM)	+= brd.o
-obj-$(CONFIG_BLK_DEV_PMEM)	+= pmem.o
 obj-$(CONFIG_BLK_DEV_LOOP)	+= loop.o
 obj-$(CONFIG_BLK_CPQ_DA)	+= cpqarray.o
 obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.o
@@ -25,6 +24,7 @@ obj-$(CONFIG_MG_DISK)		+= mg_disk.o
 obj-$(CONFIG_SUNVDC)		+= sunvdc.o
 obj-$(CONFIG_BLK_DEV_NVME)	+= nvme.o
 obj-$(CONFIG_ND_DEVICES)	+= nd/
+obj-$(CONFIG_E820_PMEM)		+= e820_pmem.o
 obj-$(CONFIG_BLK_DEV_SKD)	+= skd.o
 obj-$(CONFIG_BLK_DEV_OSD)	+= osdblk.o
 
diff --git a/drivers/block/e820_pmem.c b/drivers/block/e820_pmem.c
new file mode 100644
index 000000000000..48c33e43f39e
--- /dev/null
+++ b/drivers/block/e820_pmem.c
@@ -0,0 +1,100 @@
+/*
+ * libnd e820 support
+ *
+ * Copyright (c) 2014-2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <linux/platform_device.h>
+#include <linux/module.h>
+#include <linux/libnd.h>
+
+static const struct attribute_group *e820_pmem_attribute_groups[] = {
+	&nd_bus_attribute_group,
+	NULL,
+};
+
+static const struct attribute_group *e820_pmem_region_attribute_groups[] = {
+	&nd_region_attribute_group,
+	&nd_device_attribute_group,
+	NULL,
+};
+
+static int e820_pmem_probe(struct platform_device *pdev)
+{
+	struct nd_bus_descriptor *nd_desc;
+	struct nd_region_desc ndr_desc;
+	struct nd_bus *nd_bus;
+	struct resource *res;
+
+	if (WARN_ON(pdev->num_resources > 1))
+		return -ENXIO;
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	if (!res)
+		return -ENXIO;
+
+	nd_desc = devm_kzalloc(&pdev->dev, sizeof(*nd_desc), GFP_KERNEL);
+	if (!nd_desc)
+		return -ENOMEM;
+
+	nd_desc->attr_groups = e820_pmem_attribute_groups;
+	nd_desc->provider_name = "e820";
+	nd_bus = nd_bus_register(&pdev->dev, nd_desc);
+	if (!nd_bus)
+		return -ENXIO;
+
+	memset(&ndr_desc, 0, sizeof(ndr_desc));
+	ndr_desc.res = res;
+	ndr_desc.attr_groups = e820_pmem_region_attribute_groups;
+	if (!nd_pmem_region_create(nd_bus, &ndr_desc)) {
+		nd_bus_unregister(nd_bus);
+		return -ENXIO;
+	}
+
+	platform_set_drvdata(pdev, nd_bus);
+
+	return 0;
+}
+
+static int e820_pmem_remove(struct platform_device *pdev)
+{
+	struct nd_bus *nd_bus = platform_get_drvdata(pdev);
+
+	nd_bus_unregister(nd_bus);
+
+	return 0;
+}
+
+static struct platform_driver e820_pmem_driver = {
+	.probe		= e820_pmem_probe,
+	.remove		= e820_pmem_remove,
+	.driver		= {
+		.owner	= THIS_MODULE,
+		.name	= "e820_pmem",
+	},
+};
+
+MODULE_ALIAS("platform:e820_pmem*");
+
+static int __init e820_pmem_init(void)
+{
+	return platform_driver_register(&e820_pmem_driver);
+}
+module_init(e820_pmem_init);
+
+static void e820_pmem_exit(void)
+{
+	platform_driver_unregister(&e820_pmem_driver);
+}
+module_exit(e820_pmem_exit);
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_LICENSE("GPL v2");
diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index 9b909c21afa1..03f572f0e3d0 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -17,4 +17,21 @@ if ND_DEVICES
 config LIBND
 	tristate
 
+config BLK_DEV_PMEM
+	tristate "PMEM: Persistent memory block device support"
+	depends on LIBND
+	default LIBND
+	help
+	  Memory ranges for PMEM are described by either an NFIT
+	  (NVDIMM Firmware Interface Table, see CONFIG_NFIT_ACPI), a
+	  non-standard OEM-specific E820 memory type (type-12, see
+	  CONFIG_X86_PMEM_LEGACY), or it is manually specified by the
+	  'memmap=nn[KMG]!ss[KMG]' kernel command line (see
+	  Documentation/kernel-parameters.txt).  This driver converts
+	  these persistent memory ranges into block devices that are
+	  capable of DAX (direct-access) file system mappings.  See
+	  Documentation/blockdev/nd.txt for more details.
+
+	  Say Y if you want to use a NVDIMM described by NFIT
+
 endif
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 235d9e6be94a..6f539f01fa82 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -1,4 +1,7 @@
 obj-$(CONFIG_LIBND) += libnd.o
+obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
+
+nd_pmem-y := pmem.o
 
 libnd-y := core.o
 libnd-y += bus.o
diff --git a/drivers/block/pmem.c b/drivers/block/nd/pmem.c
similarity index 85%
rename from drivers/block/pmem.c
rename to drivers/block/nd/pmem.c
index eabf4a8d0085..529a1444a918 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/nd/pmem.c
@@ -1,7 +1,7 @@
 /*
  * Persistent Memory Driver
  *
- * Copyright (c) 2014, Intel Corporation.
+ * Copyright (c) 2014-2015, Intel Corporation.
  * Copyright (c) 2015, Christoph Hellwig <hch@lst.de>.
  * Copyright (c) 2015, Boaz Harrosh <boaz@plexistor.com>.
  *
@@ -23,6 +23,8 @@
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/slab.h>
+#include <linux/nd.h>
+#include "nd.h"
 
 #define PMEM_MINORS		16
 
@@ -37,7 +39,6 @@ struct pmem_device {
 };
 
 static int pmem_major;
-static atomic_t pmem_index;
 
 static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 			unsigned int len, unsigned int off, int rw,
@@ -118,11 +119,11 @@ static const struct block_device_operations pmem_fops = {
 	.direct_access =	pmem_direct_access,
 };
 
-static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
+static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res, int id)
 {
 	struct pmem_device *pmem;
 	struct gendisk *disk;
-	int idx, err;
+	int err;
 
 	err = -ENOMEM;
 	pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
@@ -159,15 +160,13 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
 	if (!disk)
 		goto out_free_queue;
 
-	idx = atomic_inc_return(&pmem_index) - 1;
-
 	disk->major		= pmem_major;
-	disk->first_minor	= PMEM_MINORS * idx;
+	disk->first_minor	= PMEM_MINORS * id;
 	disk->fops		= &pmem_fops;
 	disk->private_data	= pmem;
 	disk->queue		= pmem->pmem_queue;
 	disk->flags		= GENHD_FL_EXT_DEVT;
-	sprintf(disk->disk_name, "pmem%d", idx);
+	sprintf(disk->disk_name, "pmem%d", id);
 	disk->driverfs_dev = dev;
 	set_capacity(disk, pmem->size >> 9);
 	pmem->pmem_disk = disk;
@@ -198,42 +197,38 @@ static void pmem_free(struct pmem_device *pmem)
 	kfree(pmem);
 }
 
-static int pmem_probe(struct platform_device *pdev)
+static int nd_pmem_probe(struct device *dev)
 {
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+	struct nd_namespace_io *nsio = to_nd_namespace_io(dev);
 	struct pmem_device *pmem;
-	struct resource *res;
-
-	if (WARN_ON(pdev->num_resources > 1))
-		return -ENXIO;
-
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-	if (!res)
-		return -ENXIO;
 
-	pmem = pmem_alloc(&pdev->dev, res);
+	pmem = pmem_alloc(dev, &nsio->res, nd_region->id);
 	if (IS_ERR(pmem))
 		return PTR_ERR(pmem);
 
-	platform_set_drvdata(pdev, pmem);
+	dev_set_drvdata(dev, pmem);
 
 	return 0;
 }
 
-static int pmem_remove(struct platform_device *pdev)
+static int nd_pmem_remove(struct device *dev)
 {
-	struct pmem_device *pmem = platform_get_drvdata(pdev);
+	struct pmem_device *pmem = dev_get_drvdata(dev);
 
 	pmem_free(pmem);
 	return 0;
 }
 
-static struct platform_driver pmem_driver = {
-	.probe		= pmem_probe,
-	.remove		= pmem_remove,
-	.driver		= {
-		.owner	= THIS_MODULE,
-		.name	= "pmem",
+MODULE_ALIAS("pmem");
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_IO);
+static struct nd_device_driver nd_pmem_driver = {
+	.probe = nd_pmem_probe,
+	.remove = nd_pmem_remove,
+	.drv = {
+		.name = "pmem",
 	},
+	.type = ND_DRIVER_NAMESPACE_IO,
 };
 
 static int __init pmem_init(void)
@@ -244,16 +239,19 @@ static int __init pmem_init(void)
 	if (pmem_major < 0)
 		return pmem_major;
 
-	error = platform_driver_register(&pmem_driver);
-	if (error)
+	error = nd_driver_register(&nd_pmem_driver);
+	if (error) {
 		unregister_blkdev(pmem_major, "pmem");
-	return error;
+		return error;
+	}
+
+	return 0;
 }
 module_init(pmem_init);
 
 static void pmem_exit(void)
 {
-	platform_driver_unregister(&pmem_driver);
+	driver_unregister(&nd_pmem_driver.drv);
 	unregister_blkdev(pmem_major, "pmem");
 }
 module_exit(pmem_exit);


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 10/21] pmem: Dynamically allocate partition numbers
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:57   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, mingo, linux-acpi,
	jmoyer, Ross Zwisler, hch

From: Ross Zwisler <ross.zwisler@linux.intel.com>

Dynamically allocate minor numbers for partitions instead of statically
preallocating them.

It gives us a simpler minors scheme, and makes it so we get a consistent
major when moving past partition 16.  Here's what happens with the
current code:

 pmem0      249:0    0 63.5G  0 rom
 ├─pmem0p1  249:1    0    1G  0 part
 ├─pmem0p2  249:2    0    1G  0 part
 ├─pmem0p3  249:3    0    1G  0 part
 ├─pmem0p4  249:4    0    1G  0 part
 ├─pmem0p5  249:5    0    1G  0 part
 ├─pmem0p6  249:6    0    1G  0 part
 ├─pmem0p7  249:7    0    1G  0 part
 ├─pmem0p8  249:8    0    1G  0 part
 ├─pmem0p9  249:9    0    1G  0 part
 ├─pmem0p10 249:10   0    1G  0 part
 ├─pmem0p11 249:11   0    1G  0 part
 ├─pmem0p12 249:12   0    1G  0 part
 ├─pmem0p13 249:13   0    1G  0 part
 ├─pmem0p14 249:14   0    1G  0 part
 ├─pmem0p15 249:15   0    1G  0 part
 ├─pmem0p16 259:0    0    1G  0 part
 ├─pmem0p17 259:1    0    1G  0 part
 └─pmem0p18 259:2    0    1G  0 part

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/block/nd/pmem.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/block/nd/pmem.c b/drivers/block/nd/pmem.c
index 529a1444a918..fc34677d0f48 100644
--- a/drivers/block/nd/pmem.c
+++ b/drivers/block/nd/pmem.c
@@ -26,8 +26,6 @@
 #include <linux/nd.h>
 #include "nd.h"
 
-#define PMEM_MINORS		16
-
 struct pmem_device {
 	struct request_queue	*pmem_queue;
 	struct gendisk		*pmem_disk;
@@ -156,12 +154,12 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res,
 	blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
 	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
 
-	disk = alloc_disk(PMEM_MINORS);
+	disk = alloc_disk(0);
 	if (!disk)
 		goto out_free_queue;
 
 	disk->major		= pmem_major;
-	disk->first_minor	= PMEM_MINORS * id;
+	disk->first_minor	= 0;
 	disk->fops		= &pmem_fops;
 	disk->private_data	= pmem;
 	disk->queue		= pmem->pmem_queue;


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 10/21] pmem: Dynamically allocate partition numbers
@ 2015-05-20 20:57   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, mingo, linux-acpi,
	jmoyer, Ross Zwisler, hch

From: Ross Zwisler <ross.zwisler@linux.intel.com>

Dynamically allocate minor numbers for partitions instead of statically
preallocating them.

It gives us a simpler minors scheme, and makes it so we get a consistent
major when moving past partition 16.  Here's what happens with the
current code:

 pmem0      249:0    0 63.5G  0 rom
 ├─pmem0p1  249:1    0    1G  0 part
 ├─pmem0p2  249:2    0    1G  0 part
 ├─pmem0p3  249:3    0    1G  0 part
 ├─pmem0p4  249:4    0    1G  0 part
 ├─pmem0p5  249:5    0    1G  0 part
 ├─pmem0p6  249:6    0    1G  0 part
 ├─pmem0p7  249:7    0    1G  0 part
 ├─pmem0p8  249:8    0    1G  0 part
 ├─pmem0p9  249:9    0    1G  0 part
 ├─pmem0p10 249:10   0    1G  0 part
 ├─pmem0p11 249:11   0    1G  0 part
 ├─pmem0p12 249:12   0    1G  0 part
 ├─pmem0p13 249:13   0    1G  0 part
 ├─pmem0p14 249:14   0    1G  0 part
 ├─pmem0p15 249:15   0    1G  0 part
 ├─pmem0p16 259:0    0    1G  0 part
 ├─pmem0p17 259:1    0    1G  0 part
 └─pmem0p18 259:2    0    1G  0 part

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/block/nd/pmem.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/block/nd/pmem.c b/drivers/block/nd/pmem.c
index 529a1444a918..fc34677d0f48 100644
--- a/drivers/block/nd/pmem.c
+++ b/drivers/block/nd/pmem.c
@@ -26,8 +26,6 @@
 #include <linux/nd.h>
 #include "nd.h"
 
-#define PMEM_MINORS		16
-
 struct pmem_device {
 	struct request_queue	*pmem_queue;
 	struct gendisk		*pmem_disk;
@@ -156,12 +154,12 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res,
 	blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
 	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
 
-	disk = alloc_disk(PMEM_MINORS);
+	disk = alloc_disk(0);
 	if (!disk)
 		goto out_free_queue;
 
 	disk->major		= pmem_major;
-	disk->first_minor	= PMEM_MINORS * id;
+	disk->first_minor	= 0;
 	disk->fops		= &pmem_fops;
 	disk->private_data	= pmem;
 	disk->queue		= pmem->pmem_queue;


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 11/21] libnd, nfit: add interleave-set state-tracking infrastructure
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:57   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: mingo, linux-nvdimm, neilb, gregkh, Rafael J. Wysocki,
	Robert Moore, linux-kernel, linux-acpi, jmoyer, hch

On platforms that have firmware support for reading/writing per-dimm
label space, a portion of the dimm may be accessible via an interleave
set PMEM mapping in addition to the dimm's BLK (block-data-window
aperture(s)) interface.  A label, stored in a "configuration data
region" on the dimm, disambiguates which dimm addresses are accessed
through which exclusive interface.

Add infrastructure that allows the kernel to block modifications to a
label in the set while any member dimm is active.  Note that this is
meant only for enforcing "no modifications of active labels" via the
coarse ioctl command.  Adding/deleting namespaces from an active
interleave set is always possible via sysfs.

Another aspect of tracking interleave sets is tracking their integrity
when DIMMs in a set are physically re-ordered.  For this purpose we
generate an "interleave-set cookie" that can be recorded in a label and
validated against the current configuration.  It is the bus provider
implementation's responsibility to calculate the interleave set cookie
and attach it to a given region.

Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c            |   91 ++++++++++++++++++++++++++++++++++++++++
 drivers/block/nd/bus.c         |   41 ++++++++++++++++++
 drivers/block/nd/core.c        |   17 +++++++
 drivers/block/nd/dimm_devs.c   |   19 ++++++++
 drivers/block/nd/nd-private.h  |   10 ++++
 drivers/block/nd/nd.h          |    1 
 drivers/block/nd/region_devs.c |   85 +++++++++++++++++++++++++++++++++++++
 include/linux/libnd.h          |    6 +++
 8 files changed, 266 insertions(+), 4 deletions(-)

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index aa719ef0418f..7c4d47492372 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -16,6 +16,7 @@
 #include <linux/ndctl.h>
 #include <linux/list.h>
 #include <linux/acpi.h>
+#include <linux/sort.h>
 #include "nfit.h"
 
 static bool force_enable_dimms;
@@ -744,6 +745,91 @@ static const struct attribute_group *acpi_nfit_region_attribute_groups[] = {
 	NULL,
 };
 
+/* enough info to uniquely specify an interleave set */
+struct nfit_set_info {
+	struct nfit_set_info_map {
+		u64 region_offset;
+		u32 serial_number;
+		u32 pad;
+	} mapping[0];
+};
+
+static size_t sizeof_nfit_set_info(int num_mappings)
+{
+	return sizeof(struct nfit_set_info)
+		+ num_mappings * sizeof(struct nfit_set_info_map);
+}
+
+static int cmp_map(const void *m0, const void *m1)
+{
+	const struct nfit_set_info_map *map0 = m0;
+	const struct nfit_set_info_map *map1 = m1;
+
+	return memcmp(&map0->region_offset, &map1->region_offset,
+			sizeof(u64));
+}
+
+/* Retrieve the nth entry referencing this spa */
+static struct acpi_nfit_memory_map *memdev_from_spa(
+		struct acpi_nfit_desc *acpi_desc, u16 range_index, int n)
+{
+        struct nfit_memdev *nfit_memdev;
+
+        list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list)
+                if (nfit_memdev->memdev->range_index == range_index)
+                        if (n-- == 0)
+                                return nfit_memdev->memdev;
+        return NULL;
+}
+
+static int acpi_nfit_init_interleave_set(struct acpi_nfit_desc *acpi_desc,
+		struct nd_region_desc *ndr_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	u16 num_mappings = ndr_desc->num_mappings;
+	int i, spa_type = nfit_spa_type(spa);
+	struct device *dev = acpi_desc->dev;
+	struct nd_interleave_set *nd_set;
+	struct nfit_set_info *info;
+
+	if (spa_type == NFIT_SPA_PM || spa_type == NFIT_SPA_VOLATILE)
+		/* pass */;
+	else
+		return 0;
+
+	nd_set = devm_kzalloc(dev, sizeof(*nd_set), GFP_KERNEL);
+	if (!nd_set)
+		return -ENOMEM;
+
+	info = devm_kzalloc(dev, sizeof_nfit_set_info(num_mappings), GFP_KERNEL);
+	if (!info)
+		return -ENOMEM;
+	for (i = 0; i < num_mappings; i++) {
+		struct nd_mapping *nd_mapping = &ndr_desc->nd_mapping[i];
+		struct nfit_set_info_map *map = &info->mapping[i];
+		struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
+		struct nfit_mem *nfit_mem = nd_dimm_provider_data(nd_dimm);
+		struct acpi_nfit_memory_map *memdev = memdev_from_spa(acpi_desc,
+				spa->range_index, i);
+
+		if (!memdev || !nfit_mem->dcr) {
+			dev_err(dev, "%s: failed to find DCR\n", __func__);
+			return -ENODEV;
+		}
+
+		map->region_offset = memdev->region_offset;
+		map->serial_number = nfit_mem->dcr->serial_number;
+	}
+
+	sort(&info->mapping[0], num_mappings, sizeof(struct nfit_set_info_map),
+			cmp_map, NULL);
+	nd_set->cookie = nd_fletcher64(info, sizeof_nfit_set_info(num_mappings), 0);
+	ndr_desc->nd_set = nd_set;
+	devm_kfree(dev, info);
+
+	return 0;
+}
+
 static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 		struct nfit_spa *nfit_spa)
 {
@@ -751,7 +837,7 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 	struct acpi_nfit_system_address *spa = nfit_spa->spa;
 	struct nfit_memdev *nfit_memdev;
 	struct nd_region_desc ndr_desc;
-	int spa_type, count = 0;
+	int spa_type, count = 0, rc;
 	struct resource res;
 	u16 range_index;
 
@@ -817,6 +903,9 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 
 	ndr_desc.nd_mapping = nd_mappings;
 	ndr_desc.num_mappings = count;
+	rc = acpi_nfit_init_interleave_set(acpi_desc, &ndr_desc, spa);
+	if (rc)
+		return rc;
 	if (spa_type == NFIT_SPA_PM) {
 		if (!nd_pmem_region_create(acpi_desc->nd_bus, &ndr_desc))
 			return -ENOMEM;
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index d2a62a6142f3..63b5182cf766 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -78,7 +78,10 @@ static int nd_bus_probe(struct device *dev)
 	if (!try_module_get(provider))
 		return -ENXIO;
 
+	nd_region_probe_start(nd_bus, dev);
 	rc = nd_drv->probe(dev);
+	nd_region_probe_end(nd_bus, dev, rc);
+
 	dev_dbg(&nd_bus->dev, "%s.probe(%s) = %d\n", dev->driver->name,
 			dev_name(dev), rc);
 	if (rc != 0)
@@ -94,6 +97,8 @@ static int nd_bus_remove(struct device *dev)
 	int rc;
 
 	rc = nd_drv->remove(dev);
+	nd_region_notify_remove(nd_bus, dev, rc);
+
 	dev_dbg(&nd_bus->dev, "%s.remove(%s) = %d\n", dev->driver->name,
 			dev_name(dev), rc);
 	module_put(provider);
@@ -359,6 +364,33 @@ u32 nd_cmd_out_size(struct nd_dimm *nd_dimm, int cmd,
 }
 EXPORT_SYMBOL_GPL(nd_cmd_out_size);
 
+static void wait_nd_bus_probe_idle(struct nd_bus *nd_bus)
+{
+	do {
+		if (nd_bus->probe_active == 0)
+			break;
+		nd_bus_unlock(&nd_bus->dev);
+		wait_event(nd_bus->probe_wait, nd_bus->probe_active == 0);
+		nd_bus_lock(&nd_bus->dev);
+	} while (true);
+}
+
+/* set_config requires an idle interleave set */
+static int nd_cmd_clear_to_send(struct nd_dimm *nd_dimm, unsigned int cmd)
+{
+	struct nd_bus *nd_bus;
+
+	if (!nd_dimm || cmd != ND_CMD_SET_CONFIG_DATA)
+		return 0;
+
+	nd_bus = walk_to_nd_bus(&nd_dimm->dev);
+	wait_nd_bus_probe_idle(nd_bus);
+
+	if (atomic_read(&nd_dimm->busy))
+		return -EBUSY;
+	return 0;
+}
+
 static int __nd_ioctl(struct nd_bus *nd_bus, struct nd_dimm *nd_dimm,
 		int read_only, unsigned int ioctl_cmd, unsigned long arg)
 {
@@ -469,11 +501,18 @@ static int __nd_ioctl(struct nd_bus *nd_bus, struct nd_dimm *nd_dimm,
 		goto out;
 	}
 
+	nd_bus_lock(&nd_bus->dev);
+	rc = nd_cmd_clear_to_send(nd_dimm, cmd);
+	if (rc)
+		goto out_unlock;
+
 	rc = nd_desc->ndctl(nd_desc, nd_dimm, cmd, buf, buf_len);
 	if (rc < 0)
-		goto out;
+		goto out_unlock;
 	if (copy_to_user(p, buf, buf_len))
 		rc = -EFAULT;
+ out_unlock:
+	nd_bus_unlock(&nd_bus->dev);
  out:
 	vfree(buf);
 	return rc;
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index 7bf88fb124b7..38fb8f4c9a2c 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -54,6 +54,22 @@ bool is_nd_bus_locked(struct device *dev)
 }
 EXPORT_SYMBOL(is_nd_bus_locked);
 
+u64 nd_fletcher64(void *addr, size_t len, bool le)
+{
+	u32 *buf = addr;
+	u32 lo32 = 0;
+	u64 hi32 = 0;
+	int i;
+
+	for (i = 0; i < len / sizeof(u32); i++) {
+		lo32 += le ? le32_to_cpu(buf[i]) : buf[i];
+		hi32 += lo32;
+	}
+
+	return hi32 << 32 | lo32;
+}
+EXPORT_SYMBOL_GPL(nd_fletcher64);
+
 static void nd_bus_release(struct device *dev)
 {
 	struct nd_bus *nd_bus = container_of(dev, struct nd_bus, dev);
@@ -172,6 +188,7 @@ struct nd_bus *__nd_bus_register(struct device *parent,
 	if (!nd_bus)
 		return NULL;
 	INIT_LIST_HEAD(&nd_bus->list);
+	init_waitqueue_head(&nd_bus->probe_wait);
 	nd_bus->id = ida_simple_get(&nd_ida, 0, 0, GFP_KERNEL);
 	mutex_init(&nd_bus->reconfig_mutex);
 	if (nd_bus->id < 0) {
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 33b6d5336096..8981adc59ba4 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -185,7 +185,24 @@ static ssize_t commands_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(commands);
 
+static ssize_t state_show(struct device *dev, struct device_attribute *attr,
+		char *buf)
+{
+	struct nd_dimm *nd_dimm = to_nd_dimm(dev);
+
+	/*
+	 * The state may be in the process of changing, userspace should
+	 * quiesce probing if it wants a static answer
+	 */
+	nd_bus_lock(dev);
+	nd_bus_unlock(dev);
+	return sprintf(buf, "%s\n", atomic_read(&nd_dimm->busy)
+			? "active" : "idle");
+}
+static DEVICE_ATTR_RO(state);
+
 static struct attribute *nd_dimm_attributes[] = {
+	&dev_attr_state.attr,
 	&dev_attr_commands.attr,
 	NULL,
 };
@@ -213,7 +230,7 @@ struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
 	nd_dimm->provider_data = provider_data;
 	nd_dimm->flags = flags;
 	nd_dimm->dsm_mask = dsm_mask;
-
+	atomic_set(&nd_dimm->busy, 0);
 	dev = &nd_dimm->dev;
 	dev_set_name(dev, "nmem%d", nd_dimm->id);
 	dev->parent = &nd_bus->dev;
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index 8ef3a1b50f44..67f28011dfa5 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -14,6 +14,8 @@
 #define __ND_PRIVATE_H__
 #include <linux/device.h>
 #include <linux/libnd.h>
+#include <linux/sizes.h>
+#include <linux/mutex.h>
 
 extern struct list_head nd_bus_list;
 extern struct mutex nd_bus_list_mutex;
@@ -21,10 +23,11 @@ extern int nd_dimm_major;
 
 struct nd_bus {
 	struct nd_bus_descriptor *nd_desc;
+	wait_queue_head_t probe_wait;
 	struct module *module;
 	struct list_head list;
 	struct device dev;
-	int id;
+	int id, probe_active;
 	struct mutex reconfig_mutex;
 };
 
@@ -33,6 +36,7 @@ struct nd_dimm {
 	void *provider_data;
 	unsigned long *dsm_mask;
 	struct device dev;
+	atomic_t busy;
 	int id;
 };
 
@@ -46,10 +50,14 @@ int __init nd_dimm_init(void);
 int __init nd_region_init(void);
 void nd_dimm_exit(void);
 int nd_region_exit(void);
+void nd_region_probe_start(struct nd_bus *nd_bus, struct device *dev);
+void nd_region_probe_end(struct nd_bus *nd_bus, struct device *dev, int rc);
+void nd_region_notify_remove(struct nd_bus *nd_bus, struct device *dev, int rc);
 int nd_bus_create_ndctl(struct nd_bus *nd_bus);
 void nd_bus_destroy_ndctl(struct nd_bus *nd_bus);
 void nd_synchronize(void);
 int nd_bus_register_dimms(struct nd_bus *nd_bus);
 int nd_bus_register_regions(struct nd_bus *nd_bus);
+int nd_bus_init_interleave_sets(struct nd_bus *nd_bus);
 int nd_match_dimm(struct device *dev, void *data);
 #endif /* __ND_PRIVATE_H__ */
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index 72f4d7b76059..905183d45799 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -35,6 +35,7 @@ struct nd_region {
 	u64 ndr_start;
 	int id;
 	void *provider_data;
+	struct nd_interleave_set *nd_set;
 	struct nd_mapping mapping[0];
 };
 
diff --git a/drivers/block/nd/region_devs.c b/drivers/block/nd/region_devs.c
index fdc58e333b78..221e6342b6ca 100644
--- a/drivers/block/nd/region_devs.c
+++ b/drivers/block/nd/region_devs.c
@@ -10,7 +10,10 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/scatterlist.h>
+#include <linux/sched.h>
 #include <linux/slab.h>
+#include <linux/sort.h>
 #include <linux/io.h>
 #include "nd-private.h"
 #include "nd.h"
@@ -133,6 +136,21 @@ static ssize_t nstype_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(nstype);
 
+static ssize_t set_cookie_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+	struct nd_interleave_set *nd_set = nd_region->nd_set;
+
+	if (is_nd_pmem(dev) && nd_set)
+		/* pass, should be precluded by nd_region_visible */;
+	else
+		return -ENXIO;
+
+	return sprintf(buf, "%#llx\n", nd_set->cookie);
+}
+static DEVICE_ATTR_RO(set_cookie);
+
 static ssize_t init_namespaces_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -154,15 +172,81 @@ static struct attribute *nd_region_attributes[] = {
 	&dev_attr_size.attr,
 	&dev_attr_nstype.attr,
 	&dev_attr_mappings.attr,
+	&dev_attr_set_cookie.attr,
 	&dev_attr_init_namespaces.attr,
 	NULL,
 };
 
+static umode_t nd_region_visible(struct kobject *kobj, struct attribute *a, int n)
+{
+	struct device *dev = container_of(kobj, typeof(*dev), kobj);
+	struct nd_region *nd_region = to_nd_region(dev);
+	struct nd_interleave_set *nd_set = nd_region->nd_set;
+
+	if (a != &dev_attr_set_cookie.attr)
+		return a->mode;
+
+	if (is_nd_pmem(dev) && nd_set)
+			return a->mode;
+
+	return 0;
+}
+
 struct attribute_group nd_region_attribute_group = {
 	.attrs = nd_region_attributes,
+	.is_visible = nd_region_visible,
 };
 EXPORT_SYMBOL_GPL(nd_region_attribute_group);
 
+/*
+ * Upon successful probe/remove, take/release a reference on the
+ * associated interleave set (if present)
+ */
+static void nd_region_notify_driver_action(struct nd_bus *nd_bus,
+		struct device *dev, int rc, bool probe)
+{
+	if (rc)
+		return;
+
+	if (is_nd_pmem(dev) || is_nd_blk(dev)) {
+		struct nd_region *nd_region = to_nd_region(dev);
+		int i;
+
+		for (i = 0; i < nd_region->ndr_mappings; i++) {
+			struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+			struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
+
+			if (probe)
+				atomic_inc(&nd_dimm->busy);
+			else
+				atomic_dec(&nd_dimm->busy);
+		}
+	}
+}
+
+void nd_region_probe_start(struct nd_bus *nd_bus, struct device *dev)
+{
+	nd_bus_lock(&nd_bus->dev);
+	nd_bus->probe_active++;
+	nd_bus_unlock(&nd_bus->dev);
+}
+
+void nd_region_probe_end(struct nd_bus *nd_bus, struct device *dev, int rc)
+{
+	nd_bus_lock(&nd_bus->dev);
+	nd_region_notify_driver_action(nd_bus, dev, rc, true);
+	if (--nd_bus->probe_active == 0)
+		wake_up(&nd_bus->probe_wait);
+	nd_bus_unlock(&nd_bus->dev);
+}
+
+void nd_region_notify_remove(struct nd_bus *nd_bus, struct device *dev, int rc)
+{
+	nd_bus_lock(dev);
+	nd_region_notify_driver_action(nd_bus, dev, rc, false);
+	nd_bus_unlock(dev);
+}
+
 static ssize_t mappingN(struct device *dev, char *buf, int n)
 {
 	struct nd_region *nd_region = to_nd_region(dev);
@@ -322,6 +406,7 @@ static noinline struct nd_region *nd_region_create(struct nd_bus *nd_bus,
 	}
 	nd_region->ndr_mappings = ndr_desc->num_mappings;
 	nd_region->provider_data = ndr_desc->provider_data;
+	nd_region->nd_set = ndr_desc->nd_set;
 	dev = &nd_region->dev;
 	dev_set_name(dev, "region%d", nd_region->id);
 	dev->parent = &nd_bus->dev;
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index 6747da2c7cb6..52f669faacfd 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -60,11 +60,16 @@ struct nd_cmd_desc {
 	int out_sizes[ND_CMD_MAX_ELEM];
 };
 
+struct nd_interleave_set {
+	u64 cookie;
+};
+
 struct nd_region_desc {
 	struct resource *res;
 	struct nd_mapping *nd_mapping;
 	u16 num_mappings;
 	const struct attribute_group **attr_groups;
+	struct nd_interleave_set *nd_set;
 	void *provider_data;
 };
 
@@ -99,4 +104,5 @@ struct nd_region *nd_blk_region_create(struct nd_bus *nd_bus,
 		struct nd_region_desc *ndr_desc);
 struct nd_region *nd_volatile_region_create(struct nd_bus *nd_bus,
 		struct nd_region_desc *ndr_desc);
+u64 nd_fletcher64(void *addr, size_t len, bool le);
 #endif /* __LIBND_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 11/21] libnd, nfit: add interleave-set state-tracking infrastructure
@ 2015-05-20 20:57   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: mingo, linux-nvdimm, neilb, gregkh, Rafael J. Wysocki,
	Robert Moore, linux-kernel, linux-acpi, jmoyer, hch

On platforms that have firmware support for reading/writing per-dimm
label space, a portion of the dimm may be accessible via an interleave
set PMEM mapping in addition to the dimm's BLK (block-data-window
aperture(s)) interface.  A label, stored in a "configuration data
region" on the dimm, disambiguates which dimm addresses are accessed
through which exclusive interface.

Add infrastructure that allows the kernel to block modifications to a
label in the set while any member dimm is active.  Note that this is
meant only for enforcing "no modifications of active labels" via the
coarse ioctl command.  Adding/deleting namespaces from an active
interleave set is always possible via sysfs.

Another aspect of tracking interleave sets is tracking their integrity
when DIMMs in a set are physically re-ordered.  For this purpose we
generate an "interleave-set cookie" that can be recorded in a label and
validated against the current configuration.  It is the bus provider
implementation's responsibility to calculate the interleave set cookie
and attach it to a given region.

Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c            |   91 ++++++++++++++++++++++++++++++++++++++++
 drivers/block/nd/bus.c         |   41 ++++++++++++++++++
 drivers/block/nd/core.c        |   17 +++++++
 drivers/block/nd/dimm_devs.c   |   19 ++++++++
 drivers/block/nd/nd-private.h  |   10 ++++
 drivers/block/nd/nd.h          |    1 
 drivers/block/nd/region_devs.c |   85 +++++++++++++++++++++++++++++++++++++
 include/linux/libnd.h          |    6 +++
 8 files changed, 266 insertions(+), 4 deletions(-)

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index aa719ef0418f..7c4d47492372 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -16,6 +16,7 @@
 #include <linux/ndctl.h>
 #include <linux/list.h>
 #include <linux/acpi.h>
+#include <linux/sort.h>
 #include "nfit.h"
 
 static bool force_enable_dimms;
@@ -744,6 +745,91 @@ static const struct attribute_group *acpi_nfit_region_attribute_groups[] = {
 	NULL,
 };
 
+/* enough info to uniquely specify an interleave set */
+struct nfit_set_info {
+	struct nfit_set_info_map {
+		u64 region_offset;
+		u32 serial_number;
+		u32 pad;
+	} mapping[0];
+};
+
+static size_t sizeof_nfit_set_info(int num_mappings)
+{
+	return sizeof(struct nfit_set_info)
+		+ num_mappings * sizeof(struct nfit_set_info_map);
+}
+
+static int cmp_map(const void *m0, const void *m1)
+{
+	const struct nfit_set_info_map *map0 = m0;
+	const struct nfit_set_info_map *map1 = m1;
+
+	return memcmp(&map0->region_offset, &map1->region_offset,
+			sizeof(u64));
+}
+
+/* Retrieve the nth entry referencing this spa */
+static struct acpi_nfit_memory_map *memdev_from_spa(
+		struct acpi_nfit_desc *acpi_desc, u16 range_index, int n)
+{
+        struct nfit_memdev *nfit_memdev;
+
+        list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list)
+                if (nfit_memdev->memdev->range_index == range_index)
+                        if (n-- == 0)
+                                return nfit_memdev->memdev;
+        return NULL;
+}
+
+static int acpi_nfit_init_interleave_set(struct acpi_nfit_desc *acpi_desc,
+		struct nd_region_desc *ndr_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	u16 num_mappings = ndr_desc->num_mappings;
+	int i, spa_type = nfit_spa_type(spa);
+	struct device *dev = acpi_desc->dev;
+	struct nd_interleave_set *nd_set;
+	struct nfit_set_info *info;
+
+	if (spa_type == NFIT_SPA_PM || spa_type == NFIT_SPA_VOLATILE)
+		/* pass */;
+	else
+		return 0;
+
+	nd_set = devm_kzalloc(dev, sizeof(*nd_set), GFP_KERNEL);
+	if (!nd_set)
+		return -ENOMEM;
+
+	info = devm_kzalloc(dev, sizeof_nfit_set_info(num_mappings), GFP_KERNEL);
+	if (!info)
+		return -ENOMEM;
+	for (i = 0; i < num_mappings; i++) {
+		struct nd_mapping *nd_mapping = &ndr_desc->nd_mapping[i];
+		struct nfit_set_info_map *map = &info->mapping[i];
+		struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
+		struct nfit_mem *nfit_mem = nd_dimm_provider_data(nd_dimm);
+		struct acpi_nfit_memory_map *memdev = memdev_from_spa(acpi_desc,
+				spa->range_index, i);
+
+		if (!memdev || !nfit_mem->dcr) {
+			dev_err(dev, "%s: failed to find DCR\n", __func__);
+			return -ENODEV;
+		}
+
+		map->region_offset = memdev->region_offset;
+		map->serial_number = nfit_mem->dcr->serial_number;
+	}
+
+	sort(&info->mapping[0], num_mappings, sizeof(struct nfit_set_info_map),
+			cmp_map, NULL);
+	nd_set->cookie = nd_fletcher64(info, sizeof_nfit_set_info(num_mappings), 0);
+	ndr_desc->nd_set = nd_set;
+	devm_kfree(dev, info);
+
+	return 0;
+}
+
 static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 		struct nfit_spa *nfit_spa)
 {
@@ -751,7 +837,7 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 	struct acpi_nfit_system_address *spa = nfit_spa->spa;
 	struct nfit_memdev *nfit_memdev;
 	struct nd_region_desc ndr_desc;
-	int spa_type, count = 0;
+	int spa_type, count = 0, rc;
 	struct resource res;
 	u16 range_index;
 
@@ -817,6 +903,9 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 
 	ndr_desc.nd_mapping = nd_mappings;
 	ndr_desc.num_mappings = count;
+	rc = acpi_nfit_init_interleave_set(acpi_desc, &ndr_desc, spa);
+	if (rc)
+		return rc;
 	if (spa_type == NFIT_SPA_PM) {
 		if (!nd_pmem_region_create(acpi_desc->nd_bus, &ndr_desc))
 			return -ENOMEM;
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index d2a62a6142f3..63b5182cf766 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -78,7 +78,10 @@ static int nd_bus_probe(struct device *dev)
 	if (!try_module_get(provider))
 		return -ENXIO;
 
+	nd_region_probe_start(nd_bus, dev);
 	rc = nd_drv->probe(dev);
+	nd_region_probe_end(nd_bus, dev, rc);
+
 	dev_dbg(&nd_bus->dev, "%s.probe(%s) = %d\n", dev->driver->name,
 			dev_name(dev), rc);
 	if (rc != 0)
@@ -94,6 +97,8 @@ static int nd_bus_remove(struct device *dev)
 	int rc;
 
 	rc = nd_drv->remove(dev);
+	nd_region_notify_remove(nd_bus, dev, rc);
+
 	dev_dbg(&nd_bus->dev, "%s.remove(%s) = %d\n", dev->driver->name,
 			dev_name(dev), rc);
 	module_put(provider);
@@ -359,6 +364,33 @@ u32 nd_cmd_out_size(struct nd_dimm *nd_dimm, int cmd,
 }
 EXPORT_SYMBOL_GPL(nd_cmd_out_size);
 
+static void wait_nd_bus_probe_idle(struct nd_bus *nd_bus)
+{
+	do {
+		if (nd_bus->probe_active == 0)
+			break;
+		nd_bus_unlock(&nd_bus->dev);
+		wait_event(nd_bus->probe_wait, nd_bus->probe_active == 0);
+		nd_bus_lock(&nd_bus->dev);
+	} while (true);
+}
+
+/* set_config requires an idle interleave set */
+static int nd_cmd_clear_to_send(struct nd_dimm *nd_dimm, unsigned int cmd)
+{
+	struct nd_bus *nd_bus;
+
+	if (!nd_dimm || cmd != ND_CMD_SET_CONFIG_DATA)
+		return 0;
+
+	nd_bus = walk_to_nd_bus(&nd_dimm->dev);
+	wait_nd_bus_probe_idle(nd_bus);
+
+	if (atomic_read(&nd_dimm->busy))
+		return -EBUSY;
+	return 0;
+}
+
 static int __nd_ioctl(struct nd_bus *nd_bus, struct nd_dimm *nd_dimm,
 		int read_only, unsigned int ioctl_cmd, unsigned long arg)
 {
@@ -469,11 +501,18 @@ static int __nd_ioctl(struct nd_bus *nd_bus, struct nd_dimm *nd_dimm,
 		goto out;
 	}
 
+	nd_bus_lock(&nd_bus->dev);
+	rc = nd_cmd_clear_to_send(nd_dimm, cmd);
+	if (rc)
+		goto out_unlock;
+
 	rc = nd_desc->ndctl(nd_desc, nd_dimm, cmd, buf, buf_len);
 	if (rc < 0)
-		goto out;
+		goto out_unlock;
 	if (copy_to_user(p, buf, buf_len))
 		rc = -EFAULT;
+ out_unlock:
+	nd_bus_unlock(&nd_bus->dev);
  out:
 	vfree(buf);
 	return rc;
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index 7bf88fb124b7..38fb8f4c9a2c 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -54,6 +54,22 @@ bool is_nd_bus_locked(struct device *dev)
 }
 EXPORT_SYMBOL(is_nd_bus_locked);
 
+u64 nd_fletcher64(void *addr, size_t len, bool le)
+{
+	u32 *buf = addr;
+	u32 lo32 = 0;
+	u64 hi32 = 0;
+	int i;
+
+	for (i = 0; i < len / sizeof(u32); i++) {
+		lo32 += le ? le32_to_cpu(buf[i]) : buf[i];
+		hi32 += lo32;
+	}
+
+	return hi32 << 32 | lo32;
+}
+EXPORT_SYMBOL_GPL(nd_fletcher64);
+
 static void nd_bus_release(struct device *dev)
 {
 	struct nd_bus *nd_bus = container_of(dev, struct nd_bus, dev);
@@ -172,6 +188,7 @@ struct nd_bus *__nd_bus_register(struct device *parent,
 	if (!nd_bus)
 		return NULL;
 	INIT_LIST_HEAD(&nd_bus->list);
+	init_waitqueue_head(&nd_bus->probe_wait);
 	nd_bus->id = ida_simple_get(&nd_ida, 0, 0, GFP_KERNEL);
 	mutex_init(&nd_bus->reconfig_mutex);
 	if (nd_bus->id < 0) {
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 33b6d5336096..8981adc59ba4 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -185,7 +185,24 @@ static ssize_t commands_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(commands);
 
+static ssize_t state_show(struct device *dev, struct device_attribute *attr,
+		char *buf)
+{
+	struct nd_dimm *nd_dimm = to_nd_dimm(dev);
+
+	/*
+	 * The state may be in the process of changing, userspace should
+	 * quiesce probing if it wants a static answer
+	 */
+	nd_bus_lock(dev);
+	nd_bus_unlock(dev);
+	return sprintf(buf, "%s\n", atomic_read(&nd_dimm->busy)
+			? "active" : "idle");
+}
+static DEVICE_ATTR_RO(state);
+
 static struct attribute *nd_dimm_attributes[] = {
+	&dev_attr_state.attr,
 	&dev_attr_commands.attr,
 	NULL,
 };
@@ -213,7 +230,7 @@ struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
 	nd_dimm->provider_data = provider_data;
 	nd_dimm->flags = flags;
 	nd_dimm->dsm_mask = dsm_mask;
-
+	atomic_set(&nd_dimm->busy, 0);
 	dev = &nd_dimm->dev;
 	dev_set_name(dev, "nmem%d", nd_dimm->id);
 	dev->parent = &nd_bus->dev;
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index 8ef3a1b50f44..67f28011dfa5 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -14,6 +14,8 @@
 #define __ND_PRIVATE_H__
 #include <linux/device.h>
 #include <linux/libnd.h>
+#include <linux/sizes.h>
+#include <linux/mutex.h>
 
 extern struct list_head nd_bus_list;
 extern struct mutex nd_bus_list_mutex;
@@ -21,10 +23,11 @@ extern int nd_dimm_major;
 
 struct nd_bus {
 	struct nd_bus_descriptor *nd_desc;
+	wait_queue_head_t probe_wait;
 	struct module *module;
 	struct list_head list;
 	struct device dev;
-	int id;
+	int id, probe_active;
 	struct mutex reconfig_mutex;
 };
 
@@ -33,6 +36,7 @@ struct nd_dimm {
 	void *provider_data;
 	unsigned long *dsm_mask;
 	struct device dev;
+	atomic_t busy;
 	int id;
 };
 
@@ -46,10 +50,14 @@ int __init nd_dimm_init(void);
 int __init nd_region_init(void);
 void nd_dimm_exit(void);
 int nd_region_exit(void);
+void nd_region_probe_start(struct nd_bus *nd_bus, struct device *dev);
+void nd_region_probe_end(struct nd_bus *nd_bus, struct device *dev, int rc);
+void nd_region_notify_remove(struct nd_bus *nd_bus, struct device *dev, int rc);
 int nd_bus_create_ndctl(struct nd_bus *nd_bus);
 void nd_bus_destroy_ndctl(struct nd_bus *nd_bus);
 void nd_synchronize(void);
 int nd_bus_register_dimms(struct nd_bus *nd_bus);
 int nd_bus_register_regions(struct nd_bus *nd_bus);
+int nd_bus_init_interleave_sets(struct nd_bus *nd_bus);
 int nd_match_dimm(struct device *dev, void *data);
 #endif /* __ND_PRIVATE_H__ */
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index 72f4d7b76059..905183d45799 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -35,6 +35,7 @@ struct nd_region {
 	u64 ndr_start;
 	int id;
 	void *provider_data;
+	struct nd_interleave_set *nd_set;
 	struct nd_mapping mapping[0];
 };
 
diff --git a/drivers/block/nd/region_devs.c b/drivers/block/nd/region_devs.c
index fdc58e333b78..221e6342b6ca 100644
--- a/drivers/block/nd/region_devs.c
+++ b/drivers/block/nd/region_devs.c
@@ -10,7 +10,10 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/scatterlist.h>
+#include <linux/sched.h>
 #include <linux/slab.h>
+#include <linux/sort.h>
 #include <linux/io.h>
 #include "nd-private.h"
 #include "nd.h"
@@ -133,6 +136,21 @@ static ssize_t nstype_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(nstype);
 
+static ssize_t set_cookie_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+	struct nd_interleave_set *nd_set = nd_region->nd_set;
+
+	if (is_nd_pmem(dev) && nd_set)
+		/* pass, should be precluded by nd_region_visible */;
+	else
+		return -ENXIO;
+
+	return sprintf(buf, "%#llx\n", nd_set->cookie);
+}
+static DEVICE_ATTR_RO(set_cookie);
+
 static ssize_t init_namespaces_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -154,15 +172,81 @@ static struct attribute *nd_region_attributes[] = {
 	&dev_attr_size.attr,
 	&dev_attr_nstype.attr,
 	&dev_attr_mappings.attr,
+	&dev_attr_set_cookie.attr,
 	&dev_attr_init_namespaces.attr,
 	NULL,
 };
 
+static umode_t nd_region_visible(struct kobject *kobj, struct attribute *a, int n)
+{
+	struct device *dev = container_of(kobj, typeof(*dev), kobj);
+	struct nd_region *nd_region = to_nd_region(dev);
+	struct nd_interleave_set *nd_set = nd_region->nd_set;
+
+	if (a != &dev_attr_set_cookie.attr)
+		return a->mode;
+
+	if (is_nd_pmem(dev) && nd_set)
+			return a->mode;
+
+	return 0;
+}
+
 struct attribute_group nd_region_attribute_group = {
 	.attrs = nd_region_attributes,
+	.is_visible = nd_region_visible,
 };
 EXPORT_SYMBOL_GPL(nd_region_attribute_group);
 
+/*
+ * Upon successful probe/remove, take/release a reference on the
+ * associated interleave set (if present)
+ */
+static void nd_region_notify_driver_action(struct nd_bus *nd_bus,
+		struct device *dev, int rc, bool probe)
+{
+	if (rc)
+		return;
+
+	if (is_nd_pmem(dev) || is_nd_blk(dev)) {
+		struct nd_region *nd_region = to_nd_region(dev);
+		int i;
+
+		for (i = 0; i < nd_region->ndr_mappings; i++) {
+			struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+			struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
+
+			if (probe)
+				atomic_inc(&nd_dimm->busy);
+			else
+				atomic_dec(&nd_dimm->busy);
+		}
+	}
+}
+
+void nd_region_probe_start(struct nd_bus *nd_bus, struct device *dev)
+{
+	nd_bus_lock(&nd_bus->dev);
+	nd_bus->probe_active++;
+	nd_bus_unlock(&nd_bus->dev);
+}
+
+void nd_region_probe_end(struct nd_bus *nd_bus, struct device *dev, int rc)
+{
+	nd_bus_lock(&nd_bus->dev);
+	nd_region_notify_driver_action(nd_bus, dev, rc, true);
+	if (--nd_bus->probe_active == 0)
+		wake_up(&nd_bus->probe_wait);
+	nd_bus_unlock(&nd_bus->dev);
+}
+
+void nd_region_notify_remove(struct nd_bus *nd_bus, struct device *dev, int rc)
+{
+	nd_bus_lock(dev);
+	nd_region_notify_driver_action(nd_bus, dev, rc, false);
+	nd_bus_unlock(dev);
+}
+
 static ssize_t mappingN(struct device *dev, char *buf, int n)
 {
 	struct nd_region *nd_region = to_nd_region(dev);
@@ -322,6 +406,7 @@ static noinline struct nd_region *nd_region_create(struct nd_bus *nd_bus,
 	}
 	nd_region->ndr_mappings = ndr_desc->num_mappings;
 	nd_region->provider_data = ndr_desc->provider_data;
+	nd_region->nd_set = ndr_desc->nd_set;
 	dev = &nd_region->dev;
 	dev_set_name(dev, "region%d", nd_region->id);
 	dev->parent = &nd_bus->dev;
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index 6747da2c7cb6..52f669faacfd 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -60,11 +60,16 @@ struct nd_cmd_desc {
 	int out_sizes[ND_CMD_MAX_ELEM];
 };
 
+struct nd_interleave_set {
+	u64 cookie;
+};
+
 struct nd_region_desc {
 	struct resource *res;
 	struct nd_mapping *nd_mapping;
 	u16 num_mappings;
 	const struct attribute_group **attr_groups;
+	struct nd_interleave_set *nd_set;
 	void *provider_data;
 };
 
@@ -99,4 +104,5 @@ struct nd_region *nd_blk_region_create(struct nd_bus *nd_bus,
 		struct nd_region_desc *ndr_desc);
 struct nd_region *nd_volatile_region_create(struct nd_bus *nd_bus,
 		struct nd_region_desc *ndr_desc);
+u64 nd_fletcher64(void *addr, size_t len, bool le);
 #endif /* __LIBND_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 12/21] libnd: namespace indices: read and validate
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:57   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, mingo, linux-acpi,
	jmoyer, hch

This on media label format [1] consists of two index blocks followed by
an array of labels.  None of these structures are ever updated in place.
A sequence number tracks the current active index and the next one to
write, while labels are written to free slots.

    +------------+
    |            |
    |  nsindex0  |
    |            |
    +------------+
    |            |
    |  nsindex1  |
    |            |
    +------------+
    |   label0   |
    +------------+
    |   label1   |
    +------------+
    |            |
     ....nslot...
    |            |
    +------------+
    |   labelN   |
    +------------+

After reading valid labels, store the dpa ranges they claim into
per-dimm resource trees.

[1]: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf

Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/block/nd/Makefile    |    1 
 drivers/block/nd/dimm.c      |   23 +++
 drivers/block/nd/dimm_devs.c |   30 ++++
 drivers/block/nd/label.c     |  287 ++++++++++++++++++++++++++++++++++++++++++
 drivers/block/nd/label.h     |  129 +++++++++++++++++++
 drivers/block/nd/nd.h        |   49 +++++++
 include/uapi/linux/ndctl.h   |    1 
 7 files changed, 518 insertions(+), 2 deletions(-)
 create mode 100644 drivers/block/nd/label.c
 create mode 100644 drivers/block/nd/label.h

diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 6f539f01fa82..8d14510559e1 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -10,3 +10,4 @@ libnd-y += dimm.o
 libnd-y += region_devs.o
 libnd-y += region.o
 libnd-y += namespace_devs.o
+libnd-y += label.o
diff --git a/drivers/block/nd/dimm.c b/drivers/block/nd/dimm.c
index c4df1a32a68b..e2f964308672 100644
--- a/drivers/block/nd/dimm.c
+++ b/drivers/block/nd/dimm.c
@@ -18,6 +18,7 @@
 #include <linux/slab.h>
 #include <linux/mm.h>
 #include <linux/nd.h>
+#include "label.h"
 #include "nd.h"
 
 static void free_data(struct nd_dimm_drvdata *ndd)
@@ -42,6 +43,11 @@ static int nd_dimm_probe(struct device *dev)
 		return -ENOMEM;
 
 	dev_set_drvdata(dev, ndd);
+	ndd->dpa.name = dev_name(dev);
+	ndd->ns_current = -1;
+	ndd->ns_next = -1;
+	ndd->dpa.start = 0;
+	ndd->dpa.end = -1;
 	ndd->dev = dev;
 
 	rc = nd_dimm_init_nsarea(ndd);
@@ -54,6 +60,17 @@ static int nd_dimm_probe(struct device *dev)
 
 	dev_dbg(dev, "config data size: %d\n", ndd->nsarea.config_size);
 
+	nd_bus_lock(dev);
+	ndd->ns_current = nd_label_validate(ndd);
+	ndd->ns_next = nd_label_next_nsindex(ndd->ns_current);
+	nd_label_copy(ndd, to_next_namespace_index(ndd),
+			to_current_namespace_index(ndd));
+	rc = nd_label_reserve_dpa(ndd);
+	nd_bus_unlock(dev);
+
+	if (rc)
+		goto err;
+
 	return 0;
 
  err:
@@ -64,7 +81,13 @@ static int nd_dimm_probe(struct device *dev)
 static int nd_dimm_remove(struct device *dev)
 {
 	struct nd_dimm_drvdata *ndd = dev_get_drvdata(dev);
+	struct resource *res, *_r;
 
+	nd_bus_lock(dev);
+	dev_set_drvdata(dev, NULL);
+	for_each_dpa_resource_safe(ndd, res, _r)
+		nd_dimm_free_dpa(ndd, res);
+	nd_bus_unlock(dev);
 	free_data(ndd);
 
 	return 0;
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 8981adc59ba4..013531b8adfa 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -92,8 +92,12 @@ int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd)
 	if (ndd->data)
 		return 0;
 
-	if (ndd->nsarea.status || ndd->nsarea.max_xfer == 0)
+	if (ndd->nsarea.status || ndd->nsarea.max_xfer == 0
+			|| ndd->nsarea.config_size < ND_LABEL_MIN_SIZE) {
+		dev_dbg(ndd->dev, "failed to init config data area: (%d:%d)\n",
+				ndd->nsarea.max_xfer, ndd->nsarea.config_size);
 		return -ENXIO;
+	}
 
 	ndd->data = kmalloc(ndd->nsarea.config_size, GFP_KERNEL);
 	if (!ndd->data)
@@ -243,6 +247,30 @@ struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
 }
 EXPORT_SYMBOL_GPL(nd_dimm_create);
 
+void nd_dimm_free_dpa(struct nd_dimm_drvdata *ndd, struct resource *res)
+{
+	WARN_ON_ONCE(!is_nd_bus_locked(ndd->dev));
+	kfree(res->name);
+	__release_region(&ndd->dpa, res->start, resource_size(res));
+}
+
+struct resource *nd_dimm_allocate_dpa(struct nd_dimm_drvdata *ndd,
+		struct nd_label_id *label_id, resource_size_t start,
+		resource_size_t n)
+{
+	char *name = kmemdup(label_id, sizeof(*label_id), GFP_KERNEL);
+	struct resource *res;
+
+	if (!name)
+		return NULL;
+
+	WARN_ON_ONCE(!is_nd_bus_locked(ndd->dev));
+	res = __request_region(&ndd->dpa, start, n, name, 0);
+	if (!res)
+		kfree(name);
+	return res;
+}
+
 static int count_dimms(struct device *dev, void *c)
 {
 	int *count = c;
diff --git a/drivers/block/nd/label.c b/drivers/block/nd/label.c
new file mode 100644
index 000000000000..da5008e45917
--- /dev/null
+++ b/drivers/block/nd/label.c
@@ -0,0 +1,287 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/device.h>
+#include <linux/ndctl.h>
+#include <linux/io.h>
+#include <linux/nd.h>
+#include "nd-private.h"
+#include "label.h"
+#include "nd.h"
+
+#include <asm-generic/io-64-nonatomic-lo-hi.h>
+
+static u32 best_seq(u32 a, u32 b)
+{
+	a &= NSINDEX_SEQ_MASK;
+	b &= NSINDEX_SEQ_MASK;
+
+	if (a == 0 || a == b)
+		return b;
+	else if (b == 0)
+		return a;
+	else if (nd_inc_seq(a) == b)
+		return b;
+	else
+		return a;
+}
+
+size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd)
+{
+	u32 index_span;
+
+	if (ndd->nsindex_size)
+		return ndd->nsindex_size;
+
+	/*
+	 * The minimum index space is 512 bytes, with that amount of
+	 * index we can describe ~1400 labels which is less than a byte
+	 * of overhead per label.  Round up to a byte of overhead per
+	 * label and determine the size of the index region.  Yes, this
+	 * starts to waste space at larger config_sizes, but it's
+	 * unlikely we'll ever see anything but 128K.
+	 */
+	index_span = ndd->nsarea.config_size / 129;
+	index_span /= NSINDEX_ALIGN * 2;
+	ndd->nsindex_size = index_span * NSINDEX_ALIGN;
+
+	return ndd->nsindex_size;
+}
+
+int nd_label_validate(struct nd_dimm_drvdata *ndd)
+{
+	/*
+	 * On media label format consists of two index blocks followed
+	 * by an array of labels.  None of these structures are ever
+	 * updated in place.  A sequence number tracks the current
+	 * active index and the next one to write, while labels are
+	 * written to free slots.
+	 *
+	 *     +------------+
+	 *     |            |
+	 *     |  nsindex0  |
+	 *     |            |
+	 *     +------------+
+	 *     |            |
+	 *     |  nsindex1  |
+	 *     |            |
+	 *     +------------+
+	 *     |   label0   |
+	 *     +------------+
+	 *     |   label1   |
+	 *     +------------+
+	 *     |            |
+	 *      ....nslot...
+	 *     |            |
+	 *     +------------+
+	 *     |   labelN   |
+	 *     +------------+
+	 */
+	struct nd_namespace_index __iomem *nsindex[] = {
+		to_namespace_index(ndd, 0),
+		to_namespace_index(ndd, 1),
+	};
+	const int num_index = ARRAY_SIZE(nsindex);
+	struct device *dev = ndd->dev;
+	bool valid[] = { false, false };
+	int i, num_valid = 0;
+	u32 seq;
+
+	for (i = 0; i < num_index; i++) {
+		u64 sum_save, sum;
+		u8 sig[NSINDEX_SIG_LEN];
+
+		memcpy_fromio(sig, nsindex[i]->sig, NSINDEX_SIG_LEN);
+		if (memcmp(sig, NSINDEX_SIGNATURE, NSINDEX_SIG_LEN) != 0) {
+			dev_dbg(dev, "%s: nsindex%d signature invalid\n",
+					__func__, i);
+			continue;
+		}
+		sum_save = readq(&nsindex[i]->checksum);
+		writeq(0, &nsindex[i]->checksum);
+		sum = nd_fletcher64((void * __force) nsindex[i],
+				sizeof_namespace_index(ndd), 1);
+		writeq(sum_save, &nsindex[i]->checksum);
+		if (sum != sum_save) {
+			dev_dbg(dev, "%s: nsindex%d checksum invalid\n",
+					__func__, i);
+			continue;
+		}
+		if ((readl(&nsindex[i]->seq) & NSINDEX_SEQ_MASK) == 0) {
+			dev_dbg(dev, "%s: nsindex%d sequence: %#x invalid\n",
+					__func__, i, readl(&nsindex[i]->seq));
+			continue;
+		}
+
+		/* sanity check the index against expected values */
+		if (readq(&nsindex[i]->myoff)
+				!= i * sizeof_namespace_index(ndd)) {
+			dev_dbg(dev, "%s: nsindex%d myoff: %#llx invalid\n",
+					__func__, i, (unsigned long long)
+					readq(&nsindex[i]->myoff));
+			continue;
+		}
+		if (readq(&nsindex[i]->otheroff)
+				!= (!i) * sizeof_namespace_index(ndd)) {
+			dev_dbg(dev, "%s: nsindex%d otheroff: %#llx invalid\n",
+					__func__, i, (unsigned long long)
+					readq(&nsindex[i]->otheroff));
+			continue;
+		}
+		if (readq(&nsindex[i]->mysize) > sizeof_namespace_index(ndd)
+				|| readq(&nsindex[i]->mysize)
+				< sizeof(struct nd_namespace_index)) {
+			dev_dbg(dev, "%s: nsindex%d mysize: %#llx invalid\n",
+					__func__, i, (unsigned long long)
+					readq(&nsindex[i]->mysize));
+			continue;
+		}
+		if (readl(&nsindex[i]->nslot) * sizeof(struct nd_namespace_label)
+				+ 2 * sizeof_namespace_index(ndd)
+				> ndd->nsarea.config_size) {
+			dev_dbg(dev, "%s: nsindex%d nslot: %u invalid, config_size: %#x\n",
+					__func__, i, readl(&nsindex[i]->nslot),
+					ndd->nsarea.config_size);
+			continue;
+		}
+		valid[i] = true;
+		num_valid++;
+	}
+
+	switch (num_valid) {
+	case 0:
+		break;
+	case 1:
+		for (i = 0; i < num_index; i++)
+			if (valid[i])
+				return i;
+		/* can't have num_valid > 0 but valid[] = { false, false } */
+		WARN_ON(1);
+		break;
+	default:
+		/* pick the best index... */
+		seq = best_seq(readl(&nsindex[0]->seq), readl(&nsindex[1]->seq));
+		if (seq == (readl(&nsindex[1]->seq) & NSINDEX_SEQ_MASK))
+			return 1;
+		else
+			return 0;
+		break;
+	}
+
+	return -1;
+}
+
+void nd_label_copy(struct nd_dimm_drvdata *ndd,
+		struct nd_namespace_index __iomem *dst,
+		struct nd_namespace_index __iomem *src)
+{
+	void *s, *d;
+
+	if (dst && src)
+		/* pass */;
+	else
+		return;
+
+	d = (void * __force) dst;
+	s = (void * __force) src;
+	memcpy(d, s, sizeof_namespace_index(ndd));
+}
+
+static struct nd_namespace_label __iomem *nd_label_base(struct nd_dimm_drvdata *ndd)
+{
+	void *base = to_namespace_index(ndd, 0);
+
+	return base + 2 * sizeof_namespace_index(ndd);
+}
+
+#define for_each_clear_bit_le(bit, addr, size) \
+	for ((bit) = find_next_zero_bit_le((addr), (size), 0);  \
+	     (bit) < (size);                                    \
+	     (bit) = find_next_zero_bit_le((addr), (size), (bit) + 1))
+
+/**
+ * preamble_current - common variable initialization for nd_label_* routines
+ * @nd_dimm: dimm container for the relevant label set
+ * @nsindex: on return set to the currently active namespace index
+ * @free: on return set to the free label bitmap in the index
+ * @nslot: on return set to the number of slots in the label space
+ */
+static bool preamble_current(struct nd_dimm_drvdata *ndd,
+		struct nd_namespace_index **nsindex,
+		unsigned long **free, u32 *nslot)
+{
+	*nsindex = to_current_namespace_index(ndd);
+	if (*nsindex == NULL)
+		return false;
+
+	*free = (unsigned long __force *) (*nsindex)->free;
+	*nslot = readl(&(*nsindex)->nslot);
+
+	return true;
+}
+
+static char *nd_label_gen_id(struct nd_label_id *label_id, u8 *uuid, u32 flags)
+{
+	if (!label_id || !uuid)
+		return NULL;
+	snprintf(label_id->id, ND_LABEL_ID_SIZE, "%s-%pUb",
+			flags & NSLABEL_FLAG_LOCAL ? "blk" : "pmem", uuid);
+	return label_id->id;
+}
+
+static bool slot_valid(struct nd_namespace_label __iomem *nd_label, u32 slot)
+{
+	/* check that we are written where we expect to be written */
+	if (slot != readl(&nd_label->slot))
+		return false;
+
+	/* check that DPA allocations are page aligned */
+	if ((readq(&nd_label->dpa) | readq(&nd_label->rawsize)) % SZ_4K)
+		return false;
+
+	return true;
+}
+
+int nd_label_reserve_dpa(struct nd_dimm_drvdata *ndd)
+{
+	struct nd_namespace_index __iomem *nsindex;
+	unsigned long *free;
+	u32 nslot, slot;
+
+	if (!preamble_current(ndd, &nsindex, &free, &nslot))
+		return 0; /* no label, nothing to reserve */
+
+	for_each_clear_bit_le(slot, free, nslot) {
+		struct nd_namespace_label __iomem *nd_label;
+		struct nd_region *nd_region = NULL;
+		u8 label_uuid[NSLABEL_UUID_LEN];
+		struct nd_label_id label_id;
+		struct resource *res;
+		u32 flags;
+
+		nd_label = nd_label_base(ndd) + slot;
+
+		if (!slot_valid(nd_label, slot))
+			continue;
+
+		memcpy_fromio(label_uuid, nd_label->uuid, NSLABEL_UUID_LEN);
+		flags = readl(&nd_label->flags);
+		nd_label_gen_id(&label_id, label_uuid, flags);
+		res = nd_dimm_allocate_dpa(ndd, &label_id, readq(&nd_label->dpa),
+				readq(&nd_label->rawsize));
+		nd_dbg_dpa(nd_region, ndd, res, "reserve\n");
+		if (!res)
+			return -EBUSY;
+	}
+
+	return 0;
+}
diff --git a/drivers/block/nd/label.h b/drivers/block/nd/label.h
new file mode 100644
index 000000000000..79ed885a43c0
--- /dev/null
+++ b/drivers/block/nd/label.h
@@ -0,0 +1,129 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __LABEL_H__
+#define __LABEL_H__
+
+#include <linux/ndctl.h>
+#include <linux/sizes.h>
+#include <linux/io.h>
+
+enum {
+	NSINDEX_SIG_LEN = 16,
+	NSINDEX_ALIGN = 256,
+	NSINDEX_SEQ_MASK = 0x3,
+	NSLABEL_UUID_LEN = 16,
+	NSLABEL_NAME_LEN = 64,
+	NSLABEL_FLAG_ROLABEL = 0x1,  /* read-only label */
+	NSLABEL_FLAG_LOCAL = 0x2,    /* DIMM-local namespace */
+	NSLABEL_FLAG_BTT = 0x4,      /* namespace contains a BTT */
+	NSLABEL_FLAG_UPDATING = 0x8, /* label being updated */
+	BTT_ALIGN = 4096,            /* all btt structures */
+	BTTINFO_SIG_LEN = 16,
+	BTTINFO_UUID_LEN = 16,
+	BTTINFO_FLAG_ERROR = 0x1,    /* error state (read-only) */
+	BTTINFO_MAJOR_VERSION = 1,
+	ND_LABEL_MIN_SIZE = 512 * 129, /* see sizeof_namespace_index() */
+	ND_LABEL_ID_SIZE = 50,
+};
+
+static const char NSINDEX_SIGNATURE[] = "NAMESPACE_INDEX\0";
+
+/**
+ * struct nd_namespace_index - label set superblock
+ * @sig: NAMESPACE_INDEX\0
+ * @flags: placeholder
+ * @seq: sequence number for this index
+ * @myoff: offset of this index in label area
+ * @mysize: size of this index struct
+ * @otheroff: offset of other index
+ * @labeloff: offset of first label slot
+ * @nslot: total number of label slots
+ * @major: label area major version
+ * @minor: label area minor version
+ * @checksum: fletcher64 of all fields
+ * @free[0]: bitmap, nlabel bits
+ *
+ * The size of free[] is rounded up so the total struct size is a
+ * multiple of NSINDEX_ALIGN bytes.  Any bits this allocates beyond
+ * nlabel bits must be zero.
+ */
+struct nd_namespace_index {
+	u8 sig[NSINDEX_SIG_LEN];
+	__le32 flags;
+	__le32 seq;
+	__le64 myoff;
+	__le64 mysize;
+	__le64 otheroff;
+	__le64 labeloff;
+	__le32 nslot;
+	__le16 major;
+	__le16 minor;
+	__le64 checksum;
+	u8 free[0];
+};
+
+/**
+ * struct nd_namespace_label - namespace superblock
+ * @uuid: UUID per RFC 4122
+ * @name: optional name (NULL-terminated)
+ * @flags: see NSLABEL_FLAG_*
+ * @nlabel: num labels to describe this ns
+ * @position: labels position in set
+ * @isetcookie: interleave set cookie
+ * @lbasize: LBA size in bytes or 0 for pmem
+ * @dpa: DPA of NVM range on this DIMM
+ * @rawsize: size of namespace
+ * @slot: slot of this label in label area
+ * @unused: must be zero
+ */
+struct nd_namespace_label {
+	u8 uuid[NSLABEL_UUID_LEN];
+	u8 name[NSLABEL_NAME_LEN];
+	__le32 flags;
+	__le16 nlabel;
+	__le16 position;
+	__le64 isetcookie;
+	__le64 lbasize;
+	__le64 dpa;
+	__le64 rawsize;
+	__le32 slot;
+	__le32 unused;
+};
+
+/**
+ * struct nd_label_id - identifier string for dpa allocation
+ * @id: "{blk|pmem}-<namespace uuid>"
+ */
+struct nd_label_id {
+	char id[ND_LABEL_ID_SIZE];
+};
+
+/*
+ * If the 'best' index is invalid, so is the 'next' index.  Otherwise,
+ * the next index is MOD(index+1, 2)
+ */
+static inline int nd_label_next_nsindex(int index)
+{
+	if (index < 0)
+		return -1;
+
+	return (index + 1) % 2;
+}
+
+struct nd_dimm_drvdata;
+int nd_label_validate(struct nd_dimm_drvdata *ndd);
+void nd_label_copy(struct nd_dimm_drvdata *ndd,
+		struct nd_namespace_index *dst,
+		struct nd_namespace_index *src);
+size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd);
+#endif /* __LABEL_H__ */
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index 905183d45799..63540ffe845d 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -16,11 +16,15 @@
 #include <linux/libnd.h>
 #include <linux/mutex.h>
 #include <linux/ndctl.h>
+#include "label.h"
 
 struct nd_dimm_drvdata {
 	struct device *dev;
+	int nsindex_size;
 	struct nd_cmd_get_config_size nsarea;
 	void *data;
+	int ns_current, ns_next;
+	struct resource dpa;
 };
 
 struct nd_region_namespaces {
@@ -28,6 +32,37 @@ struct nd_region_namespaces {
 	int active;
 };
 
+static inline struct nd_namespace_index __iomem *to_namespace_index(
+		struct nd_dimm_drvdata *ndd, int i)
+{
+	if (i < 0)
+		return NULL;
+
+	return ((void __iomem *) ndd->data + sizeof_namespace_index(ndd) * i);
+}
+
+static inline struct nd_namespace_index __iomem *to_current_namespace_index(
+		struct nd_dimm_drvdata *ndd)
+{
+	return to_namespace_index(ndd, ndd->ns_current);
+}
+
+static inline struct nd_namespace_index __iomem *to_next_namespace_index(
+		struct nd_dimm_drvdata *ndd)
+{
+	return to_namespace_index(ndd, ndd->ns_next);
+}
+
+#define nd_dbg_dpa(r, d, res, fmt, arg...) \
+	dev_dbg((r) ? &(r)->dev : (d)->dev, "%s: %.13s: %#llx @ %#llx " fmt, \
+		(r) ? dev_name((d)->dev) : "", res ? res->name : "null", \
+		(unsigned long long) (res ? resource_size(res) : 0), \
+		(unsigned long long) (res ? res->start : 0), ##arg)
+
+#define for_each_dpa_resource_safe(ndd, res, next) \
+	for (res = (ndd)->dpa.child, next = res ? res->sibling : NULL; \
+			res; res = next, next = next ? next->sibling : NULL)
+
 struct nd_region {
 	struct device dev;
 	u16 ndr_mappings;
@@ -39,6 +74,15 @@ struct nd_region {
 	struct nd_mapping mapping[0];
 };
 
+/*
+ * Lookup next in the repeating sequence of 01, 10, and 11.
+ */
+static inline unsigned nd_inc_seq(unsigned seq)
+{
+	static const unsigned next[] = { 0, 2, 3, 1 };
+
+	return next[seq & 3];
+}
 enum nd_async_mode {
 	ND_SYNC,
 	ND_ASYNC,
@@ -54,4 +98,9 @@ int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
 void nd_bus_lock(struct device *dev);
 void nd_bus_unlock(struct device *dev);
 bool is_nd_bus_locked(struct device *dev);
+int nd_label_reserve_dpa(struct nd_dimm_drvdata *ndd);
+void nd_dimm_free_dpa(struct nd_dimm_drvdata *ndd, struct resource *res);
+struct resource *nd_dimm_allocate_dpa(struct nd_dimm_drvdata *ndd,
+		struct nd_label_id *label_id, resource_size_t start,
+		resource_size_t n);
 #endif /* __ND_H__ */
diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
index 5ffa319f3408..624a19d9e6e4 100644
--- a/include/uapi/linux/ndctl.h
+++ b/include/uapi/linux/ndctl.h
@@ -175,7 +175,6 @@ static inline const char *nd_dimm_cmd_name(unsigned cmd)
 #define ND_IOCTL_ARS_QUERY		_IOWR(ND_IOCTL, ND_CMD_ARS_QUERY,\
 					struct nd_cmd_ars_query)
 
-
 #define ND_DEVICE_DIMM 1            /* nd_dimm: container for "config data" */
 #define ND_DEVICE_REGION_PMEM 2     /* nd_region: (parent of pmem namespaces) */
 #define ND_DEVICE_REGION_BLK 3      /* nd_region: (parent of blk namespaces) */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 12/21] libnd: namespace indices: read and validate
@ 2015-05-20 20:57   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, mingo, linux-acpi,
	jmoyer, hch

This on media label format [1] consists of two index blocks followed by
an array of labels.  None of these structures are ever updated in place.
A sequence number tracks the current active index and the next one to
write, while labels are written to free slots.

    +------------+
    |            |
    |  nsindex0  |
    |            |
    +------------+
    |            |
    |  nsindex1  |
    |            |
    +------------+
    |   label0   |
    +------------+
    |   label1   |
    +------------+
    |            |
     ....nslot...
    |            |
    +------------+
    |   labelN   |
    +------------+

After reading valid labels, store the dpa ranges they claim into
per-dimm resource trees.

[1]: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf

Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/block/nd/Makefile    |    1 
 drivers/block/nd/dimm.c      |   23 +++
 drivers/block/nd/dimm_devs.c |   30 ++++
 drivers/block/nd/label.c     |  287 ++++++++++++++++++++++++++++++++++++++++++
 drivers/block/nd/label.h     |  129 +++++++++++++++++++
 drivers/block/nd/nd.h        |   49 +++++++
 include/uapi/linux/ndctl.h   |    1 
 7 files changed, 518 insertions(+), 2 deletions(-)
 create mode 100644 drivers/block/nd/label.c
 create mode 100644 drivers/block/nd/label.h

diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 6f539f01fa82..8d14510559e1 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -10,3 +10,4 @@ libnd-y += dimm.o
 libnd-y += region_devs.o
 libnd-y += region.o
 libnd-y += namespace_devs.o
+libnd-y += label.o
diff --git a/drivers/block/nd/dimm.c b/drivers/block/nd/dimm.c
index c4df1a32a68b..e2f964308672 100644
--- a/drivers/block/nd/dimm.c
+++ b/drivers/block/nd/dimm.c
@@ -18,6 +18,7 @@
 #include <linux/slab.h>
 #include <linux/mm.h>
 #include <linux/nd.h>
+#include "label.h"
 #include "nd.h"
 
 static void free_data(struct nd_dimm_drvdata *ndd)
@@ -42,6 +43,11 @@ static int nd_dimm_probe(struct device *dev)
 		return -ENOMEM;
 
 	dev_set_drvdata(dev, ndd);
+	ndd->dpa.name = dev_name(dev);
+	ndd->ns_current = -1;
+	ndd->ns_next = -1;
+	ndd->dpa.start = 0;
+	ndd->dpa.end = -1;
 	ndd->dev = dev;
 
 	rc = nd_dimm_init_nsarea(ndd);
@@ -54,6 +60,17 @@ static int nd_dimm_probe(struct device *dev)
 
 	dev_dbg(dev, "config data size: %d\n", ndd->nsarea.config_size);
 
+	nd_bus_lock(dev);
+	ndd->ns_current = nd_label_validate(ndd);
+	ndd->ns_next = nd_label_next_nsindex(ndd->ns_current);
+	nd_label_copy(ndd, to_next_namespace_index(ndd),
+			to_current_namespace_index(ndd));
+	rc = nd_label_reserve_dpa(ndd);
+	nd_bus_unlock(dev);
+
+	if (rc)
+		goto err;
+
 	return 0;
 
  err:
@@ -64,7 +81,13 @@ static int nd_dimm_probe(struct device *dev)
 static int nd_dimm_remove(struct device *dev)
 {
 	struct nd_dimm_drvdata *ndd = dev_get_drvdata(dev);
+	struct resource *res, *_r;
 
+	nd_bus_lock(dev);
+	dev_set_drvdata(dev, NULL);
+	for_each_dpa_resource_safe(ndd, res, _r)
+		nd_dimm_free_dpa(ndd, res);
+	nd_bus_unlock(dev);
 	free_data(ndd);
 
 	return 0;
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 8981adc59ba4..013531b8adfa 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -92,8 +92,12 @@ int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd)
 	if (ndd->data)
 		return 0;
 
-	if (ndd->nsarea.status || ndd->nsarea.max_xfer == 0)
+	if (ndd->nsarea.status || ndd->nsarea.max_xfer == 0
+			|| ndd->nsarea.config_size < ND_LABEL_MIN_SIZE) {
+		dev_dbg(ndd->dev, "failed to init config data area: (%d:%d)\n",
+				ndd->nsarea.max_xfer, ndd->nsarea.config_size);
 		return -ENXIO;
+	}
 
 	ndd->data = kmalloc(ndd->nsarea.config_size, GFP_KERNEL);
 	if (!ndd->data)
@@ -243,6 +247,30 @@ struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
 }
 EXPORT_SYMBOL_GPL(nd_dimm_create);
 
+void nd_dimm_free_dpa(struct nd_dimm_drvdata *ndd, struct resource *res)
+{
+	WARN_ON_ONCE(!is_nd_bus_locked(ndd->dev));
+	kfree(res->name);
+	__release_region(&ndd->dpa, res->start, resource_size(res));
+}
+
+struct resource *nd_dimm_allocate_dpa(struct nd_dimm_drvdata *ndd,
+		struct nd_label_id *label_id, resource_size_t start,
+		resource_size_t n)
+{
+	char *name = kmemdup(label_id, sizeof(*label_id), GFP_KERNEL);
+	struct resource *res;
+
+	if (!name)
+		return NULL;
+
+	WARN_ON_ONCE(!is_nd_bus_locked(ndd->dev));
+	res = __request_region(&ndd->dpa, start, n, name, 0);
+	if (!res)
+		kfree(name);
+	return res;
+}
+
 static int count_dimms(struct device *dev, void *c)
 {
 	int *count = c;
diff --git a/drivers/block/nd/label.c b/drivers/block/nd/label.c
new file mode 100644
index 000000000000..da5008e45917
--- /dev/null
+++ b/drivers/block/nd/label.c
@@ -0,0 +1,287 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/device.h>
+#include <linux/ndctl.h>
+#include <linux/io.h>
+#include <linux/nd.h>
+#include "nd-private.h"
+#include "label.h"
+#include "nd.h"
+
+#include <asm-generic/io-64-nonatomic-lo-hi.h>
+
+static u32 best_seq(u32 a, u32 b)
+{
+	a &= NSINDEX_SEQ_MASK;
+	b &= NSINDEX_SEQ_MASK;
+
+	if (a == 0 || a == b)
+		return b;
+	else if (b == 0)
+		return a;
+	else if (nd_inc_seq(a) == b)
+		return b;
+	else
+		return a;
+}
+
+size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd)
+{
+	u32 index_span;
+
+	if (ndd->nsindex_size)
+		return ndd->nsindex_size;
+
+	/*
+	 * The minimum index space is 512 bytes, with that amount of
+	 * index we can describe ~1400 labels which is less than a byte
+	 * of overhead per label.  Round up to a byte of overhead per
+	 * label and determine the size of the index region.  Yes, this
+	 * starts to waste space at larger config_sizes, but it's
+	 * unlikely we'll ever see anything but 128K.
+	 */
+	index_span = ndd->nsarea.config_size / 129;
+	index_span /= NSINDEX_ALIGN * 2;
+	ndd->nsindex_size = index_span * NSINDEX_ALIGN;
+
+	return ndd->nsindex_size;
+}
+
+int nd_label_validate(struct nd_dimm_drvdata *ndd)
+{
+	/*
+	 * On media label format consists of two index blocks followed
+	 * by an array of labels.  None of these structures are ever
+	 * updated in place.  A sequence number tracks the current
+	 * active index and the next one to write, while labels are
+	 * written to free slots.
+	 *
+	 *     +------------+
+	 *     |            |
+	 *     |  nsindex0  |
+	 *     |            |
+	 *     +------------+
+	 *     |            |
+	 *     |  nsindex1  |
+	 *     |            |
+	 *     +------------+
+	 *     |   label0   |
+	 *     +------------+
+	 *     |   label1   |
+	 *     +------------+
+	 *     |            |
+	 *      ....nslot...
+	 *     |            |
+	 *     +------------+
+	 *     |   labelN   |
+	 *     +------------+
+	 */
+	struct nd_namespace_index __iomem *nsindex[] = {
+		to_namespace_index(ndd, 0),
+		to_namespace_index(ndd, 1),
+	};
+	const int num_index = ARRAY_SIZE(nsindex);
+	struct device *dev = ndd->dev;
+	bool valid[] = { false, false };
+	int i, num_valid = 0;
+	u32 seq;
+
+	for (i = 0; i < num_index; i++) {
+		u64 sum_save, sum;
+		u8 sig[NSINDEX_SIG_LEN];
+
+		memcpy_fromio(sig, nsindex[i]->sig, NSINDEX_SIG_LEN);
+		if (memcmp(sig, NSINDEX_SIGNATURE, NSINDEX_SIG_LEN) != 0) {
+			dev_dbg(dev, "%s: nsindex%d signature invalid\n",
+					__func__, i);
+			continue;
+		}
+		sum_save = readq(&nsindex[i]->checksum);
+		writeq(0, &nsindex[i]->checksum);
+		sum = nd_fletcher64((void * __force) nsindex[i],
+				sizeof_namespace_index(ndd), 1);
+		writeq(sum_save, &nsindex[i]->checksum);
+		if (sum != sum_save) {
+			dev_dbg(dev, "%s: nsindex%d checksum invalid\n",
+					__func__, i);
+			continue;
+		}
+		if ((readl(&nsindex[i]->seq) & NSINDEX_SEQ_MASK) == 0) {
+			dev_dbg(dev, "%s: nsindex%d sequence: %#x invalid\n",
+					__func__, i, readl(&nsindex[i]->seq));
+			continue;
+		}
+
+		/* sanity check the index against expected values */
+		if (readq(&nsindex[i]->myoff)
+				!= i * sizeof_namespace_index(ndd)) {
+			dev_dbg(dev, "%s: nsindex%d myoff: %#llx invalid\n",
+					__func__, i, (unsigned long long)
+					readq(&nsindex[i]->myoff));
+			continue;
+		}
+		if (readq(&nsindex[i]->otheroff)
+				!= (!i) * sizeof_namespace_index(ndd)) {
+			dev_dbg(dev, "%s: nsindex%d otheroff: %#llx invalid\n",
+					__func__, i, (unsigned long long)
+					readq(&nsindex[i]->otheroff));
+			continue;
+		}
+		if (readq(&nsindex[i]->mysize) > sizeof_namespace_index(ndd)
+				|| readq(&nsindex[i]->mysize)
+				< sizeof(struct nd_namespace_index)) {
+			dev_dbg(dev, "%s: nsindex%d mysize: %#llx invalid\n",
+					__func__, i, (unsigned long long)
+					readq(&nsindex[i]->mysize));
+			continue;
+		}
+		if (readl(&nsindex[i]->nslot) * sizeof(struct nd_namespace_label)
+				+ 2 * sizeof_namespace_index(ndd)
+				> ndd->nsarea.config_size) {
+			dev_dbg(dev, "%s: nsindex%d nslot: %u invalid, config_size: %#x\n",
+					__func__, i, readl(&nsindex[i]->nslot),
+					ndd->nsarea.config_size);
+			continue;
+		}
+		valid[i] = true;
+		num_valid++;
+	}
+
+	switch (num_valid) {
+	case 0:
+		break;
+	case 1:
+		for (i = 0; i < num_index; i++)
+			if (valid[i])
+				return i;
+		/* can't have num_valid > 0 but valid[] = { false, false } */
+		WARN_ON(1);
+		break;
+	default:
+		/* pick the best index... */
+		seq = best_seq(readl(&nsindex[0]->seq), readl(&nsindex[1]->seq));
+		if (seq == (readl(&nsindex[1]->seq) & NSINDEX_SEQ_MASK))
+			return 1;
+		else
+			return 0;
+		break;
+	}
+
+	return -1;
+}
+
+void nd_label_copy(struct nd_dimm_drvdata *ndd,
+		struct nd_namespace_index __iomem *dst,
+		struct nd_namespace_index __iomem *src)
+{
+	void *s, *d;
+
+	if (dst && src)
+		/* pass */;
+	else
+		return;
+
+	d = (void * __force) dst;
+	s = (void * __force) src;
+	memcpy(d, s, sizeof_namespace_index(ndd));
+}
+
+static struct nd_namespace_label __iomem *nd_label_base(struct nd_dimm_drvdata *ndd)
+{
+	void *base = to_namespace_index(ndd, 0);
+
+	return base + 2 * sizeof_namespace_index(ndd);
+}
+
+#define for_each_clear_bit_le(bit, addr, size) \
+	for ((bit) = find_next_zero_bit_le((addr), (size), 0);  \
+	     (bit) < (size);                                    \
+	     (bit) = find_next_zero_bit_le((addr), (size), (bit) + 1))
+
+/**
+ * preamble_current - common variable initialization for nd_label_* routines
+ * @nd_dimm: dimm container for the relevant label set
+ * @nsindex: on return set to the currently active namespace index
+ * @free: on return set to the free label bitmap in the index
+ * @nslot: on return set to the number of slots in the label space
+ */
+static bool preamble_current(struct nd_dimm_drvdata *ndd,
+		struct nd_namespace_index **nsindex,
+		unsigned long **free, u32 *nslot)
+{
+	*nsindex = to_current_namespace_index(ndd);
+	if (*nsindex == NULL)
+		return false;
+
+	*free = (unsigned long __force *) (*nsindex)->free;
+	*nslot = readl(&(*nsindex)->nslot);
+
+	return true;
+}
+
+static char *nd_label_gen_id(struct nd_label_id *label_id, u8 *uuid, u32 flags)
+{
+	if (!label_id || !uuid)
+		return NULL;
+	snprintf(label_id->id, ND_LABEL_ID_SIZE, "%s-%pUb",
+			flags & NSLABEL_FLAG_LOCAL ? "blk" : "pmem", uuid);
+	return label_id->id;
+}
+
+static bool slot_valid(struct nd_namespace_label __iomem *nd_label, u32 slot)
+{
+	/* check that we are written where we expect to be written */
+	if (slot != readl(&nd_label->slot))
+		return false;
+
+	/* check that DPA allocations are page aligned */
+	if ((readq(&nd_label->dpa) | readq(&nd_label->rawsize)) % SZ_4K)
+		return false;
+
+	return true;
+}
+
+int nd_label_reserve_dpa(struct nd_dimm_drvdata *ndd)
+{
+	struct nd_namespace_index __iomem *nsindex;
+	unsigned long *free;
+	u32 nslot, slot;
+
+	if (!preamble_current(ndd, &nsindex, &free, &nslot))
+		return 0; /* no label, nothing to reserve */
+
+	for_each_clear_bit_le(slot, free, nslot) {
+		struct nd_namespace_label __iomem *nd_label;
+		struct nd_region *nd_region = NULL;
+		u8 label_uuid[NSLABEL_UUID_LEN];
+		struct nd_label_id label_id;
+		struct resource *res;
+		u32 flags;
+
+		nd_label = nd_label_base(ndd) + slot;
+
+		if (!slot_valid(nd_label, slot))
+			continue;
+
+		memcpy_fromio(label_uuid, nd_label->uuid, NSLABEL_UUID_LEN);
+		flags = readl(&nd_label->flags);
+		nd_label_gen_id(&label_id, label_uuid, flags);
+		res = nd_dimm_allocate_dpa(ndd, &label_id, readq(&nd_label->dpa),
+				readq(&nd_label->rawsize));
+		nd_dbg_dpa(nd_region, ndd, res, "reserve\n");
+		if (!res)
+			return -EBUSY;
+	}
+
+	return 0;
+}
diff --git a/drivers/block/nd/label.h b/drivers/block/nd/label.h
new file mode 100644
index 000000000000..79ed885a43c0
--- /dev/null
+++ b/drivers/block/nd/label.h
@@ -0,0 +1,129 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __LABEL_H__
+#define __LABEL_H__
+
+#include <linux/ndctl.h>
+#include <linux/sizes.h>
+#include <linux/io.h>
+
+enum {
+	NSINDEX_SIG_LEN = 16,
+	NSINDEX_ALIGN = 256,
+	NSINDEX_SEQ_MASK = 0x3,
+	NSLABEL_UUID_LEN = 16,
+	NSLABEL_NAME_LEN = 64,
+	NSLABEL_FLAG_ROLABEL = 0x1,  /* read-only label */
+	NSLABEL_FLAG_LOCAL = 0x2,    /* DIMM-local namespace */
+	NSLABEL_FLAG_BTT = 0x4,      /* namespace contains a BTT */
+	NSLABEL_FLAG_UPDATING = 0x8, /* label being updated */
+	BTT_ALIGN = 4096,            /* all btt structures */
+	BTTINFO_SIG_LEN = 16,
+	BTTINFO_UUID_LEN = 16,
+	BTTINFO_FLAG_ERROR = 0x1,    /* error state (read-only) */
+	BTTINFO_MAJOR_VERSION = 1,
+	ND_LABEL_MIN_SIZE = 512 * 129, /* see sizeof_namespace_index() */
+	ND_LABEL_ID_SIZE = 50,
+};
+
+static const char NSINDEX_SIGNATURE[] = "NAMESPACE_INDEX\0";
+
+/**
+ * struct nd_namespace_index - label set superblock
+ * @sig: NAMESPACE_INDEX\0
+ * @flags: placeholder
+ * @seq: sequence number for this index
+ * @myoff: offset of this index in label area
+ * @mysize: size of this index struct
+ * @otheroff: offset of other index
+ * @labeloff: offset of first label slot
+ * @nslot: total number of label slots
+ * @major: label area major version
+ * @minor: label area minor version
+ * @checksum: fletcher64 of all fields
+ * @free[0]: bitmap, nlabel bits
+ *
+ * The size of free[] is rounded up so the total struct size is a
+ * multiple of NSINDEX_ALIGN bytes.  Any bits this allocates beyond
+ * nlabel bits must be zero.
+ */
+struct nd_namespace_index {
+	u8 sig[NSINDEX_SIG_LEN];
+	__le32 flags;
+	__le32 seq;
+	__le64 myoff;
+	__le64 mysize;
+	__le64 otheroff;
+	__le64 labeloff;
+	__le32 nslot;
+	__le16 major;
+	__le16 minor;
+	__le64 checksum;
+	u8 free[0];
+};
+
+/**
+ * struct nd_namespace_label - namespace superblock
+ * @uuid: UUID per RFC 4122
+ * @name: optional name (NULL-terminated)
+ * @flags: see NSLABEL_FLAG_*
+ * @nlabel: num labels to describe this ns
+ * @position: labels position in set
+ * @isetcookie: interleave set cookie
+ * @lbasize: LBA size in bytes or 0 for pmem
+ * @dpa: DPA of NVM range on this DIMM
+ * @rawsize: size of namespace
+ * @slot: slot of this label in label area
+ * @unused: must be zero
+ */
+struct nd_namespace_label {
+	u8 uuid[NSLABEL_UUID_LEN];
+	u8 name[NSLABEL_NAME_LEN];
+	__le32 flags;
+	__le16 nlabel;
+	__le16 position;
+	__le64 isetcookie;
+	__le64 lbasize;
+	__le64 dpa;
+	__le64 rawsize;
+	__le32 slot;
+	__le32 unused;
+};
+
+/**
+ * struct nd_label_id - identifier string for dpa allocation
+ * @id: "{blk|pmem}-<namespace uuid>"
+ */
+struct nd_label_id {
+	char id[ND_LABEL_ID_SIZE];
+};
+
+/*
+ * If the 'best' index is invalid, so is the 'next' index.  Otherwise,
+ * the next index is MOD(index+1, 2)
+ */
+static inline int nd_label_next_nsindex(int index)
+{
+	if (index < 0)
+		return -1;
+
+	return (index + 1) % 2;
+}
+
+struct nd_dimm_drvdata;
+int nd_label_validate(struct nd_dimm_drvdata *ndd);
+void nd_label_copy(struct nd_dimm_drvdata *ndd,
+		struct nd_namespace_index *dst,
+		struct nd_namespace_index *src);
+size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd);
+#endif /* __LABEL_H__ */
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index 905183d45799..63540ffe845d 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -16,11 +16,15 @@
 #include <linux/libnd.h>
 #include <linux/mutex.h>
 #include <linux/ndctl.h>
+#include "label.h"
 
 struct nd_dimm_drvdata {
 	struct device *dev;
+	int nsindex_size;
 	struct nd_cmd_get_config_size nsarea;
 	void *data;
+	int ns_current, ns_next;
+	struct resource dpa;
 };
 
 struct nd_region_namespaces {
@@ -28,6 +32,37 @@ struct nd_region_namespaces {
 	int active;
 };
 
+static inline struct nd_namespace_index __iomem *to_namespace_index(
+		struct nd_dimm_drvdata *ndd, int i)
+{
+	if (i < 0)
+		return NULL;
+
+	return ((void __iomem *) ndd->data + sizeof_namespace_index(ndd) * i);
+}
+
+static inline struct nd_namespace_index __iomem *to_current_namespace_index(
+		struct nd_dimm_drvdata *ndd)
+{
+	return to_namespace_index(ndd, ndd->ns_current);
+}
+
+static inline struct nd_namespace_index __iomem *to_next_namespace_index(
+		struct nd_dimm_drvdata *ndd)
+{
+	return to_namespace_index(ndd, ndd->ns_next);
+}
+
+#define nd_dbg_dpa(r, d, res, fmt, arg...) \
+	dev_dbg((r) ? &(r)->dev : (d)->dev, "%s: %.13s: %#llx @ %#llx " fmt, \
+		(r) ? dev_name((d)->dev) : "", res ? res->name : "null", \
+		(unsigned long long) (res ? resource_size(res) : 0), \
+		(unsigned long long) (res ? res->start : 0), ##arg)
+
+#define for_each_dpa_resource_safe(ndd, res, next) \
+	for (res = (ndd)->dpa.child, next = res ? res->sibling : NULL; \
+			res; res = next, next = next ? next->sibling : NULL)
+
 struct nd_region {
 	struct device dev;
 	u16 ndr_mappings;
@@ -39,6 +74,15 @@ struct nd_region {
 	struct nd_mapping mapping[0];
 };
 
+/*
+ * Lookup next in the repeating sequence of 01, 10, and 11.
+ */
+static inline unsigned nd_inc_seq(unsigned seq)
+{
+	static const unsigned next[] = { 0, 2, 3, 1 };
+
+	return next[seq & 3];
+}
 enum nd_async_mode {
 	ND_SYNC,
 	ND_ASYNC,
@@ -54,4 +98,9 @@ int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
 void nd_bus_lock(struct device *dev);
 void nd_bus_unlock(struct device *dev);
 bool is_nd_bus_locked(struct device *dev);
+int nd_label_reserve_dpa(struct nd_dimm_drvdata *ndd);
+void nd_dimm_free_dpa(struct nd_dimm_drvdata *ndd, struct resource *res);
+struct resource *nd_dimm_allocate_dpa(struct nd_dimm_drvdata *ndd,
+		struct nd_label_id *label_id, resource_size_t start,
+		resource_size_t n);
 #endif /* __ND_H__ */
diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
index 5ffa319f3408..624a19d9e6e4 100644
--- a/include/uapi/linux/ndctl.h
+++ b/include/uapi/linux/ndctl.h
@@ -175,7 +175,6 @@ static inline const char *nd_dimm_cmd_name(unsigned cmd)
 #define ND_IOCTL_ARS_QUERY		_IOWR(ND_IOCTL, ND_CMD_ARS_QUERY,\
 					struct nd_cmd_ars_query)
 
-
 #define ND_DEVICE_DIMM 1            /* nd_dimm: container for "config data" */
 #define ND_DEVICE_REGION_PMEM 2     /* nd_region: (parent of pmem namespaces) */
 #define ND_DEVICE_REGION_BLK 3      /* nd_region: (parent of blk namespaces) */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 13/21] libnd: pmem label sets and namespace instantiation.
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:57   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, mingo, linux-acpi,
	jmoyer, hch

A complete label set is a PMEM-label per-dimm per-interleave-set where
all the UUIDs match and the interleave set cookie matches the hosting
interleave set.

Present sysfs attributes for manipulation of a PMEM-namespace's
'alt_name', 'uuid', and 'size' attributes.  A later patch will make
these settings persistent by writing back the label.

Note that PMEM allocations grow forwards from the start of an interleave
set (lowest dimm-physical-address (DPA)).  BLK-namespaces that alias
with a PMEM interleave set will grow allocations backward from the
highest DPA.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/block/nd/bus.c            |    6 
 drivers/block/nd/core.c           |   64 ++
 drivers/block/nd/dimm_devs.c      |  103 ++++
 drivers/block/nd/label.c          |   54 ++
 drivers/block/nd/label.h          |    3 
 drivers/block/nd/namespace_devs.c | 1024 +++++++++++++++++++++++++++++++++++++
 drivers/block/nd/nd-private.h     |   11 
 drivers/block/nd/nd.h             |   32 +
 drivers/block/nd/pmem.c           |   20 +
 drivers/block/nd/region.c         |    3 
 drivers/block/nd/region_devs.c    |  145 +++++
 include/linux/libnd.h             |    2 
 include/linux/nd.h                |   24 +
 include/uapi/linux/ndctl.h        |    4 
 14 files changed, 1484 insertions(+), 11 deletions(-)

diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index 63b5182cf766..65af6bcc5472 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -364,8 +364,10 @@ u32 nd_cmd_out_size(struct nd_dimm *nd_dimm, int cmd,
 }
 EXPORT_SYMBOL_GPL(nd_cmd_out_size);
 
-static void wait_nd_bus_probe_idle(struct nd_bus *nd_bus)
+void wait_nd_bus_probe_idle(struct device *dev)
 {
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
 	do {
 		if (nd_bus->probe_active == 0)
 			break;
@@ -384,7 +386,7 @@ static int nd_cmd_clear_to_send(struct nd_dimm *nd_dimm, unsigned int cmd)
 		return 0;
 
 	nd_bus = walk_to_nd_bus(&nd_dimm->dev);
-	wait_nd_bus_probe_idle(nd_bus);
+	wait_nd_bus_probe_idle(&nd_bus->dev);
 
 	if (atomic_read(&nd_dimm->busy))
 		return -EBUSY;
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index 38fb8f4c9a2c..0bf69abb47fc 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -14,6 +14,7 @@
 #include <linux/module.h>
 #include <linux/device.h>
 #include <linux/libnd.h>
+#include <linux/ctype.h>
 #include <linux/ndctl.h>
 #include <linux/mutex.h>
 #include <linux/slab.h>
@@ -107,6 +108,69 @@ struct nd_bus *walk_to_nd_bus(struct device *nd_dev)
 	return NULL;
 }
 
+static bool is_uuid_sep(char sep)
+{
+	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
+		return true;
+	return false;
+}
+
+static int nd_uuid_parse(struct device *dev, u8 *uuid_out, const char *buf,
+		size_t len)
+{
+	const char *str = buf;
+	u8 uuid[16];
+	int i;
+
+	for (i = 0; i < 16; i++) {
+		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
+			dev_dbg(dev, "%s: pos: %d buf[%zd]: %c buf[%zd]: %c\n",
+					__func__, i, str - buf, str[0],
+					str + 1 - buf, str[1]);
+			return -EINVAL;
+		}
+
+		uuid[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
+		str += 2;
+		if (is_uuid_sep(*str))
+			str++;
+	}
+
+	memcpy(uuid_out, uuid, sizeof(uuid));
+	return 0;
+}
+
+/**
+ * nd_uuid_store: common implementation for writing 'uuid' sysfs attributes
+ * @dev: container device for the uuid property
+ * @uuid_out: uuid buffer to replace
+ * @buf: raw sysfs buffer to parse
+ *
+ * Enforce that uuids can only be changed while the device is disabled
+ * (driver detached)
+ * LOCKING: expects device_lock() is held on entry
+ */
+int nd_uuid_store(struct device *dev, u8 **uuid_out, const char *buf,
+		size_t len)
+{
+	u8 uuid[16];
+	int rc;
+
+	if (dev->driver)
+		return -EBUSY;
+
+	rc = nd_uuid_parse(dev, uuid, buf, len);
+	if (rc)
+		return rc;
+
+	kfree(*uuid_out);
+	*uuid_out = kmemdup(uuid, sizeof(uuid), GFP_KERNEL);
+	if (!(*uuid_out))
+		return -ENOMEM;
+
+	return 0;
+}
+
 static ssize_t commands_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 013531b8adfa..b242d3ae6d12 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -159,6 +159,14 @@ struct nd_dimm *to_nd_dimm(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(to_nd_dimm);
 
+struct nd_dimm_drvdata *to_ndd(struct nd_mapping *nd_mapping)
+{
+	struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
+
+	return dev_get_drvdata(&nd_dimm->dev);
+}
+EXPORT_SYMBOL(to_ndd);
+
 const char *nd_dimm_name(struct nd_dimm *nd_dimm)
 {
 	return dev_name(&nd_dimm->dev);
@@ -247,6 +255,83 @@ struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
 }
 EXPORT_SYMBOL_GPL(nd_dimm_create);
 
+/**
+ * nd_pmem_available_dpa - for the given dimm+region account unallocated dpa
+ * @nd_mapping: container of dpa-resource-root + labels
+ * @nd_region: constrain available space check to this reference region
+ * @overlap: calculate available space assuming this level of overlap
+ *
+ * Validate that a PMEM label, if present, aligns with the start of an
+ * interleave set and truncate the available size at the lowest BLK
+ * overlap point.
+ *
+ * The expectation is that this routine is called multiple times as it
+ * probes for the largest BLK encroachment for any single member DIMM of
+ * the interleave set.  Once that value is determined the PMEM-limit for
+ * the set can be established.
+ */
+resource_size_t nd_pmem_available_dpa(struct nd_region *nd_region,
+		struct nd_mapping *nd_mapping, resource_size_t *overlap)
+{
+	resource_size_t map_end, busy = 0, available, blk_start;
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	struct resource *res;
+	const char *reason;
+
+	if (!ndd)
+		return 0;
+
+	map_end = nd_mapping->start + nd_mapping->size - 1;
+	blk_start = max(nd_mapping->start, map_end + 1 - *overlap);
+	for_each_dpa_resource(ndd, res)
+		if (res->start >= nd_mapping->start && res->start < map_end) {
+			if (strncmp(res->name, "blk", 3) == 0)
+				blk_start = min(blk_start, res->start);
+			else if (res->start != nd_mapping->start) {
+				reason = "misaligned to iset";
+				goto err;
+			} else {
+				if (busy) {
+					reason = "duplicate overlapping PMEM reservations?";
+					goto err;
+				}
+				busy += resource_size(res);
+				continue;
+			}
+		} else if (res->end >= nd_mapping->start && res->end <= map_end) {
+			if (strncmp(res->name, "blk", 3) == 0) {
+				/*
+				 * If a BLK allocation overlaps the start of
+				 * PMEM the entire interleave set may now only
+				 * be used for BLK.
+				 */
+				blk_start = nd_mapping->start;
+			} else {
+				reason = "misaligned to iset";
+				goto err;
+			}
+		} else if (nd_mapping->start > res->start
+				&& nd_mapping->start < res->end) {
+			/* total eclipse of the mapping */
+			busy += nd_mapping->size;
+			blk_start = nd_mapping->start;
+		}
+
+	*overlap = map_end + 1 - blk_start;
+	available = blk_start - nd_mapping->start;
+	if (busy < available)
+		return available - busy;
+	return 0;
+
+ err:
+	/*
+	 * Something is wrong, PMEM must align with the start of the
+	 * interleave set, and there can only be one allocation per set.
+	 */
+	nd_dbg_dpa(nd_region, ndd, res, "%s\n", reason);
+	return 0;
+}
+
 void nd_dimm_free_dpa(struct nd_dimm_drvdata *ndd, struct resource *res)
 {
 	WARN_ON_ONCE(!is_nd_bus_locked(ndd->dev));
@@ -271,6 +356,24 @@ struct resource *nd_dimm_allocate_dpa(struct nd_dimm_drvdata *ndd,
 	return res;
 }
 
+/**
+ * nd_dimm_allocated_dpa - sum up the dpa currently allocated to this label_id
+ * @nd_dimm: container of dpa-resource-root + labels
+ * @label_id: dpa resource name of the form {pmem|blk}-<human readable uuid>
+ */
+resource_size_t nd_dimm_allocated_dpa(struct nd_dimm_drvdata *ndd,
+		struct nd_label_id *label_id)
+{
+	resource_size_t allocated = 0;
+	struct resource *res;
+
+	for_each_dpa_resource(ndd, res)
+		if (strcmp(res->name, label_id->id) == 0)
+			allocated += resource_size(res);
+
+	return allocated;
+}
+
 static int count_dimms(struct device *dev, void *c)
 {
 	int *count = c;
diff --git a/drivers/block/nd/label.c b/drivers/block/nd/label.c
index da5008e45917..ecd196b42d57 100644
--- a/drivers/block/nd/label.c
+++ b/drivers/block/nd/label.c
@@ -229,7 +229,7 @@ static bool preamble_current(struct nd_dimm_drvdata *ndd,
 	return true;
 }
 
-static char *nd_label_gen_id(struct nd_label_id *label_id, u8 *uuid, u32 flags)
+char *nd_label_gen_id(struct nd_label_id *label_id, u8 *uuid, u32 flags)
 {
 	if (!label_id || !uuid)
 		return NULL;
@@ -285,3 +285,55 @@ int nd_label_reserve_dpa(struct nd_dimm_drvdata *ndd)
 
 	return 0;
 }
+
+int nd_label_active_count(struct nd_dimm_drvdata *ndd)
+{
+	struct nd_namespace_index __iomem *nsindex;
+	unsigned long *free;
+	u32 nslot, slot;
+	int count = 0;
+
+	if (!preamble_current(ndd, &nsindex, &free, &nslot))
+		return 0;
+
+	for_each_clear_bit_le(slot, free, nslot) {
+		struct nd_namespace_label __iomem *nd_label;
+
+		nd_label = nd_label_base(ndd) + slot;
+
+		if (!slot_valid(nd_label, slot)) {
+			dev_dbg(ndd->dev,
+				"%s: slot%d invalid slot: %d dpa: %lx rawsize: %lx\n",
+					__func__, slot, readl(&nd_label->slot),
+					(unsigned long) readq(&nd_label->dpa),
+					(unsigned long) readq(&nd_label->rawsize));
+			continue;
+		}
+		count++;
+	}
+	return count;
+}
+
+struct nd_namespace_label __iomem *nd_label_active(
+		struct nd_dimm_drvdata *ndd, int n)
+{
+	struct nd_namespace_index __iomem *nsindex;
+	unsigned long *free;
+	u32 nslot, slot;
+
+	if (!preamble_current(ndd, &nsindex, &free, &nslot))
+		return NULL;
+
+	for_each_clear_bit_le(slot, free, nslot) {
+		struct nd_namespace_label __iomem *nd_label;
+
+		nd_label = nd_label_base(ndd) + slot;
+		if (!slot_valid(nd_label, slot))
+			continue;
+
+		if (n-- == 0)
+			return nd_label_base(ndd) + slot;
+	}
+
+	return NULL;
+}
diff --git a/drivers/block/nd/label.h b/drivers/block/nd/label.h
index 79ed885a43c0..4436624f4146 100644
--- a/drivers/block/nd/label.h
+++ b/drivers/block/nd/label.h
@@ -126,4 +126,7 @@ void nd_label_copy(struct nd_dimm_drvdata *ndd,
 		struct nd_namespace_index *dst,
 		struct nd_namespace_index *src);
 size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd);
+int nd_label_active_count(struct nd_dimm_drvdata *ndd);
+struct nd_namespace_label __iomem *nd_label_active(
+		struct nd_dimm_drvdata *ndd, int n);
 #endif /* __LABEL_H__ */
diff --git a/drivers/block/nd/namespace_devs.c b/drivers/block/nd/namespace_devs.c
index 8fbdf68c64d8..d0417575b18c 100644
--- a/drivers/block/nd/namespace_devs.c
+++ b/drivers/block/nd/namespace_devs.c
@@ -14,8 +14,11 @@
 #include <linux/device.h>
 #include <linux/slab.h>
 #include <linux/nd.h>
+#include "nd-private.h"
 #include "nd.h"
 
+#include <asm-generic/io-64-nonatomic-lo-hi.h>
+
 static void namespace_io_release(struct device *dev)
 {
 	struct nd_namespace_io *nsio = to_nd_namespace_io(dev);
@@ -23,11 +26,50 @@ static void namespace_io_release(struct device *dev)
 	kfree(nsio);
 }
 
+static void namespace_pmem_release(struct device *dev)
+{
+	struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+	kfree(nspm->alt_name);
+	kfree(nspm->uuid);
+	kfree(nspm);
+}
+
+static void namespace_blk_release(struct device *dev)
+{
+	/* TODO: blk namespace support */
+}
+
 static struct device_type namespace_io_device_type = {
 	.name = "nd_namespace_io",
 	.release = namespace_io_release,
 };
 
+static struct device_type namespace_pmem_device_type = {
+	.name = "nd_namespace_pmem",
+	.release = namespace_pmem_release,
+};
+
+static struct device_type namespace_blk_device_type = {
+	.name = "nd_namespace_blk",
+	.release = namespace_blk_release,
+};
+
+static bool is_namespace_pmem(struct device *dev)
+{
+	return dev ? dev->type == &namespace_pmem_device_type : false;
+}
+
+static bool is_namespace_blk(struct device *dev)
+{
+	return dev ? dev->type == &namespace_blk_device_type : false;
+}
+
+static bool is_namespace_io(struct device *dev)
+{
+	return dev ? dev->type == &namespace_io_device_type : false;
+}
+
 static ssize_t nstype_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -37,13 +79,674 @@ static ssize_t nstype_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(nstype);
 
+static ssize_t __alt_name_store(struct device *dev, const char *buf,
+		const size_t len)
+{
+	char *input, *pos, *alt_name, **ns_altname;
+	ssize_t rc;
+
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		ns_altname = &nspm->alt_name;
+	} else if (is_namespace_blk(dev)) {
+		/* TODO: blk namespace support */
+		return -ENXIO;
+	} else
+		return -ENXIO;
+
+	if (dev->driver)
+		return -EBUSY;
+
+	input = kmemdup(buf, len + 1, GFP_KERNEL);
+	if (!input)
+		return -ENOMEM;
+
+	input[len] = '\0';
+	pos = strim(input);
+	if (strlen(pos) + 1 > NSLABEL_NAME_LEN) {
+		rc = -EINVAL;
+		goto out;
+	}
+
+	alt_name = kzalloc(NSLABEL_NAME_LEN, GFP_KERNEL);
+	if (!alt_name) {
+		rc = -ENOMEM;
+		goto out;
+	}
+	kfree(*ns_altname);
+	*ns_altname = alt_name;
+	sprintf(*ns_altname, "%s", pos);
+	rc = len;
+
+out:
+	kfree(input);
+	return rc;
+}
+
+static ssize_t alt_name_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	ssize_t rc;
+
+	device_lock(dev);
+	nd_bus_lock(dev);
+	wait_nd_bus_probe_idle(dev);
+	rc = __alt_name_store(dev, buf, len);
+	dev_dbg(dev, "%s: %s (%zd)\n", __func__, rc < 0 ? "fail" : "success", rc);
+	nd_bus_unlock(dev);
+	device_unlock(dev);
+
+	return rc;
+}
+
+static ssize_t alt_name_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	char *ns_altname;
+
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		ns_altname = nspm->alt_name;
+	} else if (is_namespace_blk(dev)) {
+		/* TODO: blk namespace support */
+		return -ENXIO;
+	} else
+		return -ENXIO;
+
+	return sprintf(buf, "%s\n", ns_altname ? ns_altname : "");
+}
+static DEVICE_ATTR_RW(alt_name);
+
+static int scan_free(struct nd_region *nd_region,
+		struct nd_mapping *nd_mapping, struct nd_label_id *label_id,
+		resource_size_t n)
+{
+	bool is_blk = strncmp(label_id->id, "blk", 3) == 0;
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	int rc = 0;
+
+	while (n) {
+		struct resource *res, *last;
+		resource_size_t new_start;
+
+		last = NULL;
+		for_each_dpa_resource(ndd, res)
+			if (strcmp(res->name, label_id->id) == 0)
+				last = res;
+		res = last;
+		if (!res)
+			return 0;
+
+		if (n >= resource_size(res)) {
+			n -= resource_size(res);
+			nd_dbg_dpa(nd_region, ndd, res, "delete %d\n", rc);
+			nd_dimm_free_dpa(ndd, res);
+			/* retry with last resource deleted */
+			continue;
+		}
+
+		/*
+		 * Keep BLK allocations relegated to high DPA as much as
+		 * possible
+		 */
+		if (is_blk)
+			new_start = res->start + n;
+		else
+			new_start = res->start;
+
+		rc = adjust_resource(res, new_start, resource_size(res) - n);
+		nd_dbg_dpa(nd_region, ndd, res, "shrink %d\n", rc);
+		break;
+	}
+
+	return rc;
+}
+
+/**
+ * shrink_dpa_allocation - for each dimm in region free n bytes for label_id
+ * @nd_region: the set of dimms to reclaim @n bytes from
+ * @label_id: unique identifier for the namespace consuming this dpa range
+ * @n: number of bytes per-dimm to release
+ *
+ * Assumes resources are ordered.  Starting from the end try to
+ * adjust_resource() the allocation to @n, but if @n is larger than the
+ * allocation delete it and find the 'new' last allocation in the label
+ * set.
+ */
+static int shrink_dpa_allocation(struct nd_region *nd_region,
+		struct nd_label_id *label_id, resource_size_t n)
+{
+	int i;
+
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		int rc;
+
+		rc = scan_free(nd_region, nd_mapping, label_id, n);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
+
+static resource_size_t init_dpa_allocation(struct nd_label_id *label_id,
+		struct nd_region *nd_region, struct nd_mapping *nd_mapping,
+		resource_size_t n)
+{
+	bool is_blk = strncmp(label_id->id, "blk", 3) == 0;
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	resource_size_t first_dpa;
+	struct resource *res;
+	int rc = 0;
+
+	/* allocate blk from highest dpa first */
+	if (is_blk)
+		first_dpa = nd_mapping->start + nd_mapping->size - n;
+	else
+		first_dpa = nd_mapping->start;
+
+	/* first resource allocation for this label-id or dimm */
+	res = nd_dimm_allocate_dpa(ndd, label_id, first_dpa, n);
+	if (!res)
+		rc = -EBUSY;
+
+	nd_dbg_dpa(nd_region, ndd, res, "init %d\n", rc);
+	return rc ? n : 0;
+}
+
+static bool space_valid(bool is_pmem, struct nd_label_id *label_id,
+		struct resource *res)
+{
+	/*
+	 * For BLK-space any space is valid, for PMEM-space, it must be
+	 * contiguous with an existing allocation.
+	 */
+	if (!is_pmem)
+		return true;
+	if (!res || strcmp(res->name, label_id->id) == 0)
+		return true;
+	return false;
+}
+
+enum alloc_loc {
+	ALLOC_ERR = 0, ALLOC_BEFORE, ALLOC_MID, ALLOC_AFTER,
+};
+
+static resource_size_t scan_allocate(struct nd_region *nd_region,
+		struct nd_mapping *nd_mapping, struct nd_label_id *label_id,
+		resource_size_t n)
+{
+	resource_size_t mapping_end = nd_mapping->start + nd_mapping->size - 1;
+	bool is_pmem = strncmp(label_id->id, "pmem", 4) == 0;
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	const resource_size_t to_allocate = n;
+	struct resource *res;
+	int first;
+
+ retry:
+	first = 0;
+	for_each_dpa_resource(ndd, res) {
+		resource_size_t allocate, available = 0, free_start, free_end;
+		struct resource *next = res->sibling, *new_res = NULL;
+		enum alloc_loc loc = ALLOC_ERR;
+		const char *action;
+		int rc = 0;
+
+		/* ignore resources outside this nd_mapping */
+		if (res->start > mapping_end)
+			continue;
+		if (res->end < nd_mapping->start)
+			continue;
+
+		/* space at the beginning of the mapping */
+		if (!first++ && res->start > nd_mapping->start) {
+			free_start = nd_mapping->start;
+			available = res->start - free_start;
+			if (space_valid(is_pmem, label_id, NULL))
+				loc = ALLOC_BEFORE;
+		}
+
+		/* space between allocations */
+		if (!loc && next) {
+			free_start = res->start + resource_size(res);
+			free_end = min(mapping_end, next->start - 1);
+			if (space_valid(is_pmem, label_id, res)
+					&& free_start < free_end) {
+				available = free_end + 1 - free_start;
+				loc = ALLOC_MID;
+			}
+		}
+
+		/* space at the end of the mapping */
+		if (!loc && !next) {
+			free_start = res->start + resource_size(res);
+			free_end = mapping_end;
+			if (space_valid(is_pmem, label_id, res)
+					&& free_start < free_end) {
+				available = free_end + 1 - free_start;
+				loc = ALLOC_AFTER;
+			}
+		}
+
+		if (!loc || !available)
+			continue;
+		allocate = min(available, n);
+		switch (loc) {
+		case ALLOC_BEFORE:
+			if (strcmp(res->name, label_id->id) == 0) {
+				/* adjust current resource up */
+				if (is_pmem)
+					return n;
+				rc = adjust_resource(res, res->start - allocate,
+						resource_size(res) + allocate);
+				action = "cur grow up";
+			} else
+				action = "allocate";
+			break;
+		case ALLOC_MID:
+			if (strcmp(next->name, label_id->id) == 0) {
+				/* adjust next resource up */
+				if (is_pmem)
+					return n;
+				rc = adjust_resource(next, next->start
+						- allocate, resource_size(next)
+						+ allocate);
+				new_res = next;
+				action = "next grow up";
+			} else if (strcmp(res->name, label_id->id) == 0) {
+				action = "grow down";
+			} else
+				action = "allocate";
+			break;
+		case ALLOC_AFTER:
+			if (strcmp(res->name, label_id->id) == 0)
+				action = "grow down";
+			else
+				action = "allocate";
+			break;
+		default:
+			return n;
+		}
+
+		if (strcmp(action, "allocate") == 0) {
+			/* BLK allocate bottom up */
+			if (!is_pmem)
+				free_start += available - allocate;
+			else if (free_start != nd_mapping->start)
+				return n;
+
+			new_res = nd_dimm_allocate_dpa(ndd, label_id,
+					free_start, allocate);
+			if (!new_res)
+				rc = -EBUSY;
+		} else if (strcmp(action, "grow down") == 0) {
+			/* adjust current resource down */
+			rc = adjust_resource(res, res->start, resource_size(res)
+					+ allocate);
+		}
+
+		if (!new_res)
+			new_res = res;
+
+		nd_dbg_dpa(nd_region, ndd, new_res, "%s(%d) %d\n",
+				action, loc, rc);
+
+		if (rc)
+			return n;
+
+		n -= allocate;
+		if (n) {
+			/*
+			 * Retry scan with newly inserted resources.
+			 * For example, if we did an ALLOC_BEFORE
+			 * insertion there may also have been space
+			 * available for an ALLOC_AFTER insertion, so we
+			 * need to check this same resource again
+			 */
+			goto retry;
+		} else
+			return 0;
+	}
+
+	if (is_pmem && n == to_allocate)
+		return init_dpa_allocation(label_id, nd_region, nd_mapping, n);
+	return n;
+}
+
+/**
+ * grow_dpa_allocation - for each dimm allocate n bytes for @label_id
+ * @nd_region: the set of dimms to allocate @n more bytes from
+ * @label_id: unique identifier for the namespace consuming this dpa range
+ * @n: number of bytes per-dimm to add to the existing allocation
+ *
+ * Assumes resources are ordered.  For BLK regions, first consume
+ * BLK-only available DPA free space, then consume PMEM-aliased DPA
+ * space starting at the highest DPA.  For PMEM regions start
+ * allocations from the start of an interleave set and end at the first
+ * BLK allocation or the end of the interleave set, whichever comes
+ * first.
+ */
+static int grow_dpa_allocation(struct nd_region *nd_region,
+		struct nd_label_id *label_id, resource_size_t n)
+{
+	int i;
+
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		int rc;
+
+		rc = scan_allocate(nd_region, nd_mapping, label_id, n);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
+
+static void nd_namespace_pmem_set_size(struct nd_region *nd_region,
+		struct nd_namespace_pmem *nspm, resource_size_t size)
+{
+	struct resource *res = &nspm->nsio.res;
+
+	res->start = nd_region->ndr_start;
+	res->end = nd_region->ndr_start + size - 1;
+}
+
+static ssize_t __size_store(struct device *dev, unsigned long long val)
+{
+	resource_size_t allocated = 0, available = 0;
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+	struct nd_mapping *nd_mapping;
+	struct nd_dimm_drvdata *ndd;
+	struct nd_label_id label_id;
+	u32 flags = 0, remainder;
+	u8 *uuid = NULL;
+	int rc, i;
+
+	if (dev->driver)
+		return -EBUSY;
+
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		uuid = nspm->uuid;
+	} else if (is_namespace_blk(dev)) {
+		/* TODO: blk namespace support */
+		return -ENXIO;
+	}
+
+	/*
+	 * We need a uuid for the allocation-label and dimm(s) on which
+	 * to store the label.
+	 */
+	if (!uuid || nd_region->ndr_mappings == 0)
+		return -ENXIO;
+
+	div_u64_rem(val, SZ_4K * nd_region->ndr_mappings, &remainder);
+	if (remainder) {
+		dev_dbg(dev, "%llu is not %dK aligned\n", val,
+				(SZ_4K * nd_region->ndr_mappings) / SZ_1K);
+		return -EINVAL;
+	}
+
+	nd_label_gen_id(&label_id, uuid, flags);
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		nd_mapping = &nd_region->mapping[i];
+		ndd = to_ndd(nd_mapping);
+
+		/*
+		 * All dimms in an interleave set, or the base dimm for a blk
+		 * region, need to be enabled for the size to be changed.
+		 */
+		if (!ndd)
+			return -ENXIO;
+
+		allocated += nd_dimm_allocated_dpa(ndd, &label_id);
+	}
+	available = nd_region_available_dpa(nd_region);
+
+	if (val > available + allocated)
+		return -ENOSPC;
+
+	if (val == allocated)
+		return 0;
+
+	val = div_u64(val, nd_region->ndr_mappings);
+	allocated = div_u64(allocated, nd_region->ndr_mappings);
+	if (val < allocated)
+		rc = shrink_dpa_allocation(nd_region, &label_id, allocated - val);
+	else
+		rc = grow_dpa_allocation(nd_region, &label_id, val - allocated);
+
+	if (rc)
+		return rc;
+
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		nd_namespace_pmem_set_size(nd_region, nspm,
+				val * nd_region->ndr_mappings);
+	}
+
+	return rc;
+}
+
+static ssize_t size_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	unsigned long long val;
+	u8 **uuid = NULL;
+	int rc;
+
+	rc = kstrtoull(buf, 0, &val);
+	if (rc)
+		return rc;
+
+	device_lock(dev);
+	nd_bus_lock(dev);
+	wait_nd_bus_probe_idle(dev);
+	rc = __size_store(dev, val);
+
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		uuid = &nspm->uuid;
+	} else if (is_namespace_blk(dev)) {
+		/* TODO: blk namespace support */
+		rc = -ENXIO;
+	}
+
+	if (rc == 0 && val == 0 && uuid) {
+		/* setting size zero == 'delete namespace' */
+		kfree(*uuid);
+		*uuid = NULL;
+	}
+
+	dev_dbg(dev, "%s: %llx %s (%d)\n", __func__, val, rc < 0
+			? "fail" : "success", rc);
+
+	nd_bus_unlock(dev);
+	device_unlock(dev);
+
+	return rc ? rc : len;
+}
+
+static ssize_t size_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		return sprintf(buf, "%llu\n", (unsigned long long)
+				resource_size(&nspm->nsio.res));
+	} else if (is_namespace_blk(dev)) {
+		/* TODO: blk namespace support */
+		return -ENXIO;
+	} else if (is_namespace_io(dev)) {
+		struct nd_namespace_io *nsio = to_nd_namespace_io(dev);
+
+		return sprintf(buf, "%llu\n", (unsigned long long)
+				resource_size(&nsio->res));
+	} else
+		return -ENXIO;
+}
+static DEVICE_ATTR(size, S_IRUGO, size_show, size_store);
+
+static ssize_t uuid_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	u8 *uuid;
+
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		uuid = nspm->uuid;
+	} else if (is_namespace_blk(dev)) {
+		/* TODO: blk namespace support */
+		return -ENXIO;
+	} else
+		return -ENXIO;
+
+	if (uuid)
+		return sprintf(buf, "%pUb\n", uuid);
+	return sprintf(buf, "\n");
+}
+
+/**
+ * namespace_update_uuid - check for a unique uuid and whether we're "renaming"
+ * @nd_region: parent region so we can updates all dimms in the set
+ * @dev: namespace type for generating label_id
+ * @new_uuid: incoming uuid
+ * @old_uuid: reference to the uuid storage location in the namespace object
+ */
+static int namespace_update_uuid(struct nd_region *nd_region,
+		struct device *dev, u8 *new_uuid, u8 **old_uuid)
+{
+	u32 flags = is_namespace_blk(dev) ? NSLABEL_FLAG_LOCAL : 0;
+	struct nd_label_id old_label_id;
+	struct nd_label_id new_label_id;
+	int i, rc;
+
+	rc = nd_is_uuid_unique(dev, new_uuid) ? 0 : -EINVAL;
+	if (rc) {
+		kfree(new_uuid);
+		return rc;
+	}
+
+	if (*old_uuid == NULL)
+		goto out;
+
+	nd_label_gen_id(&old_label_id, *old_uuid, flags);
+	nd_label_gen_id(&new_label_id, new_uuid, flags);
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+		struct resource *res;
+
+		for_each_dpa_resource(ndd, res)
+			if (strcmp(res->name, old_label_id.id) == 0)
+				sprintf((void *) res->name, "%s",
+						new_label_id.id);
+	}
+	kfree(*old_uuid);
+ out:
+	*old_uuid = new_uuid;
+	return 0;
+}
+
+static ssize_t uuid_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+	u8 *uuid = NULL;
+	u8 **ns_uuid;
+	ssize_t rc;
+
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		ns_uuid = &nspm->uuid;
+	} else if (is_namespace_blk(dev)) {
+		/* TODO: blk namespace support */
+		return -ENXIO;
+	} else
+		return -ENXIO;
+
+	device_lock(dev);
+	nd_bus_lock(dev);
+	wait_nd_bus_probe_idle(dev);
+	rc = nd_uuid_store(dev, &uuid, buf, len);
+	if (rc >= 0)
+		rc = namespace_update_uuid(nd_region, dev, uuid, ns_uuid);
+	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	nd_bus_unlock(dev);
+	device_unlock(dev);
+
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(uuid);
+
+static ssize_t resource_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct resource *res;
+
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		res = &nspm->nsio.res;
+	} else if (is_namespace_io(dev)) {
+		struct nd_namespace_io *nsio = to_nd_namespace_io(dev);
+
+		res = &nsio->res;
+	} else
+		return -ENXIO;
+
+	/* no address to convey if the namespace has no allocation */
+	if (resource_size(res) == 0)
+		return -ENXIO;
+	return sprintf(buf, "%#llx\n", (unsigned long long) res->start);
+}
+static DEVICE_ATTR_RO(resource);
+
 static struct attribute *nd_namespace_attributes[] = {
 	&dev_attr_nstype.attr,
+	&dev_attr_size.attr,
+	&dev_attr_uuid.attr,
+	&dev_attr_resource.attr,
+	&dev_attr_alt_name.attr,
 	NULL,
 };
 
+static umode_t nd_namespace_attr_visible(struct kobject *kobj, struct attribute *a, int n)
+{
+	struct device *dev = container_of(kobj, struct device, kobj);
+
+	if (a == &dev_attr_resource.attr) {
+		if (is_namespace_blk(dev))
+			return 0;
+		return a->mode;
+	}
+
+	if (is_namespace_pmem(dev) || is_namespace_blk(dev)) {
+		if (a == &dev_attr_size.attr)
+			return S_IWUSR;
+		return a->mode;
+	}
+
+	if (a == &dev_attr_nstype.attr || a == &dev_attr_size.attr)
+		return a->mode;
+
+	return 0;
+}
+
 static struct attribute_group nd_namespace_attribute_group = {
 	.attrs = nd_namespace_attributes,
+	.is_visible = nd_namespace_attr_visible,
 };
 
 static const struct attribute_group *nd_namespace_attribute_groups[] = {
@@ -80,23 +783,326 @@ static struct device **create_namespace_io(struct nd_region *nd_region)
 	return devs;
 }
 
+static bool has_uuid_at_pos(struct nd_region *nd_region, u8 *uuid, u64 cookie, u16 pos)
+{
+	struct nd_namespace_label __iomem *found = NULL;
+	int i;
+
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		struct nd_namespace_label __iomem *nd_label;
+		u8 label_uuid[NSLABEL_UUID_LEN];
+		u8 *found_uuid = NULL;
+		int l;
+
+		for_each_label(l, nd_label, nd_mapping->labels) {
+			u64 isetcookie = readq(&nd_label->isetcookie);
+			u16 position = readw(&nd_label->position);
+			u16 nlabel = readw(&nd_label->nlabel);
+
+			if (isetcookie != cookie)
+				continue;
+
+			memcpy_fromio(label_uuid, nd_label->uuid,
+					NSLABEL_UUID_LEN);
+			if (memcmp(label_uuid, uuid, NSLABEL_UUID_LEN) != 0)
+				continue;
+
+			if (found_uuid) {
+				dev_dbg(to_ndd(nd_mapping)->dev,
+						"%s duplicate entry for uuid\n",
+						__func__);
+				return false;
+			}
+			found_uuid = label_uuid;
+			if (nlabel != nd_region->ndr_mappings)
+				continue;
+			if (position != pos)
+				continue;
+			found = nd_label;
+			break;
+		}
+		if (found)
+			break;
+	}
+	return found != NULL;
+}
+
+static int select_pmem_uuid(struct nd_region *nd_region, u8 *pmem_uuid)
+{
+	struct nd_namespace_label __iomem *select = NULL;
+	int i;
+
+	if (!pmem_uuid)
+		return -ENODEV;
+
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		struct nd_namespace_label __iomem *nd_label;
+		u64 hw_start, hw_end, pmem_start, pmem_end;
+		int l;
+
+		for_each_label(l, nd_label, nd_mapping->labels) {
+			u8 label_uuid[NSLABEL_UUID_LEN];
+
+			memcpy_fromio(label_uuid, nd_label->uuid,
+					NSLABEL_UUID_LEN);
+			if (memcmp(label_uuid, pmem_uuid, NSLABEL_UUID_LEN) == 0)
+				break;
+		}
+
+		if (!nd_label) {
+			WARN_ON(1);
+			return -EINVAL;
+		}
+
+		select = nd_label;
+		/*
+		 * Check that this label is compliant with the dpa
+		 * range published in NFIT
+		 */
+		hw_start = nd_mapping->start;
+		hw_end = hw_start + nd_mapping->size;
+		pmem_start = readq(&select->dpa);
+		pmem_end = pmem_start + readq(&select->rawsize);
+		if (pmem_start == hw_start && pmem_end <= hw_end)
+			/* pass */;
+		else
+			return -EINVAL;
+
+		nd_set_label(nd_mapping->labels, select, 0);
+		nd_set_label(nd_mapping->labels, (void __iomem *) NULL, 1);
+	}
+	return 0;
+}
+
+/**
+ * find_pmem_label_set - validate interleave set labelling, retrieve label0
+ * @nd_region: region with mappings to validate
+ */
+static int find_pmem_label_set(struct nd_region *nd_region,
+		struct nd_namespace_pmem *nspm)
+{
+	u64 cookie = nd_region_interleave_set_cookie(nd_region);
+	struct nd_namespace_label __iomem *nd_label;
+	u8 select_uuid[NSLABEL_UUID_LEN];
+	resource_size_t size = 0;
+	u8 *pmem_uuid = NULL;
+	int rc = -ENODEV, l;
+	u16 i;
+
+	if (cookie == 0)
+		return -ENXIO;
+
+	/*
+	 * Find a complete set of labels by uuid.  By definition we can start
+	 * with any mapping as the reference label
+	 */
+	for_each_label(l, nd_label, nd_region->mapping[0].labels) {
+		u64 isetcookie = readq(&nd_label->isetcookie);
+		u8 label_uuid[NSLABEL_UUID_LEN];
+
+		if (isetcookie != cookie)
+			continue;
+
+		memcpy_fromio(label_uuid, nd_label->uuid,
+				NSLABEL_UUID_LEN);
+		for (i = 0; nd_region->ndr_mappings; i++)
+			if (!has_uuid_at_pos(nd_region, label_uuid, cookie, i))
+				break;
+		if (i < nd_region->ndr_mappings) {
+			/*
+			 * Give up if we don't find an instance of a
+			 * uuid at each position (from 0 to
+			 * nd_region->ndr_mappings - 1), or if we find a
+			 * dimm with two instances of the same uuid.
+			 */
+			rc = -EINVAL;
+			goto err;
+		} else if (pmem_uuid) {
+			/*
+			 * If there is more than one valid uuid set, we
+			 * need userspace to clean this up.
+			 */
+			rc = -EBUSY;
+			goto err;
+		}
+		memcpy(select_uuid, label_uuid, NSLABEL_UUID_LEN);
+		pmem_uuid = select_uuid;
+	}
+
+	/*
+	 * Fix up each mapping's 'labels' to have the validated pmem label for
+	 * that position at labels[0], and NULL at labels[1].  In the process,
+	 * check that the namespace aligns with interleave-set.  We know
+	 * that it does not overlap with any blk namespaces by virtue of
+	 * the dimm being enabled (i.e. nd_label_reserve_dpa()
+	 * succeeded).
+	 */
+	rc = select_pmem_uuid(nd_region, pmem_uuid);
+	if (rc)
+		goto err;
+
+	/* Calculate total size and populate namespace properties from label0 */
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		struct nd_namespace_label __iomem *label0;
+
+		label0 = nd_get_label(nd_mapping->labels, 0);
+		size += readq(&label0->rawsize);
+		if (readw(&label0->position) != 0)
+			continue;
+		WARN_ON(nspm->alt_name || nspm->uuid);
+		nspm->alt_name = kmemdup((void __force *) label0->name,
+				NSLABEL_NAME_LEN, GFP_KERNEL);
+		nspm->uuid = kmemdup((void __force *) label0->uuid,
+				NSLABEL_UUID_LEN, GFP_KERNEL);
+	}
+
+	if (!nspm->alt_name || !nspm->uuid) {
+		rc = -ENOMEM;
+		goto err;
+	}
+
+	nd_namespace_pmem_set_size(nd_region, nspm, size);
+
+	return 0;
+ err:
+	switch (rc) {
+	case -EINVAL:
+		dev_dbg(&nd_region->dev, "%s: invalid label(s)\n", __func__);
+		break;
+	case -ENODEV:
+		dev_dbg(&nd_region->dev, "%s: label not found\n", __func__);
+		break;
+	default:
+		dev_dbg(&nd_region->dev, "%s: unexpected err: %d\n", __func__, rc);
+		break;
+	}
+	return rc;
+}
+
+static struct device **create_namespace_pmem(struct nd_region *nd_region)
+{
+	struct nd_namespace_pmem *nspm;
+	struct device *dev, **devs;
+	struct resource *res;
+	int rc;
+
+	nspm = kzalloc(sizeof(*nspm), GFP_KERNEL);
+	if (!nspm)
+		return NULL;
+
+	dev = &nspm->nsio.dev;
+	dev->type = &namespace_pmem_device_type;
+	res = &nspm->nsio.res;
+	res->name = dev_name(&nd_region->dev);
+	res->flags = IORESOURCE_MEM;
+	rc = find_pmem_label_set(nd_region, nspm);
+	if (rc == -ENODEV) {
+		int i;
+
+		/* Pass, try to permit namespace creation... */
+		for (i = 0; i < nd_region->ndr_mappings; i++) {
+			struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+
+			kfree(nd_mapping->labels);
+			nd_mapping->labels = NULL;
+		}
+
+		/* Publish a zero-sized namespace for userspace to configure. */
+		nd_namespace_pmem_set_size(nd_region, nspm, 0);
+
+		rc = 0;
+	} else if (rc)
+		goto err;
+
+	devs = kcalloc(2, sizeof(struct device *), GFP_KERNEL);
+	if (!devs)
+		goto err;
+
+	devs[0] = dev;
+	return devs;
+
+ err:
+	namespace_pmem_release(&nspm->nsio.dev);
+	return NULL;
+}
+
+static int init_active_labels(struct nd_region *nd_region)
+{
+	int i;
+
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+		int count, j;
+
+		/*
+		 * If the dimm is disabled then prevent the region from
+		 * being activated if it aliases DPA.
+		 */
+		if (!ndd) {
+			struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
+
+			if ((nd_dimm->flags & NDD_ALIASING) == 0)
+				return 0;
+			dev_dbg(&nd_region->dev, "%s: is disabled, failing probe\n",
+					dev_name(&nd_mapping->nd_dimm->dev));
+			return -ENXIO;
+		}
+
+		count = nd_label_active_count(ndd);
+		dev_dbg(ndd->dev, "%s: %d\n", __func__, count);
+		if (!count)
+			continue;
+		nd_mapping->labels = kcalloc(count + 1,
+				sizeof(struct nd_namespace_label *), GFP_KERNEL);
+		if (!nd_mapping->labels)
+			return -ENOMEM;
+		for (j = 0; j < count; j++) {
+			struct nd_namespace_label __iomem *label;
+
+			label = nd_label_active(ndd, j);
+			nd_set_label(nd_mapping->labels, label, j);
+		}
+	}
+
+	return 0;
+}
+
 int nd_region_register_namespaces(struct nd_region *nd_region, int *err)
 {
 	struct device **devs = NULL;
-	int i;
+	int i, rc = 0, type;
 
 	*err = 0;
-	switch (nd_region_to_namespace_type(nd_region)) {
+	nd_bus_lock(&nd_region->dev);
+	rc = init_active_labels(nd_region);
+	if (rc) {
+		nd_bus_unlock(&nd_region->dev);
+		return rc;
+	}
+
+	type = nd_region_to_namespace_type(nd_region);
+	switch (type) {
 	case ND_DEVICE_NAMESPACE_IO:
 		devs = create_namespace_io(nd_region);
 		break;
+	case ND_DEVICE_NAMESPACE_PMEM:
+		devs = create_namespace_pmem(nd_region);
+		break;
 	default:
 		break;
 	}
+	nd_bus_unlock(&nd_region->dev);
 
-	if (!devs)
-		return -ENODEV;
+	if (!devs) {
+		rc = -ENODEV;
+		goto err;
+	}
 
+	nd_region->ns_seed = devs[0];
 	for (i = 0; devs[i]; i++) {
 		struct device *dev = devs[i];
 
@@ -108,4 +1114,14 @@ int nd_region_register_namespaces(struct nd_region *nd_region, int *err)
 	kfree(devs);
 
 	return i;
+
+ err:
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+
+		kfree(nd_mapping->labels);
+		nd_mapping->labels = NULL;
+	}
+
+	return rc;
 }
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index 67f28011dfa5..814843454417 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -60,4 +60,15 @@ int nd_bus_register_dimms(struct nd_bus *nd_bus);
 int nd_bus_register_regions(struct nd_bus *nd_bus);
 int nd_bus_init_interleave_sets(struct nd_bus *nd_bus);
 int nd_match_dimm(struct device *dev, void *data);
+struct nd_label_id;
+char *nd_label_gen_id(struct nd_label_id *label_id, u8 *uuid, u32 flags);
+bool nd_is_uuid_unique(struct device *dev, u8 *uuid);
+struct nd_region;
+struct nd_dimm_drvdata;
+struct nd_mapping;
+resource_size_t nd_pmem_available_dpa(struct nd_region *nd_region,
+		struct nd_mapping *nd_mapping, resource_size_t *overlap);
+resource_size_t nd_region_available_dpa(struct nd_region *nd_region);
+resource_size_t nd_dimm_allocated_dpa(struct nd_dimm_drvdata *ndd,
+		struct nd_label_id *label_id);
 #endif /* __ND_PRIVATE_H__ */
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index 63540ffe845d..d9d221a7006e 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -16,6 +16,7 @@
 #include <linux/libnd.h>
 #include <linux/mutex.h>
 #include <linux/ndctl.h>
+#include <linux/types.h>
 #include "label.h"
 
 struct nd_dimm_drvdata {
@@ -59,12 +60,37 @@ static inline struct nd_namespace_index __iomem *to_next_namespace_index(
 		(unsigned long long) (res ? resource_size(res) : 0), \
 		(unsigned long long) (res ? res->start : 0), ##arg)
 
+/* sparse helpers */
+static inline void nd_set_label(struct nd_namespace_label **labels,
+		struct nd_namespace_label __iomem *label, int idx)
+{
+	labels[idx] = (void __force *) label;
+}
+
+static inline struct nd_namespace_label __iomem *nd_get_label(
+		struct nd_namespace_label **labels, int idx)
+{
+	struct nd_namespace_label __iomem *label = NULL;
+
+	if (labels)
+		label = (struct nd_namespace_label __iomem *) labels[idx];
+
+	return label;
+}
+
+#define for_each_label(l, label, labels) \
+	for (l = 0; (label = nd_get_label(labels, l)); l++)
+
+#define for_each_dpa_resource(ndd, res) \
+	for (res = (ndd)->dpa.child; res; res = res->sibling)
+
 #define for_each_dpa_resource_safe(ndd, res, next) \
 	for (res = (ndd)->dpa.child, next = res ? res->sibling : NULL; \
 			res; res = next, next = next ? next->sibling : NULL)
 
 struct nd_region {
 	struct device dev;
+	struct device *ns_seed;
 	u16 ndr_mappings;
 	u64 ndr_size;
 	u64 ndr_start;
@@ -88,13 +114,19 @@ enum nd_async_mode {
 	ND_ASYNC,
 };
 
+void wait_nd_bus_probe_idle(struct device *dev);
 void nd_device_register(struct device *dev);
 void nd_device_unregister(struct device *dev, enum nd_async_mode mode);
+int nd_uuid_store(struct device *dev, u8 **uuid_out, const char *buf,
+		size_t len);
+struct nd_dimm;
+struct nd_dimm_drvdata *to_ndd(struct nd_mapping *nd_mapping);
 int nd_dimm_init_nsarea(struct nd_dimm_drvdata *ndd);
 int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd);
 struct nd_region *to_nd_region(struct device *dev);
 int nd_region_to_namespace_type(struct nd_region *nd_region);
 int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
+u64 nd_region_interleave_set_cookie(struct nd_region *nd_region);
 void nd_bus_lock(struct device *dev);
 void nd_bus_unlock(struct device *dev);
 bool is_nd_bus_locked(struct device *dev);
diff --git a/drivers/block/nd/pmem.c b/drivers/block/nd/pmem.c
index fc34677d0f48..bf380393da92 100644
--- a/drivers/block/nd/pmem.c
+++ b/drivers/block/nd/pmem.c
@@ -201,6 +201,23 @@ static int nd_pmem_probe(struct device *dev)
 	struct nd_namespace_io *nsio = to_nd_namespace_io(dev);
 	struct pmem_device *pmem;
 
+	if (resource_size(&nsio->res) < ND_MIN_NAMESPACE_SIZE) {
+		resource_size_t size = resource_size(&nsio->res);
+
+		dev_dbg(dev, "%s: size: %pa, too small must be at least %#x\n",
+				__func__, &size, ND_MIN_NAMESPACE_SIZE);
+		return -ENODEV;
+	}
+
+	if (nd_region_to_namespace_type(nd_region) == ND_DEVICE_NAMESPACE_PMEM) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		if (!nspm->uuid) {
+			dev_dbg(dev, "%s: uuid not set\n", __func__);
+			return -ENODEV;
+		}
+	}
+
 	pmem = pmem_alloc(dev, &nsio->res, nd_region->id);
 	if (IS_ERR(pmem))
 		return PTR_ERR(pmem);
@@ -220,13 +237,14 @@ static int nd_pmem_remove(struct device *dev)
 
 MODULE_ALIAS("pmem");
 MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_IO);
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_PMEM);
 static struct nd_device_driver nd_pmem_driver = {
 	.probe = nd_pmem_probe,
 	.remove = nd_pmem_remove,
 	.drv = {
 		.name = "pmem",
 	},
-	.type = ND_DRIVER_NAMESPACE_IO,
+	.type = ND_DRIVER_NAMESPACE_IO | ND_DRIVER_NAMESPACE_PMEM,
 };
 
 static int __init pmem_init(void)
diff --git a/drivers/block/nd/region.c b/drivers/block/nd/region.c
index 7e58b2a700c2..31bb33962e14 100644
--- a/drivers/block/nd/region.c
+++ b/drivers/block/nd/region.c
@@ -61,8 +61,11 @@ static int child_unregister(struct device *dev, void *data)
 
 static int nd_region_remove(struct device *dev)
 {
+	struct nd_region *nd_region = to_nd_region(dev);
+
 	/* flush attribute readers and disable */
 	nd_bus_lock(dev);
+	nd_region->ns_seed = NULL;
 	dev_set_drvdata(dev, NULL);
 	nd_bus_unlock(dev);
 
diff --git a/drivers/block/nd/region_devs.c b/drivers/block/nd/region_devs.c
index 221e6342b6ca..6b43a5c901cd 100644
--- a/drivers/block/nd/region_devs.c
+++ b/drivers/block/nd/region_devs.c
@@ -15,6 +15,7 @@
 #include <linux/slab.h>
 #include <linux/sort.h>
 #include <linux/io.h>
+#include <linux/nd.h>
 #include "nd-private.h"
 #include "nd.h"
 
@@ -99,6 +100,58 @@ int nd_region_to_namespace_type(struct nd_region *nd_region)
 
 	return 0;
 }
+EXPORT_SYMBOL(nd_region_to_namespace_type);
+
+static int is_uuid_busy(struct device *dev, void *data)
+{
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+	u8 *uuid = data;
+
+	switch (nd_region_to_namespace_type(nd_region)) {
+	case ND_DEVICE_NAMESPACE_PMEM: {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		if (!nspm->uuid)
+			break;
+		if (memcmp(uuid, nspm->uuid, NSLABEL_UUID_LEN) == 0)
+			return -EBUSY;
+		break;
+	}
+	case ND_DEVICE_NAMESPACE_BLK: {
+		/* TODO: blk namespace support */
+		break;
+	}
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+static int is_namespace_uuid_busy(struct device *dev, void *data)
+{
+	if (is_nd_pmem(dev) || is_nd_blk(dev))
+		return device_for_each_child(dev, data, is_uuid_busy);
+	return 0;
+}
+
+/**
+ * nd_is_uuid_unique - verify that no other namespace has @uuid
+ * @dev: any device on a nd_bus
+ * @uuid: uuid to check
+ */
+bool nd_is_uuid_unique(struct device *dev, u8 *uuid)
+{
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
+	if (!nd_bus)
+		return false;
+	WARN_ON_ONCE(!is_nd_bus_locked(&nd_bus->dev));
+	if (device_for_each_child(&nd_bus->dev, uuid,
+				is_namespace_uuid_busy) != 0)
+		return false;
+	return true;
+}
 
 static ssize_t size_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
@@ -151,6 +204,60 @@ static ssize_t set_cookie_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(set_cookie);
 
+resource_size_t nd_region_available_dpa(struct nd_region *nd_region)
+{
+	resource_size_t blk_max_overlap = 0, available, overlap;
+	int i;
+
+	WARN_ON(!is_nd_bus_locked(&nd_region->dev));
+
+ retry:
+	available = 0;
+	overlap = blk_max_overlap;
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+
+		/* if a dimm is disabled the available capacity is zero */
+		if (!ndd)
+			return 0;
+
+		if (is_nd_pmem(&nd_region->dev)) {
+			available += nd_pmem_available_dpa(nd_region,
+					nd_mapping, &overlap);
+			if (overlap > blk_max_overlap) {
+				blk_max_overlap = overlap;
+				goto retry;
+			}
+		} else if (is_nd_blk(&nd_region->dev)) {
+			/* TODO: BLK Namespace support */
+		}
+	}
+
+	return available;
+}
+
+static ssize_t available_size_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+	unsigned long long available = 0;
+
+	/*
+	 * Flush in-flight updates and grab a snapshot of the available
+	 * size.  Of course, this value is potentially invalidated the
+	 * memory nd_bus_lock() is dropped, but that's userspace's
+	 * problem to not race itself.
+	 */
+	nd_bus_lock(dev);
+	wait_nd_bus_probe_idle(dev);
+	available = nd_region_available_dpa(nd_region);
+	nd_bus_unlock(dev);
+
+	return sprintf(buf, "%llu\n", available);
+}
+static DEVICE_ATTR_RO(available_size);
+
 static ssize_t init_namespaces_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -168,11 +275,29 @@ static ssize_t init_namespaces_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(init_namespaces);
 
+static ssize_t namespace_seed_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+	ssize_t rc;
+
+	nd_bus_lock(dev);
+	if (nd_region->ns_seed)
+		rc = sprintf(buf, "%s\n", dev_name(nd_region->ns_seed));
+	else
+		rc = sprintf(buf, "\n");
+	nd_bus_unlock(dev);
+	return rc;
+}
+static DEVICE_ATTR_RO(namespace_seed);
+
 static struct attribute *nd_region_attributes[] = {
 	&dev_attr_size.attr,
 	&dev_attr_nstype.attr,
 	&dev_attr_mappings.attr,
 	&dev_attr_set_cookie.attr,
+	&dev_attr_available_size.attr,
+	&dev_attr_namespace_seed.attr,
 	&dev_attr_init_namespaces.attr,
 	NULL,
 };
@@ -182,12 +307,17 @@ static umode_t nd_region_visible(struct kobject *kobj, struct attribute *a, int
 	struct device *dev = container_of(kobj, typeof(*dev), kobj);
 	struct nd_region *nd_region = to_nd_region(dev);
 	struct nd_interleave_set *nd_set = nd_region->nd_set;
+	int type = nd_region_to_namespace_type(nd_region);
 
-	if (a != &dev_attr_set_cookie.attr)
+	if (a != &dev_attr_set_cookie.attr && a != &dev_attr_available_size.attr)
 		return a->mode;
 
-	if (is_nd_pmem(dev) && nd_set)
-			return a->mode;
+	if ((type == ND_DEVICE_NAMESPACE_PMEM
+				|| type == ND_DEVICE_NAMESPACE_BLK)
+			&& a == &dev_attr_available_size.attr)
+		return a->mode;
+	else if (is_nd_pmem(dev) && nd_set)
+		return a->mode;
 
 	return 0;
 }
@@ -198,6 +328,15 @@ struct attribute_group nd_region_attribute_group = {
 };
 EXPORT_SYMBOL_GPL(nd_region_attribute_group);
 
+u64 nd_region_interleave_set_cookie(struct nd_region *nd_region)
+{
+	struct nd_interleave_set *nd_set = nd_region->nd_set;
+
+	if (nd_set)
+		return nd_set->cookie;
+	return 0;
+}
+
 /*
  * Upon successful probe/remove, take/release a reference on the
  * associated interleave set (if present)
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index 52f669faacfd..3190a561ea59 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -40,8 +40,10 @@ typedef int (*ndctl_fn)(struct nd_bus_descriptor *nd_desc,
 		struct nd_dimm *nd_dimm, unsigned int cmd, void *buf,
 		unsigned int buf_len);
 
+struct nd_namespace_label;
 struct nd_mapping {
 	struct nd_dimm *nd_dimm;
+	struct nd_namespace_label **labels;
 	u64 start;
 	u64 size;
 };
diff --git a/include/linux/nd.h b/include/linux/nd.h
index da70e9962197..255c38a83083 100644
--- a/include/linux/nd.h
+++ b/include/linux/nd.h
@@ -28,16 +28,40 @@ static inline struct nd_device_driver *to_nd_device_driver(
 	return container_of(drv, struct nd_device_driver, drv);
 };
 
+/**
+ * struct nd_namespace_io - infrastructure for loading an nd_pmem instance
+ * @dev: namespace device created by the nd region driver
+ * @res: struct resource conversion of a NFIT SPA table
+ */
 struct nd_namespace_io {
 	struct device dev;
 	struct resource res;
 };
 
+/**
+ * struct nd_namespace_pmem - namespace device for dimm-backed interleaved memory
+ * @nsio: device and system physical address range to drive
+ * @alt_name: namespace name supplied in the dimm label
+ * @uuid: namespace name supplied in the dimm label
+ */
+struct nd_namespace_pmem {
+	struct nd_namespace_io nsio;
+	char *alt_name;
+	u8 *uuid;
+};
+
 static inline struct nd_namespace_io *to_nd_namespace_io(struct device *dev)
 {
 	return container_of(dev, struct nd_namespace_io, dev);
 }
 
+static inline struct nd_namespace_pmem *to_nd_namespace_pmem(struct device *dev)
+{
+	struct nd_namespace_io *nsio = to_nd_namespace_io(dev);
+
+	return container_of(nsio, struct nd_namespace_pmem, nsio);
+}
+
 #define MODULE_ALIAS_ND_DEVICE(type) \
 	MODULE_ALIAS("nd:t" __stringify(type) "*")
 #define ND_DEVICE_MODALIAS_FMT "nd:t%d"
diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
index 624a19d9e6e4..0b4dcabb248a 100644
--- a/include/uapi/linux/ndctl.h
+++ b/include/uapi/linux/ndctl.h
@@ -190,4 +190,8 @@ enum nd_driver_flags {
 	ND_DRIVER_NAMESPACE_PMEM  = 1 << ND_DEVICE_NAMESPACE_PMEM,
 	ND_DRIVER_NAMESPACE_BLK   = 1 << ND_DEVICE_NAMESPACE_BLK,
 };
+
+enum {
+	ND_MIN_NAMESPACE_SIZE = 0x00400000,
+};
 #endif /* __NDCTL_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 13/21] libnd: pmem label sets and namespace instantiation.
@ 2015-05-20 20:57   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, mingo, linux-acpi,
	jmoyer, hch

A complete label set is a PMEM-label per-dimm per-interleave-set where
all the UUIDs match and the interleave set cookie matches the hosting
interleave set.

Present sysfs attributes for manipulation of a PMEM-namespace's
'alt_name', 'uuid', and 'size' attributes.  A later patch will make
these settings persistent by writing back the label.

Note that PMEM allocations grow forwards from the start of an interleave
set (lowest dimm-physical-address (DPA)).  BLK-namespaces that alias
with a PMEM interleave set will grow allocations backward from the
highest DPA.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/block/nd/bus.c            |    6 
 drivers/block/nd/core.c           |   64 ++
 drivers/block/nd/dimm_devs.c      |  103 ++++
 drivers/block/nd/label.c          |   54 ++
 drivers/block/nd/label.h          |    3 
 drivers/block/nd/namespace_devs.c | 1024 +++++++++++++++++++++++++++++++++++++
 drivers/block/nd/nd-private.h     |   11 
 drivers/block/nd/nd.h             |   32 +
 drivers/block/nd/pmem.c           |   20 +
 drivers/block/nd/region.c         |    3 
 drivers/block/nd/region_devs.c    |  145 +++++
 include/linux/libnd.h             |    2 
 include/linux/nd.h                |   24 +
 include/uapi/linux/ndctl.h        |    4 
 14 files changed, 1484 insertions(+), 11 deletions(-)

diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index 63b5182cf766..65af6bcc5472 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -364,8 +364,10 @@ u32 nd_cmd_out_size(struct nd_dimm *nd_dimm, int cmd,
 }
 EXPORT_SYMBOL_GPL(nd_cmd_out_size);
 
-static void wait_nd_bus_probe_idle(struct nd_bus *nd_bus)
+void wait_nd_bus_probe_idle(struct device *dev)
 {
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
 	do {
 		if (nd_bus->probe_active == 0)
 			break;
@@ -384,7 +386,7 @@ static int nd_cmd_clear_to_send(struct nd_dimm *nd_dimm, unsigned int cmd)
 		return 0;
 
 	nd_bus = walk_to_nd_bus(&nd_dimm->dev);
-	wait_nd_bus_probe_idle(nd_bus);
+	wait_nd_bus_probe_idle(&nd_bus->dev);
 
 	if (atomic_read(&nd_dimm->busy))
 		return -EBUSY;
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index 38fb8f4c9a2c..0bf69abb47fc 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -14,6 +14,7 @@
 #include <linux/module.h>
 #include <linux/device.h>
 #include <linux/libnd.h>
+#include <linux/ctype.h>
 #include <linux/ndctl.h>
 #include <linux/mutex.h>
 #include <linux/slab.h>
@@ -107,6 +108,69 @@ struct nd_bus *walk_to_nd_bus(struct device *nd_dev)
 	return NULL;
 }
 
+static bool is_uuid_sep(char sep)
+{
+	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
+		return true;
+	return false;
+}
+
+static int nd_uuid_parse(struct device *dev, u8 *uuid_out, const char *buf,
+		size_t len)
+{
+	const char *str = buf;
+	u8 uuid[16];
+	int i;
+
+	for (i = 0; i < 16; i++) {
+		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
+			dev_dbg(dev, "%s: pos: %d buf[%zd]: %c buf[%zd]: %c\n",
+					__func__, i, str - buf, str[0],
+					str + 1 - buf, str[1]);
+			return -EINVAL;
+		}
+
+		uuid[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
+		str += 2;
+		if (is_uuid_sep(*str))
+			str++;
+	}
+
+	memcpy(uuid_out, uuid, sizeof(uuid));
+	return 0;
+}
+
+/**
+ * nd_uuid_store: common implementation for writing 'uuid' sysfs attributes
+ * @dev: container device for the uuid property
+ * @uuid_out: uuid buffer to replace
+ * @buf: raw sysfs buffer to parse
+ *
+ * Enforce that uuids can only be changed while the device is disabled
+ * (driver detached)
+ * LOCKING: expects device_lock() is held on entry
+ */
+int nd_uuid_store(struct device *dev, u8 **uuid_out, const char *buf,
+		size_t len)
+{
+	u8 uuid[16];
+	int rc;
+
+	if (dev->driver)
+		return -EBUSY;
+
+	rc = nd_uuid_parse(dev, uuid, buf, len);
+	if (rc)
+		return rc;
+
+	kfree(*uuid_out);
+	*uuid_out = kmemdup(uuid, sizeof(uuid), GFP_KERNEL);
+	if (!(*uuid_out))
+		return -ENOMEM;
+
+	return 0;
+}
+
 static ssize_t commands_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 013531b8adfa..b242d3ae6d12 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -159,6 +159,14 @@ struct nd_dimm *to_nd_dimm(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(to_nd_dimm);
 
+struct nd_dimm_drvdata *to_ndd(struct nd_mapping *nd_mapping)
+{
+	struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
+
+	return dev_get_drvdata(&nd_dimm->dev);
+}
+EXPORT_SYMBOL(to_ndd);
+
 const char *nd_dimm_name(struct nd_dimm *nd_dimm)
 {
 	return dev_name(&nd_dimm->dev);
@@ -247,6 +255,83 @@ struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
 }
 EXPORT_SYMBOL_GPL(nd_dimm_create);
 
+/**
+ * nd_pmem_available_dpa - for the given dimm+region account unallocated dpa
+ * @nd_mapping: container of dpa-resource-root + labels
+ * @nd_region: constrain available space check to this reference region
+ * @overlap: calculate available space assuming this level of overlap
+ *
+ * Validate that a PMEM label, if present, aligns with the start of an
+ * interleave set and truncate the available size at the lowest BLK
+ * overlap point.
+ *
+ * The expectation is that this routine is called multiple times as it
+ * probes for the largest BLK encroachment for any single member DIMM of
+ * the interleave set.  Once that value is determined the PMEM-limit for
+ * the set can be established.
+ */
+resource_size_t nd_pmem_available_dpa(struct nd_region *nd_region,
+		struct nd_mapping *nd_mapping, resource_size_t *overlap)
+{
+	resource_size_t map_end, busy = 0, available, blk_start;
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	struct resource *res;
+	const char *reason;
+
+	if (!ndd)
+		return 0;
+
+	map_end = nd_mapping->start + nd_mapping->size - 1;
+	blk_start = max(nd_mapping->start, map_end + 1 - *overlap);
+	for_each_dpa_resource(ndd, res)
+		if (res->start >= nd_mapping->start && res->start < map_end) {
+			if (strncmp(res->name, "blk", 3) == 0)
+				blk_start = min(blk_start, res->start);
+			else if (res->start != nd_mapping->start) {
+				reason = "misaligned to iset";
+				goto err;
+			} else {
+				if (busy) {
+					reason = "duplicate overlapping PMEM reservations?";
+					goto err;
+				}
+				busy += resource_size(res);
+				continue;
+			}
+		} else if (res->end >= nd_mapping->start && res->end <= map_end) {
+			if (strncmp(res->name, "blk", 3) == 0) {
+				/*
+				 * If a BLK allocation overlaps the start of
+				 * PMEM the entire interleave set may now only
+				 * be used for BLK.
+				 */
+				blk_start = nd_mapping->start;
+			} else {
+				reason = "misaligned to iset";
+				goto err;
+			}
+		} else if (nd_mapping->start > res->start
+				&& nd_mapping->start < res->end) {
+			/* total eclipse of the mapping */
+			busy += nd_mapping->size;
+			blk_start = nd_mapping->start;
+		}
+
+	*overlap = map_end + 1 - blk_start;
+	available = blk_start - nd_mapping->start;
+	if (busy < available)
+		return available - busy;
+	return 0;
+
+ err:
+	/*
+	 * Something is wrong, PMEM must align with the start of the
+	 * interleave set, and there can only be one allocation per set.
+	 */
+	nd_dbg_dpa(nd_region, ndd, res, "%s\n", reason);
+	return 0;
+}
+
 void nd_dimm_free_dpa(struct nd_dimm_drvdata *ndd, struct resource *res)
 {
 	WARN_ON_ONCE(!is_nd_bus_locked(ndd->dev));
@@ -271,6 +356,24 @@ struct resource *nd_dimm_allocate_dpa(struct nd_dimm_drvdata *ndd,
 	return res;
 }
 
+/**
+ * nd_dimm_allocated_dpa - sum up the dpa currently allocated to this label_id
+ * @nd_dimm: container of dpa-resource-root + labels
+ * @label_id: dpa resource name of the form {pmem|blk}-<human readable uuid>
+ */
+resource_size_t nd_dimm_allocated_dpa(struct nd_dimm_drvdata *ndd,
+		struct nd_label_id *label_id)
+{
+	resource_size_t allocated = 0;
+	struct resource *res;
+
+	for_each_dpa_resource(ndd, res)
+		if (strcmp(res->name, label_id->id) == 0)
+			allocated += resource_size(res);
+
+	return allocated;
+}
+
 static int count_dimms(struct device *dev, void *c)
 {
 	int *count = c;
diff --git a/drivers/block/nd/label.c b/drivers/block/nd/label.c
index da5008e45917..ecd196b42d57 100644
--- a/drivers/block/nd/label.c
+++ b/drivers/block/nd/label.c
@@ -229,7 +229,7 @@ static bool preamble_current(struct nd_dimm_drvdata *ndd,
 	return true;
 }
 
-static char *nd_label_gen_id(struct nd_label_id *label_id, u8 *uuid, u32 flags)
+char *nd_label_gen_id(struct nd_label_id *label_id, u8 *uuid, u32 flags)
 {
 	if (!label_id || !uuid)
 		return NULL;
@@ -285,3 +285,55 @@ int nd_label_reserve_dpa(struct nd_dimm_drvdata *ndd)
 
 	return 0;
 }
+
+int nd_label_active_count(struct nd_dimm_drvdata *ndd)
+{
+	struct nd_namespace_index __iomem *nsindex;
+	unsigned long *free;
+	u32 nslot, slot;
+	int count = 0;
+
+	if (!preamble_current(ndd, &nsindex, &free, &nslot))
+		return 0;
+
+	for_each_clear_bit_le(slot, free, nslot) {
+		struct nd_namespace_label __iomem *nd_label;
+
+		nd_label = nd_label_base(ndd) + slot;
+
+		if (!slot_valid(nd_label, slot)) {
+			dev_dbg(ndd->dev,
+				"%s: slot%d invalid slot: %d dpa: %lx rawsize: %lx\n",
+					__func__, slot, readl(&nd_label->slot),
+					(unsigned long) readq(&nd_label->dpa),
+					(unsigned long) readq(&nd_label->rawsize));
+			continue;
+		}
+		count++;
+	}
+	return count;
+}
+
+struct nd_namespace_label __iomem *nd_label_active(
+		struct nd_dimm_drvdata *ndd, int n)
+{
+	struct nd_namespace_index __iomem *nsindex;
+	unsigned long *free;
+	u32 nslot, slot;
+
+	if (!preamble_current(ndd, &nsindex, &free, &nslot))
+		return NULL;
+
+	for_each_clear_bit_le(slot, free, nslot) {
+		struct nd_namespace_label __iomem *nd_label;
+
+		nd_label = nd_label_base(ndd) + slot;
+		if (!slot_valid(nd_label, slot))
+			continue;
+
+		if (n-- == 0)
+			return nd_label_base(ndd) + slot;
+	}
+
+	return NULL;
+}
diff --git a/drivers/block/nd/label.h b/drivers/block/nd/label.h
index 79ed885a43c0..4436624f4146 100644
--- a/drivers/block/nd/label.h
+++ b/drivers/block/nd/label.h
@@ -126,4 +126,7 @@ void nd_label_copy(struct nd_dimm_drvdata *ndd,
 		struct nd_namespace_index *dst,
 		struct nd_namespace_index *src);
 size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd);
+int nd_label_active_count(struct nd_dimm_drvdata *ndd);
+struct nd_namespace_label __iomem *nd_label_active(
+		struct nd_dimm_drvdata *ndd, int n);
 #endif /* __LABEL_H__ */
diff --git a/drivers/block/nd/namespace_devs.c b/drivers/block/nd/namespace_devs.c
index 8fbdf68c64d8..d0417575b18c 100644
--- a/drivers/block/nd/namespace_devs.c
+++ b/drivers/block/nd/namespace_devs.c
@@ -14,8 +14,11 @@
 #include <linux/device.h>
 #include <linux/slab.h>
 #include <linux/nd.h>
+#include "nd-private.h"
 #include "nd.h"
 
+#include <asm-generic/io-64-nonatomic-lo-hi.h>
+
 static void namespace_io_release(struct device *dev)
 {
 	struct nd_namespace_io *nsio = to_nd_namespace_io(dev);
@@ -23,11 +26,50 @@ static void namespace_io_release(struct device *dev)
 	kfree(nsio);
 }
 
+static void namespace_pmem_release(struct device *dev)
+{
+	struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+	kfree(nspm->alt_name);
+	kfree(nspm->uuid);
+	kfree(nspm);
+}
+
+static void namespace_blk_release(struct device *dev)
+{
+	/* TODO: blk namespace support */
+}
+
 static struct device_type namespace_io_device_type = {
 	.name = "nd_namespace_io",
 	.release = namespace_io_release,
 };
 
+static struct device_type namespace_pmem_device_type = {
+	.name = "nd_namespace_pmem",
+	.release = namespace_pmem_release,
+};
+
+static struct device_type namespace_blk_device_type = {
+	.name = "nd_namespace_blk",
+	.release = namespace_blk_release,
+};
+
+static bool is_namespace_pmem(struct device *dev)
+{
+	return dev ? dev->type == &namespace_pmem_device_type : false;
+}
+
+static bool is_namespace_blk(struct device *dev)
+{
+	return dev ? dev->type == &namespace_blk_device_type : false;
+}
+
+static bool is_namespace_io(struct device *dev)
+{
+	return dev ? dev->type == &namespace_io_device_type : false;
+}
+
 static ssize_t nstype_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -37,13 +79,674 @@ static ssize_t nstype_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(nstype);
 
+static ssize_t __alt_name_store(struct device *dev, const char *buf,
+		const size_t len)
+{
+	char *input, *pos, *alt_name, **ns_altname;
+	ssize_t rc;
+
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		ns_altname = &nspm->alt_name;
+	} else if (is_namespace_blk(dev)) {
+		/* TODO: blk namespace support */
+		return -ENXIO;
+	} else
+		return -ENXIO;
+
+	if (dev->driver)
+		return -EBUSY;
+
+	input = kmemdup(buf, len + 1, GFP_KERNEL);
+	if (!input)
+		return -ENOMEM;
+
+	input[len] = '\0';
+	pos = strim(input);
+	if (strlen(pos) + 1 > NSLABEL_NAME_LEN) {
+		rc = -EINVAL;
+		goto out;
+	}
+
+	alt_name = kzalloc(NSLABEL_NAME_LEN, GFP_KERNEL);
+	if (!alt_name) {
+		rc = -ENOMEM;
+		goto out;
+	}
+	kfree(*ns_altname);
+	*ns_altname = alt_name;
+	sprintf(*ns_altname, "%s", pos);
+	rc = len;
+
+out:
+	kfree(input);
+	return rc;
+}
+
+static ssize_t alt_name_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	ssize_t rc;
+
+	device_lock(dev);
+	nd_bus_lock(dev);
+	wait_nd_bus_probe_idle(dev);
+	rc = __alt_name_store(dev, buf, len);
+	dev_dbg(dev, "%s: %s (%zd)\n", __func__, rc < 0 ? "fail" : "success", rc);
+	nd_bus_unlock(dev);
+	device_unlock(dev);
+
+	return rc;
+}
+
+static ssize_t alt_name_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	char *ns_altname;
+
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		ns_altname = nspm->alt_name;
+	} else if (is_namespace_blk(dev)) {
+		/* TODO: blk namespace support */
+		return -ENXIO;
+	} else
+		return -ENXIO;
+
+	return sprintf(buf, "%s\n", ns_altname ? ns_altname : "");
+}
+static DEVICE_ATTR_RW(alt_name);
+
+static int scan_free(struct nd_region *nd_region,
+		struct nd_mapping *nd_mapping, struct nd_label_id *label_id,
+		resource_size_t n)
+{
+	bool is_blk = strncmp(label_id->id, "blk", 3) == 0;
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	int rc = 0;
+
+	while (n) {
+		struct resource *res, *last;
+		resource_size_t new_start;
+
+		last = NULL;
+		for_each_dpa_resource(ndd, res)
+			if (strcmp(res->name, label_id->id) == 0)
+				last = res;
+		res = last;
+		if (!res)
+			return 0;
+
+		if (n >= resource_size(res)) {
+			n -= resource_size(res);
+			nd_dbg_dpa(nd_region, ndd, res, "delete %d\n", rc);
+			nd_dimm_free_dpa(ndd, res);
+			/* retry with last resource deleted */
+			continue;
+		}
+
+		/*
+		 * Keep BLK allocations relegated to high DPA as much as
+		 * possible
+		 */
+		if (is_blk)
+			new_start = res->start + n;
+		else
+			new_start = res->start;
+
+		rc = adjust_resource(res, new_start, resource_size(res) - n);
+		nd_dbg_dpa(nd_region, ndd, res, "shrink %d\n", rc);
+		break;
+	}
+
+	return rc;
+}
+
+/**
+ * shrink_dpa_allocation - for each dimm in region free n bytes for label_id
+ * @nd_region: the set of dimms to reclaim @n bytes from
+ * @label_id: unique identifier for the namespace consuming this dpa range
+ * @n: number of bytes per-dimm to release
+ *
+ * Assumes resources are ordered.  Starting from the end try to
+ * adjust_resource() the allocation to @n, but if @n is larger than the
+ * allocation delete it and find the 'new' last allocation in the label
+ * set.
+ */
+static int shrink_dpa_allocation(struct nd_region *nd_region,
+		struct nd_label_id *label_id, resource_size_t n)
+{
+	int i;
+
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		int rc;
+
+		rc = scan_free(nd_region, nd_mapping, label_id, n);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
+
+static resource_size_t init_dpa_allocation(struct nd_label_id *label_id,
+		struct nd_region *nd_region, struct nd_mapping *nd_mapping,
+		resource_size_t n)
+{
+	bool is_blk = strncmp(label_id->id, "blk", 3) == 0;
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	resource_size_t first_dpa;
+	struct resource *res;
+	int rc = 0;
+
+	/* allocate blk from highest dpa first */
+	if (is_blk)
+		first_dpa = nd_mapping->start + nd_mapping->size - n;
+	else
+		first_dpa = nd_mapping->start;
+
+	/* first resource allocation for this label-id or dimm */
+	res = nd_dimm_allocate_dpa(ndd, label_id, first_dpa, n);
+	if (!res)
+		rc = -EBUSY;
+
+	nd_dbg_dpa(nd_region, ndd, res, "init %d\n", rc);
+	return rc ? n : 0;
+}
+
+static bool space_valid(bool is_pmem, struct nd_label_id *label_id,
+		struct resource *res)
+{
+	/*
+	 * For BLK-space any space is valid, for PMEM-space, it must be
+	 * contiguous with an existing allocation.
+	 */
+	if (!is_pmem)
+		return true;
+	if (!res || strcmp(res->name, label_id->id) == 0)
+		return true;
+	return false;
+}
+
+enum alloc_loc {
+	ALLOC_ERR = 0, ALLOC_BEFORE, ALLOC_MID, ALLOC_AFTER,
+};
+
+static resource_size_t scan_allocate(struct nd_region *nd_region,
+		struct nd_mapping *nd_mapping, struct nd_label_id *label_id,
+		resource_size_t n)
+{
+	resource_size_t mapping_end = nd_mapping->start + nd_mapping->size - 1;
+	bool is_pmem = strncmp(label_id->id, "pmem", 4) == 0;
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	const resource_size_t to_allocate = n;
+	struct resource *res;
+	int first;
+
+ retry:
+	first = 0;
+	for_each_dpa_resource(ndd, res) {
+		resource_size_t allocate, available = 0, free_start, free_end;
+		struct resource *next = res->sibling, *new_res = NULL;
+		enum alloc_loc loc = ALLOC_ERR;
+		const char *action;
+		int rc = 0;
+
+		/* ignore resources outside this nd_mapping */
+		if (res->start > mapping_end)
+			continue;
+		if (res->end < nd_mapping->start)
+			continue;
+
+		/* space at the beginning of the mapping */
+		if (!first++ && res->start > nd_mapping->start) {
+			free_start = nd_mapping->start;
+			available = res->start - free_start;
+			if (space_valid(is_pmem, label_id, NULL))
+				loc = ALLOC_BEFORE;
+		}
+
+		/* space between allocations */
+		if (!loc && next) {
+			free_start = res->start + resource_size(res);
+			free_end = min(mapping_end, next->start - 1);
+			if (space_valid(is_pmem, label_id, res)
+					&& free_start < free_end) {
+				available = free_end + 1 - free_start;
+				loc = ALLOC_MID;
+			}
+		}
+
+		/* space at the end of the mapping */
+		if (!loc && !next) {
+			free_start = res->start + resource_size(res);
+			free_end = mapping_end;
+			if (space_valid(is_pmem, label_id, res)
+					&& free_start < free_end) {
+				available = free_end + 1 - free_start;
+				loc = ALLOC_AFTER;
+			}
+		}
+
+		if (!loc || !available)
+			continue;
+		allocate = min(available, n);
+		switch (loc) {
+		case ALLOC_BEFORE:
+			if (strcmp(res->name, label_id->id) == 0) {
+				/* adjust current resource up */
+				if (is_pmem)
+					return n;
+				rc = adjust_resource(res, res->start - allocate,
+						resource_size(res) + allocate);
+				action = "cur grow up";
+			} else
+				action = "allocate";
+			break;
+		case ALLOC_MID:
+			if (strcmp(next->name, label_id->id) == 0) {
+				/* adjust next resource up */
+				if (is_pmem)
+					return n;
+				rc = adjust_resource(next, next->start
+						- allocate, resource_size(next)
+						+ allocate);
+				new_res = next;
+				action = "next grow up";
+			} else if (strcmp(res->name, label_id->id) == 0) {
+				action = "grow down";
+			} else
+				action = "allocate";
+			break;
+		case ALLOC_AFTER:
+			if (strcmp(res->name, label_id->id) == 0)
+				action = "grow down";
+			else
+				action = "allocate";
+			break;
+		default:
+			return n;
+		}
+
+		if (strcmp(action, "allocate") == 0) {
+			/* BLK allocate bottom up */
+			if (!is_pmem)
+				free_start += available - allocate;
+			else if (free_start != nd_mapping->start)
+				return n;
+
+			new_res = nd_dimm_allocate_dpa(ndd, label_id,
+					free_start, allocate);
+			if (!new_res)
+				rc = -EBUSY;
+		} else if (strcmp(action, "grow down") == 0) {
+			/* adjust current resource down */
+			rc = adjust_resource(res, res->start, resource_size(res)
+					+ allocate);
+		}
+
+		if (!new_res)
+			new_res = res;
+
+		nd_dbg_dpa(nd_region, ndd, new_res, "%s(%d) %d\n",
+				action, loc, rc);
+
+		if (rc)
+			return n;
+
+		n -= allocate;
+		if (n) {
+			/*
+			 * Retry scan with newly inserted resources.
+			 * For example, if we did an ALLOC_BEFORE
+			 * insertion there may also have been space
+			 * available for an ALLOC_AFTER insertion, so we
+			 * need to check this same resource again
+			 */
+			goto retry;
+		} else
+			return 0;
+	}
+
+	if (is_pmem && n == to_allocate)
+		return init_dpa_allocation(label_id, nd_region, nd_mapping, n);
+	return n;
+}
+
+/**
+ * grow_dpa_allocation - for each dimm allocate n bytes for @label_id
+ * @nd_region: the set of dimms to allocate @n more bytes from
+ * @label_id: unique identifier for the namespace consuming this dpa range
+ * @n: number of bytes per-dimm to add to the existing allocation
+ *
+ * Assumes resources are ordered.  For BLK regions, first consume
+ * BLK-only available DPA free space, then consume PMEM-aliased DPA
+ * space starting at the highest DPA.  For PMEM regions start
+ * allocations from the start of an interleave set and end at the first
+ * BLK allocation or the end of the interleave set, whichever comes
+ * first.
+ */
+static int grow_dpa_allocation(struct nd_region *nd_region,
+		struct nd_label_id *label_id, resource_size_t n)
+{
+	int i;
+
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		int rc;
+
+		rc = scan_allocate(nd_region, nd_mapping, label_id, n);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
+
+static void nd_namespace_pmem_set_size(struct nd_region *nd_region,
+		struct nd_namespace_pmem *nspm, resource_size_t size)
+{
+	struct resource *res = &nspm->nsio.res;
+
+	res->start = nd_region->ndr_start;
+	res->end = nd_region->ndr_start + size - 1;
+}
+
+static ssize_t __size_store(struct device *dev, unsigned long long val)
+{
+	resource_size_t allocated = 0, available = 0;
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+	struct nd_mapping *nd_mapping;
+	struct nd_dimm_drvdata *ndd;
+	struct nd_label_id label_id;
+	u32 flags = 0, remainder;
+	u8 *uuid = NULL;
+	int rc, i;
+
+	if (dev->driver)
+		return -EBUSY;
+
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		uuid = nspm->uuid;
+	} else if (is_namespace_blk(dev)) {
+		/* TODO: blk namespace support */
+		return -ENXIO;
+	}
+
+	/*
+	 * We need a uuid for the allocation-label and dimm(s) on which
+	 * to store the label.
+	 */
+	if (!uuid || nd_region->ndr_mappings == 0)
+		return -ENXIO;
+
+	div_u64_rem(val, SZ_4K * nd_region->ndr_mappings, &remainder);
+	if (remainder) {
+		dev_dbg(dev, "%llu is not %dK aligned\n", val,
+				(SZ_4K * nd_region->ndr_mappings) / SZ_1K);
+		return -EINVAL;
+	}
+
+	nd_label_gen_id(&label_id, uuid, flags);
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		nd_mapping = &nd_region->mapping[i];
+		ndd = to_ndd(nd_mapping);
+
+		/*
+		 * All dimms in an interleave set, or the base dimm for a blk
+		 * region, need to be enabled for the size to be changed.
+		 */
+		if (!ndd)
+			return -ENXIO;
+
+		allocated += nd_dimm_allocated_dpa(ndd, &label_id);
+	}
+	available = nd_region_available_dpa(nd_region);
+
+	if (val > available + allocated)
+		return -ENOSPC;
+
+	if (val == allocated)
+		return 0;
+
+	val = div_u64(val, nd_region->ndr_mappings);
+	allocated = div_u64(allocated, nd_region->ndr_mappings);
+	if (val < allocated)
+		rc = shrink_dpa_allocation(nd_region, &label_id, allocated - val);
+	else
+		rc = grow_dpa_allocation(nd_region, &label_id, val - allocated);
+
+	if (rc)
+		return rc;
+
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		nd_namespace_pmem_set_size(nd_region, nspm,
+				val * nd_region->ndr_mappings);
+	}
+
+	return rc;
+}
+
+static ssize_t size_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	unsigned long long val;
+	u8 **uuid = NULL;
+	int rc;
+
+	rc = kstrtoull(buf, 0, &val);
+	if (rc)
+		return rc;
+
+	device_lock(dev);
+	nd_bus_lock(dev);
+	wait_nd_bus_probe_idle(dev);
+	rc = __size_store(dev, val);
+
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		uuid = &nspm->uuid;
+	} else if (is_namespace_blk(dev)) {
+		/* TODO: blk namespace support */
+		rc = -ENXIO;
+	}
+
+	if (rc == 0 && val == 0 && uuid) {
+		/* setting size zero == 'delete namespace' */
+		kfree(*uuid);
+		*uuid = NULL;
+	}
+
+	dev_dbg(dev, "%s: %llx %s (%d)\n", __func__, val, rc < 0
+			? "fail" : "success", rc);
+
+	nd_bus_unlock(dev);
+	device_unlock(dev);
+
+	return rc ? rc : len;
+}
+
+static ssize_t size_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		return sprintf(buf, "%llu\n", (unsigned long long)
+				resource_size(&nspm->nsio.res));
+	} else if (is_namespace_blk(dev)) {
+		/* TODO: blk namespace support */
+		return -ENXIO;
+	} else if (is_namespace_io(dev)) {
+		struct nd_namespace_io *nsio = to_nd_namespace_io(dev);
+
+		return sprintf(buf, "%llu\n", (unsigned long long)
+				resource_size(&nsio->res));
+	} else
+		return -ENXIO;
+}
+static DEVICE_ATTR(size, S_IRUGO, size_show, size_store);
+
+static ssize_t uuid_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	u8 *uuid;
+
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		uuid = nspm->uuid;
+	} else if (is_namespace_blk(dev)) {
+		/* TODO: blk namespace support */
+		return -ENXIO;
+	} else
+		return -ENXIO;
+
+	if (uuid)
+		return sprintf(buf, "%pUb\n", uuid);
+	return sprintf(buf, "\n");
+}
+
+/**
+ * namespace_update_uuid - check for a unique uuid and whether we're "renaming"
+ * @nd_region: parent region so we can updates all dimms in the set
+ * @dev: namespace type for generating label_id
+ * @new_uuid: incoming uuid
+ * @old_uuid: reference to the uuid storage location in the namespace object
+ */
+static int namespace_update_uuid(struct nd_region *nd_region,
+		struct device *dev, u8 *new_uuid, u8 **old_uuid)
+{
+	u32 flags = is_namespace_blk(dev) ? NSLABEL_FLAG_LOCAL : 0;
+	struct nd_label_id old_label_id;
+	struct nd_label_id new_label_id;
+	int i, rc;
+
+	rc = nd_is_uuid_unique(dev, new_uuid) ? 0 : -EINVAL;
+	if (rc) {
+		kfree(new_uuid);
+		return rc;
+	}
+
+	if (*old_uuid == NULL)
+		goto out;
+
+	nd_label_gen_id(&old_label_id, *old_uuid, flags);
+	nd_label_gen_id(&new_label_id, new_uuid, flags);
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+		struct resource *res;
+
+		for_each_dpa_resource(ndd, res)
+			if (strcmp(res->name, old_label_id.id) == 0)
+				sprintf((void *) res->name, "%s",
+						new_label_id.id);
+	}
+	kfree(*old_uuid);
+ out:
+	*old_uuid = new_uuid;
+	return 0;
+}
+
+static ssize_t uuid_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+	u8 *uuid = NULL;
+	u8 **ns_uuid;
+	ssize_t rc;
+
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		ns_uuid = &nspm->uuid;
+	} else if (is_namespace_blk(dev)) {
+		/* TODO: blk namespace support */
+		return -ENXIO;
+	} else
+		return -ENXIO;
+
+	device_lock(dev);
+	nd_bus_lock(dev);
+	wait_nd_bus_probe_idle(dev);
+	rc = nd_uuid_store(dev, &uuid, buf, len);
+	if (rc >= 0)
+		rc = namespace_update_uuid(nd_region, dev, uuid, ns_uuid);
+	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	nd_bus_unlock(dev);
+	device_unlock(dev);
+
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(uuid);
+
+static ssize_t resource_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct resource *res;
+
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		res = &nspm->nsio.res;
+	} else if (is_namespace_io(dev)) {
+		struct nd_namespace_io *nsio = to_nd_namespace_io(dev);
+
+		res = &nsio->res;
+	} else
+		return -ENXIO;
+
+	/* no address to convey if the namespace has no allocation */
+	if (resource_size(res) == 0)
+		return -ENXIO;
+	return sprintf(buf, "%#llx\n", (unsigned long long) res->start);
+}
+static DEVICE_ATTR_RO(resource);
+
 static struct attribute *nd_namespace_attributes[] = {
 	&dev_attr_nstype.attr,
+	&dev_attr_size.attr,
+	&dev_attr_uuid.attr,
+	&dev_attr_resource.attr,
+	&dev_attr_alt_name.attr,
 	NULL,
 };
 
+static umode_t nd_namespace_attr_visible(struct kobject *kobj, struct attribute *a, int n)
+{
+	struct device *dev = container_of(kobj, struct device, kobj);
+
+	if (a == &dev_attr_resource.attr) {
+		if (is_namespace_blk(dev))
+			return 0;
+		return a->mode;
+	}
+
+	if (is_namespace_pmem(dev) || is_namespace_blk(dev)) {
+		if (a == &dev_attr_size.attr)
+			return S_IWUSR;
+		return a->mode;
+	}
+
+	if (a == &dev_attr_nstype.attr || a == &dev_attr_size.attr)
+		return a->mode;
+
+	return 0;
+}
+
 static struct attribute_group nd_namespace_attribute_group = {
 	.attrs = nd_namespace_attributes,
+	.is_visible = nd_namespace_attr_visible,
 };
 
 static const struct attribute_group *nd_namespace_attribute_groups[] = {
@@ -80,23 +783,326 @@ static struct device **create_namespace_io(struct nd_region *nd_region)
 	return devs;
 }
 
+static bool has_uuid_at_pos(struct nd_region *nd_region, u8 *uuid, u64 cookie, u16 pos)
+{
+	struct nd_namespace_label __iomem *found = NULL;
+	int i;
+
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		struct nd_namespace_label __iomem *nd_label;
+		u8 label_uuid[NSLABEL_UUID_LEN];
+		u8 *found_uuid = NULL;
+		int l;
+
+		for_each_label(l, nd_label, nd_mapping->labels) {
+			u64 isetcookie = readq(&nd_label->isetcookie);
+			u16 position = readw(&nd_label->position);
+			u16 nlabel = readw(&nd_label->nlabel);
+
+			if (isetcookie != cookie)
+				continue;
+
+			memcpy_fromio(label_uuid, nd_label->uuid,
+					NSLABEL_UUID_LEN);
+			if (memcmp(label_uuid, uuid, NSLABEL_UUID_LEN) != 0)
+				continue;
+
+			if (found_uuid) {
+				dev_dbg(to_ndd(nd_mapping)->dev,
+						"%s duplicate entry for uuid\n",
+						__func__);
+				return false;
+			}
+			found_uuid = label_uuid;
+			if (nlabel != nd_region->ndr_mappings)
+				continue;
+			if (position != pos)
+				continue;
+			found = nd_label;
+			break;
+		}
+		if (found)
+			break;
+	}
+	return found != NULL;
+}
+
+static int select_pmem_uuid(struct nd_region *nd_region, u8 *pmem_uuid)
+{
+	struct nd_namespace_label __iomem *select = NULL;
+	int i;
+
+	if (!pmem_uuid)
+		return -ENODEV;
+
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		struct nd_namespace_label __iomem *nd_label;
+		u64 hw_start, hw_end, pmem_start, pmem_end;
+		int l;
+
+		for_each_label(l, nd_label, nd_mapping->labels) {
+			u8 label_uuid[NSLABEL_UUID_LEN];
+
+			memcpy_fromio(label_uuid, nd_label->uuid,
+					NSLABEL_UUID_LEN);
+			if (memcmp(label_uuid, pmem_uuid, NSLABEL_UUID_LEN) == 0)
+				break;
+		}
+
+		if (!nd_label) {
+			WARN_ON(1);
+			return -EINVAL;
+		}
+
+		select = nd_label;
+		/*
+		 * Check that this label is compliant with the dpa
+		 * range published in NFIT
+		 */
+		hw_start = nd_mapping->start;
+		hw_end = hw_start + nd_mapping->size;
+		pmem_start = readq(&select->dpa);
+		pmem_end = pmem_start + readq(&select->rawsize);
+		if (pmem_start == hw_start && pmem_end <= hw_end)
+			/* pass */;
+		else
+			return -EINVAL;
+
+		nd_set_label(nd_mapping->labels, select, 0);
+		nd_set_label(nd_mapping->labels, (void __iomem *) NULL, 1);
+	}
+	return 0;
+}
+
+/**
+ * find_pmem_label_set - validate interleave set labelling, retrieve label0
+ * @nd_region: region with mappings to validate
+ */
+static int find_pmem_label_set(struct nd_region *nd_region,
+		struct nd_namespace_pmem *nspm)
+{
+	u64 cookie = nd_region_interleave_set_cookie(nd_region);
+	struct nd_namespace_label __iomem *nd_label;
+	u8 select_uuid[NSLABEL_UUID_LEN];
+	resource_size_t size = 0;
+	u8 *pmem_uuid = NULL;
+	int rc = -ENODEV, l;
+	u16 i;
+
+	if (cookie == 0)
+		return -ENXIO;
+
+	/*
+	 * Find a complete set of labels by uuid.  By definition we can start
+	 * with any mapping as the reference label
+	 */
+	for_each_label(l, nd_label, nd_region->mapping[0].labels) {
+		u64 isetcookie = readq(&nd_label->isetcookie);
+		u8 label_uuid[NSLABEL_UUID_LEN];
+
+		if (isetcookie != cookie)
+			continue;
+
+		memcpy_fromio(label_uuid, nd_label->uuid,
+				NSLABEL_UUID_LEN);
+		for (i = 0; nd_region->ndr_mappings; i++)
+			if (!has_uuid_at_pos(nd_region, label_uuid, cookie, i))
+				break;
+		if (i < nd_region->ndr_mappings) {
+			/*
+			 * Give up if we don't find an instance of a
+			 * uuid at each position (from 0 to
+			 * nd_region->ndr_mappings - 1), or if we find a
+			 * dimm with two instances of the same uuid.
+			 */
+			rc = -EINVAL;
+			goto err;
+		} else if (pmem_uuid) {
+			/*
+			 * If there is more than one valid uuid set, we
+			 * need userspace to clean this up.
+			 */
+			rc = -EBUSY;
+			goto err;
+		}
+		memcpy(select_uuid, label_uuid, NSLABEL_UUID_LEN);
+		pmem_uuid = select_uuid;
+	}
+
+	/*
+	 * Fix up each mapping's 'labels' to have the validated pmem label for
+	 * that position at labels[0], and NULL at labels[1].  In the process,
+	 * check that the namespace aligns with interleave-set.  We know
+	 * that it does not overlap with any blk namespaces by virtue of
+	 * the dimm being enabled (i.e. nd_label_reserve_dpa()
+	 * succeeded).
+	 */
+	rc = select_pmem_uuid(nd_region, pmem_uuid);
+	if (rc)
+		goto err;
+
+	/* Calculate total size and populate namespace properties from label0 */
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		struct nd_namespace_label __iomem *label0;
+
+		label0 = nd_get_label(nd_mapping->labels, 0);
+		size += readq(&label0->rawsize);
+		if (readw(&label0->position) != 0)
+			continue;
+		WARN_ON(nspm->alt_name || nspm->uuid);
+		nspm->alt_name = kmemdup((void __force *) label0->name,
+				NSLABEL_NAME_LEN, GFP_KERNEL);
+		nspm->uuid = kmemdup((void __force *) label0->uuid,
+				NSLABEL_UUID_LEN, GFP_KERNEL);
+	}
+
+	if (!nspm->alt_name || !nspm->uuid) {
+		rc = -ENOMEM;
+		goto err;
+	}
+
+	nd_namespace_pmem_set_size(nd_region, nspm, size);
+
+	return 0;
+ err:
+	switch (rc) {
+	case -EINVAL:
+		dev_dbg(&nd_region->dev, "%s: invalid label(s)\n", __func__);
+		break;
+	case -ENODEV:
+		dev_dbg(&nd_region->dev, "%s: label not found\n", __func__);
+		break;
+	default:
+		dev_dbg(&nd_region->dev, "%s: unexpected err: %d\n", __func__, rc);
+		break;
+	}
+	return rc;
+}
+
+static struct device **create_namespace_pmem(struct nd_region *nd_region)
+{
+	struct nd_namespace_pmem *nspm;
+	struct device *dev, **devs;
+	struct resource *res;
+	int rc;
+
+	nspm = kzalloc(sizeof(*nspm), GFP_KERNEL);
+	if (!nspm)
+		return NULL;
+
+	dev = &nspm->nsio.dev;
+	dev->type = &namespace_pmem_device_type;
+	res = &nspm->nsio.res;
+	res->name = dev_name(&nd_region->dev);
+	res->flags = IORESOURCE_MEM;
+	rc = find_pmem_label_set(nd_region, nspm);
+	if (rc == -ENODEV) {
+		int i;
+
+		/* Pass, try to permit namespace creation... */
+		for (i = 0; i < nd_region->ndr_mappings; i++) {
+			struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+
+			kfree(nd_mapping->labels);
+			nd_mapping->labels = NULL;
+		}
+
+		/* Publish a zero-sized namespace for userspace to configure. */
+		nd_namespace_pmem_set_size(nd_region, nspm, 0);
+
+		rc = 0;
+	} else if (rc)
+		goto err;
+
+	devs = kcalloc(2, sizeof(struct device *), GFP_KERNEL);
+	if (!devs)
+		goto err;
+
+	devs[0] = dev;
+	return devs;
+
+ err:
+	namespace_pmem_release(&nspm->nsio.dev);
+	return NULL;
+}
+
+static int init_active_labels(struct nd_region *nd_region)
+{
+	int i;
+
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+		int count, j;
+
+		/*
+		 * If the dimm is disabled then prevent the region from
+		 * being activated if it aliases DPA.
+		 */
+		if (!ndd) {
+			struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
+
+			if ((nd_dimm->flags & NDD_ALIASING) == 0)
+				return 0;
+			dev_dbg(&nd_region->dev, "%s: is disabled, failing probe\n",
+					dev_name(&nd_mapping->nd_dimm->dev));
+			return -ENXIO;
+		}
+
+		count = nd_label_active_count(ndd);
+		dev_dbg(ndd->dev, "%s: %d\n", __func__, count);
+		if (!count)
+			continue;
+		nd_mapping->labels = kcalloc(count + 1,
+				sizeof(struct nd_namespace_label *), GFP_KERNEL);
+		if (!nd_mapping->labels)
+			return -ENOMEM;
+		for (j = 0; j < count; j++) {
+			struct nd_namespace_label __iomem *label;
+
+			label = nd_label_active(ndd, j);
+			nd_set_label(nd_mapping->labels, label, j);
+		}
+	}
+
+	return 0;
+}
+
 int nd_region_register_namespaces(struct nd_region *nd_region, int *err)
 {
 	struct device **devs = NULL;
-	int i;
+	int i, rc = 0, type;
 
 	*err = 0;
-	switch (nd_region_to_namespace_type(nd_region)) {
+	nd_bus_lock(&nd_region->dev);
+	rc = init_active_labels(nd_region);
+	if (rc) {
+		nd_bus_unlock(&nd_region->dev);
+		return rc;
+	}
+
+	type = nd_region_to_namespace_type(nd_region);
+	switch (type) {
 	case ND_DEVICE_NAMESPACE_IO:
 		devs = create_namespace_io(nd_region);
 		break;
+	case ND_DEVICE_NAMESPACE_PMEM:
+		devs = create_namespace_pmem(nd_region);
+		break;
 	default:
 		break;
 	}
+	nd_bus_unlock(&nd_region->dev);
 
-	if (!devs)
-		return -ENODEV;
+	if (!devs) {
+		rc = -ENODEV;
+		goto err;
+	}
 
+	nd_region->ns_seed = devs[0];
 	for (i = 0; devs[i]; i++) {
 		struct device *dev = devs[i];
 
@@ -108,4 +1114,14 @@ int nd_region_register_namespaces(struct nd_region *nd_region, int *err)
 	kfree(devs);
 
 	return i;
+
+ err:
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+
+		kfree(nd_mapping->labels);
+		nd_mapping->labels = NULL;
+	}
+
+	return rc;
 }
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index 67f28011dfa5..814843454417 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -60,4 +60,15 @@ int nd_bus_register_dimms(struct nd_bus *nd_bus);
 int nd_bus_register_regions(struct nd_bus *nd_bus);
 int nd_bus_init_interleave_sets(struct nd_bus *nd_bus);
 int nd_match_dimm(struct device *dev, void *data);
+struct nd_label_id;
+char *nd_label_gen_id(struct nd_label_id *label_id, u8 *uuid, u32 flags);
+bool nd_is_uuid_unique(struct device *dev, u8 *uuid);
+struct nd_region;
+struct nd_dimm_drvdata;
+struct nd_mapping;
+resource_size_t nd_pmem_available_dpa(struct nd_region *nd_region,
+		struct nd_mapping *nd_mapping, resource_size_t *overlap);
+resource_size_t nd_region_available_dpa(struct nd_region *nd_region);
+resource_size_t nd_dimm_allocated_dpa(struct nd_dimm_drvdata *ndd,
+		struct nd_label_id *label_id);
 #endif /* __ND_PRIVATE_H__ */
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index 63540ffe845d..d9d221a7006e 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -16,6 +16,7 @@
 #include <linux/libnd.h>
 #include <linux/mutex.h>
 #include <linux/ndctl.h>
+#include <linux/types.h>
 #include "label.h"
 
 struct nd_dimm_drvdata {
@@ -59,12 +60,37 @@ static inline struct nd_namespace_index __iomem *to_next_namespace_index(
 		(unsigned long long) (res ? resource_size(res) : 0), \
 		(unsigned long long) (res ? res->start : 0), ##arg)
 
+/* sparse helpers */
+static inline void nd_set_label(struct nd_namespace_label **labels,
+		struct nd_namespace_label __iomem *label, int idx)
+{
+	labels[idx] = (void __force *) label;
+}
+
+static inline struct nd_namespace_label __iomem *nd_get_label(
+		struct nd_namespace_label **labels, int idx)
+{
+	struct nd_namespace_label __iomem *label = NULL;
+
+	if (labels)
+		label = (struct nd_namespace_label __iomem *) labels[idx];
+
+	return label;
+}
+
+#define for_each_label(l, label, labels) \
+	for (l = 0; (label = nd_get_label(labels, l)); l++)
+
+#define for_each_dpa_resource(ndd, res) \
+	for (res = (ndd)->dpa.child; res; res = res->sibling)
+
 #define for_each_dpa_resource_safe(ndd, res, next) \
 	for (res = (ndd)->dpa.child, next = res ? res->sibling : NULL; \
 			res; res = next, next = next ? next->sibling : NULL)
 
 struct nd_region {
 	struct device dev;
+	struct device *ns_seed;
 	u16 ndr_mappings;
 	u64 ndr_size;
 	u64 ndr_start;
@@ -88,13 +114,19 @@ enum nd_async_mode {
 	ND_ASYNC,
 };
 
+void wait_nd_bus_probe_idle(struct device *dev);
 void nd_device_register(struct device *dev);
 void nd_device_unregister(struct device *dev, enum nd_async_mode mode);
+int nd_uuid_store(struct device *dev, u8 **uuid_out, const char *buf,
+		size_t len);
+struct nd_dimm;
+struct nd_dimm_drvdata *to_ndd(struct nd_mapping *nd_mapping);
 int nd_dimm_init_nsarea(struct nd_dimm_drvdata *ndd);
 int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd);
 struct nd_region *to_nd_region(struct device *dev);
 int nd_region_to_namespace_type(struct nd_region *nd_region);
 int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
+u64 nd_region_interleave_set_cookie(struct nd_region *nd_region);
 void nd_bus_lock(struct device *dev);
 void nd_bus_unlock(struct device *dev);
 bool is_nd_bus_locked(struct device *dev);
diff --git a/drivers/block/nd/pmem.c b/drivers/block/nd/pmem.c
index fc34677d0f48..bf380393da92 100644
--- a/drivers/block/nd/pmem.c
+++ b/drivers/block/nd/pmem.c
@@ -201,6 +201,23 @@ static int nd_pmem_probe(struct device *dev)
 	struct nd_namespace_io *nsio = to_nd_namespace_io(dev);
 	struct pmem_device *pmem;
 
+	if (resource_size(&nsio->res) < ND_MIN_NAMESPACE_SIZE) {
+		resource_size_t size = resource_size(&nsio->res);
+
+		dev_dbg(dev, "%s: size: %pa, too small must be at least %#x\n",
+				__func__, &size, ND_MIN_NAMESPACE_SIZE);
+		return -ENODEV;
+	}
+
+	if (nd_region_to_namespace_type(nd_region) == ND_DEVICE_NAMESPACE_PMEM) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		if (!nspm->uuid) {
+			dev_dbg(dev, "%s: uuid not set\n", __func__);
+			return -ENODEV;
+		}
+	}
+
 	pmem = pmem_alloc(dev, &nsio->res, nd_region->id);
 	if (IS_ERR(pmem))
 		return PTR_ERR(pmem);
@@ -220,13 +237,14 @@ static int nd_pmem_remove(struct device *dev)
 
 MODULE_ALIAS("pmem");
 MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_IO);
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_PMEM);
 static struct nd_device_driver nd_pmem_driver = {
 	.probe = nd_pmem_probe,
 	.remove = nd_pmem_remove,
 	.drv = {
 		.name = "pmem",
 	},
-	.type = ND_DRIVER_NAMESPACE_IO,
+	.type = ND_DRIVER_NAMESPACE_IO | ND_DRIVER_NAMESPACE_PMEM,
 };
 
 static int __init pmem_init(void)
diff --git a/drivers/block/nd/region.c b/drivers/block/nd/region.c
index 7e58b2a700c2..31bb33962e14 100644
--- a/drivers/block/nd/region.c
+++ b/drivers/block/nd/region.c
@@ -61,8 +61,11 @@ static int child_unregister(struct device *dev, void *data)
 
 static int nd_region_remove(struct device *dev)
 {
+	struct nd_region *nd_region = to_nd_region(dev);
+
 	/* flush attribute readers and disable */
 	nd_bus_lock(dev);
+	nd_region->ns_seed = NULL;
 	dev_set_drvdata(dev, NULL);
 	nd_bus_unlock(dev);
 
diff --git a/drivers/block/nd/region_devs.c b/drivers/block/nd/region_devs.c
index 221e6342b6ca..6b43a5c901cd 100644
--- a/drivers/block/nd/region_devs.c
+++ b/drivers/block/nd/region_devs.c
@@ -15,6 +15,7 @@
 #include <linux/slab.h>
 #include <linux/sort.h>
 #include <linux/io.h>
+#include <linux/nd.h>
 #include "nd-private.h"
 #include "nd.h"
 
@@ -99,6 +100,58 @@ int nd_region_to_namespace_type(struct nd_region *nd_region)
 
 	return 0;
 }
+EXPORT_SYMBOL(nd_region_to_namespace_type);
+
+static int is_uuid_busy(struct device *dev, void *data)
+{
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+	u8 *uuid = data;
+
+	switch (nd_region_to_namespace_type(nd_region)) {
+	case ND_DEVICE_NAMESPACE_PMEM: {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		if (!nspm->uuid)
+			break;
+		if (memcmp(uuid, nspm->uuid, NSLABEL_UUID_LEN) == 0)
+			return -EBUSY;
+		break;
+	}
+	case ND_DEVICE_NAMESPACE_BLK: {
+		/* TODO: blk namespace support */
+		break;
+	}
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+static int is_namespace_uuid_busy(struct device *dev, void *data)
+{
+	if (is_nd_pmem(dev) || is_nd_blk(dev))
+		return device_for_each_child(dev, data, is_uuid_busy);
+	return 0;
+}
+
+/**
+ * nd_is_uuid_unique - verify that no other namespace has @uuid
+ * @dev: any device on a nd_bus
+ * @uuid: uuid to check
+ */
+bool nd_is_uuid_unique(struct device *dev, u8 *uuid)
+{
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
+	if (!nd_bus)
+		return false;
+	WARN_ON_ONCE(!is_nd_bus_locked(&nd_bus->dev));
+	if (device_for_each_child(&nd_bus->dev, uuid,
+				is_namespace_uuid_busy) != 0)
+		return false;
+	return true;
+}
 
 static ssize_t size_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
@@ -151,6 +204,60 @@ static ssize_t set_cookie_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(set_cookie);
 
+resource_size_t nd_region_available_dpa(struct nd_region *nd_region)
+{
+	resource_size_t blk_max_overlap = 0, available, overlap;
+	int i;
+
+	WARN_ON(!is_nd_bus_locked(&nd_region->dev));
+
+ retry:
+	available = 0;
+	overlap = blk_max_overlap;
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+
+		/* if a dimm is disabled the available capacity is zero */
+		if (!ndd)
+			return 0;
+
+		if (is_nd_pmem(&nd_region->dev)) {
+			available += nd_pmem_available_dpa(nd_region,
+					nd_mapping, &overlap);
+			if (overlap > blk_max_overlap) {
+				blk_max_overlap = overlap;
+				goto retry;
+			}
+		} else if (is_nd_blk(&nd_region->dev)) {
+			/* TODO: BLK Namespace support */
+		}
+	}
+
+	return available;
+}
+
+static ssize_t available_size_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+	unsigned long long available = 0;
+
+	/*
+	 * Flush in-flight updates and grab a snapshot of the available
+	 * size.  Of course, this value is potentially invalidated the
+	 * memory nd_bus_lock() is dropped, but that's userspace's
+	 * problem to not race itself.
+	 */
+	nd_bus_lock(dev);
+	wait_nd_bus_probe_idle(dev);
+	available = nd_region_available_dpa(nd_region);
+	nd_bus_unlock(dev);
+
+	return sprintf(buf, "%llu\n", available);
+}
+static DEVICE_ATTR_RO(available_size);
+
 static ssize_t init_namespaces_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -168,11 +275,29 @@ static ssize_t init_namespaces_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(init_namespaces);
 
+static ssize_t namespace_seed_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+	ssize_t rc;
+
+	nd_bus_lock(dev);
+	if (nd_region->ns_seed)
+		rc = sprintf(buf, "%s\n", dev_name(nd_region->ns_seed));
+	else
+		rc = sprintf(buf, "\n");
+	nd_bus_unlock(dev);
+	return rc;
+}
+static DEVICE_ATTR_RO(namespace_seed);
+
 static struct attribute *nd_region_attributes[] = {
 	&dev_attr_size.attr,
 	&dev_attr_nstype.attr,
 	&dev_attr_mappings.attr,
 	&dev_attr_set_cookie.attr,
+	&dev_attr_available_size.attr,
+	&dev_attr_namespace_seed.attr,
 	&dev_attr_init_namespaces.attr,
 	NULL,
 };
@@ -182,12 +307,17 @@ static umode_t nd_region_visible(struct kobject *kobj, struct attribute *a, int
 	struct device *dev = container_of(kobj, typeof(*dev), kobj);
 	struct nd_region *nd_region = to_nd_region(dev);
 	struct nd_interleave_set *nd_set = nd_region->nd_set;
+	int type = nd_region_to_namespace_type(nd_region);
 
-	if (a != &dev_attr_set_cookie.attr)
+	if (a != &dev_attr_set_cookie.attr && a != &dev_attr_available_size.attr)
 		return a->mode;
 
-	if (is_nd_pmem(dev) && nd_set)
-			return a->mode;
+	if ((type == ND_DEVICE_NAMESPACE_PMEM
+				|| type == ND_DEVICE_NAMESPACE_BLK)
+			&& a == &dev_attr_available_size.attr)
+		return a->mode;
+	else if (is_nd_pmem(dev) && nd_set)
+		return a->mode;
 
 	return 0;
 }
@@ -198,6 +328,15 @@ struct attribute_group nd_region_attribute_group = {
 };
 EXPORT_SYMBOL_GPL(nd_region_attribute_group);
 
+u64 nd_region_interleave_set_cookie(struct nd_region *nd_region)
+{
+	struct nd_interleave_set *nd_set = nd_region->nd_set;
+
+	if (nd_set)
+		return nd_set->cookie;
+	return 0;
+}
+
 /*
  * Upon successful probe/remove, take/release a reference on the
  * associated interleave set (if present)
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index 52f669faacfd..3190a561ea59 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -40,8 +40,10 @@ typedef int (*ndctl_fn)(struct nd_bus_descriptor *nd_desc,
 		struct nd_dimm *nd_dimm, unsigned int cmd, void *buf,
 		unsigned int buf_len);
 
+struct nd_namespace_label;
 struct nd_mapping {
 	struct nd_dimm *nd_dimm;
+	struct nd_namespace_label **labels;
 	u64 start;
 	u64 size;
 };
diff --git a/include/linux/nd.h b/include/linux/nd.h
index da70e9962197..255c38a83083 100644
--- a/include/linux/nd.h
+++ b/include/linux/nd.h
@@ -28,16 +28,40 @@ static inline struct nd_device_driver *to_nd_device_driver(
 	return container_of(drv, struct nd_device_driver, drv);
 };
 
+/**
+ * struct nd_namespace_io - infrastructure for loading an nd_pmem instance
+ * @dev: namespace device created by the nd region driver
+ * @res: struct resource conversion of a NFIT SPA table
+ */
 struct nd_namespace_io {
 	struct device dev;
 	struct resource res;
 };
 
+/**
+ * struct nd_namespace_pmem - namespace device for dimm-backed interleaved memory
+ * @nsio: device and system physical address range to drive
+ * @alt_name: namespace name supplied in the dimm label
+ * @uuid: namespace name supplied in the dimm label
+ */
+struct nd_namespace_pmem {
+	struct nd_namespace_io nsio;
+	char *alt_name;
+	u8 *uuid;
+};
+
 static inline struct nd_namespace_io *to_nd_namespace_io(struct device *dev)
 {
 	return container_of(dev, struct nd_namespace_io, dev);
 }
 
+static inline struct nd_namespace_pmem *to_nd_namespace_pmem(struct device *dev)
+{
+	struct nd_namespace_io *nsio = to_nd_namespace_io(dev);
+
+	return container_of(nsio, struct nd_namespace_pmem, nsio);
+}
+
 #define MODULE_ALIAS_ND_DEVICE(type) \
 	MODULE_ALIAS("nd:t" __stringify(type) "*")
 #define ND_DEVICE_MODALIAS_FMT "nd:t%d"
diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
index 624a19d9e6e4..0b4dcabb248a 100644
--- a/include/uapi/linux/ndctl.h
+++ b/include/uapi/linux/ndctl.h
@@ -190,4 +190,8 @@ enum nd_driver_flags {
 	ND_DRIVER_NAMESPACE_PMEM  = 1 << ND_DEVICE_NAMESPACE_PMEM,
 	ND_DRIVER_NAMESPACE_BLK   = 1 << ND_DEVICE_NAMESPACE_BLK,
 };
+
+enum {
+	ND_MIN_NAMESPACE_SIZE = 0x00400000,
+};
 #endif /* __NDCTL_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 14/21] libnd: blk labels and namespace instantiation
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:57   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, mingo, linux-acpi,
	jmoyer, hch

A blk label set describes a namespace comprised of one or more
discontiguous dpa ranges on a single dimm.  They may alias with one or
more pmem interleave sets that include the given dimm.

This is the runtime/volatile configuration infrastructure for sysfs
manipulation of 'alt_name', 'uuid', 'size', and 'sector_size'.  A later
patch will make these settings persistent by writing back the label(s).

Unlike pmem namespaces, multiple blk namespaces can be created per
region.  Once a blk namespace has been created a new seed device
(unconfigured child of a parent blk region) is instantiated.  As long as
a region has 'available_size' != 0 new child namespaces may be created.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/block/nd/core.c           |   40 +++
 drivers/block/nd/dimm_devs.c      |   35 +++
 drivers/block/nd/namespace_devs.c |  504 ++++++++++++++++++++++++++++++++++---
 drivers/block/nd/nd-private.h     |    8 +
 drivers/block/nd/nd.h             |    5 
 drivers/block/nd/region_devs.c    |   15 +
 include/linux/libnd.h             |    3 
 include/linux/nd.h                |   25 ++
 8 files changed, 590 insertions(+), 45 deletions(-)

diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index 0bf69abb47fc..b45863343a48 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -171,6 +171,46 @@ int nd_uuid_store(struct device *dev, u8 **uuid_out, const char *buf,
 	return 0;
 }
 
+ssize_t nd_sector_size_show(unsigned long current_lbasize,
+		const unsigned long *supported, char *buf)
+{
+	ssize_t len = 0;
+	int i;
+
+	for (i = 0; supported[i]; i++)
+		if (current_lbasize == supported[i])
+			len += sprintf(buf + len, "[%ld] ", supported[i]);
+		else
+			len += sprintf(buf + len, "%ld ", supported[i]);
+	len += sprintf(buf + len, "\n");
+	return len;
+}
+
+ssize_t nd_sector_size_store(struct device *dev, const char *buf,
+		unsigned long *current_lbasize, const unsigned long *supported)
+{
+	unsigned long lbasize;
+	int rc, i;
+
+	if (dev->driver)
+		return -EBUSY;
+
+	rc = kstrtoul(buf, 0, &lbasize);
+	if (rc)
+		return rc;
+
+	for (i = 0; supported[i]; i++)
+		if (lbasize == supported[i])
+			break;
+
+	if (supported[i]) {
+		*current_lbasize = lbasize;
+		return 0;
+	} else {
+		return -EINVAL;
+	}
+}
+
 static ssize_t commands_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index b242d3ae6d12..4aa5654354ac 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -256,6 +256,41 @@ struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
 EXPORT_SYMBOL_GPL(nd_dimm_create);
 
 /**
+ * nd_blk_available_dpa - account the unused dpa of BLK region
+ * @nd_mapping: container of dpa-resource-root + labels
+ *
+ * Unlike PMEM, BLK namespaces can occupy discontiguous DPA ranges.
+ */
+resource_size_t nd_blk_available_dpa(struct nd_mapping *nd_mapping)
+{
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	resource_size_t map_end, busy = 0, available;
+	struct resource *res;
+
+	if (!ndd)
+		return 0;
+
+	map_end = nd_mapping->start + nd_mapping->size - 1;
+	for_each_dpa_resource(ndd, res)
+		if (res->start >= nd_mapping->start && res->start < map_end) {
+			resource_size_t end = min(map_end, res->end);
+
+			busy += end - res->start + 1;
+		} else if (res->end >= nd_mapping->start && res->end <= map_end) {
+			busy += res->end - nd_mapping->start;
+		} else if (nd_mapping->start > res->start
+				&& nd_mapping->start < res->end) {
+			/* total eclipse of the BLK region mapping */
+			busy += nd_mapping->size;
+		}
+
+	available = map_end - nd_mapping->start + 1;
+	if (busy < available)
+		return available - busy;
+	return 0;
+}
+
+/**
  * nd_pmem_available_dpa - for the given dimm+region account unallocated dpa
  * @nd_mapping: container of dpa-resource-root + labels
  * @nd_region: constrain available space check to this reference region
diff --git a/drivers/block/nd/namespace_devs.c b/drivers/block/nd/namespace_devs.c
index d0417575b18c..d06b8abf6744 100644
--- a/drivers/block/nd/namespace_devs.c
+++ b/drivers/block/nd/namespace_devs.c
@@ -37,7 +37,15 @@ static void namespace_pmem_release(struct device *dev)
 
 static void namespace_blk_release(struct device *dev)
 {
-	/* TODO: blk namespace support */
+	struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+
+	if (nsblk->id >= 0)
+		ida_simple_remove(&nd_region->ns_ida, nsblk->id);
+	kfree(nsblk->alt_name);
+	kfree(nsblk->uuid);
+	kfree(nsblk->res);
+	kfree(nsblk);
 }
 
 static struct device_type namespace_io_device_type = {
@@ -90,8 +98,9 @@ static ssize_t __alt_name_store(struct device *dev, const char *buf,
 
 		ns_altname = &nspm->alt_name;
 	} else if (is_namespace_blk(dev)) {
-		/* TODO: blk namespace support */
-		return -ENXIO;
+		struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+
+		ns_altname = &nsblk->alt_name;
 	} else
 		return -ENXIO;
 
@@ -124,6 +133,24 @@ out:
 	return rc;
 }
 
+static resource_size_t nd_namespace_blk_size(struct nd_namespace_blk *nsblk)
+{
+	struct nd_region *nd_region = to_nd_region(nsblk->dev.parent);
+	struct nd_mapping *nd_mapping = &nd_region->mapping[0];
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	struct nd_label_id label_id;
+	resource_size_t size = 0;
+	struct resource *res;
+
+	if (!nsblk->uuid)
+		return 0;
+	nd_label_gen_id(&label_id, nsblk->uuid, NSLABEL_FLAG_LOCAL);
+	for_each_dpa_resource(ndd, res)
+		if (strcmp(res->name, label_id.id) == 0)
+			size += resource_size(res);
+	return size;
+}
+
 static ssize_t alt_name_store(struct device *dev,
 		struct device_attribute *attr, const char *buf, size_t len)
 {
@@ -150,8 +177,9 @@ static ssize_t alt_name_show(struct device *dev,
 
 		ns_altname = nspm->alt_name;
 	} else if (is_namespace_blk(dev)) {
-		/* TODO: blk namespace support */
-		return -ENXIO;
+		struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+
+		ns_altname = nsblk->alt_name;
 	} else
 		return -ENXIO;
 
@@ -197,6 +225,8 @@ static int scan_free(struct nd_region *nd_region,
 			new_start = res->start;
 
 		rc = adjust_resource(res, new_start, resource_size(res) - n);
+		if (rc == 0)
+			res->flags |= DPA_RESOURCE_ADJUSTED;
 		nd_dbg_dpa(nd_region, ndd, res, "shrink %d\n", rc);
 		break;
 	}
@@ -257,14 +287,15 @@ static resource_size_t init_dpa_allocation(struct nd_label_id *label_id,
 	return rc ? n : 0;
 }
 
-static bool space_valid(bool is_pmem, struct nd_label_id *label_id,
-		struct resource *res)
+static bool space_valid(bool is_pmem, bool is_reserve,
+		struct nd_label_id *label_id, struct resource *res)
 {
 	/*
 	 * For BLK-space any space is valid, for PMEM-space, it must be
-	 * contiguous with an existing allocation.
+	 * contiguous with an existing allocation unless we are
+	 * reserving pmem.
 	 */
-	if (!is_pmem)
+	if (is_reserve || !is_pmem)
 		return true;
 	if (!res || strcmp(res->name, label_id->id) == 0)
 		return true;
@@ -280,6 +311,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 		resource_size_t n)
 {
 	resource_size_t mapping_end = nd_mapping->start + nd_mapping->size - 1;
+	bool is_reserve = strcmp(label_id->id, "pmem-reserve") == 0;
 	bool is_pmem = strncmp(label_id->id, "pmem", 4) == 0;
 	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
 	const resource_size_t to_allocate = n;
@@ -305,7 +337,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 		if (!first++ && res->start > nd_mapping->start) {
 			free_start = nd_mapping->start;
 			available = res->start - free_start;
-			if (space_valid(is_pmem, label_id, NULL))
+			if (space_valid(is_pmem, is_reserve, label_id, NULL))
 				loc = ALLOC_BEFORE;
 		}
 
@@ -313,7 +345,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 		if (!loc && next) {
 			free_start = res->start + resource_size(res);
 			free_end = min(mapping_end, next->start - 1);
-			if (space_valid(is_pmem, label_id, res)
+			if (space_valid(is_pmem, is_reserve, label_id, res)
 					&& free_start < free_end) {
 				available = free_end + 1 - free_start;
 				loc = ALLOC_MID;
@@ -324,7 +356,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 		if (!loc && !next) {
 			free_start = res->start + resource_size(res);
 			free_end = mapping_end;
-			if (space_valid(is_pmem, label_id, res)
+			if (space_valid(is_pmem, is_reserve, label_id, res)
 					&& free_start < free_end) {
 				available = free_end + 1 - free_start;
 				loc = ALLOC_AFTER;
@@ -338,7 +370,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 		case ALLOC_BEFORE:
 			if (strcmp(res->name, label_id->id) == 0) {
 				/* adjust current resource up */
-				if (is_pmem)
+				if (is_pmem && !is_reserve)
 					return n;
 				rc = adjust_resource(res, res->start - allocate,
 						resource_size(res) + allocate);
@@ -349,7 +381,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 		case ALLOC_MID:
 			if (strcmp(next->name, label_id->id) == 0) {
 				/* adjust next resource up */
-				if (is_pmem)
+				if (is_pmem && !is_reserve)
 					return n;
 				rc = adjust_resource(next, next->start
 						- allocate, resource_size(next)
@@ -375,7 +407,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 			/* BLK allocate bottom up */
 			if (!is_pmem)
 				free_start += available - allocate;
-			else if (free_start != nd_mapping->start)
+			else if (!is_reserve && free_start != nd_mapping->start)
 				return n;
 
 			new_res = nd_dimm_allocate_dpa(ndd, label_id,
@@ -386,6 +418,8 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 			/* adjust current resource down */
 			rc = adjust_resource(res, res->start, resource_size(res)
 					+ allocate);
+			if (rc == 0)
+				res->flags |= DPA_RESOURCE_ADJUSTED;
 		}
 
 		if (!new_res)
@@ -411,11 +445,106 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 			return 0;
 	}
 
-	if (is_pmem && n == to_allocate)
+	/*
+	 * If we allocated nothing in the BLK case it may be because we are in
+	 * an initial "pmem-reserve pass".  Only do an initial BLK allocation
+	 * when none of the DPA space is reserved.
+	 */
+	if ((is_pmem || !ndd->dpa.child) && n == to_allocate)
 		return init_dpa_allocation(label_id, nd_region, nd_mapping, n);
 	return n;
 }
 
+static int merge_dpa(struct nd_region *nd_region,
+		struct nd_mapping *nd_mapping, struct nd_label_id *label_id)
+{
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	struct resource *res;
+
+	if (strncmp("pmem", label_id->id, 4) == 0)
+		return 0;
+ retry:
+	for_each_dpa_resource(ndd, res) {
+		int rc;
+		struct resource *next = res->sibling;
+		resource_size_t end = res->start + resource_size(res);
+
+		if (!next || strcmp(res->name, label_id->id) != 0
+				|| strcmp(next->name, label_id->id) != 0
+				|| end != next->start)
+			continue;
+		end += resource_size(next);
+		nd_dimm_free_dpa(ndd, next);
+		rc = adjust_resource(res, res->start, end - res->start);
+		nd_dbg_dpa(nd_region, ndd, res, "merge %d\n", rc);
+		if (rc)
+			return rc;
+		res->flags |= DPA_RESOURCE_ADJUSTED;
+		goto retry;
+	}
+
+	return 0;
+}
+
+static int __reserve_free_pmem(struct device *dev, void *data)
+{
+	struct nd_dimm *nd_dimm = data;
+	struct nd_region *nd_region;
+	struct nd_label_id label_id;
+	int i;
+
+	if (!is_nd_pmem(dev))
+		return 0;
+
+	nd_region = to_nd_region(dev);
+	if (nd_region->ndr_mappings == 0)
+		return 0;
+
+	memset(&label_id, 0, sizeof(label_id));
+	strcat(label_id.id, "pmem-reserve");
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		resource_size_t n, rem = 0;
+
+		if (nd_mapping->nd_dimm != nd_dimm)
+			continue;
+
+		n = nd_pmem_available_dpa(nd_region, nd_mapping, &rem);
+		if (n == 0)
+			return 0;
+		rem = scan_allocate(nd_region, nd_mapping, &label_id, n);
+		dev_WARN_ONCE(&nd_region->dev, rem,
+				"pmem reserve underrun: %#llx of %#llx bytes\n",
+				(unsigned long long) n - rem,
+				(unsigned long long) n);
+		return rem ? -ENXIO : 0;
+	}
+
+	return 0;
+}
+
+static void release_free_pmem(struct nd_bus *nd_bus, struct nd_mapping *nd_mapping)
+{
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	struct resource *res, *_res;
+
+	for_each_dpa_resource_safe(ndd, res, _res)
+		if (strcmp(res->name, "pmem-reserve") == 0)
+			nd_dimm_free_dpa(ndd, res);
+}
+
+static int reserve_free_pmem(struct nd_bus *nd_bus,
+		struct nd_mapping *nd_mapping)
+{
+	struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
+	int rc;
+
+	rc = device_for_each_child(&nd_bus->dev, nd_dimm, __reserve_free_pmem);
+	if (rc)
+		release_free_pmem(nd_bus, nd_mapping);
+	return rc;
+}
+
 /**
  * grow_dpa_allocation - for each dimm allocate n bytes for @label_id
  * @nd_region: the set of dimms to allocate @n more bytes from
@@ -432,13 +561,44 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 static int grow_dpa_allocation(struct nd_region *nd_region,
 		struct nd_label_id *label_id, resource_size_t n)
 {
+	struct nd_bus *nd_bus = walk_to_nd_bus(&nd_region->dev);
+	bool is_pmem = strncmp(label_id->id, "pmem", 4) == 0;
 	int i;
 
 	for (i = 0; i < nd_region->ndr_mappings; i++) {
 		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
-		int rc;
+		resource_size_t rem = n;
+		int rc, j;
 
-		rc = scan_allocate(nd_region, nd_mapping, label_id, n);
+		/*
+		 * In the BLK case try once with all unallocated PMEM
+		 * reserved, and once without
+		 */
+		for (j = is_pmem; j < 2; j++) {
+			bool blk_only = j == 0;
+
+			if (blk_only) {
+				rc = reserve_free_pmem(nd_bus, nd_mapping);
+				if (rc)
+					return rc;
+			}
+			rem = scan_allocate(nd_region, nd_mapping, label_id, rem);
+			if (blk_only)
+				release_free_pmem(nd_bus, nd_mapping);
+
+			/* try again and allow encroachments into PMEM */
+			if (rem == 0)
+				break;
+		}
+
+		dev_WARN_ONCE(&nd_region->dev, rem,
+				"allocation underrun: %#llx of %#llx bytes\n",
+				(unsigned long long) n - rem,
+				(unsigned long long) n);
+		if (rem)
+			return -ENXIO;
+
+		rc = merge_dpa(nd_region, nd_mapping, label_id);
 		if (rc)
 			return rc;
 	}
@@ -474,8 +634,10 @@ static ssize_t __size_store(struct device *dev, unsigned long long val)
 
 		uuid = nspm->uuid;
 	} else if (is_namespace_blk(dev)) {
-		/* TODO: blk namespace support */
-		return -ENXIO;
+		struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+
+		uuid = nsblk->uuid;
+		flags = NSLABEL_FLAG_LOCAL;
 	}
 
 	/*
@@ -529,6 +691,14 @@ static ssize_t __size_store(struct device *dev, unsigned long long val)
 
 		nd_namespace_pmem_set_size(nd_region, nspm,
 				val * nd_region->ndr_mappings);
+	} else if (is_namespace_blk(dev)) {
+		/*
+		 * Try to delete the namespace if we deleted all of its
+		 * allocation and this is not the seed device for the
+		 * region.
+		 */
+		if (val == 0 && nd_region->ns_seed != dev)
+			nd_device_unregister(dev, ND_ASYNC);
 	}
 
 	return rc;
@@ -555,8 +725,9 @@ static ssize_t size_store(struct device *dev,
 
 		uuid = &nspm->uuid;
 	} else if (is_namespace_blk(dev)) {
-		/* TODO: blk namespace support */
-		rc = -ENXIO;
+		struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+
+		uuid = &nsblk->uuid;
 	}
 
 	if (rc == 0 && val == 0 && uuid) {
@@ -577,21 +748,23 @@ static ssize_t size_store(struct device *dev,
 static ssize_t size_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
+	unsigned long long size = 0;
+
+	nd_bus_lock(dev);
 	if (is_namespace_pmem(dev)) {
 		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
 
-		return sprintf(buf, "%llu\n", (unsigned long long)
-				resource_size(&nspm->nsio.res));
+		size = resource_size(&nspm->nsio.res);
 	} else if (is_namespace_blk(dev)) {
-		/* TODO: blk namespace support */
-		return -ENXIO;
+		size = nd_namespace_blk_size(to_nd_namespace_blk(dev));
 	} else if (is_namespace_io(dev)) {
 		struct nd_namespace_io *nsio = to_nd_namespace_io(dev);
 
-		return sprintf(buf, "%llu\n", (unsigned long long)
-				resource_size(&nsio->res));
-	} else
-		return -ENXIO;
+		size = resource_size(&nsio->res);
+	}
+	nd_bus_unlock(dev);
+
+	return sprintf(buf, "%llu\n", size);
 }
 static DEVICE_ATTR(size, S_IRUGO, size_show, size_store);
 
@@ -605,8 +778,9 @@ static ssize_t uuid_show(struct device *dev,
 
 		uuid = nspm->uuid;
 	} else if (is_namespace_blk(dev)) {
-		/* TODO: blk namespace support */
-		return -ENXIO;
+		struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+
+		uuid = nsblk->uuid;
 	} else
 		return -ENXIO;
 
@@ -670,8 +844,9 @@ static ssize_t uuid_store(struct device *dev,
 
 		ns_uuid = &nspm->uuid;
 	} else if (is_namespace_blk(dev)) {
-		/* TODO: blk namespace support */
-		return -ENXIO;
+		struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+
+		ns_uuid = &nsblk->uuid;
 	} else
 		return -ENXIO;
 
@@ -713,12 +888,48 @@ static ssize_t resource_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(resource);
 
+static const unsigned long ns_lbasize_supported[] = { 512, 0 };
+
+static ssize_t sector_size_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+
+	if (!is_namespace_blk(dev))
+		return -ENXIO;
+
+	return nd_sector_size_show(nsblk->lbasize, ns_lbasize_supported, buf);
+}
+
+static ssize_t sector_size_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+	ssize_t rc;
+
+	if (!is_namespace_blk(dev))
+		return -ENXIO;
+
+	device_lock(dev);
+	nd_bus_lock(dev);
+	rc = nd_sector_size_store(dev, buf, &nsblk->lbasize,
+			ns_lbasize_supported);
+	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	nd_bus_unlock(dev);
+	device_unlock(dev);
+
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(sector_size);
+
 static struct attribute *nd_namespace_attributes[] = {
 	&dev_attr_nstype.attr,
 	&dev_attr_size.attr,
 	&dev_attr_uuid.attr,
 	&dev_attr_resource.attr,
 	&dev_attr_alt_name.attr,
+	&dev_attr_sector_size.attr,
 	NULL,
 };
 
@@ -735,6 +946,10 @@ static umode_t nd_namespace_attr_visible(struct kobject *kobj, struct attribute
 	if (is_namespace_pmem(dev) || is_namespace_blk(dev)) {
 		if (a == &dev_attr_size.attr)
 			return S_IWUSR;
+
+		if (is_namespace_pmem(dev) && a == &dev_attr_sector_size.attr)
+			return 0;
+
 		return a->mode;
 	}
 
@@ -1029,6 +1244,173 @@ static struct device **create_namespace_pmem(struct nd_region *nd_region)
 	return NULL;
 }
 
+struct resource *nsblk_add_resource(struct nd_region *nd_region,
+		struct nd_dimm_drvdata *ndd, struct nd_namespace_blk *nsblk,
+		resource_size_t start)
+{
+	struct nd_label_id label_id;
+	struct resource *res;
+
+	nd_label_gen_id(&label_id, nsblk->uuid, NSLABEL_FLAG_LOCAL);
+	nsblk->res = krealloc(nsblk->res,
+			sizeof(void *) * (nsblk->num_resources + 1),
+			GFP_KERNEL);
+	if (!nsblk->res)
+		return NULL;
+	for_each_dpa_resource(ndd, res)
+		if (strcmp(res->name, label_id.id) == 0 && res->start == start) {
+			nsblk->res[nsblk->num_resources++] = res;
+			return res;
+		}
+	return NULL;
+}
+
+static struct device *nd_namespace_blk_create(struct nd_region *nd_region)
+{
+	struct nd_namespace_blk *nsblk;
+	struct device *dev;
+
+	if (!is_nd_blk(&nd_region->dev))
+		return NULL;
+
+	nsblk = kzalloc(sizeof(*nsblk), GFP_KERNEL);
+	if (!nsblk)
+		return NULL;
+
+	dev = &nsblk->dev;
+	dev->type = &namespace_blk_device_type;
+	nsblk->id = ida_simple_get(&nd_region->ns_ida, 0, 0, GFP_KERNEL);
+	if (nsblk->id < 0) {
+		kfree(nsblk);
+		return NULL;
+	}
+	dev_set_name(dev, "namespace%d.%d", nd_region->id, nsblk->id);
+	dev->parent = &nd_region->dev;
+	dev->groups = nd_namespace_attribute_groups;
+
+	return &nsblk->dev;
+}
+
+void nd_region_create_blk_seed(struct nd_region *nd_region)
+{
+	WARN_ON(!is_nd_bus_locked(&nd_region->dev));
+	nd_region->ns_seed = nd_namespace_blk_create(nd_region);
+	/*
+	 * Seed creation failures are not fatal, provisioning is simply
+	 * disabled until memory becomes available
+	 */
+	if (!nd_region->ns_seed)
+		dev_err(&nd_region->dev, "failed to create blk namespace\n");
+	else
+		nd_device_register(nd_region->ns_seed);
+}
+
+static struct device **create_namespace_blk(struct nd_region *nd_region)
+{
+	struct nd_mapping *nd_mapping = &nd_region->mapping[0];
+	struct nd_namespace_label __iomem *nd_label;
+	struct device *dev, **devs = NULL;
+	u8 label_uuid[NSLABEL_UUID_LEN];
+	struct nd_namespace_blk *nsblk;
+	struct nd_dimm_drvdata *ndd;
+	int i, l, count = 0;
+	struct resource *res;
+
+	if (nd_region->ndr_mappings == 0)
+		return NULL;
+
+	ndd = to_ndd(nd_mapping);
+	for_each_label(l, nd_label, nd_mapping->labels) {
+		u32 flags = readl(&nd_label->flags);
+		char *name[NSLABEL_NAME_LEN];
+		struct device **__devs;
+
+		if (flags & NSLABEL_FLAG_LOCAL)
+			/* pass */;
+		else
+			continue;
+
+		memcpy_fromio(label_uuid, nd_label->uuid, NSLABEL_UUID_LEN);
+		for (i = 0; i < count; i++) {
+			nsblk = to_nd_namespace_blk(devs[i]);
+			if (memcmp(nsblk->uuid, label_uuid,
+						NSLABEL_UUID_LEN) == 0) {
+				res = nsblk_add_resource(nd_region, ndd, nsblk,
+						readq(&nd_label->dpa));
+				if (!res)
+					goto err;
+				nd_dbg_dpa(nd_region, ndd, res, "%s assign\n",
+					dev_name(&nsblk->dev));
+				break;
+			}
+		}
+		if (i < count)
+			continue;
+		__devs = kcalloc(count + 2, sizeof(dev), GFP_KERNEL);
+		if (!__devs)
+			goto err;
+		memcpy(__devs, devs, sizeof(dev) * count);
+		kfree(devs);
+		devs = __devs;
+
+		nsblk = kzalloc(sizeof(*nsblk), GFP_KERNEL);
+		if (!nsblk)
+			goto err;
+		dev = &nsblk->dev;
+		dev->type = &namespace_blk_device_type;
+		dev_set_name(dev, "namespace%d.%d", nd_region->id, count);
+		devs[count++] = dev;
+		nsblk->id = -1;
+		nsblk->lbasize = readq(&nd_label->lbasize);
+		nsblk->uuid = kmemdup(label_uuid, NSLABEL_UUID_LEN, GFP_KERNEL);
+		if (!nsblk->uuid)
+			goto err;
+		memcpy_fromio(name, nd_label->name, NSLABEL_NAME_LEN);
+		if (name[0])
+			nsblk->alt_name = kmemdup(name, NSLABEL_NAME_LEN,
+					GFP_KERNEL);
+		res = nsblk_add_resource(nd_region, ndd, nsblk,
+				readq(&nd_label->dpa));
+		if (!res)
+			goto err;
+		nd_dbg_dpa(nd_region, ndd, res, "%s assign\n",
+				dev_name(&nsblk->dev));
+	}
+
+	dev_dbg(&nd_region->dev, "%s: discovered %d blk namespace%s\n",
+			__func__, count, count == 1 ? "" : "s");
+
+	if (count == 0) {
+		/* Publish a zero-sized namespace for userspace to configure. */
+		for (i = 0; i < nd_region->ndr_mappings; i++) {
+			struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+
+			kfree(nd_mapping->labels);
+			nd_mapping->labels = NULL;
+		}
+
+		devs = kcalloc(2, sizeof(dev), GFP_KERNEL);
+		if (!devs)
+			goto err;
+		nsblk = kzalloc(sizeof(*nsblk), GFP_KERNEL);
+		if (!nsblk)
+			goto err;
+		dev = &nsblk->dev;
+		dev->type = &namespace_blk_device_type;
+		devs[count++] = dev;
+	}
+
+	return devs;
+
+err:
+	for (i = 0; i < count; i++) {
+		nsblk = to_nd_namespace_blk(devs[i]);
+		namespace_blk_release(&nsblk->dev);
+	}
+	kfree(devs);
+	return NULL;
+}
+
 static int init_active_labels(struct nd_region *nd_region)
 {
 	int i;
@@ -1092,6 +1474,9 @@ int nd_region_register_namespaces(struct nd_region *nd_region, int *err)
 	case ND_DEVICE_NAMESPACE_PMEM:
 		devs = create_namespace_pmem(nd_region);
 		break;
+	case ND_DEVICE_NAMESPACE_BLK:
+		devs = create_namespace_blk(nd_region);
+		break;
 	default:
 		break;
 	}
@@ -1102,26 +1487,59 @@ int nd_region_register_namespaces(struct nd_region *nd_region, int *err)
 		goto err;
 	}
 
-	nd_region->ns_seed = devs[0];
 	for (i = 0; devs[i]; i++) {
 		struct device *dev = devs[i];
+		int id;
+
+		if (type == ND_DEVICE_NAMESPACE_BLK) {
+			struct nd_namespace_blk *nsblk;
 
-		dev_set_name(dev, "namespace%d.%d", nd_region->id, i);
+			nsblk = to_nd_namespace_blk(dev);
+			id = ida_simple_get(&nd_region->ns_ida, 0, 0,
+					GFP_KERNEL);
+			nsblk->id = id;
+		} else
+			id = i;
+
+		if (id < 0)
+			break;
+		dev_set_name(dev, "namespace%d.%d", nd_region->id, id);
 		dev->parent = &nd_region->dev;
 		dev->groups = nd_namespace_attribute_groups;
 		nd_device_register(dev);
 	}
-	kfree(devs);
+	if (i)
+		nd_region->ns_seed = devs[0];
 
-	return i;
+	if (devs[i]) {
+		int j;
+
+		for (j = i; devs[j]; j++) {
+			struct device *dev = devs[j];
+
+			device_initialize(dev);
+			put_device(dev);
+		}
+		*err = j - i;
+		/*
+		 * All of the namespaces we tried to register failed, so
+		 * fail region activation.
+		 */
+		if (*err == 0)
+			rc = -ENODEV;
+	}
+	kfree(devs);
 
  err:
-	for (i = 0; i < nd_region->ndr_mappings; i++) {
-		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+	if (rc == -ENODEV) {
+		for (i = 0; i < nd_region->ndr_mappings; i++) {
+			struct nd_mapping *nd_mapping = &nd_region->mapping[i];
 
-		kfree(nd_mapping->labels);
-		nd_mapping->labels = NULL;
+			kfree(nd_mapping->labels);
+			nd_mapping->labels = NULL;
+		}
+		return rc;
 	}
 
-	return rc;
+	return i;
 }
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index 814843454417..fe852175a3b8 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -16,6 +16,7 @@
 #include <linux/libnd.h>
 #include <linux/sizes.h>
 #include <linux/mutex.h>
+#include <linux/nd.h>
 
 extern struct list_head nd_bus_list;
 extern struct mutex nd_bus_list_mutex;
@@ -52,6 +53,8 @@ void nd_dimm_exit(void);
 int nd_region_exit(void);
 void nd_region_probe_start(struct nd_bus *nd_bus, struct device *dev);
 void nd_region_probe_end(struct nd_bus *nd_bus, struct device *dev, int rc);
+struct nd_region;
+void nd_region_create_blk_seed(struct nd_region *nd_region);
 void nd_region_notify_remove(struct nd_bus *nd_bus, struct device *dev, int rc);
 int nd_bus_create_ndctl(struct nd_bus *nd_bus);
 void nd_bus_destroy_ndctl(struct nd_bus *nd_bus);
@@ -68,7 +71,12 @@ struct nd_dimm_drvdata;
 struct nd_mapping;
 resource_size_t nd_pmem_available_dpa(struct nd_region *nd_region,
 		struct nd_mapping *nd_mapping, resource_size_t *overlap);
+resource_size_t nd_blk_available_dpa(struct nd_mapping *nd_mapping);
 resource_size_t nd_region_available_dpa(struct nd_region *nd_region);
 resource_size_t nd_dimm_allocated_dpa(struct nd_dimm_drvdata *ndd,
 		struct nd_label_id *label_id);
+struct nd_mapping;
+struct resource *nsblk_add_resource(struct nd_region *nd_region,
+		struct nd_dimm_drvdata *ndd, struct nd_namespace_blk *nsblk,
+		resource_size_t start);
 #endif /* __ND_PRIVATE_H__ */
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index d9d221a7006e..3876d0c7db87 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -90,6 +90,7 @@ static inline struct nd_namespace_label __iomem *nd_get_label(
 
 struct nd_region {
 	struct device dev;
+	struct ida ns_ida;
 	struct device *ns_seed;
 	u16 ndr_mappings;
 	u64 ndr_size;
@@ -119,6 +120,10 @@ void nd_device_register(struct device *dev);
 void nd_device_unregister(struct device *dev, enum nd_async_mode mode);
 int nd_uuid_store(struct device *dev, u8 **uuid_out, const char *buf,
 		size_t len);
+ssize_t nd_sector_size_show(unsigned long current_lbasize,
+		const unsigned long *supported, char *buf);
+ssize_t nd_sector_size_store(struct device *dev, const char *buf,
+		unsigned long *current_lbasize, const unsigned long *supported);
 struct nd_dimm;
 struct nd_dimm_drvdata *to_ndd(struct nd_mapping *nd_mapping);
 int nd_dimm_init_nsarea(struct nd_dimm_drvdata *ndd);
diff --git a/drivers/block/nd/region_devs.c b/drivers/block/nd/region_devs.c
index 6b43a5c901cd..1ae6bb44c371 100644
--- a/drivers/block/nd/region_devs.c
+++ b/drivers/block/nd/region_devs.c
@@ -118,7 +118,12 @@ static int is_uuid_busy(struct device *dev, void *data)
 		break;
 	}
 	case ND_DEVICE_NAMESPACE_BLK: {
-		/* TODO: blk namespace support */
+		struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+
+		if (!nsblk->uuid)
+			break;
+		if (memcmp(uuid, nsblk->uuid, NSLABEL_UUID_LEN) == 0)
+			return -EBUSY;
 		break;
 	}
 	default:
@@ -230,7 +235,7 @@ resource_size_t nd_region_available_dpa(struct nd_region *nd_region)
 				goto retry;
 			}
 		} else if (is_nd_blk(&nd_region->dev)) {
-			/* TODO: BLK Namespace support */
+			available += nd_blk_available_dpa(nd_mapping);
 		}
 	}
 
@@ -360,6 +365,11 @@ static void nd_region_notify_driver_action(struct nd_bus *nd_bus,
 			else
 				atomic_dec(&nd_dimm->busy);
 		}
+	} else if (dev->parent && is_nd_blk(dev->parent) && probe && rc == 0) {
+		struct nd_region *nd_region = to_nd_region(dev->parent);
+
+		if (nd_region->ns_seed == dev)
+			nd_region_create_blk_seed(nd_region);
 	}
 }
 
@@ -546,6 +556,7 @@ static noinline struct nd_region *nd_region_create(struct nd_bus *nd_bus,
 	nd_region->ndr_mappings = ndr_desc->num_mappings;
 	nd_region->provider_data = ndr_desc->provider_data;
 	nd_region->nd_set = ndr_desc->nd_set;
+	ida_init(&nd_region->ns_ida);
 	dev = &nd_region->dev;
 	dev_set_name(dev, "region%d", nd_region->id);
 	dev->parent = &nd_bus->dev;
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index 3190a561ea59..43f58330d14c 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -26,6 +26,9 @@ enum {
 	ND_CMD_MAX_ENVELOPE = 16,
 	ND_CMD_ARS_QUERY_MAX = SZ_4K,
 	ND_MAX_MAPPINGS = 32,
+
+	/* mark newly adjusted resources as requiring a label update */
+	DPA_RESOURCE_ADJUSTED = 1 << 0,
 };
 
 extern struct attribute_group nd_bus_attribute_group;
diff --git a/include/linux/nd.h b/include/linux/nd.h
index 255c38a83083..23276ea91690 100644
--- a/include/linux/nd.h
+++ b/include/linux/nd.h
@@ -50,6 +50,26 @@ struct nd_namespace_pmem {
 	u8 *uuid;
 };
 
+/**
+ * struct nd_namespace_blk - namespace for dimm-bounded persistent memory
+ * @dev: namespace device creation by the nd region driver
+ * @alt_name: namespace name supplied in the dimm label
+ * @uuid: namespace name supplied in the dimm label
+ * @id: ida allocated id
+ * @lbasize: blk namespaces have a native sector size when btt not present
+ * @num_resources: number of dpa extents to claim
+ * @res: discontiguous dpa extents for given dimm
+ */
+struct nd_namespace_blk {
+	struct device dev;
+	char *alt_name;
+	u8 *uuid;
+	int id;
+	unsigned long lbasize;
+	int num_resources;
+	struct resource **res;
+};
+
 static inline struct nd_namespace_io *to_nd_namespace_io(struct device *dev)
 {
 	return container_of(dev, struct nd_namespace_io, dev);
@@ -62,6 +82,11 @@ static inline struct nd_namespace_pmem *to_nd_namespace_pmem(struct device *dev)
 	return container_of(nsio, struct nd_namespace_pmem, nsio);
 }
 
+static inline struct nd_namespace_blk *to_nd_namespace_blk(struct device *dev)
+{
+	return container_of(dev, struct nd_namespace_blk, dev);
+}
+
 #define MODULE_ALIAS_ND_DEVICE(type) \
 	MODULE_ALIAS("nd:t" __stringify(type) "*")
 #define ND_DEVICE_MODALIAS_FMT "nd:t%d"


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 14/21] libnd: blk labels and namespace instantiation
@ 2015-05-20 20:57   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, mingo, linux-acpi,
	jmoyer, hch

A blk label set describes a namespace comprised of one or more
discontiguous dpa ranges on a single dimm.  They may alias with one or
more pmem interleave sets that include the given dimm.

This is the runtime/volatile configuration infrastructure for sysfs
manipulation of 'alt_name', 'uuid', 'size', and 'sector_size'.  A later
patch will make these settings persistent by writing back the label(s).

Unlike pmem namespaces, multiple blk namespaces can be created per
region.  Once a blk namespace has been created a new seed device
(unconfigured child of a parent blk region) is instantiated.  As long as
a region has 'available_size' != 0 new child namespaces may be created.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/block/nd/core.c           |   40 +++
 drivers/block/nd/dimm_devs.c      |   35 +++
 drivers/block/nd/namespace_devs.c |  504 ++++++++++++++++++++++++++++++++++---
 drivers/block/nd/nd-private.h     |    8 +
 drivers/block/nd/nd.h             |    5 
 drivers/block/nd/region_devs.c    |   15 +
 include/linux/libnd.h             |    3 
 include/linux/nd.h                |   25 ++
 8 files changed, 590 insertions(+), 45 deletions(-)

diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index 0bf69abb47fc..b45863343a48 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -171,6 +171,46 @@ int nd_uuid_store(struct device *dev, u8 **uuid_out, const char *buf,
 	return 0;
 }
 
+ssize_t nd_sector_size_show(unsigned long current_lbasize,
+		const unsigned long *supported, char *buf)
+{
+	ssize_t len = 0;
+	int i;
+
+	for (i = 0; supported[i]; i++)
+		if (current_lbasize == supported[i])
+			len += sprintf(buf + len, "[%ld] ", supported[i]);
+		else
+			len += sprintf(buf + len, "%ld ", supported[i]);
+	len += sprintf(buf + len, "\n");
+	return len;
+}
+
+ssize_t nd_sector_size_store(struct device *dev, const char *buf,
+		unsigned long *current_lbasize, const unsigned long *supported)
+{
+	unsigned long lbasize;
+	int rc, i;
+
+	if (dev->driver)
+		return -EBUSY;
+
+	rc = kstrtoul(buf, 0, &lbasize);
+	if (rc)
+		return rc;
+
+	for (i = 0; supported[i]; i++)
+		if (lbasize == supported[i])
+			break;
+
+	if (supported[i]) {
+		*current_lbasize = lbasize;
+		return 0;
+	} else {
+		return -EINVAL;
+	}
+}
+
 static ssize_t commands_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index b242d3ae6d12..4aa5654354ac 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -256,6 +256,41 @@ struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
 EXPORT_SYMBOL_GPL(nd_dimm_create);
 
 /**
+ * nd_blk_available_dpa - account the unused dpa of BLK region
+ * @nd_mapping: container of dpa-resource-root + labels
+ *
+ * Unlike PMEM, BLK namespaces can occupy discontiguous DPA ranges.
+ */
+resource_size_t nd_blk_available_dpa(struct nd_mapping *nd_mapping)
+{
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	resource_size_t map_end, busy = 0, available;
+	struct resource *res;
+
+	if (!ndd)
+		return 0;
+
+	map_end = nd_mapping->start + nd_mapping->size - 1;
+	for_each_dpa_resource(ndd, res)
+		if (res->start >= nd_mapping->start && res->start < map_end) {
+			resource_size_t end = min(map_end, res->end);
+
+			busy += end - res->start + 1;
+		} else if (res->end >= nd_mapping->start && res->end <= map_end) {
+			busy += res->end - nd_mapping->start;
+		} else if (nd_mapping->start > res->start
+				&& nd_mapping->start < res->end) {
+			/* total eclipse of the BLK region mapping */
+			busy += nd_mapping->size;
+		}
+
+	available = map_end - nd_mapping->start + 1;
+	if (busy < available)
+		return available - busy;
+	return 0;
+}
+
+/**
  * nd_pmem_available_dpa - for the given dimm+region account unallocated dpa
  * @nd_mapping: container of dpa-resource-root + labels
  * @nd_region: constrain available space check to this reference region
diff --git a/drivers/block/nd/namespace_devs.c b/drivers/block/nd/namespace_devs.c
index d0417575b18c..d06b8abf6744 100644
--- a/drivers/block/nd/namespace_devs.c
+++ b/drivers/block/nd/namespace_devs.c
@@ -37,7 +37,15 @@ static void namespace_pmem_release(struct device *dev)
 
 static void namespace_blk_release(struct device *dev)
 {
-	/* TODO: blk namespace support */
+	struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+
+	if (nsblk->id >= 0)
+		ida_simple_remove(&nd_region->ns_ida, nsblk->id);
+	kfree(nsblk->alt_name);
+	kfree(nsblk->uuid);
+	kfree(nsblk->res);
+	kfree(nsblk);
 }
 
 static struct device_type namespace_io_device_type = {
@@ -90,8 +98,9 @@ static ssize_t __alt_name_store(struct device *dev, const char *buf,
 
 		ns_altname = &nspm->alt_name;
 	} else if (is_namespace_blk(dev)) {
-		/* TODO: blk namespace support */
-		return -ENXIO;
+		struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+
+		ns_altname = &nsblk->alt_name;
 	} else
 		return -ENXIO;
 
@@ -124,6 +133,24 @@ out:
 	return rc;
 }
 
+static resource_size_t nd_namespace_blk_size(struct nd_namespace_blk *nsblk)
+{
+	struct nd_region *nd_region = to_nd_region(nsblk->dev.parent);
+	struct nd_mapping *nd_mapping = &nd_region->mapping[0];
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	struct nd_label_id label_id;
+	resource_size_t size = 0;
+	struct resource *res;
+
+	if (!nsblk->uuid)
+		return 0;
+	nd_label_gen_id(&label_id, nsblk->uuid, NSLABEL_FLAG_LOCAL);
+	for_each_dpa_resource(ndd, res)
+		if (strcmp(res->name, label_id.id) == 0)
+			size += resource_size(res);
+	return size;
+}
+
 static ssize_t alt_name_store(struct device *dev,
 		struct device_attribute *attr, const char *buf, size_t len)
 {
@@ -150,8 +177,9 @@ static ssize_t alt_name_show(struct device *dev,
 
 		ns_altname = nspm->alt_name;
 	} else if (is_namespace_blk(dev)) {
-		/* TODO: blk namespace support */
-		return -ENXIO;
+		struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+
+		ns_altname = nsblk->alt_name;
 	} else
 		return -ENXIO;
 
@@ -197,6 +225,8 @@ static int scan_free(struct nd_region *nd_region,
 			new_start = res->start;
 
 		rc = adjust_resource(res, new_start, resource_size(res) - n);
+		if (rc == 0)
+			res->flags |= DPA_RESOURCE_ADJUSTED;
 		nd_dbg_dpa(nd_region, ndd, res, "shrink %d\n", rc);
 		break;
 	}
@@ -257,14 +287,15 @@ static resource_size_t init_dpa_allocation(struct nd_label_id *label_id,
 	return rc ? n : 0;
 }
 
-static bool space_valid(bool is_pmem, struct nd_label_id *label_id,
-		struct resource *res)
+static bool space_valid(bool is_pmem, bool is_reserve,
+		struct nd_label_id *label_id, struct resource *res)
 {
 	/*
 	 * For BLK-space any space is valid, for PMEM-space, it must be
-	 * contiguous with an existing allocation.
+	 * contiguous with an existing allocation unless we are
+	 * reserving pmem.
 	 */
-	if (!is_pmem)
+	if (is_reserve || !is_pmem)
 		return true;
 	if (!res || strcmp(res->name, label_id->id) == 0)
 		return true;
@@ -280,6 +311,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 		resource_size_t n)
 {
 	resource_size_t mapping_end = nd_mapping->start + nd_mapping->size - 1;
+	bool is_reserve = strcmp(label_id->id, "pmem-reserve") == 0;
 	bool is_pmem = strncmp(label_id->id, "pmem", 4) == 0;
 	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
 	const resource_size_t to_allocate = n;
@@ -305,7 +337,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 		if (!first++ && res->start > nd_mapping->start) {
 			free_start = nd_mapping->start;
 			available = res->start - free_start;
-			if (space_valid(is_pmem, label_id, NULL))
+			if (space_valid(is_pmem, is_reserve, label_id, NULL))
 				loc = ALLOC_BEFORE;
 		}
 
@@ -313,7 +345,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 		if (!loc && next) {
 			free_start = res->start + resource_size(res);
 			free_end = min(mapping_end, next->start - 1);
-			if (space_valid(is_pmem, label_id, res)
+			if (space_valid(is_pmem, is_reserve, label_id, res)
 					&& free_start < free_end) {
 				available = free_end + 1 - free_start;
 				loc = ALLOC_MID;
@@ -324,7 +356,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 		if (!loc && !next) {
 			free_start = res->start + resource_size(res);
 			free_end = mapping_end;
-			if (space_valid(is_pmem, label_id, res)
+			if (space_valid(is_pmem, is_reserve, label_id, res)
 					&& free_start < free_end) {
 				available = free_end + 1 - free_start;
 				loc = ALLOC_AFTER;
@@ -338,7 +370,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 		case ALLOC_BEFORE:
 			if (strcmp(res->name, label_id->id) == 0) {
 				/* adjust current resource up */
-				if (is_pmem)
+				if (is_pmem && !is_reserve)
 					return n;
 				rc = adjust_resource(res, res->start - allocate,
 						resource_size(res) + allocate);
@@ -349,7 +381,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 		case ALLOC_MID:
 			if (strcmp(next->name, label_id->id) == 0) {
 				/* adjust next resource up */
-				if (is_pmem)
+				if (is_pmem && !is_reserve)
 					return n;
 				rc = adjust_resource(next, next->start
 						- allocate, resource_size(next)
@@ -375,7 +407,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 			/* BLK allocate bottom up */
 			if (!is_pmem)
 				free_start += available - allocate;
-			else if (free_start != nd_mapping->start)
+			else if (!is_reserve && free_start != nd_mapping->start)
 				return n;
 
 			new_res = nd_dimm_allocate_dpa(ndd, label_id,
@@ -386,6 +418,8 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 			/* adjust current resource down */
 			rc = adjust_resource(res, res->start, resource_size(res)
 					+ allocate);
+			if (rc == 0)
+				res->flags |= DPA_RESOURCE_ADJUSTED;
 		}
 
 		if (!new_res)
@@ -411,11 +445,106 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 			return 0;
 	}
 
-	if (is_pmem && n == to_allocate)
+	/*
+	 * If we allocated nothing in the BLK case it may be because we are in
+	 * an initial "pmem-reserve pass".  Only do an initial BLK allocation
+	 * when none of the DPA space is reserved.
+	 */
+	if ((is_pmem || !ndd->dpa.child) && n == to_allocate)
 		return init_dpa_allocation(label_id, nd_region, nd_mapping, n);
 	return n;
 }
 
+static int merge_dpa(struct nd_region *nd_region,
+		struct nd_mapping *nd_mapping, struct nd_label_id *label_id)
+{
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	struct resource *res;
+
+	if (strncmp("pmem", label_id->id, 4) == 0)
+		return 0;
+ retry:
+	for_each_dpa_resource(ndd, res) {
+		int rc;
+		struct resource *next = res->sibling;
+		resource_size_t end = res->start + resource_size(res);
+
+		if (!next || strcmp(res->name, label_id->id) != 0
+				|| strcmp(next->name, label_id->id) != 0
+				|| end != next->start)
+			continue;
+		end += resource_size(next);
+		nd_dimm_free_dpa(ndd, next);
+		rc = adjust_resource(res, res->start, end - res->start);
+		nd_dbg_dpa(nd_region, ndd, res, "merge %d\n", rc);
+		if (rc)
+			return rc;
+		res->flags |= DPA_RESOURCE_ADJUSTED;
+		goto retry;
+	}
+
+	return 0;
+}
+
+static int __reserve_free_pmem(struct device *dev, void *data)
+{
+	struct nd_dimm *nd_dimm = data;
+	struct nd_region *nd_region;
+	struct nd_label_id label_id;
+	int i;
+
+	if (!is_nd_pmem(dev))
+		return 0;
+
+	nd_region = to_nd_region(dev);
+	if (nd_region->ndr_mappings == 0)
+		return 0;
+
+	memset(&label_id, 0, sizeof(label_id));
+	strcat(label_id.id, "pmem-reserve");
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		resource_size_t n, rem = 0;
+
+		if (nd_mapping->nd_dimm != nd_dimm)
+			continue;
+
+		n = nd_pmem_available_dpa(nd_region, nd_mapping, &rem);
+		if (n == 0)
+			return 0;
+		rem = scan_allocate(nd_region, nd_mapping, &label_id, n);
+		dev_WARN_ONCE(&nd_region->dev, rem,
+				"pmem reserve underrun: %#llx of %#llx bytes\n",
+				(unsigned long long) n - rem,
+				(unsigned long long) n);
+		return rem ? -ENXIO : 0;
+	}
+
+	return 0;
+}
+
+static void release_free_pmem(struct nd_bus *nd_bus, struct nd_mapping *nd_mapping)
+{
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	struct resource *res, *_res;
+
+	for_each_dpa_resource_safe(ndd, res, _res)
+		if (strcmp(res->name, "pmem-reserve") == 0)
+			nd_dimm_free_dpa(ndd, res);
+}
+
+static int reserve_free_pmem(struct nd_bus *nd_bus,
+		struct nd_mapping *nd_mapping)
+{
+	struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
+	int rc;
+
+	rc = device_for_each_child(&nd_bus->dev, nd_dimm, __reserve_free_pmem);
+	if (rc)
+		release_free_pmem(nd_bus, nd_mapping);
+	return rc;
+}
+
 /**
  * grow_dpa_allocation - for each dimm allocate n bytes for @label_id
  * @nd_region: the set of dimms to allocate @n more bytes from
@@ -432,13 +561,44 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 static int grow_dpa_allocation(struct nd_region *nd_region,
 		struct nd_label_id *label_id, resource_size_t n)
 {
+	struct nd_bus *nd_bus = walk_to_nd_bus(&nd_region->dev);
+	bool is_pmem = strncmp(label_id->id, "pmem", 4) == 0;
 	int i;
 
 	for (i = 0; i < nd_region->ndr_mappings; i++) {
 		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
-		int rc;
+		resource_size_t rem = n;
+		int rc, j;
 
-		rc = scan_allocate(nd_region, nd_mapping, label_id, n);
+		/*
+		 * In the BLK case try once with all unallocated PMEM
+		 * reserved, and once without
+		 */
+		for (j = is_pmem; j < 2; j++) {
+			bool blk_only = j == 0;
+
+			if (blk_only) {
+				rc = reserve_free_pmem(nd_bus, nd_mapping);
+				if (rc)
+					return rc;
+			}
+			rem = scan_allocate(nd_region, nd_mapping, label_id, rem);
+			if (blk_only)
+				release_free_pmem(nd_bus, nd_mapping);
+
+			/* try again and allow encroachments into PMEM */
+			if (rem == 0)
+				break;
+		}
+
+		dev_WARN_ONCE(&nd_region->dev, rem,
+				"allocation underrun: %#llx of %#llx bytes\n",
+				(unsigned long long) n - rem,
+				(unsigned long long) n);
+		if (rem)
+			return -ENXIO;
+
+		rc = merge_dpa(nd_region, nd_mapping, label_id);
 		if (rc)
 			return rc;
 	}
@@ -474,8 +634,10 @@ static ssize_t __size_store(struct device *dev, unsigned long long val)
 
 		uuid = nspm->uuid;
 	} else if (is_namespace_blk(dev)) {
-		/* TODO: blk namespace support */
-		return -ENXIO;
+		struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+
+		uuid = nsblk->uuid;
+		flags = NSLABEL_FLAG_LOCAL;
 	}
 
 	/*
@@ -529,6 +691,14 @@ static ssize_t __size_store(struct device *dev, unsigned long long val)
 
 		nd_namespace_pmem_set_size(nd_region, nspm,
 				val * nd_region->ndr_mappings);
+	} else if (is_namespace_blk(dev)) {
+		/*
+		 * Try to delete the namespace if we deleted all of its
+		 * allocation and this is not the seed device for the
+		 * region.
+		 */
+		if (val == 0 && nd_region->ns_seed != dev)
+			nd_device_unregister(dev, ND_ASYNC);
 	}
 
 	return rc;
@@ -555,8 +725,9 @@ static ssize_t size_store(struct device *dev,
 
 		uuid = &nspm->uuid;
 	} else if (is_namespace_blk(dev)) {
-		/* TODO: blk namespace support */
-		rc = -ENXIO;
+		struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+
+		uuid = &nsblk->uuid;
 	}
 
 	if (rc == 0 && val == 0 && uuid) {
@@ -577,21 +748,23 @@ static ssize_t size_store(struct device *dev,
 static ssize_t size_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
+	unsigned long long size = 0;
+
+	nd_bus_lock(dev);
 	if (is_namespace_pmem(dev)) {
 		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
 
-		return sprintf(buf, "%llu\n", (unsigned long long)
-				resource_size(&nspm->nsio.res));
+		size = resource_size(&nspm->nsio.res);
 	} else if (is_namespace_blk(dev)) {
-		/* TODO: blk namespace support */
-		return -ENXIO;
+		size = nd_namespace_blk_size(to_nd_namespace_blk(dev));
 	} else if (is_namespace_io(dev)) {
 		struct nd_namespace_io *nsio = to_nd_namespace_io(dev);
 
-		return sprintf(buf, "%llu\n", (unsigned long long)
-				resource_size(&nsio->res));
-	} else
-		return -ENXIO;
+		size = resource_size(&nsio->res);
+	}
+	nd_bus_unlock(dev);
+
+	return sprintf(buf, "%llu\n", size);
 }
 static DEVICE_ATTR(size, S_IRUGO, size_show, size_store);
 
@@ -605,8 +778,9 @@ static ssize_t uuid_show(struct device *dev,
 
 		uuid = nspm->uuid;
 	} else if (is_namespace_blk(dev)) {
-		/* TODO: blk namespace support */
-		return -ENXIO;
+		struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+
+		uuid = nsblk->uuid;
 	} else
 		return -ENXIO;
 
@@ -670,8 +844,9 @@ static ssize_t uuid_store(struct device *dev,
 
 		ns_uuid = &nspm->uuid;
 	} else if (is_namespace_blk(dev)) {
-		/* TODO: blk namespace support */
-		return -ENXIO;
+		struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+
+		ns_uuid = &nsblk->uuid;
 	} else
 		return -ENXIO;
 
@@ -713,12 +888,48 @@ static ssize_t resource_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(resource);
 
+static const unsigned long ns_lbasize_supported[] = { 512, 0 };
+
+static ssize_t sector_size_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+
+	if (!is_namespace_blk(dev))
+		return -ENXIO;
+
+	return nd_sector_size_show(nsblk->lbasize, ns_lbasize_supported, buf);
+}
+
+static ssize_t sector_size_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+	ssize_t rc;
+
+	if (!is_namespace_blk(dev))
+		return -ENXIO;
+
+	device_lock(dev);
+	nd_bus_lock(dev);
+	rc = nd_sector_size_store(dev, buf, &nsblk->lbasize,
+			ns_lbasize_supported);
+	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	nd_bus_unlock(dev);
+	device_unlock(dev);
+
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(sector_size);
+
 static struct attribute *nd_namespace_attributes[] = {
 	&dev_attr_nstype.attr,
 	&dev_attr_size.attr,
 	&dev_attr_uuid.attr,
 	&dev_attr_resource.attr,
 	&dev_attr_alt_name.attr,
+	&dev_attr_sector_size.attr,
 	NULL,
 };
 
@@ -735,6 +946,10 @@ static umode_t nd_namespace_attr_visible(struct kobject *kobj, struct attribute
 	if (is_namespace_pmem(dev) || is_namespace_blk(dev)) {
 		if (a == &dev_attr_size.attr)
 			return S_IWUSR;
+
+		if (is_namespace_pmem(dev) && a == &dev_attr_sector_size.attr)
+			return 0;
+
 		return a->mode;
 	}
 
@@ -1029,6 +1244,173 @@ static struct device **create_namespace_pmem(struct nd_region *nd_region)
 	return NULL;
 }
 
+struct resource *nsblk_add_resource(struct nd_region *nd_region,
+		struct nd_dimm_drvdata *ndd, struct nd_namespace_blk *nsblk,
+		resource_size_t start)
+{
+	struct nd_label_id label_id;
+	struct resource *res;
+
+	nd_label_gen_id(&label_id, nsblk->uuid, NSLABEL_FLAG_LOCAL);
+	nsblk->res = krealloc(nsblk->res,
+			sizeof(void *) * (nsblk->num_resources + 1),
+			GFP_KERNEL);
+	if (!nsblk->res)
+		return NULL;
+	for_each_dpa_resource(ndd, res)
+		if (strcmp(res->name, label_id.id) == 0 && res->start == start) {
+			nsblk->res[nsblk->num_resources++] = res;
+			return res;
+		}
+	return NULL;
+}
+
+static struct device *nd_namespace_blk_create(struct nd_region *nd_region)
+{
+	struct nd_namespace_blk *nsblk;
+	struct device *dev;
+
+	if (!is_nd_blk(&nd_region->dev))
+		return NULL;
+
+	nsblk = kzalloc(sizeof(*nsblk), GFP_KERNEL);
+	if (!nsblk)
+		return NULL;
+
+	dev = &nsblk->dev;
+	dev->type = &namespace_blk_device_type;
+	nsblk->id = ida_simple_get(&nd_region->ns_ida, 0, 0, GFP_KERNEL);
+	if (nsblk->id < 0) {
+		kfree(nsblk);
+		return NULL;
+	}
+	dev_set_name(dev, "namespace%d.%d", nd_region->id, nsblk->id);
+	dev->parent = &nd_region->dev;
+	dev->groups = nd_namespace_attribute_groups;
+
+	return &nsblk->dev;
+}
+
+void nd_region_create_blk_seed(struct nd_region *nd_region)
+{
+	WARN_ON(!is_nd_bus_locked(&nd_region->dev));
+	nd_region->ns_seed = nd_namespace_blk_create(nd_region);
+	/*
+	 * Seed creation failures are not fatal, provisioning is simply
+	 * disabled until memory becomes available
+	 */
+	if (!nd_region->ns_seed)
+		dev_err(&nd_region->dev, "failed to create blk namespace\n");
+	else
+		nd_device_register(nd_region->ns_seed);
+}
+
+static struct device **create_namespace_blk(struct nd_region *nd_region)
+{
+	struct nd_mapping *nd_mapping = &nd_region->mapping[0];
+	struct nd_namespace_label __iomem *nd_label;
+	struct device *dev, **devs = NULL;
+	u8 label_uuid[NSLABEL_UUID_LEN];
+	struct nd_namespace_blk *nsblk;
+	struct nd_dimm_drvdata *ndd;
+	int i, l, count = 0;
+	struct resource *res;
+
+	if (nd_region->ndr_mappings == 0)
+		return NULL;
+
+	ndd = to_ndd(nd_mapping);
+	for_each_label(l, nd_label, nd_mapping->labels) {
+		u32 flags = readl(&nd_label->flags);
+		char *name[NSLABEL_NAME_LEN];
+		struct device **__devs;
+
+		if (flags & NSLABEL_FLAG_LOCAL)
+			/* pass */;
+		else
+			continue;
+
+		memcpy_fromio(label_uuid, nd_label->uuid, NSLABEL_UUID_LEN);
+		for (i = 0; i < count; i++) {
+			nsblk = to_nd_namespace_blk(devs[i]);
+			if (memcmp(nsblk->uuid, label_uuid,
+						NSLABEL_UUID_LEN) == 0) {
+				res = nsblk_add_resource(nd_region, ndd, nsblk,
+						readq(&nd_label->dpa));
+				if (!res)
+					goto err;
+				nd_dbg_dpa(nd_region, ndd, res, "%s assign\n",
+					dev_name(&nsblk->dev));
+				break;
+			}
+		}
+		if (i < count)
+			continue;
+		__devs = kcalloc(count + 2, sizeof(dev), GFP_KERNEL);
+		if (!__devs)
+			goto err;
+		memcpy(__devs, devs, sizeof(dev) * count);
+		kfree(devs);
+		devs = __devs;
+
+		nsblk = kzalloc(sizeof(*nsblk), GFP_KERNEL);
+		if (!nsblk)
+			goto err;
+		dev = &nsblk->dev;
+		dev->type = &namespace_blk_device_type;
+		dev_set_name(dev, "namespace%d.%d", nd_region->id, count);
+		devs[count++] = dev;
+		nsblk->id = -1;
+		nsblk->lbasize = readq(&nd_label->lbasize);
+		nsblk->uuid = kmemdup(label_uuid, NSLABEL_UUID_LEN, GFP_KERNEL);
+		if (!nsblk->uuid)
+			goto err;
+		memcpy_fromio(name, nd_label->name, NSLABEL_NAME_LEN);
+		if (name[0])
+			nsblk->alt_name = kmemdup(name, NSLABEL_NAME_LEN,
+					GFP_KERNEL);
+		res = nsblk_add_resource(nd_region, ndd, nsblk,
+				readq(&nd_label->dpa));
+		if (!res)
+			goto err;
+		nd_dbg_dpa(nd_region, ndd, res, "%s assign\n",
+				dev_name(&nsblk->dev));
+	}
+
+	dev_dbg(&nd_region->dev, "%s: discovered %d blk namespace%s\n",
+			__func__, count, count == 1 ? "" : "s");
+
+	if (count == 0) {
+		/* Publish a zero-sized namespace for userspace to configure. */
+		for (i = 0; i < nd_region->ndr_mappings; i++) {
+			struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+
+			kfree(nd_mapping->labels);
+			nd_mapping->labels = NULL;
+		}
+
+		devs = kcalloc(2, sizeof(dev), GFP_KERNEL);
+		if (!devs)
+			goto err;
+		nsblk = kzalloc(sizeof(*nsblk), GFP_KERNEL);
+		if (!nsblk)
+			goto err;
+		dev = &nsblk->dev;
+		dev->type = &namespace_blk_device_type;
+		devs[count++] = dev;
+	}
+
+	return devs;
+
+err:
+	for (i = 0; i < count; i++) {
+		nsblk = to_nd_namespace_blk(devs[i]);
+		namespace_blk_release(&nsblk->dev);
+	}
+	kfree(devs);
+	return NULL;
+}
+
 static int init_active_labels(struct nd_region *nd_region)
 {
 	int i;
@@ -1092,6 +1474,9 @@ int nd_region_register_namespaces(struct nd_region *nd_region, int *err)
 	case ND_DEVICE_NAMESPACE_PMEM:
 		devs = create_namespace_pmem(nd_region);
 		break;
+	case ND_DEVICE_NAMESPACE_BLK:
+		devs = create_namespace_blk(nd_region);
+		break;
 	default:
 		break;
 	}
@@ -1102,26 +1487,59 @@ int nd_region_register_namespaces(struct nd_region *nd_region, int *err)
 		goto err;
 	}
 
-	nd_region->ns_seed = devs[0];
 	for (i = 0; devs[i]; i++) {
 		struct device *dev = devs[i];
+		int id;
+
+		if (type == ND_DEVICE_NAMESPACE_BLK) {
+			struct nd_namespace_blk *nsblk;
 
-		dev_set_name(dev, "namespace%d.%d", nd_region->id, i);
+			nsblk = to_nd_namespace_blk(dev);
+			id = ida_simple_get(&nd_region->ns_ida, 0, 0,
+					GFP_KERNEL);
+			nsblk->id = id;
+		} else
+			id = i;
+
+		if (id < 0)
+			break;
+		dev_set_name(dev, "namespace%d.%d", nd_region->id, id);
 		dev->parent = &nd_region->dev;
 		dev->groups = nd_namespace_attribute_groups;
 		nd_device_register(dev);
 	}
-	kfree(devs);
+	if (i)
+		nd_region->ns_seed = devs[0];
 
-	return i;
+	if (devs[i]) {
+		int j;
+
+		for (j = i; devs[j]; j++) {
+			struct device *dev = devs[j];
+
+			device_initialize(dev);
+			put_device(dev);
+		}
+		*err = j - i;
+		/*
+		 * All of the namespaces we tried to register failed, so
+		 * fail region activation.
+		 */
+		if (*err == 0)
+			rc = -ENODEV;
+	}
+	kfree(devs);
 
  err:
-	for (i = 0; i < nd_region->ndr_mappings; i++) {
-		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+	if (rc == -ENODEV) {
+		for (i = 0; i < nd_region->ndr_mappings; i++) {
+			struct nd_mapping *nd_mapping = &nd_region->mapping[i];
 
-		kfree(nd_mapping->labels);
-		nd_mapping->labels = NULL;
+			kfree(nd_mapping->labels);
+			nd_mapping->labels = NULL;
+		}
+		return rc;
 	}
 
-	return rc;
+	return i;
 }
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index 814843454417..fe852175a3b8 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -16,6 +16,7 @@
 #include <linux/libnd.h>
 #include <linux/sizes.h>
 #include <linux/mutex.h>
+#include <linux/nd.h>
 
 extern struct list_head nd_bus_list;
 extern struct mutex nd_bus_list_mutex;
@@ -52,6 +53,8 @@ void nd_dimm_exit(void);
 int nd_region_exit(void);
 void nd_region_probe_start(struct nd_bus *nd_bus, struct device *dev);
 void nd_region_probe_end(struct nd_bus *nd_bus, struct device *dev, int rc);
+struct nd_region;
+void nd_region_create_blk_seed(struct nd_region *nd_region);
 void nd_region_notify_remove(struct nd_bus *nd_bus, struct device *dev, int rc);
 int nd_bus_create_ndctl(struct nd_bus *nd_bus);
 void nd_bus_destroy_ndctl(struct nd_bus *nd_bus);
@@ -68,7 +71,12 @@ struct nd_dimm_drvdata;
 struct nd_mapping;
 resource_size_t nd_pmem_available_dpa(struct nd_region *nd_region,
 		struct nd_mapping *nd_mapping, resource_size_t *overlap);
+resource_size_t nd_blk_available_dpa(struct nd_mapping *nd_mapping);
 resource_size_t nd_region_available_dpa(struct nd_region *nd_region);
 resource_size_t nd_dimm_allocated_dpa(struct nd_dimm_drvdata *ndd,
 		struct nd_label_id *label_id);
+struct nd_mapping;
+struct resource *nsblk_add_resource(struct nd_region *nd_region,
+		struct nd_dimm_drvdata *ndd, struct nd_namespace_blk *nsblk,
+		resource_size_t start);
 #endif /* __ND_PRIVATE_H__ */
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index d9d221a7006e..3876d0c7db87 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -90,6 +90,7 @@ static inline struct nd_namespace_label __iomem *nd_get_label(
 
 struct nd_region {
 	struct device dev;
+	struct ida ns_ida;
 	struct device *ns_seed;
 	u16 ndr_mappings;
 	u64 ndr_size;
@@ -119,6 +120,10 @@ void nd_device_register(struct device *dev);
 void nd_device_unregister(struct device *dev, enum nd_async_mode mode);
 int nd_uuid_store(struct device *dev, u8 **uuid_out, const char *buf,
 		size_t len);
+ssize_t nd_sector_size_show(unsigned long current_lbasize,
+		const unsigned long *supported, char *buf);
+ssize_t nd_sector_size_store(struct device *dev, const char *buf,
+		unsigned long *current_lbasize, const unsigned long *supported);
 struct nd_dimm;
 struct nd_dimm_drvdata *to_ndd(struct nd_mapping *nd_mapping);
 int nd_dimm_init_nsarea(struct nd_dimm_drvdata *ndd);
diff --git a/drivers/block/nd/region_devs.c b/drivers/block/nd/region_devs.c
index 6b43a5c901cd..1ae6bb44c371 100644
--- a/drivers/block/nd/region_devs.c
+++ b/drivers/block/nd/region_devs.c
@@ -118,7 +118,12 @@ static int is_uuid_busy(struct device *dev, void *data)
 		break;
 	}
 	case ND_DEVICE_NAMESPACE_BLK: {
-		/* TODO: blk namespace support */
+		struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+
+		if (!nsblk->uuid)
+			break;
+		if (memcmp(uuid, nsblk->uuid, NSLABEL_UUID_LEN) == 0)
+			return -EBUSY;
 		break;
 	}
 	default:
@@ -230,7 +235,7 @@ resource_size_t nd_region_available_dpa(struct nd_region *nd_region)
 				goto retry;
 			}
 		} else if (is_nd_blk(&nd_region->dev)) {
-			/* TODO: BLK Namespace support */
+			available += nd_blk_available_dpa(nd_mapping);
 		}
 	}
 
@@ -360,6 +365,11 @@ static void nd_region_notify_driver_action(struct nd_bus *nd_bus,
 			else
 				atomic_dec(&nd_dimm->busy);
 		}
+	} else if (dev->parent && is_nd_blk(dev->parent) && probe && rc == 0) {
+		struct nd_region *nd_region = to_nd_region(dev->parent);
+
+		if (nd_region->ns_seed == dev)
+			nd_region_create_blk_seed(nd_region);
 	}
 }
 
@@ -546,6 +556,7 @@ static noinline struct nd_region *nd_region_create(struct nd_bus *nd_bus,
 	nd_region->ndr_mappings = ndr_desc->num_mappings;
 	nd_region->provider_data = ndr_desc->provider_data;
 	nd_region->nd_set = ndr_desc->nd_set;
+	ida_init(&nd_region->ns_ida);
 	dev = &nd_region->dev;
 	dev_set_name(dev, "region%d", nd_region->id);
 	dev->parent = &nd_bus->dev;
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index 3190a561ea59..43f58330d14c 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -26,6 +26,9 @@ enum {
 	ND_CMD_MAX_ENVELOPE = 16,
 	ND_CMD_ARS_QUERY_MAX = SZ_4K,
 	ND_MAX_MAPPINGS = 32,
+
+	/* mark newly adjusted resources as requiring a label update */
+	DPA_RESOURCE_ADJUSTED = 1 << 0,
 };
 
 extern struct attribute_group nd_bus_attribute_group;
diff --git a/include/linux/nd.h b/include/linux/nd.h
index 255c38a83083..23276ea91690 100644
--- a/include/linux/nd.h
+++ b/include/linux/nd.h
@@ -50,6 +50,26 @@ struct nd_namespace_pmem {
 	u8 *uuid;
 };
 
+/**
+ * struct nd_namespace_blk - namespace for dimm-bounded persistent memory
+ * @dev: namespace device creation by the nd region driver
+ * @alt_name: namespace name supplied in the dimm label
+ * @uuid: namespace name supplied in the dimm label
+ * @id: ida allocated id
+ * @lbasize: blk namespaces have a native sector size when btt not present
+ * @num_resources: number of dpa extents to claim
+ * @res: discontiguous dpa extents for given dimm
+ */
+struct nd_namespace_blk {
+	struct device dev;
+	char *alt_name;
+	u8 *uuid;
+	int id;
+	unsigned long lbasize;
+	int num_resources;
+	struct resource **res;
+};
+
 static inline struct nd_namespace_io *to_nd_namespace_io(struct device *dev)
 {
 	return container_of(dev, struct nd_namespace_io, dev);
@@ -62,6 +82,11 @@ static inline struct nd_namespace_pmem *to_nd_namespace_pmem(struct device *dev)
 	return container_of(nsio, struct nd_namespace_pmem, nsio);
 }
 
+static inline struct nd_namespace_blk *to_nd_namespace_blk(struct device *dev)
+{
+	return container_of(dev, struct nd_namespace_blk, dev);
+}
+
 #define MODULE_ALIAS_ND_DEVICE(type) \
 	MODULE_ALIAS("nd:t" __stringify(type) "*")
 #define ND_DEVICE_MODALIAS_FMT "nd:t%d"


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 15/21] libnd: write pmem label set
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:57   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, mingo, linux-acpi,
	jmoyer, hch

After 'uuid', 'size', and optionally 'alt_name' have been set to valid
values the labels on the dimms can be updated.

Write procedure is:
1/ Allocate and write new labels in the "next" index
2/ Free the old labels in the working copy
3/ Write the bitmap and the label space on the dimm
4/ Write the index to make the update valid

Label ranges directly mirror the dpa resource values for the given
label_id of the namespace.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/block/nd/dimm_devs.c      |   49 ++++++
 drivers/block/nd/label.c          |  328 +++++++++++++++++++++++++++++++++++++
 drivers/block/nd/label.h          |    6 +
 drivers/block/nd/namespace_devs.c |   82 ++++++++-
 drivers/block/nd/nd.h             |    3 
 5 files changed, 454 insertions(+), 14 deletions(-)

diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 4aa5654354ac..358b2a06d680 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -132,6 +132,55 @@ int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd)
 	return rc;
 }
 
+int nd_dimm_set_config_data(struct nd_dimm_drvdata *ndd, size_t offset,
+		void *buf, size_t len)
+{
+	int rc = validate_dimm(ndd);
+	size_t max_cmd_size, buf_offset;
+	struct nd_cmd_set_config_hdr *cmd;
+	struct nd_bus *nd_bus = walk_to_nd_bus(ndd->dev);
+	struct nd_bus_descriptor *nd_desc = nd_bus->nd_desc;
+
+	if (rc)
+		return rc;
+
+	if (!ndd->data)
+		return -ENXIO;
+
+	if (offset + len > ndd->nsarea.config_size)
+		return -ENXIO;
+
+	max_cmd_size = min_t(u32, PAGE_SIZE, len);
+	max_cmd_size = min_t(u32, max_cmd_size, ndd->nsarea.max_xfer);
+	cmd = kzalloc(max_cmd_size + sizeof(*cmd) + sizeof(u32), GFP_KERNEL);
+	if (!cmd)
+		return -ENOMEM;
+
+	for (buf_offset = 0; len; len -= cmd->in_length,
+			buf_offset += cmd->in_length) {
+		size_t cmd_size;
+		u32 *status;
+
+		cmd->in_offset = offset + buf_offset;
+		cmd->in_length = min(max_cmd_size, len);
+		memcpy(cmd->in_buf, buf + buf_offset, cmd->in_length);
+
+		/* status is output in the last 4-bytes of the command buffer */
+		cmd_size = sizeof(*cmd) + cmd->in_length + sizeof(u32);
+		status = ((void *) cmd) + cmd_size - sizeof(u32);
+
+		rc = nd_desc->ndctl(nd_desc, to_nd_dimm(ndd->dev),
+				ND_CMD_SET_CONFIG_DATA, cmd, cmd_size);
+		if (rc || *status) {
+			rc = rc ? rc : -ENXIO;
+			break;
+		}
+	}
+	kfree(cmd);
+
+	return rc;
+}
+
 static void nd_dimm_release(struct device *dev)
 {
 	struct nd_dimm *nd_dimm = to_nd_dimm(dev);
diff --git a/drivers/block/nd/label.c b/drivers/block/nd/label.c
index ecd196b42d57..a4746f1fe99c 100644
--- a/drivers/block/nd/label.c
+++ b/drivers/block/nd/label.c
@@ -12,6 +12,7 @@
  */
 #include <linux/device.h>
 #include <linux/ndctl.h>
+#include <linux/slab.h>
 #include <linux/io.h>
 #include <linux/nd.h>
 #include "nd-private.h"
@@ -57,6 +58,11 @@ size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd)
 	return ndd->nsindex_size;
 }
 
+static int nd_dimm_num_label_slots(struct nd_dimm_drvdata *ndd)
+{
+	return ndd->nsarea.config_size / 129;
+}
+
 int nd_label_validate(struct nd_dimm_drvdata *ndd)
 {
 	/*
@@ -203,23 +209,30 @@ static struct nd_namespace_label __iomem *nd_label_base(struct nd_dimm_drvdata *
 	return base + 2 * sizeof_namespace_index(ndd);
 }
 
+static int to_slot(struct nd_dimm_drvdata *ndd,
+		struct nd_namespace_label __iomem *nd_label)
+{
+	return nd_label - nd_label_base(ndd);
+}
+
 #define for_each_clear_bit_le(bit, addr, size) \
 	for ((bit) = find_next_zero_bit_le((addr), (size), 0);  \
 	     (bit) < (size);                                    \
 	     (bit) = find_next_zero_bit_le((addr), (size), (bit) + 1))
 
 /**
- * preamble_current - common variable initialization for nd_label_* routines
+ * preamble_index - common variable initialization for nd_label_* routines
  * @nd_dimm: dimm container for the relevant label set
+ * @idx: namespace_index index
  * @nsindex: on return set to the currently active namespace index
  * @free: on return set to the free label bitmap in the index
  * @nslot: on return set to the number of slots in the label space
  */
-static bool preamble_current(struct nd_dimm_drvdata *ndd,
+static bool preamble_index(struct nd_dimm_drvdata *ndd, int idx,
 		struct nd_namespace_index **nsindex,
 		unsigned long **free, u32 *nslot)
 {
-	*nsindex = to_current_namespace_index(ndd);
+	*nsindex = to_namespace_index(ndd, idx);
 	if (*nsindex == NULL)
 		return false;
 
@@ -238,6 +251,22 @@ char *nd_label_gen_id(struct nd_label_id *label_id, u8 *uuid, u32 flags)
 	return label_id->id;
 }
 
+static bool preamble_current(struct nd_dimm_drvdata *ndd,
+		struct nd_namespace_index **nsindex,
+		unsigned long **free, u32 *nslot)
+{
+	return preamble_index(ndd, ndd->ns_current, nsindex,
+			free, nslot);
+}
+
+static bool preamble_next(struct nd_dimm_drvdata *ndd,
+		struct nd_namespace_index **nsindex,
+		unsigned long **free, u32 *nslot)
+{
+	return preamble_index(ndd, ndd->ns_next, nsindex,
+			free, nslot);
+}
+
 static bool slot_valid(struct nd_namespace_label __iomem *nd_label, u32 slot)
 {
 	/* check that we are written where we expect to be written */
@@ -337,3 +366,296 @@ struct nd_namespace_label __iomem *nd_label_active(
 
 	return NULL;
 }
+
+u32 nd_label_alloc_slot(struct nd_dimm_drvdata *ndd)
+{
+	struct nd_namespace_index __iomem *nsindex;
+	unsigned long *free;
+	u32 nslot, slot;
+
+	if (!preamble_next(ndd, &nsindex, &free, &nslot))
+		return UINT_MAX;
+
+	WARN_ON(!is_nd_bus_locked(ndd->dev));
+
+	slot = find_next_bit_le(free, nslot, 0);
+	if (slot == nslot)
+		return UINT_MAX;
+
+	clear_bit_le(slot, free);
+
+	return slot;
+}
+
+bool nd_label_free_slot(struct nd_dimm_drvdata *ndd, u32 slot)
+{
+	struct nd_namespace_index __iomem *nsindex;
+	unsigned long *free;
+	u32 nslot;
+
+	if (!preamble_next(ndd, &nsindex, &free, &nslot))
+		return false;
+
+	WARN_ON(!is_nd_bus_locked(ndd->dev));
+
+	if (slot < nslot)
+		return !test_and_set_bit_le(slot, free);
+	return false;
+}
+
+u32 nd_label_nfree(struct nd_dimm_drvdata *ndd)
+{
+	struct nd_namespace_index __iomem *nsindex;
+	unsigned long *free;
+	u32 nslot;
+
+	WARN_ON(!is_nd_bus_locked(ndd->dev));
+
+	if (!preamble_next(ndd, &nsindex, &free, &nslot))
+		return 0;
+
+	return bitmap_weight(free, nslot);
+}
+
+static int nd_label_write_index(struct nd_dimm_drvdata *ndd, int index, u32 seq,
+		unsigned long flags)
+{
+	struct nd_namespace_index *nsindex = to_namespace_index(ndd, index);
+	unsigned long offset;
+	u64 checksum;
+	u32 nslot;
+	int rc;
+
+	if (flags & ND_NSINDEX_INIT)
+		nslot = nd_dimm_num_label_slots(ndd);
+	else
+		nslot = readl(&nsindex->nslot);
+
+	memcpy_toio(nsindex->sig, NSINDEX_SIGNATURE, NSINDEX_SIG_LEN);
+	writel(0, &nsindex->flags);
+	writel(seq, &nsindex->seq);
+	offset = (unsigned long) nsindex
+		- (unsigned long) to_namespace_index(ndd, 0);
+	writeq(offset, &nsindex->myoff);
+	writeq(sizeof_namespace_index(ndd), &nsindex->mysize);
+	offset = (unsigned long) to_namespace_index(ndd,
+			nd_label_next_nsindex(index))
+		- (unsigned long) to_namespace_index(ndd, 0);
+	writeq(offset, &nsindex->otheroff);
+	offset = (unsigned long) nd_label_base(ndd)
+		- (unsigned long) to_namespace_index(ndd, 0);
+	writeq(offset, &nsindex->labeloff);
+	writel(nslot, &nsindex->nslot);
+	writew(1, &nsindex->major);
+	writew(1, &nsindex->minor);
+	writeq(0, &nsindex->checksum);
+	if (flags & ND_NSINDEX_INIT) {
+		unsigned long *free = (unsigned long __force *) nsindex->free;
+		u32 nfree = ALIGN(nslot, BITS_PER_LONG);
+		int last_bits, i;
+
+		memset_io(nsindex->free, 0xff, nfree / 8);
+		for (i = 0, last_bits = nfree - nslot; i < last_bits; i++)
+			clear_bit_le(nslot + i, free);
+	}
+	checksum = nd_fletcher64((void * __force) nsindex,
+			sizeof_namespace_index(ndd), 1);
+	writeq(checksum, &nsindex->checksum);
+	rc = nd_dimm_set_config_data(ndd, readq(&nsindex->myoff),
+			nsindex, sizeof_namespace_index(ndd));
+	if (rc < 0)
+		return rc;
+
+	if (flags & ND_NSINDEX_INIT)
+		return 0;
+
+	/* copy the index we just wrote to the new 'next' */
+	WARN_ON(index != ndd->ns_next);
+	nd_label_copy(ndd, to_current_namespace_index(ndd), nsindex);
+	ndd->ns_current = nd_label_next_nsindex(ndd->ns_current);
+	ndd->ns_next = nd_label_next_nsindex(ndd->ns_next);
+	WARN_ON(ndd->ns_current == ndd->ns_next);
+
+	return 0;
+}
+
+static unsigned long nd_label_offset(struct nd_dimm_drvdata *ndd,
+		struct nd_namespace_label __iomem *nd_label)
+{
+	return (unsigned long) nd_label
+		- (unsigned long) to_namespace_index(ndd, 0);
+}
+
+static int __pmem_label_update(struct nd_region *nd_region,
+		struct nd_mapping *nd_mapping, struct nd_namespace_pmem *nspm,
+		int pos)
+{
+	u64 cookie = nd_region_interleave_set_cookie(nd_region), rawsize;
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	struct nd_namespace_label __iomem *victim_label;
+	struct nd_namespace_label __iomem *nd_label;
+	struct nd_namespace_index __iomem *nsindex;
+	unsigned long *free;
+	u32 nslot, slot;
+	size_t offset;
+	int rc;
+
+	if (!preamble_next(ndd, &nsindex, &free, &nslot))
+		return -ENXIO;
+
+	/* allocate and write the label to the staging (next) index */
+	slot = nd_label_alloc_slot(ndd);
+	if (slot == UINT_MAX)
+		return -ENXIO;
+	dev_dbg(ndd->dev, "%s: allocated: %d\n", __func__, slot);
+
+	nd_label = nd_label_base(ndd) + slot;
+	memset_io(nd_label, 0, sizeof(struct nd_namespace_label));
+	memcpy_toio(nd_label->uuid, nspm->uuid, NSLABEL_UUID_LEN);
+	if (nspm->alt_name)
+		memcpy_toio(nd_label->name, nspm->alt_name, NSLABEL_NAME_LEN);
+	writel(NSLABEL_FLAG_UPDATING, &nd_label->flags);
+	writew(nd_region->ndr_mappings, &nd_label->nlabel);
+	writew(pos, &nd_label->position);
+	writeq(cookie, &nd_label->isetcookie);
+	rawsize = div_u64(resource_size(&nspm->nsio.res),
+			nd_region->ndr_mappings);
+	writeq(rawsize, &nd_label->rawsize);
+	writeq(nd_mapping->start, &nd_label->dpa);
+	writel(slot, &nd_label->slot);
+
+	/* update label */
+	offset = nd_label_offset(ndd, nd_label);
+	rc = nd_dimm_set_config_data(ndd, offset, nd_label,
+			sizeof(struct nd_namespace_label));
+	if (rc < 0)
+		return rc;
+
+	/* Garbage collect the previous label */
+	victim_label = nd_get_label(nd_mapping->labels, 0);
+	if (victim_label) {
+		slot = to_slot(ndd, victim_label);
+		nd_label_free_slot(ndd, slot);
+		dev_dbg(ndd->dev, "%s: free: %d\n", __func__, slot);
+	}
+
+	/* update index */
+	rc = nd_label_write_index(ndd, ndd->ns_next,
+			nd_inc_seq(readl(&nsindex->seq)), 0);
+	if (rc < 0)
+		return rc;
+
+	nd_set_label(nd_mapping->labels, nd_label, 0);
+
+	return 0;
+}
+
+static int init_labels(struct nd_mapping *nd_mapping)
+{
+	int i;
+	struct nd_namespace_index __iomem *nsindex;
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+
+	if (!nd_mapping->labels)
+		nd_mapping->labels = kcalloc(2, sizeof(void *), GFP_KERNEL);
+
+	if (!nd_mapping->labels)
+		return -ENOMEM;
+
+	if (ndd->ns_current == -1 || ndd->ns_next == -1)
+		/* pass */;
+	else
+		return 0;
+
+	nsindex = to_namespace_index(ndd, 0);
+	memset_io(nsindex, 0, ndd->nsarea.config_size);
+	for (i = 0; i < 2; i++) {
+		int rc = nd_label_write_index(ndd, i, i*2, ND_NSINDEX_INIT);
+
+		if (rc)
+			return rc;
+	}
+	ndd->ns_next = 1;
+	ndd->ns_current = 0;
+
+	return 0;
+}
+
+static int del_labels(struct nd_mapping *nd_mapping, u8 *uuid)
+{
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	struct nd_namespace_label __iomem *nd_label;
+	struct nd_namespace_index __iomem *nsindex;
+	u8 label_uuid[NSLABEL_UUID_LEN];
+	int l, num_freed = 0;
+	unsigned long *free;
+	u32 nslot, slot;
+
+	if (!uuid)
+		return 0;
+
+	/* no index || no labels == nothing to delete */
+	if (!preamble_next(ndd, &nsindex, &free, &nslot)
+			|| !nd_mapping->labels)
+		return 0;
+
+	for_each_label(l, nd_label, nd_mapping->labels) {
+		int j;
+
+		memcpy_fromio(label_uuid, nd_label->uuid, NSLABEL_UUID_LEN);
+		if (memcmp(label_uuid, uuid, NSLABEL_UUID_LEN) != 0)
+			continue;
+		slot = to_slot(ndd, nd_label);
+		nd_label_free_slot(ndd, slot);
+		dev_dbg(ndd->dev, "%s: free: %d\n", __func__, slot);
+		for (j = l; nd_get_label(nd_mapping->labels, j + 1); j++) {
+			struct nd_namespace_label __iomem *next_label;
+
+			next_label = nd_get_label(nd_mapping->labels, j + 1);
+			nd_set_label(nd_mapping->labels, next_label, j);
+		}
+		nd_set_label(nd_mapping->labels, NULL, j);
+		num_freed++;
+	}
+
+	if (num_freed > l) {
+		/*
+		 * num_freed will only ever be > l when we delete the last
+		 * label
+		 */
+		kfree(nd_mapping->labels);
+		nd_mapping->labels = NULL;
+		dev_dbg(ndd->dev, "%s: no more labels\n", __func__);
+	}
+
+	return nd_label_write_index(ndd, ndd->ns_next,
+			nd_inc_seq(readl(&nsindex->seq)), 0);
+}
+
+int nd_pmem_namespace_label_update(struct nd_region *nd_region,
+		struct nd_namespace_pmem *nspm, resource_size_t size)
+{
+	int i;
+
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		int rc;
+
+		if (size == 0) {
+			rc = del_labels(nd_mapping, nspm->uuid);
+			if (rc)
+				return rc;
+			continue;
+		}
+
+		rc = init_labels(nd_mapping);
+		if (rc)
+			return rc;
+
+		rc = __pmem_label_update(nd_region, nd_mapping, nspm, i);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
diff --git a/drivers/block/nd/label.h b/drivers/block/nd/label.h
index 4436624f4146..e17958941e34 100644
--- a/drivers/block/nd/label.h
+++ b/drivers/block/nd/label.h
@@ -34,6 +34,7 @@ enum {
 	BTTINFO_MAJOR_VERSION = 1,
 	ND_LABEL_MIN_SIZE = 512 * 129, /* see sizeof_namespace_index() */
 	ND_LABEL_ID_SIZE = 50,
+	ND_NSINDEX_INIT = 0x1,
 };
 
 static const char NSINDEX_SIGNATURE[] = "NAMESPACE_INDEX\0";
@@ -129,4 +130,9 @@ size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd);
 int nd_label_active_count(struct nd_dimm_drvdata *ndd);
 struct nd_namespace_label __iomem *nd_label_active(
 		struct nd_dimm_drvdata *ndd, int n);
+u32 nd_label_nfree(struct nd_dimm_drvdata *ndd);
+struct nd_region;
+struct nd_namespace_pmem;
+int nd_pmem_namespace_label_update(struct nd_region *nd_region,
+		struct nd_namespace_pmem *nspm, resource_size_t size);
 #endif /* __LABEL_H__ */
diff --git a/drivers/block/nd/namespace_devs.c b/drivers/block/nd/namespace_devs.c
index d06b8abf6744..cdb78dddcfa9 100644
--- a/drivers/block/nd/namespace_devs.c
+++ b/drivers/block/nd/namespace_devs.c
@@ -151,20 +151,52 @@ static resource_size_t nd_namespace_blk_size(struct nd_namespace_blk *nsblk)
 	return size;
 }
 
+static int nd_namespace_label_update(struct nd_region *nd_region, struct device *dev)
+{
+	dev_WARN_ONCE(dev, dev->driver,
+			"namespace must be idle during label update\n");
+	if (dev->driver)
+		return 0;
+
+	/*
+	 * Only allow label writes that will result in a valid namespace
+	 * or deletion of an existing namespace.
+	 */
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+		struct resource *res = &nspm->nsio.res;
+		resource_size_t size = resource_size(res);
+
+		if (size == 0 && nspm->uuid)
+			/* delete allocation */;
+		else if (!nspm->uuid)
+			return 0;
+
+		return nd_pmem_namespace_label_update(nd_region, nspm, size);
+	} else if (is_namespace_blk(dev)) {
+		/* TODO: implement blk labels */
+		return 0;
+	} else
+		return -ENXIO;
+}
+
 static ssize_t alt_name_store(struct device *dev,
 		struct device_attribute *attr, const char *buf, size_t len)
 {
+	struct nd_region *nd_region = to_nd_region(dev->parent);
 	ssize_t rc;
 
 	device_lock(dev);
 	nd_bus_lock(dev);
 	wait_nd_bus_probe_idle(dev);
 	rc = __alt_name_store(dev, buf, len);
+	if (rc >= 0)
+		rc = nd_namespace_label_update(nd_region, dev);
 	dev_dbg(dev, "%s: %s (%zd)\n", __func__, rc < 0 ? "fail" : "success", rc);
 	nd_bus_unlock(dev);
 	device_unlock(dev);
 
-	return rc;
+	return rc < 0 ? rc : len;
 }
 
 static ssize_t alt_name_show(struct device *dev,
@@ -707,6 +739,7 @@ static ssize_t __size_store(struct device *dev, unsigned long long val)
 static ssize_t size_store(struct device *dev,
 		struct device_attribute *attr, const char *buf, size_t len)
 {
+	struct nd_region *nd_region = to_nd_region(dev->parent);
 	unsigned long long val;
 	u8 **uuid = NULL;
 	int rc;
@@ -719,6 +752,8 @@ static ssize_t size_store(struct device *dev,
 	nd_bus_lock(dev);
 	wait_nd_bus_probe_idle(dev);
 	rc = __size_store(dev, val);
+	if (rc >= 0)
+		rc = nd_namespace_label_update(nd_region, dev);
 
 	if (is_namespace_pmem(dev)) {
 		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
@@ -742,7 +777,7 @@ static ssize_t size_store(struct device *dev,
 	nd_bus_unlock(dev);
 	device_unlock(dev);
 
-	return rc ? rc : len;
+	return rc < 0 ? rc : len;
 }
 
 static ssize_t size_show(struct device *dev,
@@ -802,17 +837,34 @@ static int namespace_update_uuid(struct nd_region *nd_region,
 	u32 flags = is_namespace_blk(dev) ? NSLABEL_FLAG_LOCAL : 0;
 	struct nd_label_id old_label_id;
 	struct nd_label_id new_label_id;
-	int i, rc;
+	int i;
 
-	rc = nd_is_uuid_unique(dev, new_uuid) ? 0 : -EINVAL;
-	if (rc) {
-		kfree(new_uuid);
-		return rc;
-	}
+	if (!nd_is_uuid_unique(dev, new_uuid))
+		return -EINVAL;
 
 	if (*old_uuid == NULL)
 		goto out;
 
+	/*
+	 * If we've already written a label with this uuid, then it's
+	 * too late to rename because we can't reliably update the uuid
+	 * without losing the old namespace.  Userspace must delete this
+	 * namespace to abandon the old uuid.
+	 */
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+
+		/*
+		 * This check by itself is sufficient because old_uuid
+		 * would be NULL above if this uuid did not exist in the
+		 * currently written set.
+		 *
+		 * FIXME: can we delete uuid with zero dpa allocated?
+		 */
+		if (nd_mapping->labels)
+			return -EBUSY;
+	}
+
 	nd_label_gen_id(&old_label_id, *old_uuid, flags);
 	nd_label_gen_id(&new_label_id, new_uuid, flags);
 	for (i = 0; i < nd_region->ndr_mappings; i++) {
@@ -856,12 +908,16 @@ static ssize_t uuid_store(struct device *dev,
 	rc = nd_uuid_store(dev, &uuid, buf, len);
 	if (rc >= 0)
 		rc = namespace_update_uuid(nd_region, dev, uuid, ns_uuid);
+	if (rc >= 0)
+		rc = nd_namespace_label_update(nd_region, dev);
+	else
+		kfree(uuid);
 	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
 			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
 	nd_bus_unlock(dev);
 	device_unlock(dev);
 
-	return rc ? rc : len;
+	return rc < 0 ? rc : len;
 }
 static DEVICE_ATTR_RW(uuid);
 
@@ -905,6 +961,7 @@ static ssize_t sector_size_store(struct device *dev,
 		struct device_attribute *attr, const char *buf, size_t len)
 {
 	struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+	struct nd_region *nd_region = to_nd_region(dev->parent);
 	ssize_t rc;
 
 	if (!is_namespace_blk(dev))
@@ -914,8 +971,11 @@ static ssize_t sector_size_store(struct device *dev,
 	nd_bus_lock(dev);
 	rc = nd_sector_size_store(dev, buf, &nsblk->lbasize,
 			ns_lbasize_supported);
-	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
-			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	if (rc >= 0)
+		rc = nd_namespace_label_update(nd_region, dev);
+	dev_dbg(dev, "%s: result: %zd %s: %s%s", __func__,
+			rc, rc < 0 ? "tried" : "wrote", buf,
+			buf[len - 1] == '\n' ? "" : "\n");
 	nd_bus_unlock(dev);
 	device_unlock(dev);
 
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index 3876d0c7db87..24a440a23b2c 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -110,6 +110,7 @@ static inline unsigned nd_inc_seq(unsigned seq)
 
 	return next[seq & 3];
 }
+
 enum nd_async_mode {
 	ND_SYNC,
 	ND_ASYNC,
@@ -128,6 +129,8 @@ struct nd_dimm;
 struct nd_dimm_drvdata *to_ndd(struct nd_mapping *nd_mapping);
 int nd_dimm_init_nsarea(struct nd_dimm_drvdata *ndd);
 int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd);
+int nd_dimm_set_config_data(struct nd_dimm_drvdata *ndd, size_t offset,
+		void *buf, size_t len);
 struct nd_region *to_nd_region(struct device *dev);
 int nd_region_to_namespace_type(struct nd_region *nd_region);
 int nd_region_register_namespaces(struct nd_region *nd_region, int *err);


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 15/21] libnd: write pmem label set
@ 2015-05-20 20:57   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, mingo, linux-acpi,
	jmoyer, hch

After 'uuid', 'size', and optionally 'alt_name' have been set to valid
values the labels on the dimms can be updated.

Write procedure is:
1/ Allocate and write new labels in the "next" index
2/ Free the old labels in the working copy
3/ Write the bitmap and the label space on the dimm
4/ Write the index to make the update valid

Label ranges directly mirror the dpa resource values for the given
label_id of the namespace.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/block/nd/dimm_devs.c      |   49 ++++++
 drivers/block/nd/label.c          |  328 +++++++++++++++++++++++++++++++++++++
 drivers/block/nd/label.h          |    6 +
 drivers/block/nd/namespace_devs.c |   82 ++++++++-
 drivers/block/nd/nd.h             |    3 
 5 files changed, 454 insertions(+), 14 deletions(-)

diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 4aa5654354ac..358b2a06d680 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -132,6 +132,55 @@ int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd)
 	return rc;
 }
 
+int nd_dimm_set_config_data(struct nd_dimm_drvdata *ndd, size_t offset,
+		void *buf, size_t len)
+{
+	int rc = validate_dimm(ndd);
+	size_t max_cmd_size, buf_offset;
+	struct nd_cmd_set_config_hdr *cmd;
+	struct nd_bus *nd_bus = walk_to_nd_bus(ndd->dev);
+	struct nd_bus_descriptor *nd_desc = nd_bus->nd_desc;
+
+	if (rc)
+		return rc;
+
+	if (!ndd->data)
+		return -ENXIO;
+
+	if (offset + len > ndd->nsarea.config_size)
+		return -ENXIO;
+
+	max_cmd_size = min_t(u32, PAGE_SIZE, len);
+	max_cmd_size = min_t(u32, max_cmd_size, ndd->nsarea.max_xfer);
+	cmd = kzalloc(max_cmd_size + sizeof(*cmd) + sizeof(u32), GFP_KERNEL);
+	if (!cmd)
+		return -ENOMEM;
+
+	for (buf_offset = 0; len; len -= cmd->in_length,
+			buf_offset += cmd->in_length) {
+		size_t cmd_size;
+		u32 *status;
+
+		cmd->in_offset = offset + buf_offset;
+		cmd->in_length = min(max_cmd_size, len);
+		memcpy(cmd->in_buf, buf + buf_offset, cmd->in_length);
+
+		/* status is output in the last 4-bytes of the command buffer */
+		cmd_size = sizeof(*cmd) + cmd->in_length + sizeof(u32);
+		status = ((void *) cmd) + cmd_size - sizeof(u32);
+
+		rc = nd_desc->ndctl(nd_desc, to_nd_dimm(ndd->dev),
+				ND_CMD_SET_CONFIG_DATA, cmd, cmd_size);
+		if (rc || *status) {
+			rc = rc ? rc : -ENXIO;
+			break;
+		}
+	}
+	kfree(cmd);
+
+	return rc;
+}
+
 static void nd_dimm_release(struct device *dev)
 {
 	struct nd_dimm *nd_dimm = to_nd_dimm(dev);
diff --git a/drivers/block/nd/label.c b/drivers/block/nd/label.c
index ecd196b42d57..a4746f1fe99c 100644
--- a/drivers/block/nd/label.c
+++ b/drivers/block/nd/label.c
@@ -12,6 +12,7 @@
  */
 #include <linux/device.h>
 #include <linux/ndctl.h>
+#include <linux/slab.h>
 #include <linux/io.h>
 #include <linux/nd.h>
 #include "nd-private.h"
@@ -57,6 +58,11 @@ size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd)
 	return ndd->nsindex_size;
 }
 
+static int nd_dimm_num_label_slots(struct nd_dimm_drvdata *ndd)
+{
+	return ndd->nsarea.config_size / 129;
+}
+
 int nd_label_validate(struct nd_dimm_drvdata *ndd)
 {
 	/*
@@ -203,23 +209,30 @@ static struct nd_namespace_label __iomem *nd_label_base(struct nd_dimm_drvdata *
 	return base + 2 * sizeof_namespace_index(ndd);
 }
 
+static int to_slot(struct nd_dimm_drvdata *ndd,
+		struct nd_namespace_label __iomem *nd_label)
+{
+	return nd_label - nd_label_base(ndd);
+}
+
 #define for_each_clear_bit_le(bit, addr, size) \
 	for ((bit) = find_next_zero_bit_le((addr), (size), 0);  \
 	     (bit) < (size);                                    \
 	     (bit) = find_next_zero_bit_le((addr), (size), (bit) + 1))
 
 /**
- * preamble_current - common variable initialization for nd_label_* routines
+ * preamble_index - common variable initialization for nd_label_* routines
  * @nd_dimm: dimm container for the relevant label set
+ * @idx: namespace_index index
  * @nsindex: on return set to the currently active namespace index
  * @free: on return set to the free label bitmap in the index
  * @nslot: on return set to the number of slots in the label space
  */
-static bool preamble_current(struct nd_dimm_drvdata *ndd,
+static bool preamble_index(struct nd_dimm_drvdata *ndd, int idx,
 		struct nd_namespace_index **nsindex,
 		unsigned long **free, u32 *nslot)
 {
-	*nsindex = to_current_namespace_index(ndd);
+	*nsindex = to_namespace_index(ndd, idx);
 	if (*nsindex == NULL)
 		return false;
 
@@ -238,6 +251,22 @@ char *nd_label_gen_id(struct nd_label_id *label_id, u8 *uuid, u32 flags)
 	return label_id->id;
 }
 
+static bool preamble_current(struct nd_dimm_drvdata *ndd,
+		struct nd_namespace_index **nsindex,
+		unsigned long **free, u32 *nslot)
+{
+	return preamble_index(ndd, ndd->ns_current, nsindex,
+			free, nslot);
+}
+
+static bool preamble_next(struct nd_dimm_drvdata *ndd,
+		struct nd_namespace_index **nsindex,
+		unsigned long **free, u32 *nslot)
+{
+	return preamble_index(ndd, ndd->ns_next, nsindex,
+			free, nslot);
+}
+
 static bool slot_valid(struct nd_namespace_label __iomem *nd_label, u32 slot)
 {
 	/* check that we are written where we expect to be written */
@@ -337,3 +366,296 @@ struct nd_namespace_label __iomem *nd_label_active(
 
 	return NULL;
 }
+
+u32 nd_label_alloc_slot(struct nd_dimm_drvdata *ndd)
+{
+	struct nd_namespace_index __iomem *nsindex;
+	unsigned long *free;
+	u32 nslot, slot;
+
+	if (!preamble_next(ndd, &nsindex, &free, &nslot))
+		return UINT_MAX;
+
+	WARN_ON(!is_nd_bus_locked(ndd->dev));
+
+	slot = find_next_bit_le(free, nslot, 0);
+	if (slot == nslot)
+		return UINT_MAX;
+
+	clear_bit_le(slot, free);
+
+	return slot;
+}
+
+bool nd_label_free_slot(struct nd_dimm_drvdata *ndd, u32 slot)
+{
+	struct nd_namespace_index __iomem *nsindex;
+	unsigned long *free;
+	u32 nslot;
+
+	if (!preamble_next(ndd, &nsindex, &free, &nslot))
+		return false;
+
+	WARN_ON(!is_nd_bus_locked(ndd->dev));
+
+	if (slot < nslot)
+		return !test_and_set_bit_le(slot, free);
+	return false;
+}
+
+u32 nd_label_nfree(struct nd_dimm_drvdata *ndd)
+{
+	struct nd_namespace_index __iomem *nsindex;
+	unsigned long *free;
+	u32 nslot;
+
+	WARN_ON(!is_nd_bus_locked(ndd->dev));
+
+	if (!preamble_next(ndd, &nsindex, &free, &nslot))
+		return 0;
+
+	return bitmap_weight(free, nslot);
+}
+
+static int nd_label_write_index(struct nd_dimm_drvdata *ndd, int index, u32 seq,
+		unsigned long flags)
+{
+	struct nd_namespace_index *nsindex = to_namespace_index(ndd, index);
+	unsigned long offset;
+	u64 checksum;
+	u32 nslot;
+	int rc;
+
+	if (flags & ND_NSINDEX_INIT)
+		nslot = nd_dimm_num_label_slots(ndd);
+	else
+		nslot = readl(&nsindex->nslot);
+
+	memcpy_toio(nsindex->sig, NSINDEX_SIGNATURE, NSINDEX_SIG_LEN);
+	writel(0, &nsindex->flags);
+	writel(seq, &nsindex->seq);
+	offset = (unsigned long) nsindex
+		- (unsigned long) to_namespace_index(ndd, 0);
+	writeq(offset, &nsindex->myoff);
+	writeq(sizeof_namespace_index(ndd), &nsindex->mysize);
+	offset = (unsigned long) to_namespace_index(ndd,
+			nd_label_next_nsindex(index))
+		- (unsigned long) to_namespace_index(ndd, 0);
+	writeq(offset, &nsindex->otheroff);
+	offset = (unsigned long) nd_label_base(ndd)
+		- (unsigned long) to_namespace_index(ndd, 0);
+	writeq(offset, &nsindex->labeloff);
+	writel(nslot, &nsindex->nslot);
+	writew(1, &nsindex->major);
+	writew(1, &nsindex->minor);
+	writeq(0, &nsindex->checksum);
+	if (flags & ND_NSINDEX_INIT) {
+		unsigned long *free = (unsigned long __force *) nsindex->free;
+		u32 nfree = ALIGN(nslot, BITS_PER_LONG);
+		int last_bits, i;
+
+		memset_io(nsindex->free, 0xff, nfree / 8);
+		for (i = 0, last_bits = nfree - nslot; i < last_bits; i++)
+			clear_bit_le(nslot + i, free);
+	}
+	checksum = nd_fletcher64((void * __force) nsindex,
+			sizeof_namespace_index(ndd), 1);
+	writeq(checksum, &nsindex->checksum);
+	rc = nd_dimm_set_config_data(ndd, readq(&nsindex->myoff),
+			nsindex, sizeof_namespace_index(ndd));
+	if (rc < 0)
+		return rc;
+
+	if (flags & ND_NSINDEX_INIT)
+		return 0;
+
+	/* copy the index we just wrote to the new 'next' */
+	WARN_ON(index != ndd->ns_next);
+	nd_label_copy(ndd, to_current_namespace_index(ndd), nsindex);
+	ndd->ns_current = nd_label_next_nsindex(ndd->ns_current);
+	ndd->ns_next = nd_label_next_nsindex(ndd->ns_next);
+	WARN_ON(ndd->ns_current == ndd->ns_next);
+
+	return 0;
+}
+
+static unsigned long nd_label_offset(struct nd_dimm_drvdata *ndd,
+		struct nd_namespace_label __iomem *nd_label)
+{
+	return (unsigned long) nd_label
+		- (unsigned long) to_namespace_index(ndd, 0);
+}
+
+static int __pmem_label_update(struct nd_region *nd_region,
+		struct nd_mapping *nd_mapping, struct nd_namespace_pmem *nspm,
+		int pos)
+{
+	u64 cookie = nd_region_interleave_set_cookie(nd_region), rawsize;
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	struct nd_namespace_label __iomem *victim_label;
+	struct nd_namespace_label __iomem *nd_label;
+	struct nd_namespace_index __iomem *nsindex;
+	unsigned long *free;
+	u32 nslot, slot;
+	size_t offset;
+	int rc;
+
+	if (!preamble_next(ndd, &nsindex, &free, &nslot))
+		return -ENXIO;
+
+	/* allocate and write the label to the staging (next) index */
+	slot = nd_label_alloc_slot(ndd);
+	if (slot == UINT_MAX)
+		return -ENXIO;
+	dev_dbg(ndd->dev, "%s: allocated: %d\n", __func__, slot);
+
+	nd_label = nd_label_base(ndd) + slot;
+	memset_io(nd_label, 0, sizeof(struct nd_namespace_label));
+	memcpy_toio(nd_label->uuid, nspm->uuid, NSLABEL_UUID_LEN);
+	if (nspm->alt_name)
+		memcpy_toio(nd_label->name, nspm->alt_name, NSLABEL_NAME_LEN);
+	writel(NSLABEL_FLAG_UPDATING, &nd_label->flags);
+	writew(nd_region->ndr_mappings, &nd_label->nlabel);
+	writew(pos, &nd_label->position);
+	writeq(cookie, &nd_label->isetcookie);
+	rawsize = div_u64(resource_size(&nspm->nsio.res),
+			nd_region->ndr_mappings);
+	writeq(rawsize, &nd_label->rawsize);
+	writeq(nd_mapping->start, &nd_label->dpa);
+	writel(slot, &nd_label->slot);
+
+	/* update label */
+	offset = nd_label_offset(ndd, nd_label);
+	rc = nd_dimm_set_config_data(ndd, offset, nd_label,
+			sizeof(struct nd_namespace_label));
+	if (rc < 0)
+		return rc;
+
+	/* Garbage collect the previous label */
+	victim_label = nd_get_label(nd_mapping->labels, 0);
+	if (victim_label) {
+		slot = to_slot(ndd, victim_label);
+		nd_label_free_slot(ndd, slot);
+		dev_dbg(ndd->dev, "%s: free: %d\n", __func__, slot);
+	}
+
+	/* update index */
+	rc = nd_label_write_index(ndd, ndd->ns_next,
+			nd_inc_seq(readl(&nsindex->seq)), 0);
+	if (rc < 0)
+		return rc;
+
+	nd_set_label(nd_mapping->labels, nd_label, 0);
+
+	return 0;
+}
+
+static int init_labels(struct nd_mapping *nd_mapping)
+{
+	int i;
+	struct nd_namespace_index __iomem *nsindex;
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+
+	if (!nd_mapping->labels)
+		nd_mapping->labels = kcalloc(2, sizeof(void *), GFP_KERNEL);
+
+	if (!nd_mapping->labels)
+		return -ENOMEM;
+
+	if (ndd->ns_current == -1 || ndd->ns_next == -1)
+		/* pass */;
+	else
+		return 0;
+
+	nsindex = to_namespace_index(ndd, 0);
+	memset_io(nsindex, 0, ndd->nsarea.config_size);
+	for (i = 0; i < 2; i++) {
+		int rc = nd_label_write_index(ndd, i, i*2, ND_NSINDEX_INIT);
+
+		if (rc)
+			return rc;
+	}
+	ndd->ns_next = 1;
+	ndd->ns_current = 0;
+
+	return 0;
+}
+
+static int del_labels(struct nd_mapping *nd_mapping, u8 *uuid)
+{
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	struct nd_namespace_label __iomem *nd_label;
+	struct nd_namespace_index __iomem *nsindex;
+	u8 label_uuid[NSLABEL_UUID_LEN];
+	int l, num_freed = 0;
+	unsigned long *free;
+	u32 nslot, slot;
+
+	if (!uuid)
+		return 0;
+
+	/* no index || no labels == nothing to delete */
+	if (!preamble_next(ndd, &nsindex, &free, &nslot)
+			|| !nd_mapping->labels)
+		return 0;
+
+	for_each_label(l, nd_label, nd_mapping->labels) {
+		int j;
+
+		memcpy_fromio(label_uuid, nd_label->uuid, NSLABEL_UUID_LEN);
+		if (memcmp(label_uuid, uuid, NSLABEL_UUID_LEN) != 0)
+			continue;
+		slot = to_slot(ndd, nd_label);
+		nd_label_free_slot(ndd, slot);
+		dev_dbg(ndd->dev, "%s: free: %d\n", __func__, slot);
+		for (j = l; nd_get_label(nd_mapping->labels, j + 1); j++) {
+			struct nd_namespace_label __iomem *next_label;
+
+			next_label = nd_get_label(nd_mapping->labels, j + 1);
+			nd_set_label(nd_mapping->labels, next_label, j);
+		}
+		nd_set_label(nd_mapping->labels, NULL, j);
+		num_freed++;
+	}
+
+	if (num_freed > l) {
+		/*
+		 * num_freed will only ever be > l when we delete the last
+		 * label
+		 */
+		kfree(nd_mapping->labels);
+		nd_mapping->labels = NULL;
+		dev_dbg(ndd->dev, "%s: no more labels\n", __func__);
+	}
+
+	return nd_label_write_index(ndd, ndd->ns_next,
+			nd_inc_seq(readl(&nsindex->seq)), 0);
+}
+
+int nd_pmem_namespace_label_update(struct nd_region *nd_region,
+		struct nd_namespace_pmem *nspm, resource_size_t size)
+{
+	int i;
+
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		int rc;
+
+		if (size == 0) {
+			rc = del_labels(nd_mapping, nspm->uuid);
+			if (rc)
+				return rc;
+			continue;
+		}
+
+		rc = init_labels(nd_mapping);
+		if (rc)
+			return rc;
+
+		rc = __pmem_label_update(nd_region, nd_mapping, nspm, i);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
diff --git a/drivers/block/nd/label.h b/drivers/block/nd/label.h
index 4436624f4146..e17958941e34 100644
--- a/drivers/block/nd/label.h
+++ b/drivers/block/nd/label.h
@@ -34,6 +34,7 @@ enum {
 	BTTINFO_MAJOR_VERSION = 1,
 	ND_LABEL_MIN_SIZE = 512 * 129, /* see sizeof_namespace_index() */
 	ND_LABEL_ID_SIZE = 50,
+	ND_NSINDEX_INIT = 0x1,
 };
 
 static const char NSINDEX_SIGNATURE[] = "NAMESPACE_INDEX\0";
@@ -129,4 +130,9 @@ size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd);
 int nd_label_active_count(struct nd_dimm_drvdata *ndd);
 struct nd_namespace_label __iomem *nd_label_active(
 		struct nd_dimm_drvdata *ndd, int n);
+u32 nd_label_nfree(struct nd_dimm_drvdata *ndd);
+struct nd_region;
+struct nd_namespace_pmem;
+int nd_pmem_namespace_label_update(struct nd_region *nd_region,
+		struct nd_namespace_pmem *nspm, resource_size_t size);
 #endif /* __LABEL_H__ */
diff --git a/drivers/block/nd/namespace_devs.c b/drivers/block/nd/namespace_devs.c
index d06b8abf6744..cdb78dddcfa9 100644
--- a/drivers/block/nd/namespace_devs.c
+++ b/drivers/block/nd/namespace_devs.c
@@ -151,20 +151,52 @@ static resource_size_t nd_namespace_blk_size(struct nd_namespace_blk *nsblk)
 	return size;
 }
 
+static int nd_namespace_label_update(struct nd_region *nd_region, struct device *dev)
+{
+	dev_WARN_ONCE(dev, dev->driver,
+			"namespace must be idle during label update\n");
+	if (dev->driver)
+		return 0;
+
+	/*
+	 * Only allow label writes that will result in a valid namespace
+	 * or deletion of an existing namespace.
+	 */
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+		struct resource *res = &nspm->nsio.res;
+		resource_size_t size = resource_size(res);
+
+		if (size == 0 && nspm->uuid)
+			/* delete allocation */;
+		else if (!nspm->uuid)
+			return 0;
+
+		return nd_pmem_namespace_label_update(nd_region, nspm, size);
+	} else if (is_namespace_blk(dev)) {
+		/* TODO: implement blk labels */
+		return 0;
+	} else
+		return -ENXIO;
+}
+
 static ssize_t alt_name_store(struct device *dev,
 		struct device_attribute *attr, const char *buf, size_t len)
 {
+	struct nd_region *nd_region = to_nd_region(dev->parent);
 	ssize_t rc;
 
 	device_lock(dev);
 	nd_bus_lock(dev);
 	wait_nd_bus_probe_idle(dev);
 	rc = __alt_name_store(dev, buf, len);
+	if (rc >= 0)
+		rc = nd_namespace_label_update(nd_region, dev);
 	dev_dbg(dev, "%s: %s (%zd)\n", __func__, rc < 0 ? "fail" : "success", rc);
 	nd_bus_unlock(dev);
 	device_unlock(dev);
 
-	return rc;
+	return rc < 0 ? rc : len;
 }
 
 static ssize_t alt_name_show(struct device *dev,
@@ -707,6 +739,7 @@ static ssize_t __size_store(struct device *dev, unsigned long long val)
 static ssize_t size_store(struct device *dev,
 		struct device_attribute *attr, const char *buf, size_t len)
 {
+	struct nd_region *nd_region = to_nd_region(dev->parent);
 	unsigned long long val;
 	u8 **uuid = NULL;
 	int rc;
@@ -719,6 +752,8 @@ static ssize_t size_store(struct device *dev,
 	nd_bus_lock(dev);
 	wait_nd_bus_probe_idle(dev);
 	rc = __size_store(dev, val);
+	if (rc >= 0)
+		rc = nd_namespace_label_update(nd_region, dev);
 
 	if (is_namespace_pmem(dev)) {
 		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
@@ -742,7 +777,7 @@ static ssize_t size_store(struct device *dev,
 	nd_bus_unlock(dev);
 	device_unlock(dev);
 
-	return rc ? rc : len;
+	return rc < 0 ? rc : len;
 }
 
 static ssize_t size_show(struct device *dev,
@@ -802,17 +837,34 @@ static int namespace_update_uuid(struct nd_region *nd_region,
 	u32 flags = is_namespace_blk(dev) ? NSLABEL_FLAG_LOCAL : 0;
 	struct nd_label_id old_label_id;
 	struct nd_label_id new_label_id;
-	int i, rc;
+	int i;
 
-	rc = nd_is_uuid_unique(dev, new_uuid) ? 0 : -EINVAL;
-	if (rc) {
-		kfree(new_uuid);
-		return rc;
-	}
+	if (!nd_is_uuid_unique(dev, new_uuid))
+		return -EINVAL;
 
 	if (*old_uuid == NULL)
 		goto out;
 
+	/*
+	 * If we've already written a label with this uuid, then it's
+	 * too late to rename because we can't reliably update the uuid
+	 * without losing the old namespace.  Userspace must delete this
+	 * namespace to abandon the old uuid.
+	 */
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+
+		/*
+		 * This check by itself is sufficient because old_uuid
+		 * would be NULL above if this uuid did not exist in the
+		 * currently written set.
+		 *
+		 * FIXME: can we delete uuid with zero dpa allocated?
+		 */
+		if (nd_mapping->labels)
+			return -EBUSY;
+	}
+
 	nd_label_gen_id(&old_label_id, *old_uuid, flags);
 	nd_label_gen_id(&new_label_id, new_uuid, flags);
 	for (i = 0; i < nd_region->ndr_mappings; i++) {
@@ -856,12 +908,16 @@ static ssize_t uuid_store(struct device *dev,
 	rc = nd_uuid_store(dev, &uuid, buf, len);
 	if (rc >= 0)
 		rc = namespace_update_uuid(nd_region, dev, uuid, ns_uuid);
+	if (rc >= 0)
+		rc = nd_namespace_label_update(nd_region, dev);
+	else
+		kfree(uuid);
 	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
 			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
 	nd_bus_unlock(dev);
 	device_unlock(dev);
 
-	return rc ? rc : len;
+	return rc < 0 ? rc : len;
 }
 static DEVICE_ATTR_RW(uuid);
 
@@ -905,6 +961,7 @@ static ssize_t sector_size_store(struct device *dev,
 		struct device_attribute *attr, const char *buf, size_t len)
 {
 	struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+	struct nd_region *nd_region = to_nd_region(dev->parent);
 	ssize_t rc;
 
 	if (!is_namespace_blk(dev))
@@ -914,8 +971,11 @@ static ssize_t sector_size_store(struct device *dev,
 	nd_bus_lock(dev);
 	rc = nd_sector_size_store(dev, buf, &nsblk->lbasize,
 			ns_lbasize_supported);
-	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
-			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	if (rc >= 0)
+		rc = nd_namespace_label_update(nd_region, dev);
+	dev_dbg(dev, "%s: result: %zd %s: %s%s", __func__,
+			rc, rc < 0 ? "tried" : "wrote", buf,
+			buf[len - 1] == '\n' ? "" : "\n");
 	nd_bus_unlock(dev);
 	device_unlock(dev);
 
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index 3876d0c7db87..24a440a23b2c 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -110,6 +110,7 @@ static inline unsigned nd_inc_seq(unsigned seq)
 
 	return next[seq & 3];
 }
+
 enum nd_async_mode {
 	ND_SYNC,
 	ND_ASYNC,
@@ -128,6 +129,8 @@ struct nd_dimm;
 struct nd_dimm_drvdata *to_ndd(struct nd_mapping *nd_mapping);
 int nd_dimm_init_nsarea(struct nd_dimm_drvdata *ndd);
 int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd);
+int nd_dimm_set_config_data(struct nd_dimm_drvdata *ndd, size_t offset,
+		void *buf, size_t len);
 struct nd_region *to_nd_region(struct device *dev);
 int nd_region_to_namespace_type(struct nd_region *nd_region);
 int nd_region_register_namespaces(struct nd_region *nd_region, int *err);


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 16/21] libnd: write blk label set
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:57   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, mingo, linux-acpi,
	jmoyer, hch

After 'uuid', 'size', 'sector_size', and optionally 'alt_name' have been
set to valid values the labels on the dimm can be updated.  The
difference with the pmem case is that blk namespaces are limited to one
dimm and can cover discontiguous ranges in dpa space.

Also, after allocating label slots, it is useful for userspace to know
how many slots are left.  Export this information in sysfs.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/block/nd/bus.c            |    4 
 drivers/block/nd/dimm_devs.c      |   25 +++
 drivers/block/nd/label.c          |  297 +++++++++++++++++++++++++++++++++++--
 drivers/block/nd/label.h          |    5 +
 drivers/block/nd/namespace_devs.c |   57 +++++++
 drivers/block/nd/nd-private.h     |    1 
 6 files changed, 367 insertions(+), 22 deletions(-)

diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index 65af6bcc5472..4a2185a99bd7 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -136,6 +136,10 @@ static void nd_async_device_unregister(void *d, async_cookie_t cookie)
 {
 	struct device *dev = d;
 
+	/* flush bus operations before delete */
+	nd_bus_lock(dev);
+	nd_bus_unlock(dev);
+
 	device_unregister(dev);
 	put_device(dev);
 }
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 358b2a06d680..4b225c8b7d0a 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -19,6 +19,7 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include "nd-private.h"
+#include "label.h"
 #include "nd.h"
 
 static DEFINE_IDA(dimm_ida);
@@ -262,9 +263,33 @@ static ssize_t state_show(struct device *dev, struct device_attribute *attr,
 }
 static DEVICE_ATTR_RO(state);
 
+static ssize_t available_slots_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_dimm_drvdata *ndd = dev_get_drvdata(dev);
+	ssize_t rc;
+	u32 nfree;
+
+	if (!ndd)
+		return -ENXIO;
+
+	nd_bus_lock(dev);
+	nfree = nd_label_nfree(ndd);
+	if (nfree - 1 > nfree) {
+		dev_WARN_ONCE(dev, 1, "we ate our last label?\n");
+		nfree = 0;
+	} else
+		nfree--;
+	rc = sprintf(buf, "%d\n", nfree);
+	nd_bus_unlock(dev);
+	return rc;
+}
+static DEVICE_ATTR_RO(available_slots);
+
 static struct attribute *nd_dimm_attributes[] = {
 	&dev_attr_state.attr,
 	&dev_attr_commands.attr,
+	&dev_attr_available_slots.attr,
 	NULL,
 };
 
diff --git a/drivers/block/nd/label.c b/drivers/block/nd/label.c
index a4746f1fe99c..5052db591bec 100644
--- a/drivers/block/nd/label.c
+++ b/drivers/block/nd/label.c
@@ -58,7 +58,7 @@ size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd)
 	return ndd->nsindex_size;
 }
 
-static int nd_dimm_num_label_slots(struct nd_dimm_drvdata *ndd)
+int nd_dimm_num_label_slots(struct nd_dimm_drvdata *ndd)
 {
 	return ndd->nsarea.config_size / 129;
 }
@@ -412,7 +412,7 @@ u32 nd_label_nfree(struct nd_dimm_drvdata *ndd)
 	WARN_ON(!is_nd_bus_locked(ndd->dev));
 
 	if (!preamble_next(ndd, &nsindex, &free, &nslot))
-		return 0;
+		return nd_dimm_num_label_slots(ndd);
 
 	return bitmap_weight(free, nslot);
 }
@@ -550,22 +550,270 @@ static int __pmem_label_update(struct nd_region *nd_region,
 	return 0;
 }
 
-static int init_labels(struct nd_mapping *nd_mapping)
+static void del_label(struct nd_mapping *nd_mapping, int l)
+{
+	struct nd_namespace_label __iomem *next_label, __iomem *nd_label;
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	unsigned int slot;
+	int j;
+
+	nd_label = nd_get_label(nd_mapping->labels, l);
+	slot = to_slot(ndd, nd_label);
+	dev_vdbg(ndd->dev, "%s: clear: %d\n", __func__, slot);
+
+	for (j = l; (next_label = nd_get_label(nd_mapping->labels, j + 1)); j++)
+		nd_set_label(nd_mapping->labels, next_label, j);
+	nd_set_label(nd_mapping->labels, NULL, j);
+}
+
+static bool is_old_resource(struct resource *res, struct resource **list, int n)
 {
 	int i;
+
+	if (res->flags & DPA_RESOURCE_ADJUSTED)
+		return false;
+	for (i = 0; i < n; i++)
+		if (res == list[i])
+			return true;
+	return false;
+}
+
+static struct resource *to_resource(struct nd_dimm_drvdata *ndd,
+		struct nd_namespace_label __iomem *nd_label)
+{
+	struct resource *res;
+
+	for_each_dpa_resource(ndd, res) {
+		if (res->start != readq(&nd_label->dpa))
+			continue;
+		if (resource_size(res) != readq(&nd_label->rawsize))
+			continue;
+		return res;
+	}
+
+	return NULL;
+}
+
+/*
+ * 1/ Account all the labels that can be freed after this update
+ * 2/ Allocate and write the label to the staging (next) index
+ * 3/ Record the resources in the namespace device
+ */
+static int __blk_label_update(struct nd_region *nd_region,
+		struct nd_mapping *nd_mapping, struct nd_namespace_blk *nsblk,
+		int num_labels)
+{
+	int i, l, alloc, victims, nfree, old_num_resources, nlabel, rc = -ENXIO;
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	struct nd_namespace_label __iomem *nd_label;
+	struct nd_namespace_index __iomem *nsindex;
+	unsigned long *free, *victim_map = NULL;
+	struct resource *res, **old_res_list;
+	struct nd_label_id label_id;
+	u8 uuid[NSLABEL_UUID_LEN];
+	u32 nslot, slot;
+
+	if (!preamble_next(ndd, &nsindex, &free, &nslot))
+		return -ENXIO;
+
+	old_res_list = nsblk->res;
+	nfree = nd_label_nfree(ndd);
+	old_num_resources = nsblk->num_resources;
+	nd_label_gen_id(&label_id, nsblk->uuid, NSLABEL_FLAG_LOCAL);
+
+	/*
+	 * We need to loop over the old resources a few times, which seems a
+	 * bit inefficient, but we need to know that we have the label
+	 * space before we start mutating the tracking structures.
+	 * Otherwise the recovery method of last resort for userspace is
+	 * disable and re-enable the parent region.
+	 */
+	alloc = 0;
+	for_each_dpa_resource(ndd, res) {
+		if (strcmp(res->name, label_id.id) != 0)
+			continue;
+		if (!is_old_resource(res, old_res_list, old_num_resources))
+			alloc++;
+	}
+
+	victims = 0;
+	if (old_num_resources) {
+		/* convert old local-label-map to dimm-slot victim-map */
+		victim_map = kcalloc(BITS_TO_LONGS(nslot), sizeof(long),
+				GFP_KERNEL);
+		if (!victim_map)
+			return -ENOMEM;
+
+		/* mark unused labels for garbage collection */
+		for_each_clear_bit_le(slot, free, nslot) {
+			nd_label = nd_label_base(ndd) + slot;
+			memcpy_fromio(uuid, nd_label->uuid, NSLABEL_UUID_LEN);
+			if (memcmp(uuid, nsblk->uuid, NSLABEL_UUID_LEN) != 0)
+				continue;
+			res = to_resource(ndd, nd_label);
+			if (res && is_old_resource(res, old_res_list,
+						old_num_resources))
+				continue;
+			slot = to_slot(ndd, nd_label);
+			set_bit(slot, victim_map);
+			victims++;
+		}
+	}
+
+	/* don't allow updates that consume the last label */
+	if (nfree - alloc < 0 || nfree - alloc + victims < 1) {
+		dev_info(&nsblk->dev, "insufficient label space\n");
+		kfree(victim_map);
+		return -ENOSPC;
+	}
+	/* from here on we need to abort on error */
+
+
+	/* assign all resources to the namespace before writing the labels */
+	nsblk->res = NULL;
+	nsblk->num_resources = 0;
+	for_each_dpa_resource(ndd, res) {
+		if (strcmp(res->name, label_id.id) != 0)
+			continue;
+		if (!nsblk_add_resource(nd_region, ndd, nsblk, res->start)) {
+			rc = -ENOMEM;
+			goto abort;
+		}
+	}
+
+	for (i = 0; i < nsblk->num_resources; i++) {
+		size_t offset;
+
+		res = nsblk->res[i];
+		if (is_old_resource(res, old_res_list, old_num_resources))
+			continue; /* carry-over */
+		slot = nd_label_alloc_slot(ndd);
+		if (slot == UINT_MAX)
+			goto abort;
+		dev_dbg(ndd->dev, "%s: allocated: %d\n", __func__, slot);
+
+		nd_label = nd_label_base(ndd) + slot;
+		memset_io(nd_label, 0, sizeof(struct nd_namespace_label));
+		memcpy_toio(nd_label->uuid, nsblk->uuid, NSLABEL_UUID_LEN);
+		if (nsblk->alt_name)
+			memcpy_toio(nd_label->name, nsblk->alt_name,
+					NSLABEL_NAME_LEN);
+		writel(NSLABEL_FLAG_LOCAL, &nd_label->flags);
+		writew(0, &nd_label->nlabel); /* N/A */
+		writew(0, &nd_label->position); /* N/A */
+		writeq(0, &nd_label->isetcookie); /* N/A */
+		writeq(res->start, &nd_label->dpa);
+		writeq(resource_size(res), &nd_label->rawsize);
+		writeq(nsblk->lbasize, &nd_label->lbasize);
+		writel(slot, &nd_label->slot);
+
+		/* update label */
+		offset = nd_label_offset(ndd, nd_label);
+		rc = nd_dimm_set_config_data(ndd, offset, nd_label,
+				sizeof(struct nd_namespace_label));
+		if (rc < 0)
+			goto abort;
+	}
+
+	/* free up now unused slots in the new index */
+	for_each_set_bit(slot, victim_map, victim_map ? nslot : 0) {
+		dev_dbg(ndd->dev, "%s: free: %d\n", __func__, slot);
+		nd_label_free_slot(ndd, slot);
+	}
+
+	/* update index */
+	rc = nd_label_write_index(ndd, ndd->ns_next,
+			nd_inc_seq(readl(&nsindex->seq)), 0);
+	if (rc)
+		goto abort;
+
+	/*
+	 * Now that the on-dimm labels are up to date, fix up the tracking
+	 * entries in nd_mapping->labels
+	 */
+	nlabel = 0;
+	for_each_label(l, nd_label, nd_mapping->labels) {
+		nlabel++;
+		memcpy_fromio(uuid, nd_label->uuid, NSLABEL_UUID_LEN);
+		if (memcmp(uuid, nsblk->uuid, NSLABEL_UUID_LEN) != 0)
+			continue;
+		nlabel--;
+		del_label(nd_mapping, l);
+		l--; /* retry with the new label at this index */
+	}
+	if (nlabel + nsblk->num_resources > num_labels) {
+		/*
+		 * Bug, we can't end up with more resources than
+		 * available labels
+		 */
+		WARN_ON_ONCE(1);
+		rc = -ENXIO;
+		goto out;
+	}
+
+	for_each_clear_bit_le(slot, free, nslot) {
+		nd_label = nd_label_base(ndd) + slot;
+		memcpy_fromio(uuid, nd_label->uuid, NSLABEL_UUID_LEN);
+		if (memcmp(uuid, nsblk->uuid, NSLABEL_UUID_LEN) != 0)
+			continue;
+		res = to_resource(ndd, nd_label);
+		res->flags &= ~DPA_RESOURCE_ADJUSTED;
+		dev_vdbg(&nsblk->dev, "assign label[%d] slot: %d\n", l, slot);
+		nd_set_label(nd_mapping->labels, nd_label, l++);
+	}
+	nd_set_label(nd_mapping->labels, NULL, l);
+
+ out:
+	kfree(old_res_list);
+	kfree(victim_map);
+	return rc;
+
+ abort:
+	/*
+	 * 1/ repair the allocated label bitmap in the index
+	 * 2/ restore the resource list
+	 */
+	nd_label_copy(ndd, nsindex, to_current_namespace_index(ndd));
+	kfree(nsblk->res);
+	nsblk->res = old_res_list;
+	nsblk->num_resources = old_num_resources;
+	old_res_list = NULL;
+	goto out;
+}
+
+static int init_labels(struct nd_mapping *nd_mapping, int num_labels)
+{
+	int i, l, old_num_labels = 0;
 	struct nd_namespace_index __iomem *nsindex;
+	struct nd_namespace_label __iomem *nd_label;
 	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	size_t size = (num_labels + 1) * sizeof(struct nd_namespace_label *);
 
-	if (!nd_mapping->labels)
-		nd_mapping->labels = kcalloc(2, sizeof(void *), GFP_KERNEL);
+	for_each_label(l, nd_label, nd_mapping->labels)
+		old_num_labels++;
 
+	/*
+	 * We need to preserve all the old labels for the mapping so
+	 * they can be garbage collected after writing the new labels.
+	 */
+	if (num_labels > old_num_labels) {
+		struct nd_namespace_label **labels;
+
+		labels = krealloc(nd_mapping->labels, size, GFP_KERNEL);
+		if (!labels)
+			return -ENOMEM;
+		nd_mapping->labels = labels;
+	}
 	if (!nd_mapping->labels)
 		return -ENOMEM;
 
+	for (i = old_num_labels; i <= num_labels; i++)
+		nd_set_label(nd_mapping->labels, NULL, i);
+
 	if (ndd->ns_current == -1 || ndd->ns_next == -1)
 		/* pass */;
 	else
-		return 0;
+		return max(num_labels, old_num_labels);
 
 	nsindex = to_namespace_index(ndd, 0);
 	memset_io(nsindex, 0, ndd->nsarea.config_size);
@@ -578,7 +826,7 @@ static int init_labels(struct nd_mapping *nd_mapping)
 	ndd->ns_next = 1;
 	ndd->ns_current = 0;
 
-	return 0;
+	return max(num_labels, old_num_labels);
 }
 
 static int del_labels(struct nd_mapping *nd_mapping, u8 *uuid)
@@ -600,22 +848,15 @@ static int del_labels(struct nd_mapping *nd_mapping, u8 *uuid)
 		return 0;
 
 	for_each_label(l, nd_label, nd_mapping->labels) {
-		int j;
-
 		memcpy_fromio(label_uuid, nd_label->uuid, NSLABEL_UUID_LEN);
 		if (memcmp(label_uuid, uuid, NSLABEL_UUID_LEN) != 0)
 			continue;
 		slot = to_slot(ndd, nd_label);
 		nd_label_free_slot(ndd, slot);
 		dev_dbg(ndd->dev, "%s: free: %d\n", __func__, slot);
-		for (j = l; nd_get_label(nd_mapping->labels, j + 1); j++) {
-			struct nd_namespace_label __iomem *next_label;
-
-			next_label = nd_get_label(nd_mapping->labels, j + 1);
-			nd_set_label(nd_mapping->labels, next_label, j);
-		}
-		nd_set_label(nd_mapping->labels, NULL, j);
+		del_label(nd_mapping, l);
 		num_freed++;
+		l--; /* retry with new label at this index */
 	}
 
 	if (num_freed > l) {
@@ -648,8 +889,8 @@ int nd_pmem_namespace_label_update(struct nd_region *nd_region,
 			continue;
 		}
 
-		rc = init_labels(nd_mapping);
-		if (rc)
+		rc = init_labels(nd_mapping, 1);
+		if (rc < 0)
 			return rc;
 
 		rc = __pmem_label_update(nd_region, nd_mapping, nspm, i);
@@ -659,3 +900,23 @@ int nd_pmem_namespace_label_update(struct nd_region *nd_region,
 
 	return 0;
 }
+
+int nd_blk_namespace_label_update(struct nd_region *nd_region,
+		struct nd_namespace_blk *nsblk, resource_size_t size)
+{
+	struct nd_mapping *nd_mapping = &nd_region->mapping[0];
+	struct resource *res;
+	int count = 0;
+
+	if (size == 0)
+		return del_labels(nd_mapping, nsblk->uuid);
+
+	for_each_dpa_resource(to_ndd(nd_mapping), res)
+		count++;
+
+	count = init_labels(nd_mapping, count);
+	if (count < 0)
+		return count;
+
+	return __blk_label_update(nd_region, nd_mapping, nsblk, count);
+}
diff --git a/drivers/block/nd/label.h b/drivers/block/nd/label.h
index e17958941e34..a26cebc9f389 100644
--- a/drivers/block/nd/label.h
+++ b/drivers/block/nd/label.h
@@ -130,9 +130,14 @@ size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd);
 int nd_label_active_count(struct nd_dimm_drvdata *ndd);
 struct nd_namespace_label __iomem *nd_label_active(
 		struct nd_dimm_drvdata *ndd, int n);
+u32 nd_label_alloc_slot(struct nd_dimm_drvdata *ndd);
+bool nd_label_free_slot(struct nd_dimm_drvdata *ndd, u32 slot);
 u32 nd_label_nfree(struct nd_dimm_drvdata *ndd);
 struct nd_region;
 struct nd_namespace_pmem;
+struct nd_namespace_blk;
 int nd_pmem_namespace_label_update(struct nd_region *nd_region,
 		struct nd_namespace_pmem *nspm, resource_size_t size);
+int nd_blk_namespace_label_update(struct nd_region *nd_region,
+		struct nd_namespace_blk *nsblk, resource_size_t size);
 #endif /* __LABEL_H__ */
diff --git a/drivers/block/nd/namespace_devs.c b/drivers/block/nd/namespace_devs.c
index cdb78dddcfa9..c193ba6c6445 100644
--- a/drivers/block/nd/namespace_devs.c
+++ b/drivers/block/nd/namespace_devs.c
@@ -164,8 +164,7 @@ static int nd_namespace_label_update(struct nd_region *nd_region, struct device
 	 */
 	if (is_namespace_pmem(dev)) {
 		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
-		struct resource *res = &nspm->nsio.res;
-		resource_size_t size = resource_size(res);
+		resource_size_t size = resource_size(&nspm->nsio.res);
 
 		if (size == 0 && nspm->uuid)
 			/* delete allocation */;
@@ -174,8 +173,15 @@ static int nd_namespace_label_update(struct nd_region *nd_region, struct device
 
 		return nd_pmem_namespace_label_update(nd_region, nspm, size);
 	} else if (is_namespace_blk(dev)) {
-		/* TODO: implement blk labels */
-		return 0;
+		struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+		resource_size_t size = nd_namespace_blk_size(nsblk);
+
+		if (size == 0 && nsblk->uuid)
+			/* delete allocation */;
+		else if (!nsblk->uuid || !nsblk->lbasize)
+			return 0;
+
+		return nd_blk_namespace_label_update(nd_region, nsblk, size);
 	} else
 		return -ENXIO;
 }
@@ -983,6 +989,48 @@ static ssize_t sector_size_store(struct device *dev,
 }
 static DEVICE_ATTR_RW(sector_size);
 
+static ssize_t dpa_extents_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+	struct nd_label_id label_id;
+	int count = 0, i;
+	u8 *uuid = NULL;
+	u32 flags = 0;
+
+	nd_bus_lock(dev);
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		uuid = nspm->uuid;
+		flags = 0;
+	} else if (is_namespace_blk(dev)) {
+		struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+
+		uuid = nsblk->uuid;
+		flags = NSLABEL_FLAG_LOCAL;
+	}
+
+	if (!uuid)
+		goto out;
+
+	nd_label_gen_id(&label_id, uuid, flags);
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+		struct resource *res;
+
+		for_each_dpa_resource(ndd, res)
+			if (strcmp(res->name, label_id.id) == 0)
+				count++;
+	}
+ out:
+	nd_bus_unlock(dev);
+
+	return sprintf(buf, "%d\n", count);
+}
+static DEVICE_ATTR_RO(dpa_extents);
+
 static struct attribute *nd_namespace_attributes[] = {
 	&dev_attr_nstype.attr,
 	&dev_attr_size.attr,
@@ -990,6 +1038,7 @@ static struct attribute *nd_namespace_attributes[] = {
 	&dev_attr_resource.attr,
 	&dev_attr_alt_name.attr,
 	&dev_attr_sector_size.attr,
+	&dev_attr_dpa_extents.attr,
 	NULL,
 };
 
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index fe852175a3b8..fffd65436e2b 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -79,4 +79,5 @@ struct nd_mapping;
 struct resource *nsblk_add_resource(struct nd_region *nd_region,
 		struct nd_dimm_drvdata *ndd, struct nd_namespace_blk *nsblk,
 		resource_size_t start);
+int nd_dimm_num_label_slots(struct nd_dimm_drvdata *ndd);
 #endif /* __ND_PRIVATE_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 16/21] libnd: write blk label set
@ 2015-05-20 20:57   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, mingo, linux-acpi,
	jmoyer, hch

After 'uuid', 'size', 'sector_size', and optionally 'alt_name' have been
set to valid values the labels on the dimm can be updated.  The
difference with the pmem case is that blk namespaces are limited to one
dimm and can cover discontiguous ranges in dpa space.

Also, after allocating label slots, it is useful for userspace to know
how many slots are left.  Export this information in sysfs.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/block/nd/bus.c            |    4 
 drivers/block/nd/dimm_devs.c      |   25 +++
 drivers/block/nd/label.c          |  297 +++++++++++++++++++++++++++++++++++--
 drivers/block/nd/label.h          |    5 +
 drivers/block/nd/namespace_devs.c |   57 +++++++
 drivers/block/nd/nd-private.h     |    1 
 6 files changed, 367 insertions(+), 22 deletions(-)

diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index 65af6bcc5472..4a2185a99bd7 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -136,6 +136,10 @@ static void nd_async_device_unregister(void *d, async_cookie_t cookie)
 {
 	struct device *dev = d;
 
+	/* flush bus operations before delete */
+	nd_bus_lock(dev);
+	nd_bus_unlock(dev);
+
 	device_unregister(dev);
 	put_device(dev);
 }
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 358b2a06d680..4b225c8b7d0a 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -19,6 +19,7 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include "nd-private.h"
+#include "label.h"
 #include "nd.h"
 
 static DEFINE_IDA(dimm_ida);
@@ -262,9 +263,33 @@ static ssize_t state_show(struct device *dev, struct device_attribute *attr,
 }
 static DEVICE_ATTR_RO(state);
 
+static ssize_t available_slots_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_dimm_drvdata *ndd = dev_get_drvdata(dev);
+	ssize_t rc;
+	u32 nfree;
+
+	if (!ndd)
+		return -ENXIO;
+
+	nd_bus_lock(dev);
+	nfree = nd_label_nfree(ndd);
+	if (nfree - 1 > nfree) {
+		dev_WARN_ONCE(dev, 1, "we ate our last label?\n");
+		nfree = 0;
+	} else
+		nfree--;
+	rc = sprintf(buf, "%d\n", nfree);
+	nd_bus_unlock(dev);
+	return rc;
+}
+static DEVICE_ATTR_RO(available_slots);
+
 static struct attribute *nd_dimm_attributes[] = {
 	&dev_attr_state.attr,
 	&dev_attr_commands.attr,
+	&dev_attr_available_slots.attr,
 	NULL,
 };
 
diff --git a/drivers/block/nd/label.c b/drivers/block/nd/label.c
index a4746f1fe99c..5052db591bec 100644
--- a/drivers/block/nd/label.c
+++ b/drivers/block/nd/label.c
@@ -58,7 +58,7 @@ size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd)
 	return ndd->nsindex_size;
 }
 
-static int nd_dimm_num_label_slots(struct nd_dimm_drvdata *ndd)
+int nd_dimm_num_label_slots(struct nd_dimm_drvdata *ndd)
 {
 	return ndd->nsarea.config_size / 129;
 }
@@ -412,7 +412,7 @@ u32 nd_label_nfree(struct nd_dimm_drvdata *ndd)
 	WARN_ON(!is_nd_bus_locked(ndd->dev));
 
 	if (!preamble_next(ndd, &nsindex, &free, &nslot))
-		return 0;
+		return nd_dimm_num_label_slots(ndd);
 
 	return bitmap_weight(free, nslot);
 }
@@ -550,22 +550,270 @@ static int __pmem_label_update(struct nd_region *nd_region,
 	return 0;
 }
 
-static int init_labels(struct nd_mapping *nd_mapping)
+static void del_label(struct nd_mapping *nd_mapping, int l)
+{
+	struct nd_namespace_label __iomem *next_label, __iomem *nd_label;
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	unsigned int slot;
+	int j;
+
+	nd_label = nd_get_label(nd_mapping->labels, l);
+	slot = to_slot(ndd, nd_label);
+	dev_vdbg(ndd->dev, "%s: clear: %d\n", __func__, slot);
+
+	for (j = l; (next_label = nd_get_label(nd_mapping->labels, j + 1)); j++)
+		nd_set_label(nd_mapping->labels, next_label, j);
+	nd_set_label(nd_mapping->labels, NULL, j);
+}
+
+static bool is_old_resource(struct resource *res, struct resource **list, int n)
 {
 	int i;
+
+	if (res->flags & DPA_RESOURCE_ADJUSTED)
+		return false;
+	for (i = 0; i < n; i++)
+		if (res == list[i])
+			return true;
+	return false;
+}
+
+static struct resource *to_resource(struct nd_dimm_drvdata *ndd,
+		struct nd_namespace_label __iomem *nd_label)
+{
+	struct resource *res;
+
+	for_each_dpa_resource(ndd, res) {
+		if (res->start != readq(&nd_label->dpa))
+			continue;
+		if (resource_size(res) != readq(&nd_label->rawsize))
+			continue;
+		return res;
+	}
+
+	return NULL;
+}
+
+/*
+ * 1/ Account all the labels that can be freed after this update
+ * 2/ Allocate and write the label to the staging (next) index
+ * 3/ Record the resources in the namespace device
+ */
+static int __blk_label_update(struct nd_region *nd_region,
+		struct nd_mapping *nd_mapping, struct nd_namespace_blk *nsblk,
+		int num_labels)
+{
+	int i, l, alloc, victims, nfree, old_num_resources, nlabel, rc = -ENXIO;
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	struct nd_namespace_label __iomem *nd_label;
+	struct nd_namespace_index __iomem *nsindex;
+	unsigned long *free, *victim_map = NULL;
+	struct resource *res, **old_res_list;
+	struct nd_label_id label_id;
+	u8 uuid[NSLABEL_UUID_LEN];
+	u32 nslot, slot;
+
+	if (!preamble_next(ndd, &nsindex, &free, &nslot))
+		return -ENXIO;
+
+	old_res_list = nsblk->res;
+	nfree = nd_label_nfree(ndd);
+	old_num_resources = nsblk->num_resources;
+	nd_label_gen_id(&label_id, nsblk->uuid, NSLABEL_FLAG_LOCAL);
+
+	/*
+	 * We need to loop over the old resources a few times, which seems a
+	 * bit inefficient, but we need to know that we have the label
+	 * space before we start mutating the tracking structures.
+	 * Otherwise the recovery method of last resort for userspace is
+	 * disable and re-enable the parent region.
+	 */
+	alloc = 0;
+	for_each_dpa_resource(ndd, res) {
+		if (strcmp(res->name, label_id.id) != 0)
+			continue;
+		if (!is_old_resource(res, old_res_list, old_num_resources))
+			alloc++;
+	}
+
+	victims = 0;
+	if (old_num_resources) {
+		/* convert old local-label-map to dimm-slot victim-map */
+		victim_map = kcalloc(BITS_TO_LONGS(nslot), sizeof(long),
+				GFP_KERNEL);
+		if (!victim_map)
+			return -ENOMEM;
+
+		/* mark unused labels for garbage collection */
+		for_each_clear_bit_le(slot, free, nslot) {
+			nd_label = nd_label_base(ndd) + slot;
+			memcpy_fromio(uuid, nd_label->uuid, NSLABEL_UUID_LEN);
+			if (memcmp(uuid, nsblk->uuid, NSLABEL_UUID_LEN) != 0)
+				continue;
+			res = to_resource(ndd, nd_label);
+			if (res && is_old_resource(res, old_res_list,
+						old_num_resources))
+				continue;
+			slot = to_slot(ndd, nd_label);
+			set_bit(slot, victim_map);
+			victims++;
+		}
+	}
+
+	/* don't allow updates that consume the last label */
+	if (nfree - alloc < 0 || nfree - alloc + victims < 1) {
+		dev_info(&nsblk->dev, "insufficient label space\n");
+		kfree(victim_map);
+		return -ENOSPC;
+	}
+	/* from here on we need to abort on error */
+
+
+	/* assign all resources to the namespace before writing the labels */
+	nsblk->res = NULL;
+	nsblk->num_resources = 0;
+	for_each_dpa_resource(ndd, res) {
+		if (strcmp(res->name, label_id.id) != 0)
+			continue;
+		if (!nsblk_add_resource(nd_region, ndd, nsblk, res->start)) {
+			rc = -ENOMEM;
+			goto abort;
+		}
+	}
+
+	for (i = 0; i < nsblk->num_resources; i++) {
+		size_t offset;
+
+		res = nsblk->res[i];
+		if (is_old_resource(res, old_res_list, old_num_resources))
+			continue; /* carry-over */
+		slot = nd_label_alloc_slot(ndd);
+		if (slot == UINT_MAX)
+			goto abort;
+		dev_dbg(ndd->dev, "%s: allocated: %d\n", __func__, slot);
+
+		nd_label = nd_label_base(ndd) + slot;
+		memset_io(nd_label, 0, sizeof(struct nd_namespace_label));
+		memcpy_toio(nd_label->uuid, nsblk->uuid, NSLABEL_UUID_LEN);
+		if (nsblk->alt_name)
+			memcpy_toio(nd_label->name, nsblk->alt_name,
+					NSLABEL_NAME_LEN);
+		writel(NSLABEL_FLAG_LOCAL, &nd_label->flags);
+		writew(0, &nd_label->nlabel); /* N/A */
+		writew(0, &nd_label->position); /* N/A */
+		writeq(0, &nd_label->isetcookie); /* N/A */
+		writeq(res->start, &nd_label->dpa);
+		writeq(resource_size(res), &nd_label->rawsize);
+		writeq(nsblk->lbasize, &nd_label->lbasize);
+		writel(slot, &nd_label->slot);
+
+		/* update label */
+		offset = nd_label_offset(ndd, nd_label);
+		rc = nd_dimm_set_config_data(ndd, offset, nd_label,
+				sizeof(struct nd_namespace_label));
+		if (rc < 0)
+			goto abort;
+	}
+
+	/* free up now unused slots in the new index */
+	for_each_set_bit(slot, victim_map, victim_map ? nslot : 0) {
+		dev_dbg(ndd->dev, "%s: free: %d\n", __func__, slot);
+		nd_label_free_slot(ndd, slot);
+	}
+
+	/* update index */
+	rc = nd_label_write_index(ndd, ndd->ns_next,
+			nd_inc_seq(readl(&nsindex->seq)), 0);
+	if (rc)
+		goto abort;
+
+	/*
+	 * Now that the on-dimm labels are up to date, fix up the tracking
+	 * entries in nd_mapping->labels
+	 */
+	nlabel = 0;
+	for_each_label(l, nd_label, nd_mapping->labels) {
+		nlabel++;
+		memcpy_fromio(uuid, nd_label->uuid, NSLABEL_UUID_LEN);
+		if (memcmp(uuid, nsblk->uuid, NSLABEL_UUID_LEN) != 0)
+			continue;
+		nlabel--;
+		del_label(nd_mapping, l);
+		l--; /* retry with the new label at this index */
+	}
+	if (nlabel + nsblk->num_resources > num_labels) {
+		/*
+		 * Bug, we can't end up with more resources than
+		 * available labels
+		 */
+		WARN_ON_ONCE(1);
+		rc = -ENXIO;
+		goto out;
+	}
+
+	for_each_clear_bit_le(slot, free, nslot) {
+		nd_label = nd_label_base(ndd) + slot;
+		memcpy_fromio(uuid, nd_label->uuid, NSLABEL_UUID_LEN);
+		if (memcmp(uuid, nsblk->uuid, NSLABEL_UUID_LEN) != 0)
+			continue;
+		res = to_resource(ndd, nd_label);
+		res->flags &= ~DPA_RESOURCE_ADJUSTED;
+		dev_vdbg(&nsblk->dev, "assign label[%d] slot: %d\n", l, slot);
+		nd_set_label(nd_mapping->labels, nd_label, l++);
+	}
+	nd_set_label(nd_mapping->labels, NULL, l);
+
+ out:
+	kfree(old_res_list);
+	kfree(victim_map);
+	return rc;
+
+ abort:
+	/*
+	 * 1/ repair the allocated label bitmap in the index
+	 * 2/ restore the resource list
+	 */
+	nd_label_copy(ndd, nsindex, to_current_namespace_index(ndd));
+	kfree(nsblk->res);
+	nsblk->res = old_res_list;
+	nsblk->num_resources = old_num_resources;
+	old_res_list = NULL;
+	goto out;
+}
+
+static int init_labels(struct nd_mapping *nd_mapping, int num_labels)
+{
+	int i, l, old_num_labels = 0;
 	struct nd_namespace_index __iomem *nsindex;
+	struct nd_namespace_label __iomem *nd_label;
 	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	size_t size = (num_labels + 1) * sizeof(struct nd_namespace_label *);
 
-	if (!nd_mapping->labels)
-		nd_mapping->labels = kcalloc(2, sizeof(void *), GFP_KERNEL);
+	for_each_label(l, nd_label, nd_mapping->labels)
+		old_num_labels++;
 
+	/*
+	 * We need to preserve all the old labels for the mapping so
+	 * they can be garbage collected after writing the new labels.
+	 */
+	if (num_labels > old_num_labels) {
+		struct nd_namespace_label **labels;
+
+		labels = krealloc(nd_mapping->labels, size, GFP_KERNEL);
+		if (!labels)
+			return -ENOMEM;
+		nd_mapping->labels = labels;
+	}
 	if (!nd_mapping->labels)
 		return -ENOMEM;
 
+	for (i = old_num_labels; i <= num_labels; i++)
+		nd_set_label(nd_mapping->labels, NULL, i);
+
 	if (ndd->ns_current == -1 || ndd->ns_next == -1)
 		/* pass */;
 	else
-		return 0;
+		return max(num_labels, old_num_labels);
 
 	nsindex = to_namespace_index(ndd, 0);
 	memset_io(nsindex, 0, ndd->nsarea.config_size);
@@ -578,7 +826,7 @@ static int init_labels(struct nd_mapping *nd_mapping)
 	ndd->ns_next = 1;
 	ndd->ns_current = 0;
 
-	return 0;
+	return max(num_labels, old_num_labels);
 }
 
 static int del_labels(struct nd_mapping *nd_mapping, u8 *uuid)
@@ -600,22 +848,15 @@ static int del_labels(struct nd_mapping *nd_mapping, u8 *uuid)
 		return 0;
 
 	for_each_label(l, nd_label, nd_mapping->labels) {
-		int j;
-
 		memcpy_fromio(label_uuid, nd_label->uuid, NSLABEL_UUID_LEN);
 		if (memcmp(label_uuid, uuid, NSLABEL_UUID_LEN) != 0)
 			continue;
 		slot = to_slot(ndd, nd_label);
 		nd_label_free_slot(ndd, slot);
 		dev_dbg(ndd->dev, "%s: free: %d\n", __func__, slot);
-		for (j = l; nd_get_label(nd_mapping->labels, j + 1); j++) {
-			struct nd_namespace_label __iomem *next_label;
-
-			next_label = nd_get_label(nd_mapping->labels, j + 1);
-			nd_set_label(nd_mapping->labels, next_label, j);
-		}
-		nd_set_label(nd_mapping->labels, NULL, j);
+		del_label(nd_mapping, l);
 		num_freed++;
+		l--; /* retry with new label at this index */
 	}
 
 	if (num_freed > l) {
@@ -648,8 +889,8 @@ int nd_pmem_namespace_label_update(struct nd_region *nd_region,
 			continue;
 		}
 
-		rc = init_labels(nd_mapping);
-		if (rc)
+		rc = init_labels(nd_mapping, 1);
+		if (rc < 0)
 			return rc;
 
 		rc = __pmem_label_update(nd_region, nd_mapping, nspm, i);
@@ -659,3 +900,23 @@ int nd_pmem_namespace_label_update(struct nd_region *nd_region,
 
 	return 0;
 }
+
+int nd_blk_namespace_label_update(struct nd_region *nd_region,
+		struct nd_namespace_blk *nsblk, resource_size_t size)
+{
+	struct nd_mapping *nd_mapping = &nd_region->mapping[0];
+	struct resource *res;
+	int count = 0;
+
+	if (size == 0)
+		return del_labels(nd_mapping, nsblk->uuid);
+
+	for_each_dpa_resource(to_ndd(nd_mapping), res)
+		count++;
+
+	count = init_labels(nd_mapping, count);
+	if (count < 0)
+		return count;
+
+	return __blk_label_update(nd_region, nd_mapping, nsblk, count);
+}
diff --git a/drivers/block/nd/label.h b/drivers/block/nd/label.h
index e17958941e34..a26cebc9f389 100644
--- a/drivers/block/nd/label.h
+++ b/drivers/block/nd/label.h
@@ -130,9 +130,14 @@ size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd);
 int nd_label_active_count(struct nd_dimm_drvdata *ndd);
 struct nd_namespace_label __iomem *nd_label_active(
 		struct nd_dimm_drvdata *ndd, int n);
+u32 nd_label_alloc_slot(struct nd_dimm_drvdata *ndd);
+bool nd_label_free_slot(struct nd_dimm_drvdata *ndd, u32 slot);
 u32 nd_label_nfree(struct nd_dimm_drvdata *ndd);
 struct nd_region;
 struct nd_namespace_pmem;
+struct nd_namespace_blk;
 int nd_pmem_namespace_label_update(struct nd_region *nd_region,
 		struct nd_namespace_pmem *nspm, resource_size_t size);
+int nd_blk_namespace_label_update(struct nd_region *nd_region,
+		struct nd_namespace_blk *nsblk, resource_size_t size);
 #endif /* __LABEL_H__ */
diff --git a/drivers/block/nd/namespace_devs.c b/drivers/block/nd/namespace_devs.c
index cdb78dddcfa9..c193ba6c6445 100644
--- a/drivers/block/nd/namespace_devs.c
+++ b/drivers/block/nd/namespace_devs.c
@@ -164,8 +164,7 @@ static int nd_namespace_label_update(struct nd_region *nd_region, struct device
 	 */
 	if (is_namespace_pmem(dev)) {
 		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
-		struct resource *res = &nspm->nsio.res;
-		resource_size_t size = resource_size(res);
+		resource_size_t size = resource_size(&nspm->nsio.res);
 
 		if (size == 0 && nspm->uuid)
 			/* delete allocation */;
@@ -174,8 +173,15 @@ static int nd_namespace_label_update(struct nd_region *nd_region, struct device
 
 		return nd_pmem_namespace_label_update(nd_region, nspm, size);
 	} else if (is_namespace_blk(dev)) {
-		/* TODO: implement blk labels */
-		return 0;
+		struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+		resource_size_t size = nd_namespace_blk_size(nsblk);
+
+		if (size == 0 && nsblk->uuid)
+			/* delete allocation */;
+		else if (!nsblk->uuid || !nsblk->lbasize)
+			return 0;
+
+		return nd_blk_namespace_label_update(nd_region, nsblk, size);
 	} else
 		return -ENXIO;
 }
@@ -983,6 +989,48 @@ static ssize_t sector_size_store(struct device *dev,
 }
 static DEVICE_ATTR_RW(sector_size);
 
+static ssize_t dpa_extents_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+	struct nd_label_id label_id;
+	int count = 0, i;
+	u8 *uuid = NULL;
+	u32 flags = 0;
+
+	nd_bus_lock(dev);
+	if (is_namespace_pmem(dev)) {
+		struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
+
+		uuid = nspm->uuid;
+		flags = 0;
+	} else if (is_namespace_blk(dev)) {
+		struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+
+		uuid = nsblk->uuid;
+		flags = NSLABEL_FLAG_LOCAL;
+	}
+
+	if (!uuid)
+		goto out;
+
+	nd_label_gen_id(&label_id, uuid, flags);
+	for (i = 0; i < nd_region->ndr_mappings; i++) {
+		struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+		struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+		struct resource *res;
+
+		for_each_dpa_resource(ndd, res)
+			if (strcmp(res->name, label_id.id) == 0)
+				count++;
+	}
+ out:
+	nd_bus_unlock(dev);
+
+	return sprintf(buf, "%d\n", count);
+}
+static DEVICE_ATTR_RO(dpa_extents);
+
 static struct attribute *nd_namespace_attributes[] = {
 	&dev_attr_nstype.attr,
 	&dev_attr_size.attr,
@@ -990,6 +1038,7 @@ static struct attribute *nd_namespace_attributes[] = {
 	&dev_attr_resource.attr,
 	&dev_attr_alt_name.attr,
 	&dev_attr_sector_size.attr,
+	&dev_attr_dpa_extents.attr,
 	NULL,
 };
 
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index fe852175a3b8..fffd65436e2b 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -79,4 +79,5 @@ struct nd_mapping;
 struct resource *nsblk_add_resource(struct nd_region *nd_region,
 		struct nd_dimm_drvdata *ndd, struct nd_namespace_blk *nsblk,
 		resource_size_t start);
+int nd_dimm_num_label_slots(struct nd_dimm_drvdata *ndd);
 #endif /* __ND_PRIVATE_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 17/21] libnd: infrastructure for btt devices
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:57   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, mingo, linux-acpi,
	jmoyer, hch

Block devices from an nd bus, in addition to accepting "struct bio"
based requests, also have the capability to perform byte-aligned
accesses.  By default only the bio/block interface is used.  However, if
another driver can make effective use of the byte-aligned capability it
can claim/disable the block interface and use the byte-aligned "nd_io"
interface.

The BTT driver is the initial first consumer of this mechanism to allow
layering atomic sector update guarantees on top of nd_io capable
libnd-block-devices, or their partitions.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/block/nd/Kconfig      |    3 
 drivers/block/nd/Makefile     |    1 
 drivers/block/nd/btt.h        |   45 ++++
 drivers/block/nd/btt_devs.c   |  442 +++++++++++++++++++++++++++++++++++++++++
 drivers/block/nd/bus.c        |  128 ++++++++++++
 drivers/block/nd/core.c       |   79 +++++++
 drivers/block/nd/nd-private.h |   28 +++
 drivers/block/nd/nd.h         |   94 +++++++++
 drivers/block/nd/pmem.c       |   29 +++
 include/uapi/linux/ndctl.h    |    2 
 10 files changed, 847 insertions(+), 4 deletions(-)
 create mode 100644 drivers/block/nd/btt.h
 create mode 100644 drivers/block/nd/btt_devs.c

diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index 03f572f0e3d0..00d9afe9475e 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -34,4 +34,7 @@ config BLK_DEV_PMEM
 
 	  Say Y if you want to use a NVDIMM described by NFIT
 
+config ND_BTT_DEVS
+	def_bool y
+
 endif
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 8d14510559e1..9866669d7738 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -11,3 +11,4 @@ libnd-y += region_devs.o
 libnd-y += region.o
 libnd-y += namespace_devs.o
 libnd-y += label.o
+libnd-$(CONFIG_ND_BTT_DEVS) += btt_devs.o
diff --git a/drivers/block/nd/btt.h b/drivers/block/nd/btt.h
new file mode 100644
index 000000000000..e8f6d8e0ddd3
--- /dev/null
+++ b/drivers/block/nd/btt.h
@@ -0,0 +1,45 @@
+/*
+ * Block Translation Table library
+ * Copyright (c) 2014-2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_BTT_H
+#define _LINUX_BTT_H
+
+#include <linux/types.h>
+
+#define BTT_SIG_LEN 16
+#define BTT_SIG "BTT_ARENA_INFO\0"
+
+struct btt_sb {
+	u8 signature[BTT_SIG_LEN];
+	u8 uuid[16];
+	u8 parent_uuid[16];
+	__le32 flags;
+	__le16 version_major;
+	__le16 version_minor;
+	__le32 external_lbasize;
+	__le32 external_nlba;
+	__le32 internal_lbasize;
+	__le32 internal_nlba;
+	__le32 nfree;
+	__le32 infosize;
+	__le64 nextoff;
+	__le64 dataoff;
+	__le64 mapoff;
+	__le64 logoff;
+	__le64 info2off;
+	u8 padding[3968];
+	__le64 checksum;
+};
+
+#endif
diff --git a/drivers/block/nd/btt_devs.c b/drivers/block/nd/btt_devs.c
new file mode 100644
index 000000000000..b3b813288092
--- /dev/null
+++ b/drivers/block/nd/btt_devs.c
@@ -0,0 +1,442 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/device.h>
+#include <linux/genhd.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include "nd-private.h"
+#include "btt.h"
+#include "nd.h"
+
+static DEFINE_IDA(btt_ida);
+
+static void nd_btt_release(struct device *dev)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+
+	dev_dbg(dev, "%s\n", __func__);
+	WARN_ON(nd_btt->backing_dev);
+	ndio_del_claim(nd_btt->ndio_claim);
+	ida_simple_remove(&btt_ida, nd_btt->id);
+	kfree(nd_btt->uuid);
+	kfree(nd_btt);
+}
+
+static struct device_type nd_btt_device_type = {
+	.name = "nd_btt",
+	.release = nd_btt_release,
+};
+
+bool is_nd_btt(struct device *dev)
+{
+	return dev->type == &nd_btt_device_type;
+}
+
+struct nd_btt *to_nd_btt(struct device *dev)
+{
+	struct nd_btt *nd_btt = container_of(dev, struct nd_btt, dev);
+
+	WARN_ON(!is_nd_btt(dev));
+	return nd_btt;
+}
+EXPORT_SYMBOL(to_nd_btt);
+
+static const unsigned long btt_lbasize_supported[] = { 512, 4096, 0 };
+
+static ssize_t sector_size_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+
+	return nd_sector_size_show(nd_btt->lbasize, btt_lbasize_supported, buf);
+}
+
+static ssize_t sector_size_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	ssize_t rc;
+
+	device_lock(dev);
+	nd_bus_lock(dev);
+	rc = nd_sector_size_store(dev, buf, &nd_btt->lbasize,
+			btt_lbasize_supported);
+	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	nd_bus_unlock(dev);
+	device_unlock(dev);
+
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(sector_size);
+
+static ssize_t uuid_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+
+	if (nd_btt->uuid)
+		return sprintf(buf, "%pUb\n", nd_btt->uuid);
+	return sprintf(buf, "\n");
+}
+
+static ssize_t uuid_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	ssize_t rc;
+
+	device_lock(dev);
+	rc = nd_uuid_store(dev, &nd_btt->uuid, buf, len);
+	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	device_unlock(dev);
+
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(uuid);
+
+static ssize_t backing_dev_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	char name[BDEVNAME_SIZE];
+
+	if (nd_btt->backing_dev)
+		return sprintf(buf, "/dev/%s\n",
+				bdevname(nd_btt->backing_dev, name));
+	else
+		return sprintf(buf, "\n");
+}
+
+static const fmode_t nd_btt_devs_mode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
+
+static void nd_btt_ndio_notify_remove(struct nd_io_claim *ndio_claim)
+{
+	char bdev_name[BDEVNAME_SIZE];
+	struct nd_btt *nd_btt;
+
+	if (!ndio_claim || !ndio_claim->holder)
+		return;
+
+	nd_btt = to_nd_btt(ndio_claim->holder);
+	WARN_ON_ONCE(!is_nd_bus_locked(&nd_btt->dev));
+	dev_dbg(&nd_btt->dev, "%pf: %s: release /dev/%s\n",
+			__builtin_return_address(0), __func__,
+			bdevname(nd_btt->backing_dev, bdev_name));
+	blkdev_put(nd_btt->backing_dev, nd_btt_devs_mode);
+	nd_btt->backing_dev = NULL;
+
+	/*
+	 * Once we've had our backing device removed we need to be fully
+	 * reconfigured.  The bus will have already created a new seed
+	 * for this purpose, so now is a good time to clean up this
+	 * stale nd_btt instance.
+	 */
+	if (nd_btt->dev.driver)
+		nd_device_unregister(&nd_btt->dev, ND_ASYNC);
+	else {
+		ndio_del_claim(ndio_claim);
+		nd_btt->ndio_claim = NULL;
+	}
+}
+
+static ssize_t __backing_dev_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	char bdev_name[BDEVNAME_SIZE];
+	struct block_device *bdev;
+	struct nd_io *ndio;
+	char *path;
+
+	if (dev->driver) {
+		dev_dbg(dev, "%s: -EBUSY\n", __func__);
+		return -EBUSY;
+	}
+
+	path = kstrndup(buf, len, GFP_KERNEL);
+	if (!path)
+		return -ENOMEM;
+
+	/* detach the backing device */
+	if (strcmp(strim(path), "") == 0) {
+		if (!nd_btt->backing_dev)
+			goto out;
+		nd_btt_ndio_notify_remove(nd_btt->ndio_claim);
+		goto out;
+	} else if (nd_btt->backing_dev) {
+		dev_dbg(dev, "backing_dev already set\n");
+		len = -EBUSY;
+		goto out;
+	}
+
+	bdev = blkdev_get_by_path(strim(path), nd_btt_devs_mode, nd_btt);
+	if (IS_ERR(bdev)) {
+		dev_dbg(dev, "open '%s' failed: %ld\n", strim(path),
+				PTR_ERR(bdev));
+		len = PTR_ERR(bdev);
+		goto out;
+	}
+
+	if (get_capacity(bdev->bd_disk) < SZ_16M / 512) {
+		blkdev_put(bdev, nd_btt_devs_mode);
+		len = -ENXIO;
+		goto out;
+	}
+
+	ndio = ndio_lookup(nd_bus, bdevname(bdev->bd_contains, bdev_name));
+	if (!ndio) {
+		dev_dbg(dev, "%s does not have an ndio interface\n",
+				strim(path));
+		blkdev_put(bdev, nd_btt_devs_mode);
+		len = -ENXIO;
+		goto out;
+	}
+
+	nd_btt->ndio_claim = ndio_add_claim(ndio, &nd_btt->dev,
+			nd_btt_ndio_notify_remove);
+	if (!nd_btt->ndio_claim) {
+		blkdev_put(bdev, nd_btt_devs_mode);
+		len = -ENOMEM;
+		goto out;
+	}
+
+	WARN_ON_ONCE(!is_nd_bus_locked(&nd_btt->dev));
+	nd_btt->backing_dev = bdev;
+
+ out:
+	kfree(path);
+	return len;
+}
+
+static ssize_t backing_dev_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	ssize_t rc;
+
+	nd_bus_lock(dev);
+	device_lock(dev);
+	rc = __backing_dev_store(dev, attr, buf, len);
+	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	device_unlock(dev);
+	nd_bus_unlock(dev);
+
+	return rc;
+}
+static DEVICE_ATTR_RW(backing_dev);
+
+static bool is_nd_btt_idle(struct device *dev)
+{
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+
+	if (nd_bus->nd_btt == nd_btt || dev->driver || nd_btt->backing_dev)
+		return false;
+	return true;
+}
+
+static ssize_t delete_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	/* return 1 if can be deleted */
+	return sprintf(buf, "%d\n", is_nd_btt_idle(dev));
+}
+
+static ssize_t delete_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	unsigned long val;
+
+	/* write 1 to delete */
+	if (kstrtoul(buf, 0, &val) != 0 || val != 1)
+		return -EINVAL;
+
+	/* prevent deletion while this btt is active, or is the current seed */
+	if (!is_nd_btt_idle(dev))
+		return -EBUSY;
+
+	/*
+	 * userspace raced itself if device goes active here and it gets
+	 * to keep the pieces
+	 */
+	nd_device_unregister(dev, ND_ASYNC);
+
+	return len;
+}
+static DEVICE_ATTR_RW(delete);
+
+static struct attribute *nd_btt_attributes[] = {
+	&dev_attr_sector_size.attr,
+	&dev_attr_backing_dev.attr,
+	&dev_attr_delete.attr,
+	&dev_attr_uuid.attr,
+	NULL,
+};
+
+static struct attribute_group nd_btt_attribute_group = {
+	.attrs = nd_btt_attributes,
+};
+
+static const struct attribute_group *nd_btt_attribute_groups[] = {
+	&nd_btt_attribute_group,
+	&nd_device_attribute_group,
+	NULL,
+};
+
+static struct nd_btt *__nd_btt_create(struct nd_bus *nd_bus,
+		unsigned long lbasize, u8 *uuid)
+{
+	struct nd_btt *nd_btt = kzalloc(sizeof(*nd_btt), GFP_KERNEL);
+	struct device *dev;
+
+	if (!nd_btt)
+		return NULL;
+	nd_btt->id = ida_simple_get(&btt_ida, 0, 0, GFP_KERNEL);
+	if (nd_btt->id < 0) {
+		kfree(nd_btt);
+		return NULL;
+	}
+
+	nd_btt->lbasize = lbasize;
+	if (uuid)
+		uuid = kmemdup(uuid, 16, GFP_KERNEL);
+	nd_btt->uuid = uuid;
+	dev = &nd_btt->dev;
+	dev_set_name(dev, "btt%d", nd_btt->id);
+	dev->parent = &nd_bus->dev;
+	dev->type = &nd_btt_device_type;
+	dev->groups = nd_btt_attribute_groups;
+	return nd_btt;
+}
+
+struct nd_btt *nd_btt_create(struct nd_bus *nd_bus)
+{
+	struct nd_btt *nd_btt = __nd_btt_create(nd_bus, 0, NULL);
+
+	if (!nd_btt)
+		return NULL;
+	nd_device_register(&nd_btt->dev);
+	return nd_btt;
+}
+
+/*
+ * nd_btt_sb_checksum: compute checksum for btt info block
+ *
+ * Returns a fletcher64 checksum of everything in the given info block
+ * except the last field (since that's where the checksum lives).
+ */
+u64 nd_btt_sb_checksum(struct btt_sb *btt_sb)
+{
+	u64 sum, sum_save;
+
+	sum_save = btt_sb->checksum;
+	btt_sb->checksum = 0;
+	sum = nd_fletcher64(btt_sb, sizeof(*btt_sb), 1);
+	btt_sb->checksum = sum_save;
+	return sum;
+}
+EXPORT_SYMBOL(nd_btt_sb_checksum);
+
+static int nd_btt_autodetect(struct nd_bus *nd_bus, struct nd_io *ndio,
+		struct block_device *bdev)
+{
+	char name[BDEVNAME_SIZE];
+	struct nd_btt *nd_btt;
+	struct btt_sb *btt_sb;
+	u64 offset, checksum;
+	u32 lbasize;
+	u8 *uuid;
+	int rc;
+
+	btt_sb = kzalloc(sizeof(*btt_sb), GFP_KERNEL);
+	if (!btt_sb)
+		return -ENODEV;
+
+	offset = nd_partition_offset(bdev);
+	rc = ndio->rw_bytes(ndio, btt_sb, offset + SZ_4K, sizeof(*btt_sb), READ);
+	if (rc)
+		goto out_free_sb;
+
+	if (get_capacity(bdev->bd_disk) < SZ_16M / 512)
+		goto out_free_sb;
+
+	if (memcmp(btt_sb->signature, BTT_SIG, BTT_SIG_LEN) != 0)
+		goto out_free_sb;
+
+	checksum = le64_to_cpu(btt_sb->checksum);
+	btt_sb->checksum = 0;
+	if (checksum != nd_btt_sb_checksum(btt_sb))
+		goto out_free_sb;
+	btt_sb->checksum = cpu_to_le64(checksum);
+
+	uuid = kmemdup(btt_sb->uuid, 16, GFP_KERNEL);
+	if (!uuid)
+		goto out_free_sb;
+
+	lbasize = le32_to_cpu(btt_sb->external_lbasize);
+	nd_btt = __nd_btt_create(nd_bus, lbasize, uuid);
+	if (!nd_btt)
+		goto out_free_uuid;
+
+	device_initialize(&nd_btt->dev);
+	nd_btt->ndio_claim = ndio_add_claim(ndio, &nd_btt->dev,
+			nd_btt_ndio_notify_remove);
+	if (!nd_btt->ndio_claim)
+		goto out_free_btt;
+
+	nd_btt->backing_dev = bdev;
+	dev_dbg(&nd_btt->dev, "%s: activate %s\n", __func__,
+			bdevname(bdev, name));
+	__nd_device_register(&nd_btt->dev);
+	kfree(btt_sb);
+	return 0;
+
+ out_free_btt:
+	kfree(nd_btt);
+ out_free_uuid:
+	kfree(uuid);
+ out_free_sb:
+	kfree(btt_sb);
+
+	return -ENODEV;
+}
+
+void nd_btt_notify_ndio(struct nd_bus *nd_bus, struct nd_io *ndio)
+{
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+
+	disk_part_iter_init(&piter, ndio->disk, DISK_PITER_INCL_PART0);
+	while ((part = disk_part_iter_next(&piter))) {
+		struct block_device *bdev;
+		int rc;
+
+		bdev = bdget_disk(ndio->disk, part->partno);
+		if (!bdev)
+			continue;
+		if (blkdev_get(bdev, nd_btt_devs_mode, nd_bus) != 0)
+			continue;
+		rc = nd_btt_autodetect(nd_bus, ndio, bdev);
+		if (rc)
+			blkdev_put(bdev, nd_btt_devs_mode);
+		/* no need to scan further in the case of whole disk btt */
+		if (rc == 0 && part->partno == 0)
+			break;
+	}
+	disk_part_iter_exit(&piter);
+}
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index 4a2185a99bd7..dc69ccfae53a 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -16,6 +16,7 @@
 #include <linux/module.h>
 #include <linux/fcntl.h>
 #include <linux/async.h>
+#include <linux/genhd.h>
 #include <linux/ndctl.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
@@ -40,6 +41,8 @@ static int to_nd_device_type(struct device *dev)
 		return ND_DEVICE_REGION_BLK;
 	else if (is_nd_pmem(dev->parent) || is_nd_blk(dev->parent))
 		return nd_region_to_namespace_type(to_nd_region(dev->parent));
+	else if (is_nd_btt(dev))
+		return ND_DEVICE_BTT;
 
 	return 0;
 }
@@ -84,6 +87,21 @@ static int nd_bus_probe(struct device *dev)
 
 	dev_dbg(&nd_bus->dev, "%s.probe(%s) = %d\n", dev->driver->name,
 			dev_name(dev), rc);
+
+	/* check if our btt-seed has sprouted, and plant another */
+	if (rc == 0 && is_nd_btt(dev) && dev == &nd_bus->nd_btt->dev) {
+		const char *sep = "", *name = "", *status = "failed";
+
+		nd_bus->nd_btt = nd_btt_create(nd_bus);
+		if (nd_bus->nd_btt) {
+			status = "succeeded";
+			sep = ": ";
+			name = dev_name(&nd_bus->nd_btt->dev);
+		}
+		dev_dbg(&nd_bus->dev, "btt seed creation %s%s%s\n",
+				status, sep, name);
+	}
+
 	if (rc != 0)
 		module_put(provider);
 	return rc;
@@ -144,14 +162,19 @@ static void nd_async_device_unregister(void *d, async_cookie_t cookie)
 	put_device(dev);
 }
 
-void nd_device_register(struct device *dev)
+void __nd_device_register(struct device *dev)
 {
 	dev->bus = &nd_bus_type;
-	device_initialize(dev);
 	get_device(dev);
 	async_schedule_domain(nd_async_device_register, dev,
 			&nd_async_domain);
 }
+
+void nd_device_register(struct device *dev)
+{
+	device_initialize(dev);
+	__nd_device_register(dev);
+}
 EXPORT_SYMBOL(nd_device_register);
 
 void nd_device_unregister(struct device *dev, enum nd_async_mode mode)
@@ -200,6 +223,107 @@ int __nd_driver_register(struct nd_device_driver *nd_drv, struct module *owner,
 }
 EXPORT_SYMBOL(__nd_driver_register);
 
+/**
+ * nd_register_ndio() - register byte-aligned access capability for an nd-bdev
+ * @disk: child gendisk of the ndio namepace device
+ * @ndio: initialized ndio instance to register
+ *
+ * LOCKING: hold nd_bus_lock() over the creation of ndio->disk and the
+ * subsequent nd_region_ndio event
+ */
+int nd_register_ndio(struct nd_io *ndio)
+{
+	struct nd_bus *nd_bus;
+	struct device *dev;
+
+	if (!ndio || !ndio->dev || !ndio->disk || !list_empty(&ndio->list)
+			|| !ndio->rw_bytes || !list_empty(&ndio->claims)) {
+		pr_debug("%s bad parameters from %pf\n", __func__,
+				__builtin_return_address(0));
+		return -EINVAL;
+	}
+
+	dev = ndio->dev;
+	nd_bus = walk_to_nd_bus(dev);
+	if (!nd_bus)
+		return -EINVAL;
+
+	WARN_ON_ONCE(!is_nd_bus_locked(&nd_bus->dev));
+	list_add(&ndio->list, &nd_bus->ndios);
+
+	/* TODO: generic infrastructure for 3rd party ndio claimers */
+	nd_btt_notify_ndio(nd_bus, ndio);
+
+	return 0;
+}
+EXPORT_SYMBOL(nd_register_ndio);
+
+/**
+ * __nd_unregister_ndio() - try to remove an ndio interface
+ * @ndio: interface to remove
+ */
+static int __nd_unregister_ndio(struct nd_io *ndio)
+{
+	struct nd_io_claim *ndio_claim, *_n;
+	struct nd_bus *nd_bus;
+	LIST_HEAD(claims);
+
+	nd_bus = walk_to_nd_bus(ndio->dev);
+	if (!nd_bus || list_empty(&ndio->list))
+		return -ENXIO;
+
+	spin_lock(&ndio->lock);
+	list_splice_init(&ndio->claims, &claims);
+	spin_unlock(&ndio->lock);
+
+	list_for_each_entry_safe(ndio_claim, _n, &claims, list)
+		ndio_claim->notify_remove(ndio_claim);
+
+	list_del_init(&ndio->list);
+
+	return 0;
+}
+
+int nd_unregister_ndio(struct nd_io *ndio)
+{
+	struct device *dev = ndio->dev;
+	int rc;
+
+	nd_bus_lock(dev);
+	rc = __nd_unregister_ndio(ndio);
+	nd_bus_unlock(dev);
+
+	/*
+	 * Flush in case ->notify_remove() kicked off asynchronous device
+	 * unregistration
+	 */
+	nd_synchronize();
+
+	return rc;
+}
+EXPORT_SYMBOL(nd_unregister_ndio);
+
+static struct nd_io *__ndio_lookup(struct nd_bus *nd_bus, const char *diskname)
+{
+	struct nd_io *ndio;
+
+	list_for_each_entry(ndio, &nd_bus->ndios, list)
+		if (strcmp(diskname, ndio->disk->disk_name) == 0)
+			return ndio;
+
+	return NULL;
+}
+
+struct nd_io *ndio_lookup(struct nd_bus *nd_bus, const char *diskname)
+{
+	struct nd_io *ndio;
+
+	WARN_ON_ONCE(!is_nd_bus_locked(&nd_bus->dev));
+	ndio = __ndio_lookup(nd_bus, diskname);
+
+	return ndio;
+}
+
 static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
 		char *buf)
 {
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index b45863343a48..a0709a2e302f 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -55,6 +55,62 @@ bool is_nd_bus_locked(struct device *dev)
 }
 EXPORT_SYMBOL(is_nd_bus_locked);
 
+void nd_init_ndio(struct nd_io *ndio, nd_rw_bytes_fn rw_bytes,
+		struct device *dev, struct gendisk *disk, unsigned long align)
+{
+	memset(ndio, 0, sizeof(*ndio));
+	INIT_LIST_HEAD(&ndio->claims);
+	INIT_LIST_HEAD(&ndio->list);
+	spin_lock_init(&ndio->lock);
+	ndio->dev = dev;
+	ndio->disk = disk;
+	ndio->align = align;
+	ndio->rw_bytes = rw_bytes;
+}
+EXPORT_SYMBOL(nd_init_ndio);
+
+void ndio_del_claim(struct nd_io_claim *ndio_claim)
+{
+	struct nd_io *ndio;
+	struct device *holder;
+
+	if (!ndio_claim)
+		return;
+	ndio = ndio_claim->parent;
+	holder = ndio_claim->holder;
+
+	dev_dbg(holder, "%s: drop %s\n", __func__, dev_name(ndio->dev));
+	spin_lock(&ndio->lock);
+	list_del(&ndio_claim->list);
+	spin_unlock(&ndio->lock);
+	put_device(ndio->dev);
+	kfree(ndio_claim);
+	put_device(holder);
+}
+
+struct nd_io_claim *ndio_add_claim(struct nd_io *ndio, struct device *holder,
+		ndio_notify_remove_fn notify_remove)
+{
+	struct nd_io_claim *ndio_claim = kzalloc(sizeof(*ndio_claim), GFP_KERNEL);
+
+	if (!ndio_claim)
+		return NULL;
+
+	INIT_LIST_HEAD(&ndio_claim->list);
+	ndio_claim->parent = ndio;
+	get_device(ndio->dev);
+
+	spin_lock(&ndio->lock);
+	list_add(&ndio_claim->list, &ndio->claims);
+	spin_unlock(&ndio->lock);
+
+	ndio_claim->holder = holder;
+	ndio_claim->notify_remove = notify_remove;
+	get_device(holder);
+
+	return ndio_claim;
+}
+
 u64 nd_fletcher64(void *addr, size_t len, bool le)
 {
 	u32 *buf = addr;
@@ -75,6 +131,8 @@ static void nd_bus_release(struct device *dev)
 {
 	struct nd_bus *nd_bus = container_of(dev, struct nd_bus, dev);
 
+	WARN_ON(!list_empty(&nd_bus->ndios));
+
 	ida_simple_remove(&nd_ida, nd_bus->id);
 	kfree(nd_bus);
 }
@@ -271,10 +329,28 @@ static ssize_t wait_probe_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(wait_probe);
 
+static ssize_t btt_seed_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_bus *nd_bus = to_nd_bus(dev);
+	ssize_t rc;
+
+	nd_bus_lock(dev);
+	if (nd_bus->nd_btt)
+		rc = sprintf(buf, "%s\n", dev_name(&nd_bus->nd_btt->dev));
+	else
+		rc = sprintf(buf, "\n");
+	nd_bus_unlock(dev);
+
+	return rc;
+}
+static DEVICE_ATTR_RO(btt_seed);
+
 static struct attribute *nd_bus_attributes[] = {
 	&dev_attr_commands.attr,
 	&dev_attr_wait_probe.attr,
 	&dev_attr_provider.attr,
+	&dev_attr_btt_seed.attr,
 	NULL,
 };
 
@@ -291,6 +367,7 @@ struct nd_bus *__nd_bus_register(struct device *parent,
 
 	if (!nd_bus)
 		return NULL;
+	INIT_LIST_HEAD(&nd_bus->ndios);
 	INIT_LIST_HEAD(&nd_bus->list);
 	init_waitqueue_head(&nd_bus->probe_wait);
 	nd_bus->id = ida_simple_get(&nd_ida, 0, 0, GFP_KERNEL);
@@ -319,6 +396,8 @@ struct nd_bus *__nd_bus_register(struct device *parent,
 	list_add_tail(&nd_bus->list, &nd_bus_list);
 	mutex_unlock(&nd_bus_list_mutex);
 
+	nd_bus->nd_btt = nd_btt_create(nd_bus);
+
 	return nd_bus;
  err:
 	put_device(&nd_bus->dev);
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index fffd65436e2b..6c89695956a4 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -22,14 +22,21 @@ extern struct list_head nd_bus_list;
 extern struct mutex nd_bus_list_mutex;
 extern int nd_dimm_major;
 
+struct block_device;
+struct nd_io_claim;
+struct nd_btt;
+struct nd_io;
+
 struct nd_bus {
 	struct nd_bus_descriptor *nd_desc;
 	wait_queue_head_t probe_wait;
 	struct module *module;
+	struct list_head ndios;
 	struct list_head list;
 	struct device dev;
 	int id, probe_active;
 	struct mutex reconfig_mutex;
+	struct nd_btt *nd_btt;
 };
 
 struct nd_dimm {
@@ -41,9 +48,29 @@ struct nd_dimm {
 	int id;
 };
 
+struct nd_io *ndio_lookup(struct nd_bus *nd_bus, const char *diskname);
 bool is_nd_dimm(struct device *dev);
 bool is_nd_blk(struct device *dev);
 bool is_nd_pmem(struct device *dev);
+#if IS_ENABLED(CONFIG_ND_BTT_DEVS)
+bool is_nd_btt(struct device *dev);
+struct nd_btt *nd_btt_create(struct nd_bus *nd_bus);
+void nd_btt_notify_ndio(struct nd_bus *nd_bus, struct nd_io *ndio);
+#else
+static inline bool is_nd_btt(struct device *dev)
+{
+	return false;
+}
+
+static inline struct nd_btt *nd_btt_create(struct nd_bus *nd_bus)
+{
+	return NULL;
+}
+
+static inline void nd_btt_notify_ndio(struct nd_bus *nd_bus, struct nd_io *ndio)
+{
+}
+#endif
 struct nd_bus *walk_to_nd_bus(struct device *nd_dev);
 int __init nd_bus_init(void);
 void nd_bus_exit(void);
@@ -62,6 +89,7 @@ void nd_synchronize(void);
 int nd_bus_register_dimms(struct nd_bus *nd_bus);
 int nd_bus_register_regions(struct nd_bus *nd_bus);
 int nd_bus_init_interleave_sets(struct nd_bus *nd_bus);
+void __nd_device_register(struct device *dev);
 int nd_match_dimm(struct device *dev, void *data);
 struct nd_label_id;
 char *nd_label_gen_id(struct nd_label_id *label_id, u8 *uuid, u32 flags);
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index 24a440a23b2c..73e830785f74 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -12,13 +12,19 @@
  */
 #ifndef __ND_H__
 #define __ND_H__
+#include <linux/genhd.h>
 #include <linux/device.h>
 #include <linux/libnd.h>
 #include <linux/mutex.h>
 #include <linux/ndctl.h>
 #include <linux/types.h>
+#include <linux/fs.h>
 #include "label.h"
 
+enum {
+	SECTOR_SHIFT = 9,
+};
+
 struct nd_dimm_drvdata {
 	struct device *dev;
 	int nsindex_size;
@@ -111,6 +117,84 @@ static inline unsigned nd_inc_seq(unsigned seq)
 	return next[seq & 3];
 }
 
+struct nd_io;
+/**
+ * nd_rw_bytes_fn() - access bytes relative to the "whole disk" namespace device
+ * @ndio: per-namespace context
+ * @buf: source / target for the write / read
+ * @offset: offset relative to the start of the namespace device
+ * @n: num bytes to access
+ * @flags: READ, WRITE, and other REQ_* flags
+ *
+ * Note: Implementations may assume that offset + n never crosses ndio->align
+ */
+typedef int (*nd_rw_bytes_fn)(struct nd_io *ndio, void *buf, size_t offset,
+		size_t n, unsigned long flags);
+#define nd_data_dir(flags) (flags & 1)
+
+/**
+ * struct nd_io - info for byte-aligned access to nd devices
+ * @rw_bytes: operation to perform byte-aligned access
+ * @align: a single ->rw_bytes() request may not cross this alignment
+ * @gendisk: whole disk block device for the namespace
+ * @list: for the core to cache a list of "ndio"s for later association
+ * @dev: namespace device
+ * @claims: list of clients using this interface
+ * @lock: protect @claims mutation
+ */
+struct nd_io {
+	nd_rw_bytes_fn rw_bytes;
+	unsigned long align;
+	struct gendisk *disk;
+	struct list_head list;
+	struct device *dev;
+	struct list_head claims;
+	spinlock_t lock;
+};
+
+struct nd_io_claim;
+typedef void (*ndio_notify_remove_fn)(struct nd_io_claim *ndio_claim);
+
+/**
+ * struct nd_io_claim - instance of a claim on a parent ndio
+ * @notify_remove: ndio is going away, release resources
+ * @holder: object that has claimed this ndio
+ * @parent: ndio in use
+ * @holder: holder device
+ * @list: claim peers
+ *
+ * An ndio may be claimed multiple times, consider the case of a btt
+ * instance per partition on a namespace.
+ */
+struct nd_io_claim {
+	struct nd_io *parent;
+	ndio_notify_remove_fn notify_remove;
+	struct list_head list;
+	struct device *holder;
+};
+
+struct nd_btt {
+	struct device dev;
+	struct nd_io *ndio;
+	struct block_device *backing_dev;
+	unsigned long lbasize;
+	u8 *uuid;
+	u64 offset;
+	int id;
+	struct nd_io_claim *ndio_claim;
+};
+
+static inline u64 nd_partition_offset(struct block_device *bdev)
+{
+	struct hd_struct *p;
+
+	if (bdev == bdev->bd_contains)
+		return 0;
+
+	p = bdev->bd_part;
+	return ((u64) p->start_sect) << SECTOR_SHIFT;
+}
+
 enum nd_async_mode {
 	ND_SYNC,
 	ND_ASYNC,
@@ -125,12 +209,22 @@ ssize_t nd_sector_size_show(unsigned long current_lbasize,
 		const unsigned long *supported, char *buf);
 ssize_t nd_sector_size_store(struct device *dev, const char *buf,
 		unsigned long *current_lbasize, const unsigned long *supported);
+int nd_register_ndio(struct nd_io *ndio);
+int nd_unregister_ndio(struct nd_io *ndio);
+void nd_init_ndio(struct nd_io *ndio, nd_rw_bytes_fn rw_bytes,
+		struct device *dev, struct gendisk *disk, unsigned long align);
+void ndio_del_claim(struct nd_io_claim *ndio_claim);
+struct nd_io_claim *ndio_add_claim(struct nd_io *ndio, struct device *holder,
+		ndio_notify_remove_fn notify_remove);
 struct nd_dimm;
 struct nd_dimm_drvdata *to_ndd(struct nd_mapping *nd_mapping);
 int nd_dimm_init_nsarea(struct nd_dimm_drvdata *ndd);
 int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd);
 int nd_dimm_set_config_data(struct nd_dimm_drvdata *ndd, size_t offset,
 		void *buf, size_t len);
+struct nd_btt *to_nd_btt(struct device *dev);
+struct btt_sb;
+u64 nd_btt_sb_checksum(struct btt_sb *btt_sb);
 struct nd_region *to_nd_region(struct device *dev);
 int nd_region_to_namespace_type(struct nd_region *nd_region);
 int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
diff --git a/drivers/block/nd/pmem.c b/drivers/block/nd/pmem.c
index bf380393da92..7b5cedf1f2a4 100644
--- a/drivers/block/nd/pmem.c
+++ b/drivers/block/nd/pmem.c
@@ -29,6 +29,7 @@
 struct pmem_device {
 	struct request_queue	*pmem_queue;
 	struct gendisk		*pmem_disk;
+	struct nd_io		ndio;
 
 	/* One contiguous memory region per device */
 	phys_addr_t		phys_addr;
@@ -96,6 +97,26 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 	return 0;
 }
 
+static int pmem_rw_bytes(struct nd_io *ndio, void *buf, size_t offset,
+		size_t n, unsigned long flags)
+{
+	struct pmem_device *pmem = container_of(ndio, typeof(*pmem), ndio);
+	int rw = nd_data_dir(flags);
+
+	if (unlikely(offset + n > pmem->size)) {
+		dev_WARN_ONCE(ndio->dev, 1, "%s: request out of range\n",
+				__func__);
+		return -EFAULT;
+	}
+
+	if (rw == READ)
+		memcpy(buf, pmem->virt_addr + offset, n);
+	else
+		memcpy(pmem->virt_addr + offset, buf, n);
+
+	return 0;
+}
+
 static long pmem_direct_access(struct block_device *bdev, sector_t sector,
 			      void **kaddr, unsigned long *pfn, long size)
 {
@@ -169,8 +190,6 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res,
 	set_capacity(disk, pmem->size >> 9);
 	pmem->pmem_disk = disk;
 
-	add_disk(disk);
-
 	return pmem;
 
 out_free_queue:
@@ -222,7 +241,12 @@ static int nd_pmem_probe(struct device *dev)
 	if (IS_ERR(pmem))
 		return PTR_ERR(pmem);
 
+	nd_bus_lock(dev);
+	add_disk(pmem->pmem_disk);
 	dev_set_drvdata(dev, pmem);
+	nd_init_ndio(&pmem->ndio, pmem_rw_bytes, dev, pmem->pmem_disk, 0);
+	nd_register_ndio(&pmem->ndio);
+	nd_bus_unlock(dev);
 
 	return 0;
 }
@@ -231,6 +255,7 @@ static int nd_pmem_remove(struct device *dev)
 {
 	struct pmem_device *pmem = dev_get_drvdata(dev);
 
+	nd_unregister_ndio(&pmem->ndio);
 	pmem_free(pmem);
 	return 0;
 }
diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
index 0b4dcabb248a..e595751c613d 100644
--- a/include/uapi/linux/ndctl.h
+++ b/include/uapi/linux/ndctl.h
@@ -181,6 +181,7 @@ static inline const char *nd_dimm_cmd_name(unsigned cmd)
 #define ND_DEVICE_NAMESPACE_IO 4    /* legacy persistent memory */
 #define ND_DEVICE_NAMESPACE_PMEM 5  /* persistent memory namespace (may alias) */
 #define ND_DEVICE_NAMESPACE_BLK 6   /* block-data-window namespace (may alias) */
+#define ND_DEVICE_BTT 7		    /* block-translation table device */
 
 enum nd_driver_flags {
 	ND_DRIVER_DIMM            = 1 << ND_DEVICE_DIMM,
@@ -189,6 +190,7 @@ enum nd_driver_flags {
 	ND_DRIVER_NAMESPACE_IO    = 1 << ND_DEVICE_NAMESPACE_IO,
 	ND_DRIVER_NAMESPACE_PMEM  = 1 << ND_DEVICE_NAMESPACE_PMEM,
 	ND_DRIVER_NAMESPACE_BLK   = 1 << ND_DEVICE_NAMESPACE_BLK,
+	ND_DRIVER_BTT		  = 1 << ND_DEVICE_BTT,
 };
 
 enum {


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 17/21] libnd: infrastructure for btt devices
@ 2015-05-20 20:57   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, mingo, linux-acpi,
	jmoyer, hch

Block devices from an nd bus, in addition to accepting "struct bio"
based requests, also have the capability to perform byte-aligned
accesses.  By default only the bio/block interface is used.  However, if
another driver can make effective use of the byte-aligned capability it
can claim/disable the block interface and use the byte-aligned "nd_io"
interface.

The BTT driver is the initial first consumer of this mechanism to allow
layering atomic sector update guarantees on top of nd_io capable
libnd-block-devices, or their partitions.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/block/nd/Kconfig      |    3 
 drivers/block/nd/Makefile     |    1 
 drivers/block/nd/btt.h        |   45 ++++
 drivers/block/nd/btt_devs.c   |  442 +++++++++++++++++++++++++++++++++++++++++
 drivers/block/nd/bus.c        |  128 ++++++++++++
 drivers/block/nd/core.c       |   79 +++++++
 drivers/block/nd/nd-private.h |   28 +++
 drivers/block/nd/nd.h         |   94 +++++++++
 drivers/block/nd/pmem.c       |   29 +++
 include/uapi/linux/ndctl.h    |    2 
 10 files changed, 847 insertions(+), 4 deletions(-)
 create mode 100644 drivers/block/nd/btt.h
 create mode 100644 drivers/block/nd/btt_devs.c

diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index 03f572f0e3d0..00d9afe9475e 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -34,4 +34,7 @@ config BLK_DEV_PMEM
 
 	  Say Y if you want to use a NVDIMM described by NFIT
 
+config ND_BTT_DEVS
+	def_bool y
+
 endif
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 8d14510559e1..9866669d7738 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -11,3 +11,4 @@ libnd-y += region_devs.o
 libnd-y += region.o
 libnd-y += namespace_devs.o
 libnd-y += label.o
+libnd-$(CONFIG_ND_BTT_DEVS) += btt_devs.o
diff --git a/drivers/block/nd/btt.h b/drivers/block/nd/btt.h
new file mode 100644
index 000000000000..e8f6d8e0ddd3
--- /dev/null
+++ b/drivers/block/nd/btt.h
@@ -0,0 +1,45 @@
+/*
+ * Block Translation Table library
+ * Copyright (c) 2014-2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_BTT_H
+#define _LINUX_BTT_H
+
+#include <linux/types.h>
+
+#define BTT_SIG_LEN 16
+#define BTT_SIG "BTT_ARENA_INFO\0"
+
+struct btt_sb {
+	u8 signature[BTT_SIG_LEN];
+	u8 uuid[16];
+	u8 parent_uuid[16];
+	__le32 flags;
+	__le16 version_major;
+	__le16 version_minor;
+	__le32 external_lbasize;
+	__le32 external_nlba;
+	__le32 internal_lbasize;
+	__le32 internal_nlba;
+	__le32 nfree;
+	__le32 infosize;
+	__le64 nextoff;
+	__le64 dataoff;
+	__le64 mapoff;
+	__le64 logoff;
+	__le64 info2off;
+	u8 padding[3968];
+	__le64 checksum;
+};
+
+#endif
diff --git a/drivers/block/nd/btt_devs.c b/drivers/block/nd/btt_devs.c
new file mode 100644
index 000000000000..b3b813288092
--- /dev/null
+++ b/drivers/block/nd/btt_devs.c
@@ -0,0 +1,442 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/device.h>
+#include <linux/genhd.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include "nd-private.h"
+#include "btt.h"
+#include "nd.h"
+
+static DEFINE_IDA(btt_ida);
+
+static void nd_btt_release(struct device *dev)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+
+	dev_dbg(dev, "%s\n", __func__);
+	WARN_ON(nd_btt->backing_dev);
+	ndio_del_claim(nd_btt->ndio_claim);
+	ida_simple_remove(&btt_ida, nd_btt->id);
+	kfree(nd_btt->uuid);
+	kfree(nd_btt);
+}
+
+static struct device_type nd_btt_device_type = {
+	.name = "nd_btt",
+	.release = nd_btt_release,
+};
+
+bool is_nd_btt(struct device *dev)
+{
+	return dev->type == &nd_btt_device_type;
+}
+
+struct nd_btt *to_nd_btt(struct device *dev)
+{
+	struct nd_btt *nd_btt = container_of(dev, struct nd_btt, dev);
+
+	WARN_ON(!is_nd_btt(dev));
+	return nd_btt;
+}
+EXPORT_SYMBOL(to_nd_btt);
+
+static const unsigned long btt_lbasize_supported[] = { 512, 4096, 0 };
+
+static ssize_t sector_size_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+
+	return nd_sector_size_show(nd_btt->lbasize, btt_lbasize_supported, buf);
+}
+
+static ssize_t sector_size_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	ssize_t rc;
+
+	device_lock(dev);
+	nd_bus_lock(dev);
+	rc = nd_sector_size_store(dev, buf, &nd_btt->lbasize,
+			btt_lbasize_supported);
+	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	nd_bus_unlock(dev);
+	device_unlock(dev);
+
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(sector_size);
+
+static ssize_t uuid_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+
+	if (nd_btt->uuid)
+		return sprintf(buf, "%pUb\n", nd_btt->uuid);
+	return sprintf(buf, "\n");
+}
+
+static ssize_t uuid_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	ssize_t rc;
+
+	device_lock(dev);
+	rc = nd_uuid_store(dev, &nd_btt->uuid, buf, len);
+	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	device_unlock(dev);
+
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(uuid);
+
+static ssize_t backing_dev_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	char name[BDEVNAME_SIZE];
+
+	if (nd_btt->backing_dev)
+		return sprintf(buf, "/dev/%s\n",
+				bdevname(nd_btt->backing_dev, name));
+	else
+		return sprintf(buf, "\n");
+}
+
+static const fmode_t nd_btt_devs_mode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
+
+static void nd_btt_ndio_notify_remove(struct nd_io_claim *ndio_claim)
+{
+	char bdev_name[BDEVNAME_SIZE];
+	struct nd_btt *nd_btt;
+
+	if (!ndio_claim || !ndio_claim->holder)
+		return;
+
+	nd_btt = to_nd_btt(ndio_claim->holder);
+	WARN_ON_ONCE(!is_nd_bus_locked(&nd_btt->dev));
+	dev_dbg(&nd_btt->dev, "%pf: %s: release /dev/%s\n",
+			__builtin_return_address(0), __func__,
+			bdevname(nd_btt->backing_dev, bdev_name));
+	blkdev_put(nd_btt->backing_dev, nd_btt_devs_mode);
+	nd_btt->backing_dev = NULL;
+
+	/*
+	 * Once we've had our backing device removed we need to be fully
+	 * reconfigured.  The bus will have already created a new seed
+	 * for this purpose, so now is a good time to clean up this
+	 * stale nd_btt instance.
+	 */
+	if (nd_btt->dev.driver)
+		nd_device_unregister(&nd_btt->dev, ND_ASYNC);
+	else {
+		ndio_del_claim(ndio_claim);
+		nd_btt->ndio_claim = NULL;
+	}
+}
+
+static ssize_t __backing_dev_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	char bdev_name[BDEVNAME_SIZE];
+	struct block_device *bdev;
+	struct nd_io *ndio;
+	char *path;
+
+	if (dev->driver) {
+		dev_dbg(dev, "%s: -EBUSY\n", __func__);
+		return -EBUSY;
+	}
+
+	path = kstrndup(buf, len, GFP_KERNEL);
+	if (!path)
+		return -ENOMEM;
+
+	/* detach the backing device */
+	if (strcmp(strim(path), "") == 0) {
+		if (!nd_btt->backing_dev)
+			goto out;
+		nd_btt_ndio_notify_remove(nd_btt->ndio_claim);
+		goto out;
+	} else if (nd_btt->backing_dev) {
+		dev_dbg(dev, "backing_dev already set\n");
+		len = -EBUSY;
+		goto out;
+	}
+
+	bdev = blkdev_get_by_path(strim(path), nd_btt_devs_mode, nd_btt);
+	if (IS_ERR(bdev)) {
+		dev_dbg(dev, "open '%s' failed: %ld\n", strim(path),
+				PTR_ERR(bdev));
+		len = PTR_ERR(bdev);
+		goto out;
+	}
+
+	if (get_capacity(bdev->bd_disk) < SZ_16M / 512) {
+		blkdev_put(bdev, nd_btt_devs_mode);
+		len = -ENXIO;
+		goto out;
+	}
+
+	ndio = ndio_lookup(nd_bus, bdevname(bdev->bd_contains, bdev_name));
+	if (!ndio) {
+		dev_dbg(dev, "%s does not have an ndio interface\n",
+				strim(path));
+		blkdev_put(bdev, nd_btt_devs_mode);
+		len = -ENXIO;
+		goto out;
+	}
+
+	nd_btt->ndio_claim = ndio_add_claim(ndio, &nd_btt->dev,
+			nd_btt_ndio_notify_remove);
+	if (!nd_btt->ndio_claim) {
+		blkdev_put(bdev, nd_btt_devs_mode);
+		len = -ENOMEM;
+		goto out;
+	}
+
+	WARN_ON_ONCE(!is_nd_bus_locked(&nd_btt->dev));
+	nd_btt->backing_dev = bdev;
+
+ out:
+	kfree(path);
+	return len;
+}
+
+static ssize_t backing_dev_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	ssize_t rc;
+
+	nd_bus_lock(dev);
+	device_lock(dev);
+	rc = __backing_dev_store(dev, attr, buf, len);
+	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	device_unlock(dev);
+	nd_bus_unlock(dev);
+
+	return rc;
+}
+static DEVICE_ATTR_RW(backing_dev);
+
+static bool is_nd_btt_idle(struct device *dev)
+{
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+
+	if (nd_bus->nd_btt == nd_btt || dev->driver || nd_btt->backing_dev)
+		return false;
+	return true;
+}
+
+static ssize_t delete_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	/* return 1 if can be deleted */
+	return sprintf(buf, "%d\n", is_nd_btt_idle(dev));
+}
+
+static ssize_t delete_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	unsigned long val;
+
+	/* write 1 to delete */
+	if (kstrtoul(buf, 0, &val) != 0 || val != 1)
+		return -EINVAL;
+
+	/* prevent deletion while this btt is active, or is the current seed */
+	if (!is_nd_btt_idle(dev))
+		return -EBUSY;
+
+	/*
+	 * userspace raced itself if device goes active here and it gets
+	 * to keep the pieces
+	 */
+	nd_device_unregister(dev, ND_ASYNC);
+
+	return len;
+}
+static DEVICE_ATTR_RW(delete);
+
+static struct attribute *nd_btt_attributes[] = {
+	&dev_attr_sector_size.attr,
+	&dev_attr_backing_dev.attr,
+	&dev_attr_delete.attr,
+	&dev_attr_uuid.attr,
+	NULL,
+};
+
+static struct attribute_group nd_btt_attribute_group = {
+	.attrs = nd_btt_attributes,
+};
+
+static const struct attribute_group *nd_btt_attribute_groups[] = {
+	&nd_btt_attribute_group,
+	&nd_device_attribute_group,
+	NULL,
+};
+
+static struct nd_btt *__nd_btt_create(struct nd_bus *nd_bus,
+		unsigned long lbasize, u8 *uuid)
+{
+	struct nd_btt *nd_btt = kzalloc(sizeof(*nd_btt), GFP_KERNEL);
+	struct device *dev;
+
+	if (!nd_btt)
+		return NULL;
+	nd_btt->id = ida_simple_get(&btt_ida, 0, 0, GFP_KERNEL);
+	if (nd_btt->id < 0) {
+		kfree(nd_btt);
+		return NULL;
+	}
+
+	nd_btt->lbasize = lbasize;
+	if (uuid)
+		uuid = kmemdup(uuid, 16, GFP_KERNEL);
+	nd_btt->uuid = uuid;
+	dev = &nd_btt->dev;
+	dev_set_name(dev, "btt%d", nd_btt->id);
+	dev->parent = &nd_bus->dev;
+	dev->type = &nd_btt_device_type;
+	dev->groups = nd_btt_attribute_groups;
+	return nd_btt;
+}
+
+struct nd_btt *nd_btt_create(struct nd_bus *nd_bus)
+{
+	struct nd_btt *nd_btt = __nd_btt_create(nd_bus, 0, NULL);
+
+	if (!nd_btt)
+		return NULL;
+	nd_device_register(&nd_btt->dev);
+	return nd_btt;
+}
+
+/*
+ * nd_btt_sb_checksum: compute checksum for btt info block
+ *
+ * Returns a fletcher64 checksum of everything in the given info block
+ * except the last field (since that's where the checksum lives).
+ */
+u64 nd_btt_sb_checksum(struct btt_sb *btt_sb)
+{
+	u64 sum, sum_save;
+
+	sum_save = btt_sb->checksum;
+	btt_sb->checksum = 0;
+	sum = nd_fletcher64(btt_sb, sizeof(*btt_sb), 1);
+	btt_sb->checksum = sum_save;
+	return sum;
+}
+EXPORT_SYMBOL(nd_btt_sb_checksum);
+
+static int nd_btt_autodetect(struct nd_bus *nd_bus, struct nd_io *ndio,
+		struct block_device *bdev)
+{
+	char name[BDEVNAME_SIZE];
+	struct nd_btt *nd_btt;
+	struct btt_sb *btt_sb;
+	u64 offset, checksum;
+	u32 lbasize;
+	u8 *uuid;
+	int rc;
+
+	btt_sb = kzalloc(sizeof(*btt_sb), GFP_KERNEL);
+	if (!btt_sb)
+		return -ENODEV;
+
+	offset = nd_partition_offset(bdev);
+	rc = ndio->rw_bytes(ndio, btt_sb, offset + SZ_4K, sizeof(*btt_sb), READ);
+	if (rc)
+		goto out_free_sb;
+
+	if (get_capacity(bdev->bd_disk) < SZ_16M / 512)
+		goto out_free_sb;
+
+	if (memcmp(btt_sb->signature, BTT_SIG, BTT_SIG_LEN) != 0)
+		goto out_free_sb;
+
+	checksum = le64_to_cpu(btt_sb->checksum);
+	btt_sb->checksum = 0;
+	if (checksum != nd_btt_sb_checksum(btt_sb))
+		goto out_free_sb;
+	btt_sb->checksum = cpu_to_le64(checksum);
+
+	uuid = kmemdup(btt_sb->uuid, 16, GFP_KERNEL);
+	if (!uuid)
+		goto out_free_sb;
+
+	lbasize = le32_to_cpu(btt_sb->external_lbasize);
+	nd_btt = __nd_btt_create(nd_bus, lbasize, uuid);
+	if (!nd_btt)
+		goto out_free_uuid;
+
+	device_initialize(&nd_btt->dev);
+	nd_btt->ndio_claim = ndio_add_claim(ndio, &nd_btt->dev,
+			nd_btt_ndio_notify_remove);
+	if (!nd_btt->ndio_claim)
+		goto out_free_btt;
+
+	nd_btt->backing_dev = bdev;
+	dev_dbg(&nd_btt->dev, "%s: activate %s\n", __func__,
+			bdevname(bdev, name));
+	__nd_device_register(&nd_btt->dev);
+	kfree(btt_sb);
+	return 0;
+
+ out_free_btt:
+	kfree(nd_btt);
+ out_free_uuid:
+	kfree(uuid);
+ out_free_sb:
+	kfree(btt_sb);
+
+	return -ENODEV;
+}
+
+void nd_btt_notify_ndio(struct nd_bus *nd_bus, struct nd_io *ndio)
+{
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+
+	disk_part_iter_init(&piter, ndio->disk, DISK_PITER_INCL_PART0);
+	while ((part = disk_part_iter_next(&piter))) {
+		struct block_device *bdev;
+		int rc;
+
+		bdev = bdget_disk(ndio->disk, part->partno);
+		if (!bdev)
+			continue;
+		if (blkdev_get(bdev, nd_btt_devs_mode, nd_bus) != 0)
+			continue;
+		rc = nd_btt_autodetect(nd_bus, ndio, bdev);
+		if (rc)
+			blkdev_put(bdev, nd_btt_devs_mode);
+		/* no need to scan further in the case of whole disk btt */
+		if (rc == 0 && part->partno == 0)
+			break;
+	}
+	disk_part_iter_exit(&piter);
+}
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index 4a2185a99bd7..dc69ccfae53a 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -16,6 +16,7 @@
 #include <linux/module.h>
 #include <linux/fcntl.h>
 #include <linux/async.h>
+#include <linux/genhd.h>
 #include <linux/ndctl.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
@@ -40,6 +41,8 @@ static int to_nd_device_type(struct device *dev)
 		return ND_DEVICE_REGION_BLK;
 	else if (is_nd_pmem(dev->parent) || is_nd_blk(dev->parent))
 		return nd_region_to_namespace_type(to_nd_region(dev->parent));
+	else if (is_nd_btt(dev))
+		return ND_DEVICE_BTT;
 
 	return 0;
 }
@@ -84,6 +87,21 @@ static int nd_bus_probe(struct device *dev)
 
 	dev_dbg(&nd_bus->dev, "%s.probe(%s) = %d\n", dev->driver->name,
 			dev_name(dev), rc);
+
+	/* check if our btt-seed has sprouted, and plant another */
+	if (rc == 0 && is_nd_btt(dev) && dev == &nd_bus->nd_btt->dev) {
+		const char *sep = "", *name = "", *status = "failed";
+
+		nd_bus->nd_btt = nd_btt_create(nd_bus);
+		if (nd_bus->nd_btt) {
+			status = "succeeded";
+			sep = ": ";
+			name = dev_name(&nd_bus->nd_btt->dev);
+		}
+		dev_dbg(&nd_bus->dev, "btt seed creation %s%s%s\n",
+				status, sep, name);
+	}
+
 	if (rc != 0)
 		module_put(provider);
 	return rc;
@@ -144,14 +162,19 @@ static void nd_async_device_unregister(void *d, async_cookie_t cookie)
 	put_device(dev);
 }
 
-void nd_device_register(struct device *dev)
+void __nd_device_register(struct device *dev)
 {
 	dev->bus = &nd_bus_type;
-	device_initialize(dev);
 	get_device(dev);
 	async_schedule_domain(nd_async_device_register, dev,
 			&nd_async_domain);
 }
+
+void nd_device_register(struct device *dev)
+{
+	device_initialize(dev);
+	__nd_device_register(dev);
+}
 EXPORT_SYMBOL(nd_device_register);
 
 void nd_device_unregister(struct device *dev, enum nd_async_mode mode)
@@ -200,6 +223,107 @@ int __nd_driver_register(struct nd_device_driver *nd_drv, struct module *owner,
 }
 EXPORT_SYMBOL(__nd_driver_register);
 
+/**
+ * nd_register_ndio() - register byte-aligned access capability for an nd-bdev
+ * @disk: child gendisk of the ndio namepace device
+ * @ndio: initialized ndio instance to register
+ *
+ * LOCKING: hold nd_bus_lock() over the creation of ndio->disk and the
+ * subsequent nd_region_ndio event
+ */
+int nd_register_ndio(struct nd_io *ndio)
+{
+	struct nd_bus *nd_bus;
+	struct device *dev;
+
+	if (!ndio || !ndio->dev || !ndio->disk || !list_empty(&ndio->list)
+			|| !ndio->rw_bytes || !list_empty(&ndio->claims)) {
+		pr_debug("%s bad parameters from %pf\n", __func__,
+				__builtin_return_address(0));
+		return -EINVAL;
+	}
+
+	dev = ndio->dev;
+	nd_bus = walk_to_nd_bus(dev);
+	if (!nd_bus)
+		return -EINVAL;
+
+	WARN_ON_ONCE(!is_nd_bus_locked(&nd_bus->dev));
+	list_add(&ndio->list, &nd_bus->ndios);
+
+	/* TODO: generic infrastructure for 3rd party ndio claimers */
+	nd_btt_notify_ndio(nd_bus, ndio);
+
+	return 0;
+}
+EXPORT_SYMBOL(nd_register_ndio);
+
+/**
+ * __nd_unregister_ndio() - try to remove an ndio interface
+ * @ndio: interface to remove
+ */
+static int __nd_unregister_ndio(struct nd_io *ndio)
+{
+	struct nd_io_claim *ndio_claim, *_n;
+	struct nd_bus *nd_bus;
+	LIST_HEAD(claims);
+
+	nd_bus = walk_to_nd_bus(ndio->dev);
+	if (!nd_bus || list_empty(&ndio->list))
+		return -ENXIO;
+
+	spin_lock(&ndio->lock);
+	list_splice_init(&ndio->claims, &claims);
+	spin_unlock(&ndio->lock);
+
+	list_for_each_entry_safe(ndio_claim, _n, &claims, list)
+		ndio_claim->notify_remove(ndio_claim);
+
+	list_del_init(&ndio->list);
+
+	return 0;
+}
+
+int nd_unregister_ndio(struct nd_io *ndio)
+{
+	struct device *dev = ndio->dev;
+	int rc;
+
+	nd_bus_lock(dev);
+	rc = __nd_unregister_ndio(ndio);
+	nd_bus_unlock(dev);
+
+	/*
+	 * Flush in case ->notify_remove() kicked off asynchronous device
+	 * unregistration
+	 */
+	nd_synchronize();
+
+	return rc;
+}
+EXPORT_SYMBOL(nd_unregister_ndio);
+
+static struct nd_io *__ndio_lookup(struct nd_bus *nd_bus, const char *diskname)
+{
+	struct nd_io *ndio;
+
+	list_for_each_entry(ndio, &nd_bus->ndios, list)
+		if (strcmp(diskname, ndio->disk->disk_name) == 0)
+			return ndio;
+
+	return NULL;
+}
+
+struct nd_io *ndio_lookup(struct nd_bus *nd_bus, const char *diskname)
+{
+	struct nd_io *ndio;
+
+	WARN_ON_ONCE(!is_nd_bus_locked(&nd_bus->dev));
+	ndio = __ndio_lookup(nd_bus, diskname);
+
+	return ndio;
+}
+
 static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
 		char *buf)
 {
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index b45863343a48..a0709a2e302f 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -55,6 +55,62 @@ bool is_nd_bus_locked(struct device *dev)
 }
 EXPORT_SYMBOL(is_nd_bus_locked);
 
+void nd_init_ndio(struct nd_io *ndio, nd_rw_bytes_fn rw_bytes,
+		struct device *dev, struct gendisk *disk, unsigned long align)
+{
+	memset(ndio, 0, sizeof(*ndio));
+	INIT_LIST_HEAD(&ndio->claims);
+	INIT_LIST_HEAD(&ndio->list);
+	spin_lock_init(&ndio->lock);
+	ndio->dev = dev;
+	ndio->disk = disk;
+	ndio->align = align;
+	ndio->rw_bytes = rw_bytes;
+}
+EXPORT_SYMBOL(nd_init_ndio);
+
+void ndio_del_claim(struct nd_io_claim *ndio_claim)
+{
+	struct nd_io *ndio;
+	struct device *holder;
+
+	if (!ndio_claim)
+		return;
+	ndio = ndio_claim->parent;
+	holder = ndio_claim->holder;
+
+	dev_dbg(holder, "%s: drop %s\n", __func__, dev_name(ndio->dev));
+	spin_lock(&ndio->lock);
+	list_del(&ndio_claim->list);
+	spin_unlock(&ndio->lock);
+	put_device(ndio->dev);
+	kfree(ndio_claim);
+	put_device(holder);
+}
+
+struct nd_io_claim *ndio_add_claim(struct nd_io *ndio, struct device *holder,
+		ndio_notify_remove_fn notify_remove)
+{
+	struct nd_io_claim *ndio_claim = kzalloc(sizeof(*ndio_claim), GFP_KERNEL);
+
+	if (!ndio_claim)
+		return NULL;
+
+	INIT_LIST_HEAD(&ndio_claim->list);
+	ndio_claim->parent = ndio;
+	get_device(ndio->dev);
+
+	spin_lock(&ndio->lock);
+	list_add(&ndio_claim->list, &ndio->claims);
+	spin_unlock(&ndio->lock);
+
+	ndio_claim->holder = holder;
+	ndio_claim->notify_remove = notify_remove;
+	get_device(holder);
+
+	return ndio_claim;
+}
+
 u64 nd_fletcher64(void *addr, size_t len, bool le)
 {
 	u32 *buf = addr;
@@ -75,6 +131,8 @@ static void nd_bus_release(struct device *dev)
 {
 	struct nd_bus *nd_bus = container_of(dev, struct nd_bus, dev);
 
+	WARN_ON(!list_empty(&nd_bus->ndios));
+
 	ida_simple_remove(&nd_ida, nd_bus->id);
 	kfree(nd_bus);
 }
@@ -271,10 +329,28 @@ static ssize_t wait_probe_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(wait_probe);
 
+static ssize_t btt_seed_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_bus *nd_bus = to_nd_bus(dev);
+	ssize_t rc;
+
+	nd_bus_lock(dev);
+	if (nd_bus->nd_btt)
+		rc = sprintf(buf, "%s\n", dev_name(&nd_bus->nd_btt->dev));
+	else
+		rc = sprintf(buf, "\n");
+	nd_bus_unlock(dev);
+
+	return rc;
+}
+static DEVICE_ATTR_RO(btt_seed);
+
 static struct attribute *nd_bus_attributes[] = {
 	&dev_attr_commands.attr,
 	&dev_attr_wait_probe.attr,
 	&dev_attr_provider.attr,
+	&dev_attr_btt_seed.attr,
 	NULL,
 };
 
@@ -291,6 +367,7 @@ struct nd_bus *__nd_bus_register(struct device *parent,
 
 	if (!nd_bus)
 		return NULL;
+	INIT_LIST_HEAD(&nd_bus->ndios);
 	INIT_LIST_HEAD(&nd_bus->list);
 	init_waitqueue_head(&nd_bus->probe_wait);
 	nd_bus->id = ida_simple_get(&nd_ida, 0, 0, GFP_KERNEL);
@@ -319,6 +396,8 @@ struct nd_bus *__nd_bus_register(struct device *parent,
 	list_add_tail(&nd_bus->list, &nd_bus_list);
 	mutex_unlock(&nd_bus_list_mutex);
 
+	nd_bus->nd_btt = nd_btt_create(nd_bus);
+
 	return nd_bus;
  err:
 	put_device(&nd_bus->dev);
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index fffd65436e2b..6c89695956a4 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -22,14 +22,21 @@ extern struct list_head nd_bus_list;
 extern struct mutex nd_bus_list_mutex;
 extern int nd_dimm_major;
 
+struct block_device;
+struct nd_io_claim;
+struct nd_btt;
+struct nd_io;
+
 struct nd_bus {
 	struct nd_bus_descriptor *nd_desc;
 	wait_queue_head_t probe_wait;
 	struct module *module;
+	struct list_head ndios;
 	struct list_head list;
 	struct device dev;
 	int id, probe_active;
 	struct mutex reconfig_mutex;
+	struct nd_btt *nd_btt;
 };
 
 struct nd_dimm {
@@ -41,9 +48,29 @@ struct nd_dimm {
 	int id;
 };
 
+struct nd_io *ndio_lookup(struct nd_bus *nd_bus, const char *diskname);
 bool is_nd_dimm(struct device *dev);
 bool is_nd_blk(struct device *dev);
 bool is_nd_pmem(struct device *dev);
+#if IS_ENABLED(CONFIG_ND_BTT_DEVS)
+bool is_nd_btt(struct device *dev);
+struct nd_btt *nd_btt_create(struct nd_bus *nd_bus);
+void nd_btt_notify_ndio(struct nd_bus *nd_bus, struct nd_io *ndio);
+#else
+static inline bool is_nd_btt(struct device *dev)
+{
+	return false;
+}
+
+static inline struct nd_btt *nd_btt_create(struct nd_bus *nd_bus)
+{
+	return NULL;
+}
+
+static inline void nd_btt_notify_ndio(struct nd_bus *nd_bus, struct nd_io *ndio)
+{
+}
+#endif
 struct nd_bus *walk_to_nd_bus(struct device *nd_dev);
 int __init nd_bus_init(void);
 void nd_bus_exit(void);
@@ -62,6 +89,7 @@ void nd_synchronize(void);
 int nd_bus_register_dimms(struct nd_bus *nd_bus);
 int nd_bus_register_regions(struct nd_bus *nd_bus);
 int nd_bus_init_interleave_sets(struct nd_bus *nd_bus);
+void __nd_device_register(struct device *dev);
 int nd_match_dimm(struct device *dev, void *data);
 struct nd_label_id;
 char *nd_label_gen_id(struct nd_label_id *label_id, u8 *uuid, u32 flags);
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index 24a440a23b2c..73e830785f74 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -12,13 +12,19 @@
  */
 #ifndef __ND_H__
 #define __ND_H__
+#include <linux/genhd.h>
 #include <linux/device.h>
 #include <linux/libnd.h>
 #include <linux/mutex.h>
 #include <linux/ndctl.h>
 #include <linux/types.h>
+#include <linux/fs.h>
 #include "label.h"
 
+enum {
+	SECTOR_SHIFT = 9,
+};
+
 struct nd_dimm_drvdata {
 	struct device *dev;
 	int nsindex_size;
@@ -111,6 +117,84 @@ static inline unsigned nd_inc_seq(unsigned seq)
 	return next[seq & 3];
 }
 
+struct nd_io;
+/**
+ * nd_rw_bytes_fn() - access bytes relative to the "whole disk" namespace device
+ * @ndio: per-namespace context
+ * @buf: source / target for the write / read
+ * @offset: offset relative to the start of the namespace device
+ * @n: num bytes to access
+ * @flags: READ, WRITE, and other REQ_* flags
+ *
+ * Note: Implementations may assume that offset + n never crosses ndio->align
+ */
+typedef int (*nd_rw_bytes_fn)(struct nd_io *ndio, void *buf, size_t offset,
+		size_t n, unsigned long flags);
+#define nd_data_dir(flags) (flags & 1)
+
+/**
+ * struct nd_io - info for byte-aligned access to nd devices
+ * @rw_bytes: operation to perform byte-aligned access
+ * @align: a single ->rw_bytes() request may not cross this alignment
+ * @gendisk: whole disk block device for the namespace
+ * @list: for the core to cache a list of "ndio"s for later association
+ * @dev: namespace device
+ * @claims: list of clients using this interface
+ * @lock: protect @claims mutation
+ */
+struct nd_io {
+	nd_rw_bytes_fn rw_bytes;
+	unsigned long align;
+	struct gendisk *disk;
+	struct list_head list;
+	struct device *dev;
+	struct list_head claims;
+	spinlock_t lock;
+};
+
+struct nd_io_claim;
+typedef void (*ndio_notify_remove_fn)(struct nd_io_claim *ndio_claim);
+
+/**
+ * struct nd_io_claim - instance of a claim on a parent ndio
+ * @notify_remove: ndio is going away, release resources
+ * @holder: object that has claimed this ndio
+ * @parent: ndio in use
+ * @holder: holder device
+ * @list: claim peers
+ *
+ * An ndio may be claimed multiple times, consider the case of a btt
+ * instance per partition on a namespace.
+ */
+struct nd_io_claim {
+	struct nd_io *parent;
+	ndio_notify_remove_fn notify_remove;
+	struct list_head list;
+	struct device *holder;
+};
+
+struct nd_btt {
+	struct device dev;
+	struct nd_io *ndio;
+	struct block_device *backing_dev;
+	unsigned long lbasize;
+	u8 *uuid;
+	u64 offset;
+	int id;
+	struct nd_io_claim *ndio_claim;
+};
+
+static inline u64 nd_partition_offset(struct block_device *bdev)
+{
+	struct hd_struct *p;
+
+	if (bdev == bdev->bd_contains)
+		return 0;
+
+	p = bdev->bd_part;
+	return ((u64) p->start_sect) << SECTOR_SHIFT;
+}
+
 enum nd_async_mode {
 	ND_SYNC,
 	ND_ASYNC,
@@ -125,12 +209,22 @@ ssize_t nd_sector_size_show(unsigned long current_lbasize,
 		const unsigned long *supported, char *buf);
 ssize_t nd_sector_size_store(struct device *dev, const char *buf,
 		unsigned long *current_lbasize, const unsigned long *supported);
+int nd_register_ndio(struct nd_io *ndio);
+int nd_unregister_ndio(struct nd_io *ndio);
+void nd_init_ndio(struct nd_io *ndio, nd_rw_bytes_fn rw_bytes,
+		struct device *dev, struct gendisk *disk, unsigned long align);
+void ndio_del_claim(struct nd_io_claim *ndio_claim);
+struct nd_io_claim *ndio_add_claim(struct nd_io *ndio, struct device *holder,
+		ndio_notify_remove_fn notify_remove);
 struct nd_dimm;
 struct nd_dimm_drvdata *to_ndd(struct nd_mapping *nd_mapping);
 int nd_dimm_init_nsarea(struct nd_dimm_drvdata *ndd);
 int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd);
 int nd_dimm_set_config_data(struct nd_dimm_drvdata *ndd, size_t offset,
 		void *buf, size_t len);
+struct nd_btt *to_nd_btt(struct device *dev);
+struct btt_sb;
+u64 nd_btt_sb_checksum(struct btt_sb *btt_sb);
 struct nd_region *to_nd_region(struct device *dev);
 int nd_region_to_namespace_type(struct nd_region *nd_region);
 int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
diff --git a/drivers/block/nd/pmem.c b/drivers/block/nd/pmem.c
index bf380393da92..7b5cedf1f2a4 100644
--- a/drivers/block/nd/pmem.c
+++ b/drivers/block/nd/pmem.c
@@ -29,6 +29,7 @@
 struct pmem_device {
 	struct request_queue	*pmem_queue;
 	struct gendisk		*pmem_disk;
+	struct nd_io		ndio;
 
 	/* One contiguous memory region per device */
 	phys_addr_t		phys_addr;
@@ -96,6 +97,26 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 	return 0;
 }
 
+static int pmem_rw_bytes(struct nd_io *ndio, void *buf, size_t offset,
+		size_t n, unsigned long flags)
+{
+	struct pmem_device *pmem = container_of(ndio, typeof(*pmem), ndio);
+	int rw = nd_data_dir(flags);
+
+	if (unlikely(offset + n > pmem->size)) {
+		dev_WARN_ONCE(ndio->dev, 1, "%s: request out of range\n",
+				__func__);
+		return -EFAULT;
+	}
+
+	if (rw == READ)
+		memcpy(buf, pmem->virt_addr + offset, n);
+	else
+		memcpy(pmem->virt_addr + offset, buf, n);
+
+	return 0;
+}
+
 static long pmem_direct_access(struct block_device *bdev, sector_t sector,
 			      void **kaddr, unsigned long *pfn, long size)
 {
@@ -169,8 +190,6 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res,
 	set_capacity(disk, pmem->size >> 9);
 	pmem->pmem_disk = disk;
 
-	add_disk(disk);
-
 	return pmem;
 
 out_free_queue:
@@ -222,7 +241,12 @@ static int nd_pmem_probe(struct device *dev)
 	if (IS_ERR(pmem))
 		return PTR_ERR(pmem);
 
+	nd_bus_lock(dev);
+	add_disk(pmem->pmem_disk);
 	dev_set_drvdata(dev, pmem);
+	nd_init_ndio(&pmem->ndio, pmem_rw_bytes, dev, pmem->pmem_disk, 0);
+	nd_register_ndio(&pmem->ndio);
+	nd_bus_unlock(dev);
 
 	return 0;
 }
@@ -231,6 +255,7 @@ static int nd_pmem_remove(struct device *dev)
 {
 	struct pmem_device *pmem = dev_get_drvdata(dev);
 
+	nd_unregister_ndio(&pmem->ndio);
 	pmem_free(pmem);
 	return 0;
 }
diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
index 0b4dcabb248a..e595751c613d 100644
--- a/include/uapi/linux/ndctl.h
+++ b/include/uapi/linux/ndctl.h
@@ -181,6 +181,7 @@ static inline const char *nd_dimm_cmd_name(unsigned cmd)
 #define ND_DEVICE_NAMESPACE_IO 4    /* legacy persistent memory */
 #define ND_DEVICE_NAMESPACE_PMEM 5  /* persistent memory namespace (may alias) */
 #define ND_DEVICE_NAMESPACE_BLK 6   /* block-data-window namespace (may alias) */
+#define ND_DEVICE_BTT 7		    /* block-translation table device */
 
 enum nd_driver_flags {
 	ND_DRIVER_DIMM            = 1 << ND_DEVICE_DIMM,
@@ -189,6 +190,7 @@ enum nd_driver_flags {
 	ND_DRIVER_NAMESPACE_IO    = 1 << ND_DEVICE_NAMESPACE_IO,
 	ND_DRIVER_NAMESPACE_PMEM  = 1 << ND_DEVICE_NAMESPACE_PMEM,
 	ND_DRIVER_NAMESPACE_BLK   = 1 << ND_DEVICE_NAMESPACE_BLK,
+	ND_DRIVER_BTT		  = 1 << ND_DEVICE_BTT,
 };
 
 enum {


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 18/21] nd_btt: atomic sector updates
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:57   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: Boaz Harrosh, Vishal Verma, neilb, gregkh, linux-nvdimm,
	Dave Chinner, linux-kernel, Andy Lutomirski, Jens Axboe,
	linux-acpi, jmoyer, H. Peter Anvin, hch, mingo

From: Vishal Verma <vishal.l.verma@linux.intel.com>

BTT stands for Block Translation Table, and is a way to provide power
fail sector atomicity semantics for block devices that have the ability
to perform byte granularity IO. It relies on the ->rw_bytes() capability
of libnd namespace devices.

The BTT works as a stacked blocked device, and reserves a chunk of space
from the backing device for its accounting metadata.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
[jmoyer: fix nmi watchdog timeout in btt_map_init]
[jmoyer: move btt initialization to module load path]
[jmoyer: fix memory leak in the btt initialization path]
[jmoyer: Don't overwrite corrupted arenas]
Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 Documentation/blockdev/btt.txt |  273 ++++++++
 drivers/acpi/nfit.c            |    1 
 drivers/block/nd/Kconfig       |   20 +
 drivers/block/nd/Makefile      |    3 
 drivers/block/nd/btt.c         | 1438 ++++++++++++++++++++++++++++++++++++++++
 drivers/block/nd/btt.h         |  141 ++++
 drivers/block/nd/btt_devs.c    |    3 
 drivers/block/nd/nd-private.h  |    1 
 drivers/block/nd/nd.h          |   10 
 drivers/block/nd/region.c      |   89 ++
 drivers/block/nd/region_devs.c |   10 
 include/linux/libnd.h          |    1 
 12 files changed, 1986 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/blockdev/btt.txt
 create mode 100644 drivers/block/nd/btt.c

diff --git a/Documentation/blockdev/btt.txt b/Documentation/blockdev/btt.txt
new file mode 100644
index 000000000000..95134d5ec4a0
--- /dev/null
+++ b/Documentation/blockdev/btt.txt
@@ -0,0 +1,273 @@
+BTT - Block Translation Table
+=============================
+
+
+1. Introduction
+---------------
+
+Persistent memory based storage is able to perform IO at byte (or more
+accurately, cache line) granularity. However, we often want to expose such
+storage as traditional block devices. The block drivers for persistent memory
+will do exactly this. However, they do not provide any atomicity guarantees.
+Traditional SSDs typically provide protection against torn sectors in hardware,
+using stored energy in capacitors to complete in-flight block writes, or perhaps
+in firmware. We don't have this luxury with persistent memory - if a write is in
+progress, and we experience a power failure, the block will contain a mix of old
+and new data. Applications may not be prepared to handle such a scenario.
+
+The Block Translation Table (BTT) provides atomic sector update semantics for
+persistent memory devices, so that applications that rely on sector writes not
+being torn can continue to do so. The BTT manifests itself as a stacked block
+device, and reserves a portion of the underlying storage for its metadata. At
+the heart of it, is an indirection table that re-maps all the blocks on the
+volume. It can be thought of as an extremely simple file system that only
+provides atomic sector updates.
+
+
+2. Static Layout
+----------------
+
+The underlying storage on which a BTT can be laid out is not limited in any way.
+The BTT, however, splits the available space into chunks of up to 512 GiB,
+called "Arenas".
+
+Each arena follows the same layout for its metadata, and all references in an
+arena are internal to it (with the exception of one field that points to the
+next arena). The following depicts the "On-disk" metadata layout:
+
+
+  Backing Store     +------->  Arena
++---------------+   |   +------------------+
+|               |   |   | Arena info block |
+|    Arena 0    +---+   |       4K         |
+|     512G      |       +------------------+
+|               |       |                  |
++---------------+       |                  |
+|               |       |                  |
+|    Arena 1    |       |   Data Blocks    |
+|     512G      |       |                  |
+|               |       |                  |
++---------------+       |                  |
+|       .       |       |                  |
+|       .       |       |                  |
+|       .       |       |                  |
+|               |       |                  |
+|               |       |                  |
++---------------+       +------------------+
+                        |                  |
+                        |     BTT Map      |
+                        |                  |
+                        |                  |
+                        +------------------+
+                        |                  |
+                        |     BTT Flog     |
+                        |                  |
+                        +------------------+
+                        | Info block copy  |
+                        |       4K         |
+                        +------------------+
+
+
+3. Theory of Operation
+----------------------
+
+
+a. The BTT Map
+--------------
+
+The map is a simple lookup/indirection table that maps an LBA to an internal
+block. Each map entry is 32 bits. The two most significant bits are special
+flags, and the remaining form the internal block number.
+
+Bit      Description
+31     : TRIM flag - marks if the block was trimmed or discarded
+30     : ERROR flag - marks an error block. Cleared on write.
+29 - 0 : Mappings to internal 'postmap' blocks
+
+
+Some of the terminology that will be subsequently used:
+
+External LBA  : LBA as made visible to upper layers.
+ABA           : Arena Block Address - Block offset/number within an arena
+Premap ABA    : The block offset into an arena, which was decided upon by range
+		checking the External LBA
+Postmap ABA   : The block number in the "Data Blocks" area obtained after
+		indirection from the map
+nfree	      : The number of free blocks that are maintained at any given time.
+		This is the number of concurrent writes that can happen to the
+		arena.
+
+
+For example, after adding a BTT, we surface a disk of 1024G. We get a read for
+the external LBA at 768G. This falls into the second arena, and of the 512G
+worth of blocks that this arena contributes, this block is at 256G. Thus, the
+premap ABA is 256G. We now refer to the map, and find out the mapping for block
+'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64.
+
+
+b. The BTT Flog
+---------------
+
+The BTT provides sector atomicity by making every write an "allocating write",
+i.e. Every write goes to a "free" block. A running list of free blocks is
+maintained in the form of the BTT flog. 'Flog' is a combination of the words
+"free list" and "log". The flog contains 'nfree' entries, and an entry contains:
+
+lba     : The premap ABA that is being written to
+old_map : The old postmap ABA - after 'this' write completes, this will be a
+	  free block.
+new_map : The new postmap ABA. The map will up updated to reflect this
+	  lba->postmap_aba mapping, but we log it here in case we have to
+	  recover.
+seq	: Sequence number to mark which of the 2 sections of this flog entry is
+	  valid/newest. It cycles between 01->10->11->01 (binary) under normal
+	  operation, with 00 indicating an uninitialized state.
+lba'	: alternate lba entry
+old_map': alternate old postmap entry
+new_map': alternate new postmap entry
+seq'	: alternate sequence number.
+
+Each of the above fields is 32-bit, making one entry 16 bytes. Flog updates are
+done such that for any entry being written, it:
+a. overwrites the 'old' section in the entry based on sequence numbers
+b. writes the new entry such that the sequence number is written last.
+
+
+c. The concept of lanes
+-----------------------
+
+While 'nfree' describes the number of concurrent IOs an arena can process
+concurrently, 'nlanes' is the number of IOs the BTT device as a whole can
+process.
+ nlanes = min(nfree, num_cpus)
+A lane number is obtained at the start of any IO, and is used for indexing into
+all the on-disk and in-memory data structures for the duration of the IO. It is
+protected by a spinlock.
+
+
+d. In-memory data structure: Read Tracking Table (RTT)
+------------------------------------------------------
+
+Consider a case where we have two threads, one doing reads and the other,
+writes. We can hit a condition where the writer thread grabs a free block to do
+a new IO, but the (slow) reader thread is still reading from it. In other words,
+the reader consulted a map entry, and started reading the corresponding block. A
+writer started writing to the same external LBA, and finished the write updating
+the map for that external LBA to point to its new postmap ABA. At this point the
+internal, postmap block that the reader is (still) reading has been inserted
+into the list of free blocks. If another write comes in for the same LBA, it can
+grab this free block, and start writing to it, causing the reader to read
+incorrect data. To prevent this, we introduce the RTT.
+
+The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts
+into rtt[lane_number], the postmap ABA it is reading, and clears it after the
+read is complete. Every writer thread, after grabbing a free block, checks the
+RTT for its presence. If the postmap free block is in the RTT, it waits till the
+reader clears the RTT entry, and only then starts writing to it.
+
+
+e. In-memory data structure: map locks
+--------------------------------------
+
+Consider a case where two writer threads are writing to the same LBA. There can
+be a race in the following sequence of steps:
+
+free[lane] = map[premap_aba]
+map[premap_aba] = postmap_aba
+
+Both threads can update their respective free[lane] with the same old, freed
+postmap_aba. This has made the layout inconsistent by losing a free entry, and
+at the same time, duplicating another free entry for two lanes.
+
+To solve this, we could have a single map lock (per arena) that has to be taken
+before performing the above sequence, but we feel that could be too contentious.
+Instead we use an array of (nfree) map_locks that is indexed by
+(premap_aba modulo nfree).
+
+
+f. Reconstruction from the Flog
+-------------------------------
+
+On startup, we analyze the BTT flog to create our list of free blocks. We walk
+through all the entries, and for each lane, of the set of two possible
+'sections', we always look at the most recent one only (based on the sequence
+number). The reconstruction rules/steps are simple:
+- Read map[log_entry.lba].
+- If log_entry.new matches the map entry, then log_entry.old is free.
+- If log_entry.new does not match the map entry, then log_entry.new is free.
+  (This case can only be caused by power-fails/unsafe shutdowns)
+
+
+g. Summarizing - Read and Write flows
+-------------------------------------
+
+Read:
+
+1.  Convert external LBA to arena number + pre-map ABA
+2.  Get a lane (and take lane_lock)
+3.  Read map to get the entry for this pre-map ABA
+4.  Enter post-map ABA into RTT[lane]
+5.  If TRIM flag set in map, return zeroes, and end IO (go to step 8)
+6.  If ERROR flag set in map, end IO with EIO (go to step 8)
+7.  Read data from this block
+8.  Remove post-map ABA entry from RTT[lane]
+9.  Release lane (and lane_lock)
+
+Write:
+
+1.  Convert external LBA to Arena number + pre-map ABA
+2.  Get a lane (and take lane_lock)
+3.  Use lane to index into in-memory free list and obtain a new block, next flog
+        index, next sequence number
+4.  Scan the RTT to check if free block is present, and spin/wait if it is.
+5.  Write data to this free block
+6.  Read map to get the existing post-map ABA entry for this pre-map ABA
+7.  Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num]
+8.  Write new post-map ABA into map.
+9.  Write old post-map entry into the free list
+10. Calculate next sequence number and write into the free list entry
+11. Release lane (and lane_lock)
+
+
+4. Error Handling
+=================
+
+An arena would be in an error state if any of the metadata is corrupted
+irrecoverably, either due to a bug or a media error. The following conditions
+indicate an error:
+- Info block checksum does not match (and recovering from the copy also fails)
+- All internal available blocks are not uniquely and entirely addressed by the
+  sum of mapped blocks and free blocks (from the BTT flog).
+- Rebuilding free list from the flog reveals missing/duplicate/impossible
+  entries
+- A map entry is out of bounds
+
+If any of these error conditions are encountered, the arena is put into a read
+only state using a flag in the info block.
+
+
+5. In-kernel usage
+==================
+
+Any block driver that supports byte granularity IO to the storage may register
+with the BTT. It will have to provide the rw_bytes interface in its
+block_device_operations struct:
+
+	int (*rw_bytes)(struct gendisk *, void *, size_t, off_t, int rw);
+
+It may register with the BTT after it adds its own gendisk, using btt_init:
+
+	struct btt *btt_init(struct gendisk *disk, unsigned long long rawsize,
+			u32 lbasize, u8 uuid[], int maxlane);
+
+note that maxlane is the maximum amount of concurrency the driver wishes to
+allow the BTT to use.
+
+The BTT 'disk' appears as a stacked block device that grabs the underlying block
+device in the O_EXCL mode.
+
+When the driver wishes to remove the backing disk, it should similarly call
+btt_fini using the same struct btt* handle that was provided to it by btt_init.
+
+	void btt_fini(struct btt *btt);
+
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 7c4d47492372..a9aca87301c6 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -892,6 +892,7 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 			} else {
 				nd_mapping->size = nfit_mem->bdw->capacity;
 				nd_mapping->start = nfit_mem->bdw->start_address;
+				ndr_desc.num_lanes = nfit_mem->bdw->windows;
 			}
 
 			ndr_desc.nd_mapping = nd_mapping;
diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index 00d9afe9475e..2b169806eac5 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -32,9 +32,25 @@ config BLK_DEV_PMEM
 	  capable of DAX (direct-access) file system mappings.  See
 	  Documentation/blockdev/nd.txt for more details.
 
-	  Say Y if you want to use a NVDIMM described by NFIT
+	  Say Y if you want to use a NVDIMM described by ACPI, E820, etc...
 
 config ND_BTT_DEVS
-	def_bool y
+	bool
+
+config ND_BTT
+	tristate "BTT: Block Translation Table (atomic sector updates)"
+	depends on LIBND
+	default LIBND
+	select ND_BTT_DEVS
+
+config ND_MAX_REGIONS
+	int "Maximum number of regions supported by the sub-system"
+	default 64
+	---help---
+	  A 'region' corresponds to an individual DIMM or an interleave
+	  set of DIMMs.  A typical maximally configured system may have
+	  up to 32 DIMMs.
+
+	  Leave the default of 64 if you are unsure.
 
 endif
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 9866669d7738..1e8fe93a0a42 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -1,8 +1,11 @@
 obj-$(CONFIG_LIBND) += libnd.o
 obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
+obj-$(CONFIG_ND_BTT) += nd_btt.o
 
 nd_pmem-y := pmem.o
 
+nd_btt-y := btt.o
+
 libnd-y := core.o
 libnd-y += bus.o
 libnd-y += dimm_devs.o
diff --git a/drivers/block/nd/btt.c b/drivers/block/nd/btt.c
new file mode 100644
index 000000000000..a4287b6f4224
--- /dev/null
+++ b/drivers/block/nd/btt.c
@@ -0,0 +1,1438 @@
+/*
+ * Block Translation Table
+ * Copyright (c) 2014-2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <linux/highmem.h>
+#include <linux/debugfs.h>
+#include <linux/blkdev.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/hdreg.h>
+#include <linux/genhd.h>
+#include <linux/sizes.h>
+#include <linux/ndctl.h>
+#include <linux/fs.h>
+#include <linux/nd.h>
+#include "btt.h"
+#include "nd.h"
+
+enum log_ent_request {
+	LOG_NEW_ENT = 0,
+	LOG_OLD_ENT
+};
+
+static int btt_major;
+
+static int nd_btt_rw_bytes(struct nd_btt *nd_btt, void *buf, size_t offset,
+		size_t n, unsigned long flags)
+{
+	struct nd_io *ndio = nd_btt->ndio;
+
+	if (unlikely(nd_data_dir(flags) == WRITE)
+			&& bdev_read_only(nd_btt->backing_dev))
+		return -EACCES;
+
+	return ndio->rw_bytes(ndio, buf, offset + nd_btt->offset, n, flags);
+}
+
+static int arena_rw_bytes(struct arena_info *arena, void *buf, size_t n,
+		size_t offset, unsigned long flags)
+{
+	/* yes, FIXME,  'offset' and 'n' are swapped */
+	return nd_btt_rw_bytes(arena->nd_btt, buf, offset, n, flags);
+}
+
+static int btt_info_write(struct arena_info *arena, struct btt_sb *super)
+{
+	int ret;
+
+	ret = arena_rw_bytes(arena, super, sizeof(struct btt_sb),
+			arena->info2off, WRITE);
+	if (ret)
+		return ret;
+
+	return arena_rw_bytes(arena, super, sizeof(struct btt_sb),
+			arena->infooff, WRITE);
+}
+
+static int btt_info_read(struct arena_info *arena, struct btt_sb *super)
+{
+	WARN_ON(!super);
+	return arena_rw_bytes(arena, super, sizeof(struct btt_sb),
+			arena->infooff, READ);
+}
+
+/*
+ * 'raw' version of btt_map write
+ * Assumptions:
+ *   mapping is in little-endian
+ *   mapping contains 'E' and 'Z' flags as desired
+ */
+static int __btt_map_write(struct arena_info *arena, u32 lba, __le32 mapping)
+{
+	u64 ns_off = arena->mapoff + (lba * MAP_ENT_SIZE);
+
+	WARN_ON(lba >= arena->external_nlba);
+	return arena_rw_bytes(arena, &mapping, MAP_ENT_SIZE, ns_off, WRITE);
+}
+
+static int btt_map_write(struct arena_info *arena, u32 lba, u32 mapping,
+			u32 z_flag, u32 e_flag)
+{
+	u32 ze;
+	__le32 mapping_le;
+
+	/*
+	 * This 'mapping' is supposed to be just the LBA mapping, without
+	 * any flags set, so strip the flag bits.
+	 */
+	mapping &= MAP_LBA_MASK;
+
+	ze = (z_flag << 1) + e_flag;
+	switch (ze) {
+	case 0:
+		/*
+		 * We want to set neither of the Z or E flags, and
+		 * in the actual layout, this means setting the bit
+		 * positions of both to '1' to indicate a 'normal'
+		 * map entry
+		 */
+		mapping |= MAP_ENT_NORMAL;
+		break;
+	case 1:
+		mapping |= (1 << MAP_ERR_SHIFT);
+		break;
+	case 2:
+		mapping |= (1 << MAP_TRIM_SHIFT);
+		break;
+	default:
+		/*
+		 * The case where Z and E are both sent in as '1' could be
+		 * construed as a valid 'normal' case, but we decide not to,
+		 * to avoid confusion
+		 */
+		WARN_ONCE(1, "Invalid use of Z and E flags\n");
+		return -EIO;
+	}
+
+	mapping_le = cpu_to_le32(mapping);
+	return __btt_map_write(arena, lba, mapping_le);
+}
+
+static int btt_map_read(struct arena_info *arena, u32 lba, u32 *mapping,
+			int *trim, int *error)
+{
+	int ret;
+	__le32 in;
+	u32 raw_mapping, postmap, ze, z_flag, e_flag;
+	u64 ns_off = arena->mapoff + (lba * MAP_ENT_SIZE);
+
+	WARN_ON(lba >= arena->external_nlba);
+
+	ret = arena_rw_bytes(arena, &in, MAP_ENT_SIZE, ns_off, READ);
+	if (ret)
+		return ret;
+
+	raw_mapping = le32_to_cpu(in);
+
+	z_flag = (raw_mapping & MAP_TRIM_MASK) >> MAP_TRIM_SHIFT;
+	e_flag = (raw_mapping & MAP_ERR_MASK) >> MAP_ERR_SHIFT;
+	ze = (z_flag << 1) + e_flag;
+	postmap = raw_mapping & MAP_LBA_MASK;
+
+	/* Reuse the {z,e}_flag variables for *trim and *error */
+	z_flag = 0;
+	e_flag = 0;
+
+	switch (ze) {
+	case 0:
+		/* Initial state. Return postmap = premap */
+		*mapping = lba;
+		break;
+	case 1:
+		*mapping = postmap;
+		e_flag = 1;
+		break;
+	case 2:
+		*mapping = postmap;
+		z_flag = 1;
+		break;
+	case 3:
+		*mapping = postmap;
+		break;
+	default:
+		return -EIO;
+	}
+
+	if (trim)
+		*trim = z_flag;
+	if (error)
+		*error = e_flag;
+
+	return ret;
+}
+
+static int btt_log_read_pair(struct arena_info *arena, u32 lane,
+			struct log_entry *ent)
+{
+	WARN_ON(!ent);
+	return arena_rw_bytes(arena, ent, 2 * LOG_ENT_SIZE,
+			arena->logoff + (2 * lane * LOG_ENT_SIZE), READ);
+}
+
+static struct dentry *debugfs_root;
+
+static void arena_debugfs_init(struct arena_info *a, struct dentry *parent,
+				int idx)
+{
+	char dirname[32];
+	struct dentry *d;
+
+	/* If for some reason, parent bttN was not created, exit */
+	if (!parent)
+		return;
+
+	snprintf(dirname, 32, "arena%d", idx);
+	d = debugfs_create_dir(dirname, parent);
+	if (IS_ERR_OR_NULL(d))
+		return;
+	a->debugfs_dir = d;
+
+	debugfs_create_x64("size", S_IRUGO, d, &a->size);
+	debugfs_create_x64("external_lba_start", S_IRUGO, d,
+				&a->external_lba_start);
+	debugfs_create_x32("internal_nlba", S_IRUGO, d, &a->internal_nlba);
+	debugfs_create_u32("internal_lbasize", S_IRUGO, d,
+				&a->internal_lbasize);
+	debugfs_create_x32("external_nlba", S_IRUGO, d, &a->external_nlba);
+	debugfs_create_u32("external_lbasize", S_IRUGO, d,
+				&a->external_lbasize);
+	debugfs_create_u32("nfree", S_IRUGO, d, &a->nfree);
+	debugfs_create_u16("version_major", S_IRUGO, d, &a->version_major);
+	debugfs_create_u16("version_minor", S_IRUGO, d, &a->version_minor);
+	debugfs_create_x64("nextoff", S_IRUGO, d, &a->nextoff);
+	debugfs_create_x64("infooff", S_IRUGO, d, &a->infooff);
+	debugfs_create_x64("dataoff", S_IRUGO, d, &a->dataoff);
+	debugfs_create_x64("mapoff", S_IRUGO, d, &a->mapoff);
+	debugfs_create_x64("logoff", S_IRUGO, d, &a->logoff);
+	debugfs_create_x64("info2off", S_IRUGO, d, &a->info2off);
+	debugfs_create_x32("flags", S_IRUGO, d, &a->flags);
+}
+
+static void btt_debugfs_init(struct btt *btt)
+{
+	int i = 0;
+	struct arena_info *arena;
+
+	btt->debugfs_dir = debugfs_create_dir(dev_name(&btt->nd_btt->dev),
+						debugfs_root);
+	if (IS_ERR_OR_NULL(btt->debugfs_dir))
+		return;
+
+	list_for_each_entry(arena, &btt->arena_list, list) {
+		arena_debugfs_init(arena, btt->debugfs_dir, i);
+		i++;
+	}
+}
+
+/*
+ * This function accepts two log entries, and uses the
+ * sequence number to find the 'older' entry.
+ * It also updates the sequence number in this old entry to
+ * make it the 'new' one if the mark_flag is set.
+ * Finally, it returns which of the entries was the older one.
+ *
+ * TODO The logic feels a bit kludge-y. make it better..
+ */
+static int btt_log_get_old(struct log_entry *ent)
+{
+	int old;
+
+	/*
+	 * the first ever time this is seen, the entry goes into [0]
+	 * the next time, the following logic works out to put this
+	 * (next) entry into [1]
+	 */
+	if (ent[0].seq == 0) {
+		ent[0].seq = cpu_to_le32(1);
+		return 0;
+	}
+
+	if (ent[0].seq == ent[1].seq)
+		return -EINVAL;
+	if (le32_to_cpu(ent[0].seq) + le32_to_cpu(ent[1].seq) > 5)
+		return -EINVAL;
+
+	if (le32_to_cpu(ent[0].seq) < le32_to_cpu(ent[1].seq)) {
+		if (le32_to_cpu(ent[1].seq) - le32_to_cpu(ent[0].seq) == 1)
+			old = 0;
+		else
+			old = 1;
+	} else {
+		if (le32_to_cpu(ent[0].seq) - le32_to_cpu(ent[1].seq) == 1)
+			old = 1;
+		else
+			old = 0;
+	}
+
+	return old;
+}
+
+static struct device *to_dev(struct arena_info *arena)
+{
+	return &arena->nd_btt->dev;
+}
+
+/*
+ * This function copies the desired (old/new) log entry into ent if
+ * it is not NULL. It returns the sub-slot number (0 or 1)
+ * where the desired log entry was found. Negative return values
+ * indicate errors.
+ */
+static int btt_log_read(struct arena_info *arena, u32 lane,
+			struct log_entry *ent, int old_flag)
+{
+	int ret;
+	int old_ent, ret_ent;
+	struct log_entry log[2];
+
+	ret = btt_log_read_pair(arena, lane, log);
+	if (ret)
+		return -EIO;
+
+	old_ent = btt_log_get_old(log);
+	if (old_ent < 0 || old_ent > 1) {
+		dev_info(to_dev(arena),
+				"log corruption (%d): lane %d seq [%d, %d]\n",
+			old_ent, lane, log[0].seq, log[1].seq);
+		/* TODO set error state? */
+		return -EIO;
+	}
+
+	ret_ent = (old_flag ? old_ent : (1 - old_ent));
+
+	if (ent != NULL)
+		memcpy(ent, &log[ret_ent], LOG_ENT_SIZE);
+
+	return ret_ent;
+}
+
+/*
+ * This function commits a log entry to media
+ * It does _not_ prepare the freelist entry for the next write
+ * btt_flog_write is the wrapper for updating the freelist elements
+ */
+static int __btt_log_write(struct arena_info *arena, u32 lane,
+			u32 sub, struct log_entry *ent)
+{
+	int ret;
+	/*
+	 * Ignore the padding in log_entry for calculating log_half.
+	 * The entry is 'committed' when we write the sequence number,
+	 * and we want to ensure that that is the last thing written.
+	 * We don't bother writing the padding as that would be extra
+	 * media wear and write amplification
+	 */
+	unsigned int log_half = (LOG_ENT_SIZE - 2 * sizeof(u64)) / 2;
+	u64 ns_off = arena->logoff + (((2 * lane) + sub) * LOG_ENT_SIZE);
+	void *src = ent;
+
+	/* split the 16B write into atomic, durable halves */
+	ret = arena_rw_bytes(arena, src, log_half, ns_off, WRITE);
+	if (ret)
+		return ret;
+
+	ns_off += log_half;
+	src += log_half;
+	return arena_rw_bytes(arena, src, log_half, ns_off, WRITE);
+}
+
+static int btt_flog_write(struct arena_info *arena, u32 lane, u32 sub,
+			struct log_entry *ent)
+{
+	int ret;
+
+	ret = __btt_log_write(arena, lane, sub, ent);
+	if (ret)
+		return ret;
+
+	/* prepare the next free entry */
+	arena->freelist[lane].sub = 1 - arena->freelist[lane].sub;
+	if (++(arena->freelist[lane].seq) == 4)
+		arena->freelist[lane].seq = 1;
+	arena->freelist[lane].block = le32_to_cpu(ent->old_map);
+
+	return ret;
+}
+
+/*
+ * This function initializes the BTT map to the initial state, which is
+ * all-zeroes, and indicates an identity mapping
+ */
+static int btt_map_init(struct arena_info *arena)
+{
+	int ret = -EINVAL;
+	void *zerobuf;
+	size_t offset = 0;
+	size_t chunk_size = SZ_2M;
+	size_t mapsize = arena->logoff - arena->mapoff;
+
+	zerobuf = kzalloc(chunk_size, GFP_KERNEL);
+	if (!zerobuf)
+		return -ENOMEM;
+
+	while (mapsize) {
+		size_t size = min(mapsize, chunk_size);
+
+		ret = arena_rw_bytes(arena, zerobuf, size,
+				arena->mapoff + offset, WRITE);
+		if (ret)
+			goto free;
+
+		offset += size;
+		mapsize -= size;
+		cond_resched();
+	}
+
+ free:
+	kfree(zerobuf);
+	return ret;
+}
+
+/*
+ * This function initializes the BTT log with 'fake' entries pointing
+ * to the initial reserved set of blocks as being free
+ */
+static int btt_log_init(struct arena_info *arena)
+{
+	int ret;
+	u32 i;
+	struct log_entry log, zerolog;
+
+	memset(&zerolog, 0, sizeof(zerolog));
+
+	for (i = 0; i < arena->nfree; i++) {
+		log.lba = cpu_to_le32(i);
+		log.old_map = cpu_to_le32(arena->external_nlba + i);
+		log.new_map = cpu_to_le32(arena->external_nlba + i);
+		log.seq = cpu_to_le32(LOG_SEQ_INIT);
+		ret = __btt_log_write(arena, i, 0, &log);
+		if (ret)
+			return ret;
+		ret = __btt_log_write(arena, i, 1, &zerolog);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int btt_freelist_init(struct arena_info *arena)
+{
+	int old, new, ret;
+	u32 i, map_entry;
+	struct log_entry log_new, log_old;
+
+	arena->freelist = kcalloc(arena->nfree, sizeof(struct free_entry),
+					GFP_KERNEL);
+	if (!arena->freelist)
+		return -ENOMEM;
+
+	for (i = 0; i < arena->nfree; i++) {
+		old = btt_log_read(arena, i, &log_old, LOG_OLD_ENT);
+		if (old < 0)
+			return old;
+
+		new = btt_log_read(arena, i, &log_new, LOG_NEW_ENT);
+		if (new < 0)
+			return new;
+
+		/* sub points to the next one to be overwritten */
+		arena->freelist[i].sub = 1 - new;
+		arena->freelist[i].seq = nd_inc_seq(le32_to_cpu(log_new.seq));
+		arena->freelist[i].block = le32_to_cpu(log_new.old_map);
+
+		/* This implies a newly created or untouched flog entry */
+		if (log_new.old_map == log_new.new_map)
+			continue;
+
+		/* Check if map recovery is needed */
+		ret = btt_map_read(arena, le32_to_cpu(log_new.lba), &map_entry,
+				NULL, NULL);
+		if (ret)
+			return ret;
+		if ((le32_to_cpu(log_new.new_map) != map_entry) &&
+				(le32_to_cpu(log_new.old_map) == map_entry)) {
+			/*
+			 * Last transaction wrote the flog, but wasn't able
+			 * to complete the map write. So fix up the map.
+			 */
+			ret = btt_map_write(arena, le32_to_cpu(log_new.lba),
+					le32_to_cpu(log_new.new_map), 0, 0);
+			if (ret)
+				return ret;
+		}
+
+	}
+
+	return 0;
+}
+
+static int btt_rtt_init(struct arena_info *arena)
+{
+	arena->rtt = kcalloc(arena->nfree, sizeof(u32), GFP_KERNEL);
+	if (arena->rtt == NULL)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int btt_maplocks_init(struct arena_info *arena)
+{
+	u32 i;
+
+	arena->map_locks = kcalloc(arena->nfree, sizeof(struct aligned_lock),
+				GFP_KERNEL);
+	if (!arena->map_locks)
+		return -ENOMEM;
+
+	for (i = 0; i < arena->nfree; i++)
+		spin_lock_init(&arena->map_locks[i].lock);
+
+	return 0;
+}
+
+static struct arena_info *alloc_arena(struct btt *btt, size_t size,
+				size_t start, size_t arena_off)
+{
+	struct arena_info *arena;
+	u64 logsize, mapsize, datasize;
+	u64 available = size;
+
+	arena = kzalloc(sizeof(struct arena_info), GFP_KERNEL);
+	if (!arena)
+		return NULL;
+	arena->nd_btt = btt->nd_btt;
+
+	if (!size)
+		return arena;
+
+	arena->size = size;
+	arena->external_lba_start = start;
+	arena->external_lbasize = btt->lbasize;
+	arena->internal_lbasize = roundup(arena->external_lbasize,
+					INT_LBASIZE_ALIGNMENT);
+	arena->nfree = BTT_DEFAULT_NFREE;
+	arena->version_major = 1;
+	arena->version_minor = 1;
+
+	if (available % BTT_PG_SIZE)
+		available -= (available % BTT_PG_SIZE);
+
+	/* Two pages are reserved for the super block and its copy */
+	available -= 2 * BTT_PG_SIZE;
+
+	/* The log takes a fixed amount of space based on nfree */
+	logsize = roundup(2 * arena->nfree * sizeof(struct log_entry),
+				BTT_PG_SIZE);
+	available -= logsize;
+
+	/* Calculate optimal split between map and data area */
+	arena->internal_nlba = div_u64(available - BTT_PG_SIZE,
+			arena->internal_lbasize + MAP_ENT_SIZE);
+	arena->external_nlba = arena->internal_nlba - arena->nfree;
+
+	mapsize = roundup((arena->external_nlba * MAP_ENT_SIZE), BTT_PG_SIZE);
+	datasize = available - mapsize;
+
+	/* 'Absolute' values, relative to start of storage space */
+	arena->infooff = arena_off;
+	arena->dataoff = arena->infooff + BTT_PG_SIZE;
+	arena->mapoff = arena->dataoff + datasize;
+	arena->logoff = arena->mapoff + mapsize;
+	arena->info2off = arena->logoff + logsize;
+	return arena;
+}
+
+static void free_arenas(struct btt *btt)
+{
+	struct arena_info *arena, *next;
+
+	list_for_each_entry_safe(arena, next, &btt->arena_list, list) {
+		list_del(&arena->list);
+		kfree(arena->rtt);
+		kfree(arena->map_locks);
+		kfree(arena->freelist);
+		debugfs_remove_recursive(arena->debugfs_dir);
+		kfree(arena);
+	}
+}
+
+/*
+ * This function checks if the metadata layout is valid and error free
+ */
+static int arena_is_valid(struct arena_info *arena, struct btt_sb *super,
+				u8 *uuid, u32 lbasize)
+{
+	u64 checksum;
+
+	if (memcmp(super->uuid, uuid, 16))
+		return 0;
+
+	checksum = le64_to_cpu(super->checksum);
+	super->checksum = 0;
+	if (checksum != nd_btt_sb_checksum(super))
+		return 0;
+	super->checksum = cpu_to_le64(checksum);
+
+	if (lbasize != le32_to_cpu(super->external_lbasize))
+		return 0;
+
+	/* TODO: figure out action for this */
+	if ((le32_to_cpu(super->flags) & IB_FLAG_ERROR_MASK) != 0)
+		dev_info(to_dev(arena), "Found arena with an error flag\n");
+
+	return 1;
+}
+
+/*
+ * This function reads an existing valid btt superblock and
+ * populates the corresponding arena_info struct
+ */
+static void parse_arena_meta(struct arena_info *arena, struct btt_sb *super,
+				u64 arena_off)
+{
+	arena->internal_nlba = le32_to_cpu(super->internal_nlba);
+	arena->internal_lbasize = le32_to_cpu(super->internal_lbasize);
+	arena->external_nlba = le32_to_cpu(super->external_nlba);
+	arena->external_lbasize = le32_to_cpu(super->external_lbasize);
+	arena->nfree = le32_to_cpu(super->nfree);
+	arena->version_major = le16_to_cpu(super->version_major);
+	arena->version_minor = le16_to_cpu(super->version_minor);
+
+	arena->nextoff = (super->nextoff == 0) ? 0 : (arena_off +
+			le64_to_cpu(super->nextoff));
+	arena->infooff = arena_off;
+	arena->dataoff = arena_off + le64_to_cpu(super->dataoff);
+	arena->mapoff = arena_off + le64_to_cpu(super->mapoff);
+	arena->logoff = arena_off + le64_to_cpu(super->logoff);
+	arena->info2off = arena_off + le64_to_cpu(super->info2off);
+
+	arena->size = (super->nextoff > 0) ? (le64_to_cpu(super->nextoff)) :
+			(arena->info2off - arena->infooff + BTT_PG_SIZE);
+
+	arena->flags = le32_to_cpu(super->flags);
+}
+
+static int discover_arenas(struct btt *btt)
+{
+	int ret = 0;
+	struct arena_info *arena;
+	struct btt_sb *super;
+	size_t remaining = btt->rawsize;
+	u64 cur_nlba = 0;
+	size_t cur_off = 0;
+	int num_arenas = 0;
+
+	super = kzalloc(sizeof(*super), GFP_KERNEL);
+	if (!super)
+		return -ENOMEM;
+
+	while (remaining) {
+		/* Alloc memory for arena */
+		arena = alloc_arena(btt, 0, 0, 0);
+		if (!arena) {
+			ret = -ENOMEM;
+			goto out_super;
+		}
+
+		arena->infooff = cur_off;
+		ret = btt_info_read(arena, super);
+		if (ret)
+			goto out;
+
+		if (!arena_is_valid(arena, super, btt->nd_btt->uuid,
+				btt->lbasize)) {
+			if (remaining == btt->rawsize) {
+				btt->init_state = INIT_NOTFOUND;
+				dev_info(to_dev(arena), "No existing arenas\n");
+				goto out;
+			} else {
+				dev_info(to_dev(arena),
+						"Found corrupted metadata!\n");
+				ret = -ENODEV;
+				goto out;
+			}
+		}
+
+		arena->external_lba_start = cur_nlba;
+		parse_arena_meta(arena, super, cur_off);
+
+		ret = btt_freelist_init(arena);
+		if (ret)
+			goto out;
+
+		ret = btt_rtt_init(arena);
+		if (ret)
+			goto out;
+
+		ret = btt_maplocks_init(arena);
+		if (ret)
+			goto out;
+
+		list_add_tail(&arena->list, &btt->arena_list);
+
+		remaining -= arena->size;
+		cur_off += arena->size;
+		cur_nlba += arena->external_nlba;
+		num_arenas++;
+
+		if (arena->nextoff == 0)
+			break;
+	}
+	btt->num_arenas = num_arenas;
+	btt->nlba = cur_nlba;
+	btt->init_state = INIT_READY;
+
+	kfree(super);
+	return ret;
+
+ out:
+	kfree(arena);
+	free_arenas(btt);
+ out_super:
+	kfree(super);
+	return ret;
+}
+
+static int create_arenas(struct btt *btt)
+{
+	size_t remaining = btt->rawsize;
+	size_t cur_off = 0;
+
+	while (remaining) {
+		struct arena_info *arena;
+		size_t arena_size = min_t(u64, ARENA_MAX_SIZE, remaining);
+
+		remaining -= arena_size;
+		if (arena_size < ARENA_MIN_SIZE)
+			break;
+
+		arena = alloc_arena(btt, arena_size, btt->nlba, cur_off);
+		if (!arena) {
+			free_arenas(btt);
+			return -ENOMEM;
+		}
+		btt->nlba += arena->external_nlba;
+		if (remaining >= ARENA_MIN_SIZE)
+			arena->nextoff = arena->size;
+		else
+			arena->nextoff = 0;
+		cur_off += arena_size;
+		list_add_tail(&arena->list, &btt->arena_list);
+	}
+
+	return 0;
+}
+
+/*
+ * This function completes arena initialization by writing
+ * all the metadata.
+ * It is only called for an uninitialized arena when a write
+ * to that arena occurs for the first time.
+ */
+static int btt_arena_write_layout(struct arena_info *arena, u8 *uuid)
+{
+	int ret;
+	struct btt_sb *super;
+
+	ret = btt_map_init(arena);
+	if (ret)
+		return ret;
+
+	ret = btt_log_init(arena);
+	if (ret)
+		return ret;
+
+	super = kzalloc(sizeof(struct btt_sb), GFP_NOIO);
+	if (!super)
+		return -ENOMEM;
+
+	strncpy(super->signature, BTT_SIG, BTT_SIG_LEN);
+	memcpy(super->uuid, uuid, 16);
+	super->flags = cpu_to_le32(arena->flags);
+	super->version_major = cpu_to_le16(arena->version_major);
+	super->version_minor = cpu_to_le16(arena->version_minor);
+	super->external_lbasize = cpu_to_le32(arena->external_lbasize);
+	super->external_nlba = cpu_to_le32(arena->external_nlba);
+	super->internal_lbasize = cpu_to_le32(arena->internal_lbasize);
+	super->internal_nlba = cpu_to_le32(arena->internal_nlba);
+	super->nfree = cpu_to_le32(arena->nfree);
+	super->infosize = cpu_to_le32(sizeof(struct btt_sb));
+	super->nextoff = cpu_to_le64(arena->nextoff);
+	/*
+	 * Subtract arena->infooff (arena start) so numbers are relative
+	 * to 'this' arena
+	 */
+	super->dataoff = cpu_to_le64(arena->dataoff - arena->infooff);
+	super->mapoff = cpu_to_le64(arena->mapoff - arena->infooff);
+	super->logoff = cpu_to_le64(arena->logoff - arena->infooff);
+	super->info2off = cpu_to_le64(arena->info2off - arena->infooff);
+
+	super->flags = 0;
+	super->checksum = cpu_to_le64(nd_btt_sb_checksum(super));
+
+	ret = btt_info_write(arena, super);
+
+	kfree(super);
+	return ret;
+}
+
+/*
+ * This function completes the initialization for the BTT namespace
+ * such that it is ready to accept IOs
+ */
+static int btt_meta_init(struct btt *btt)
+{
+	int ret = 0;
+	struct arena_info *arena;
+
+	mutex_lock(&btt->init_lock);
+	list_for_each_entry(arena, &btt->arena_list, list) {
+		ret = btt_arena_write_layout(arena, btt->nd_btt->uuid);
+		if (ret)
+			goto unlock;
+
+		ret = btt_freelist_init(arena);
+		if (ret)
+			goto unlock;
+
+		ret = btt_rtt_init(arena);
+		if (ret)
+			goto unlock;
+
+		ret = btt_maplocks_init(arena);
+		if (ret)
+			goto unlock;
+	}
+
+	btt->init_state = INIT_READY;
+
+ unlock:
+	mutex_unlock(&btt->init_lock);
+	return ret;
+}
+
+/*
+ * This function calculates the arena in which the given LBA lies
+ * by doing a linear walk. This is acceptable since we expect only
+ * a few arenas. If we have backing devices that get much larger,
+ * we can construct a balanced binary tree of arenas at init time
+ * so that this range search becomes faster.
+ */
+static int lba_to_arena(struct btt *btt, sector_t sector, __u32 *premap,
+				struct arena_info **arena)
+{
+	struct arena_info *arena_list;
+	__u64 lba = div_u64(sector << SECTOR_SHIFT, btt->sector_size);
+
+	list_for_each_entry(arena_list, &btt->arena_list, list) {
+		if (lba < arena_list->external_nlba) {
+			*arena = arena_list;
+			*premap = lba;
+			return 0;
+		}
+		lba -= arena_list->external_nlba;
+	}
+
+	return -EIO;
+}
+
+/*
+ * The following (lock_map, unlock_map) are mostly just to improve
+ * readability, since they index into an array of locks
+ */
+static void lock_map(struct arena_info *arena, u32 premap)
+{
+	u32 idx = (premap * MAP_ENT_SIZE / L1_CACHE_BYTES) % arena->nfree;
+
+	spin_lock(&arena->map_locks[idx].lock);
+}
+
+static void unlock_map(struct arena_info *arena, u32 premap)
+{
+	u32 idx = (premap * MAP_ENT_SIZE / L1_CACHE_BYTES) % arena->nfree;
+
+	spin_unlock(&arena->map_locks[idx].lock);
+}
+
+static u64 to_namespace_offset(struct arena_info *arena, u64 lba)
+{
+	return arena->dataoff + ((u64)lba * arena->internal_lbasize);
+}
+
+static int btt_data_read(struct arena_info *arena, struct page *page,
+			unsigned int off, u32 lba, u32 len)
+{
+	int ret;
+	u64 nsoff = to_namespace_offset(arena, lba);
+	void *mem = kmap_atomic(page);
+
+	ret = arena_rw_bytes(arena, mem + off, len, nsoff, READ);
+	kunmap_atomic(mem);
+
+	return ret;
+}
+
+static int btt_data_write(struct arena_info *arena, u32 lba,
+			struct page *page, unsigned int off, u32 len)
+{
+	int ret;
+	u64 nsoff = to_namespace_offset(arena, lba);
+	void *mem = kmap_atomic(page);
+
+	ret = arena_rw_bytes(arena, mem + off, len, nsoff, WRITE);
+	kunmap_atomic(mem);
+
+	return ret;
+}
+
+static void zero_fill_data(struct page *page, unsigned int off, u32 len)
+{
+	void *mem = kmap_atomic(page);
+
+	memset(mem + off, 0, len);
+	kunmap_atomic(mem);
+}
+
+static int btt_read_pg(struct btt *btt, struct page *page, unsigned int off,
+			sector_t sector, unsigned int len)
+{
+	int ret = 0;
+	int t_flag, e_flag;
+	struct arena_info *arena = NULL;
+	u32 lane = 0, premap, postmap;
+
+	while (len) {
+		u32 cur_len;
+
+		lane = nd_region_acquire_lane(btt->nd_region);
+
+		ret = lba_to_arena(btt, sector, &premap, &arena);
+		if (ret)
+			goto out_lane;
+
+		cur_len = min(btt->sector_size, len);
+
+		ret = btt_map_read(arena, premap, &postmap, &t_flag, &e_flag);
+		if (ret)
+			goto out_lane;
+
+		/*
+		 * We loop to make sure that the post map LBA didn't change
+		 * from under us between writing the RTT and doing the actual
+		 * read.
+		 */
+		while (1) {
+			u32 new_map;
+
+			if (t_flag) {
+				zero_fill_data(page, off, cur_len);
+				goto out_lane;
+			}
+
+			if (e_flag) {
+				ret = -EIO;
+				goto out_lane;
+			}
+
+			arena->rtt[lane] = RTT_VALID | postmap;
+			/*
+			 * Barrier to make sure this write is not reordered
+			 * to do the verification map_read before the RTT store
+			 */
+			barrier();
+
+			ret = btt_map_read(arena, premap, &new_map, &t_flag,
+						&e_flag);
+			if (ret)
+				goto out_rtt;
+
+			if (postmap == new_map)
+				break;
+
+			postmap = new_map;
+		}
+
+		ret = btt_data_read(arena, page, off, postmap, cur_len);
+		if (ret)
+			goto out_rtt;
+
+		arena->rtt[lane] = RTT_INVALID;
+		nd_region_release_lane(btt->nd_region, lane);
+
+		len -= cur_len;
+		off += cur_len;
+		sector += btt->sector_size >> SECTOR_SHIFT;
+	}
+
+	return 0;
+
+ out_rtt:
+	arena->rtt[lane] = RTT_INVALID;
+ out_lane:
+	nd_region_release_lane(btt->nd_region, lane);
+	return ret;
+}
+
+static int btt_write_pg(struct btt *btt, sector_t sector, struct page *page,
+		unsigned int off, unsigned int len)
+{
+	int ret = 0;
+	struct arena_info *arena = NULL;
+	u32 premap = 0, old_postmap, new_postmap, lane = 0, i;
+	struct log_entry log;
+	int sub;
+
+	while (len) {
+		u32 cur_len;
+
+		lane = nd_region_acquire_lane(btt->nd_region);
+
+		ret = lba_to_arena(btt, sector, &premap, &arena);
+		if (ret)
+			goto out_lane;
+		cur_len = min(btt->sector_size, len);
+
+		if ((arena->flags & IB_FLAG_ERROR_MASK) != 0) {
+			ret = -EIO;
+			goto out_lane;
+		}
+
+		new_postmap = arena->freelist[lane].block;
+
+		/* Wait if the new block is being read from */
+		for (i = 0; i < arena->nfree; i++)
+			while (arena->rtt[i] == (RTT_VALID | new_postmap))
+				cpu_relax();
+
+
+		if (new_postmap >= arena->internal_nlba) {
+			ret = -EIO;
+			goto out_lane;
+		} else
+			ret = btt_data_write(arena, new_postmap, page,
+						off, cur_len);
+		if (ret)
+			goto out_lane;
+
+		lock_map(arena, premap);
+		ret = btt_map_read(arena, premap, &old_postmap, NULL, NULL);
+		if (ret)
+			goto out_map;
+		if (old_postmap >= arena->internal_nlba) {
+			ret = -EIO;
+			goto out_map;
+		}
+
+		log.lba = cpu_to_le32(premap);
+		log.old_map = cpu_to_le32(old_postmap);
+		log.new_map = cpu_to_le32(new_postmap);
+		log.seq = cpu_to_le32(arena->freelist[lane].seq);
+		sub = arena->freelist[lane].sub;
+		ret = btt_flog_write(arena, lane, sub, &log);
+		if (ret)
+			goto out_map;
+
+		ret = btt_map_write(arena, premap, new_postmap, 0, 0);
+		if (ret)
+			goto out_map;
+
+		unlock_map(arena, premap);
+		nd_region_release_lane(btt->nd_region, lane);
+
+		len -= cur_len;
+		off += cur_len;
+		sector += btt->sector_size >> SECTOR_SHIFT;
+	}
+
+	return 0;
+
+ out_map:
+	unlock_map(arena, premap);
+ out_lane:
+	nd_region_release_lane(btt->nd_region, lane);
+	return ret;
+}
+
+static int btt_do_bvec(struct btt *btt, struct page *page,
+			unsigned int len, unsigned int off, int rw,
+			sector_t sector)
+{
+	int ret;
+
+	if (rw == READ) {
+		ret = btt_read_pg(btt, page, off, sector, len);
+		flush_dcache_page(page);
+	} else {
+		flush_dcache_page(page);
+		ret = btt_write_pg(btt, sector, page, off, len);
+	}
+
+	return ret;
+}
+
+static void btt_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct block_device *bdev = bio->bi_bdev;
+	struct btt *btt = q->queuedata;
+	int rw;
+	struct bio_vec bvec;
+	sector_t sector;
+	struct bvec_iter iter;
+	int err = 0;
+
+	sector = bio->bi_iter.bi_sector;
+	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
+		err = -EIO;
+		goto out;
+	}
+
+	BUG_ON(bio->bi_rw & REQ_DISCARD);
+
+	rw = bio_rw(bio);
+	if (rw == READA)
+		rw = READ;
+
+	bio_for_each_segment(bvec, bio, iter) {
+		unsigned int len = bvec.bv_len;
+
+		BUG_ON(len > PAGE_SIZE);
+		/* Make sure len is in multiples of sector size. */
+		/* XXX is this right? */
+		BUG_ON(len < btt->sector_size);
+		BUG_ON(len % btt->sector_size);
+
+		err = btt_do_bvec(btt, bvec.bv_page, len, bvec.bv_offset,
+				rw, sector);
+		if (err) {
+			dev_info(&btt->nd_btt->dev,
+					"io error in %s sector %lld, len %d,\n",
+					(rw == READ) ? "READ" : "WRITE",
+					(unsigned long long) sector, len);
+			goto out;
+		}
+		sector += len >> SECTOR_SHIFT;
+	}
+
+out:
+	bio_endio(bio, err);
+}
+
+static int btt_rw_page(struct block_device *bdev, sector_t sector,
+		struct page *page, int rw)
+{
+	struct btt *btt = bdev->bd_disk->private_data;
+
+	btt_do_bvec(btt, page, PAGE_CACHE_SIZE, 0, rw, sector);
+	page_endio(page, rw & WRITE, 0);
+	return 0;
+}
+
+
+static int btt_getgeo(struct block_device *bd, struct hd_geometry *geo)
+{
+	/* some standard values */
+	geo->heads = 1 << 6;
+	geo->sectors = 1 << 5;
+	geo->cylinders = get_capacity(bd->bd_disk) >> 11;
+	return 0;
+}
+
+static const struct block_device_operations btt_fops = {
+	.owner =		THIS_MODULE,
+	.rw_page =		btt_rw_page,
+	.getgeo =		btt_getgeo,
+};
+
+static int btt_blk_init(struct btt *btt)
+{
+	struct nd_btt *nd_btt = btt->nd_btt;
+	char name[BDEVNAME_SIZE];
+	int ret;
+
+	/* create a new disk and request queue for btt */
+	btt->btt_queue = blk_alloc_queue(GFP_KERNEL);
+	if (!btt->btt_queue)
+		return -ENOMEM;
+
+	btt->btt_disk = alloc_disk(0);
+	if (!btt->btt_disk) {
+		ret = -ENOMEM;
+		goto out_free_queue;
+	}
+
+	sprintf(btt->btt_disk->disk_name, "%ss",
+			bdevname(nd_btt->backing_dev, name));
+	btt->btt_disk->driverfs_dev = &btt->nd_btt->dev;
+	btt->btt_disk->major = btt_major;
+	btt->btt_disk->first_minor = 0;
+	btt->btt_disk->fops = &btt_fops;
+	btt->btt_disk->private_data = btt;
+	btt->btt_disk->queue = btt->btt_queue;
+	btt->btt_disk->flags = GENHD_FL_EXT_DEVT;
+
+	blk_queue_make_request(btt->btt_queue, btt_make_request);
+	blk_queue_max_hw_sectors(btt->btt_queue, 1024);
+	blk_queue_bounce_limit(btt->btt_queue, BLK_BOUNCE_ANY);
+	blk_queue_logical_block_size(btt->btt_queue, btt->sector_size);
+	btt->btt_queue->queuedata = btt;
+
+	set_capacity(btt->btt_disk, btt->nlba * btt->sector_size >> SECTOR_SHIFT);
+	add_disk(btt->btt_disk);
+
+	return 0;
+
+out_free_queue:
+	blk_cleanup_queue(btt->btt_queue);
+	return ret;
+}
+
+static void btt_blk_cleanup(struct btt *btt)
+{
+	del_gendisk(btt->btt_disk);
+	put_disk(btt->btt_disk);
+	blk_cleanup_queue(btt->btt_queue);
+}
+
+/**
+ * btt_init - initialize a block translation table for the given device
+ * @nd_btt:	device with BTT geometry and backing device info
+ * @rawsize:	raw size in bytes of the backing device
+ * @lbasize:	lba size of the backing device
+ * @uuid:	A uuid for the backing device - this is stored on media
+ * @maxlane:	maximum number of parallel requests the device can handle
+ *
+ * Initialize a Block Translation Table on a backing device to provide
+ * single sector power fail atomicity.
+ *
+ * Context:
+ * Might sleep.
+ *
+ * Returns:
+ * Pointer to a new struct btt on success, NULL on failure.
+ */
+static struct btt *btt_init(struct nd_btt *nd_btt, unsigned long long rawsize,
+		u32 lbasize, u8 *uuid, struct nd_region *nd_region)
+{
+	int ret;
+	struct btt *btt;
+	struct device *dev = &nd_btt->dev;
+
+	btt = kzalloc(sizeof(struct btt), GFP_KERNEL);
+	if (!btt)
+		return NULL;
+
+	btt->nd_btt = nd_btt;
+	btt->rawsize = rawsize;
+	btt->lbasize = lbasize;
+	btt->sector_size = ((lbasize >= 4096) ? 4096 : 512);
+	INIT_LIST_HEAD(&btt->arena_list);
+	mutex_init(&btt->init_lock);
+	btt->nd_region = nd_region;
+
+	ret = discover_arenas(btt);
+	if (ret) {
+		dev_err(dev, "init: error in arena_discover: %d\n", ret);
+		goto out_free;
+	}
+
+	if (btt->init_state != INIT_READY) {
+		btt->num_arenas = (rawsize / ARENA_MAX_SIZE) +
+			((rawsize % ARENA_MAX_SIZE) ? 1 : 0);
+		dev_dbg(dev, "init: %d arenas for %llu rawsize\n",
+				btt->num_arenas, rawsize);
+
+		ret = create_arenas(btt);
+		if (ret) {
+			dev_info(dev, "init: create_arenas: %d\n", ret);
+			goto out_free;
+		}
+
+		ret = btt_meta_init(btt);
+		if (ret) {
+			dev_err(dev, "init: error in meta_init: %d\n", ret);
+			return NULL;
+		}
+	}
+
+	ret = btt_blk_init(btt);
+	if (ret) {
+		dev_err(dev, "init: error in blk_init: %d\n", ret);
+		goto out_free;
+	}
+
+	btt_debugfs_init(btt);
+
+	return btt;
+
+ out_free:
+	kfree(btt);
+	return NULL;
+}
+
+/**
+ * btt_fini - de-initialize a BTT
+ * @btt:	the BTT handle that was generated by btt_init
+ *
+ * De-initialize a Block Translation Table on device removal
+ *
+ * Context:
+ * Might sleep.
+ */
+static void btt_fini(struct btt *btt)
+{
+	if (btt) {
+		btt_blk_cleanup(btt);
+		free_arenas(btt);
+		debugfs_remove_recursive(btt->debugfs_dir);
+		kfree(btt);
+	}
+}
+
+static int link_btt(struct nd_btt *nd_btt)
+{
+	struct block_device *bdev = nd_btt->backing_dev;
+	struct kobject *dir = &part_to_dev(bdev->bd_part)->kobj;
+
+	return sysfs_create_link(dir, &nd_btt->dev.kobj, "nd_btt");
+}
+
+static void unlink_btt(struct nd_btt *nd_btt)
+{
+	struct block_device *bdev = nd_btt->backing_dev;
+	struct kobject *dir;
+
+	/* if backing_dev was deleted first we may have nothing to unlink */
+	if (!nd_btt->backing_dev)
+		return;
+
+	dir = &part_to_dev(bdev->bd_part)->kobj;
+	sysfs_remove_link(dir, "nd_btt");
+}
+
+static int nd_btt_probe(struct device *dev)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	struct nd_io_claim *ndio_claim = nd_btt->ndio_claim;
+	struct nd_region *nd_region;
+	struct block_device *bdev;
+	struct btt *btt;
+	size_t rawsize;
+	int rc;
+
+	if (!ndio_claim || !nd_btt->uuid || !nd_btt->backing_dev
+			|| !nd_btt->lbasize)
+		return -ENODEV;
+
+	rc = link_btt(nd_btt);
+	if (rc)
+		return rc;
+
+	bdev = nd_btt->backing_dev;
+	sync_blockdev(bdev);
+	invalidate_bdev(bdev);
+	/* the first 4K of a device is padding */
+	nd_btt->offset = nd_partition_offset(bdev) + SZ_4K;
+	rawsize = (bdev->bd_part->nr_sects << SECTOR_SHIFT) - SZ_4K;
+	if (rawsize < ARENA_MIN_SIZE) {
+		rc = -ENXIO;
+		goto err_btt;
+	}
+	nd_btt->ndio = nd_btt->ndio_claim->parent;
+	nd_region = to_nd_region(nd_btt->ndio->dev->parent);
+	btt = btt_init(nd_btt, rawsize, nd_btt->lbasize, nd_btt->uuid,
+			nd_region);
+	if (!btt) {
+		rc = -ENOMEM;
+		goto err_btt;
+	}
+	dev_set_drvdata(dev, btt);
+
+	return 0;
+ err_btt:
+	unlink_btt(nd_btt);
+	return rc;
+}
+
+static int nd_btt_remove(struct device *dev)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	struct btt *btt = dev_get_drvdata(dev);
+
+	btt_fini(btt);
+	unlink_btt(nd_btt);
+
+	return 0;
+}
+
+static struct nd_device_driver nd_btt_driver = {
+	.probe = nd_btt_probe,
+	.remove = nd_btt_remove,
+	.drv = {
+		.name = "nd_btt",
+	},
+	.type = ND_DRIVER_BTT,
+};
+
+static int __init nd_btt_init(void)
+{
+	int rc;
+
+	BUILD_BUG_ON(sizeof(struct btt_sb) != SZ_4K);
+
+	btt_major = register_blkdev(0, "btt");
+	if (btt_major < 0)
+		return btt_major;
+
+	debugfs_root = debugfs_create_dir("btt", NULL);
+	if (IS_ERR_OR_NULL(debugfs_root)) {
+		rc = -ENXIO;
+		goto err_debugfs;
+	}
+
+	rc = nd_driver_register(&nd_btt_driver);
+	if (rc < 0)
+		goto err_driver;
+	return 0;
+
+ err_driver:
+	debugfs_remove_recursive(debugfs_root);
+ err_debugfs:
+	unregister_blkdev(btt_major, "btt");
+
+	return rc;
+}
+
+static void __exit nd_btt_exit(void)
+{
+	driver_unregister(&nd_btt_driver.drv);
+	debugfs_remove_recursive(debugfs_root);
+	unregister_blkdev(btt_major, "btt");
+}
+
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_BTT);
+MODULE_AUTHOR("Vishal Verma <vishal.l.verma@linux.intel.com>");
+MODULE_LICENSE("GPL v2");
+module_init(nd_btt_init);
+module_exit(nd_btt_exit);
diff --git a/drivers/block/nd/btt.h b/drivers/block/nd/btt.h
index e8f6d8e0ddd3..c9fe38e5b61a 100644
--- a/drivers/block/nd/btt.h
+++ b/drivers/block/nd/btt.h
@@ -19,6 +19,39 @@
 
 #define BTT_SIG_LEN 16
 #define BTT_SIG "BTT_ARENA_INFO\0"
+#define MAP_ENT_SIZE 4
+#define MAP_TRIM_SHIFT 31
+#define MAP_TRIM_MASK (1 << MAP_TRIM_SHIFT)
+#define MAP_ERR_SHIFT 30
+#define MAP_ERR_MASK (1 << MAP_ERR_SHIFT)
+#define MAP_LBA_MASK (~((1 << MAP_TRIM_SHIFT) | (1 << MAP_ERR_SHIFT)))
+#define MAP_ENT_NORMAL 0xC0000000
+#define LOG_ENT_SIZE sizeof(struct log_entry)
+#define ARENA_MIN_SIZE (1UL << 24)	/* 16 MB */
+#define ARENA_MAX_SIZE (1ULL << 39)	/* 512 GB */
+#define RTT_VALID (1UL << 31)
+#define RTT_INVALID 0
+#define INT_LBASIZE_ALIGNMENT 256
+#define BTT_PG_SIZE 4096
+#define BTT_DEFAULT_NFREE ND_MAX_LANES
+#define LOG_SEQ_INIT 1
+
+#define IB_FLAG_ERROR 0x00000001
+#define IB_FLAG_ERROR_MASK 0x00000001
+
+enum btt_init_state {
+	INIT_UNCHECKED = 0,
+	INIT_NOTFOUND,
+	INIT_READY
+};
+
+struct log_entry {
+	__le32 lba;
+	__le32 old_map;
+	__le32 new_map;
+	__le32 seq;
+	__le64 padding[2];
+};
 
 struct btt_sb {
 	u8 signature[BTT_SIG_LEN];
@@ -42,4 +75,112 @@ struct btt_sb {
 	__le64 checksum;
 };
 
+struct free_entry {
+	u32 block;
+	u8 sub;
+	u8 seq;
+};
+
+struct aligned_lock {
+	union {
+		spinlock_t lock;
+		u8 cacheline_padding[L1_CACHE_BYTES];
+	};
+};
+
+/**
+ * struct arena_info - handle for an arena
+ * @size:		Size in bytes this arena occupies on the raw device.
+ *			This includes arena metadata.
+ * @external_lba_start:	The first external LBA in this arena.
+ * @internal_nlba:	Number of internal blocks available in the arena
+ *			including nfree reserved blocks
+ * @internal_lbasize:	Internal and external lba sizes may be different as
+ *			we can round up 'odd' external lbasizes such as 520B
+ *			to be aligned.
+ * @external_nlba:	Number of blocks contributed by the arena to the number
+ *			reported to upper layers. (internal_nlba - nfree)
+ * @external_lbasize:	LBA size as exposed to upper layers.
+ * @nfree:		A reserve number of 'free' blocks that is used to
+ *			handle incoming writes.
+ * @version_major:	Metadata layout version major.
+ * @version_minor:	Metadata layout version minor.
+ * @nextoff:		Offset in bytes to the start of the next arena.
+ * @infooff:		Offset in bytes to the info block of this arena.
+ * @dataoff:		Offset in bytes to the data area of this arena.
+ * @mapoff:		Offset in bytes to the map area of this arena.
+ * @logoff:		Offset in bytes to the log area of this arena.
+ * @info2off:		Offset in bytes to the backup info block of this arena.
+ * @freelist:		Pointer to in-memory list of free blocks
+ * @rtt:		Pointer to in-memory "Read Tracking Table"
+ * @map_locks:		Spinlocks protecting concurrent map writes
+ * @nd_btt:		Pointer to parent nd_btt structure.
+ * @list:		List head for list of arenas
+ * @debugfs_dir:	Debugfs dentry
+ * @flags:		Arena flags - may signify error states.
+ *
+ * arena_info is a per-arena handle. Once an arena is narrowed down for an
+ * IO, this struct is passed around for the duration of the IO.
+ */
+struct arena_info {
+	u64 size;			/* Total bytes for this arena */
+	u64 external_lba_start;
+	u32 internal_nlba;
+	u32 internal_lbasize;
+	u32 external_nlba;
+	u32 external_lbasize;
+	u32 nfree;
+	u16 version_major;
+	u16 version_minor;
+	/* Byte offsets to the different on-media structures */
+	u64 nextoff;
+	u64 infooff;
+	u64 dataoff;
+	u64 mapoff;
+	u64 logoff;
+	u64 info2off;
+	/* Pointers to other in-memory structures for this arena */
+	struct free_entry *freelist;
+	u32 *rtt;
+	struct aligned_lock *map_locks;
+	struct nd_btt *nd_btt;
+	struct list_head list;
+	struct dentry *debugfs_dir;
+	/* Arena flags */
+	u32 flags;
+};
+
+/**
+ * struct btt - handle for a BTT instance
+ * @btt_disk:		Pointer to the gendisk for BTT device
+ * @btt_queue:		Pointer to the request queue for the BTT device
+ * @arena_list:		Head of the list of arenas
+ * @debugfs_dir:	Debugfs dentry
+ * @nd_btt:		Parent nd_btt struct
+ * @nlba:		Number of logical blocks exposed to the	upper layers
+ *			after removing the amount of space needed by metadata
+ * @rawsize:		Total size in bytes of the available backing device
+ * @lbasize:		LBA size as requested and presented to upper layers.
+ * 			This is sector_size + size of any metadata.
+ * @sector_size:	The Linux sector size - 512 or 4096
+ * @lanes:		Per-lane spinlocks
+ * @init_lock:		Mutex used for the BTT initialization
+ * @init_state:		Flag describing the initialization state for the BTT
+ * @num_arenas:		Number of arenas in the BTT instance
+ */
+struct btt {
+	struct gendisk *btt_disk;
+	struct request_queue *btt_queue;
+	struct list_head arena_list;
+	struct dentry *debugfs_dir;
+	struct nd_btt *nd_btt;
+	u64 nlba;
+	unsigned long long rawsize;
+	u32 lbasize;
+	u32 sector_size;
+	struct nd_region *nd_region;
+	struct mutex init_lock;
+	int init_state;
+	int num_arenas;
+};
 #endif
diff --git a/drivers/block/nd/btt_devs.c b/drivers/block/nd/btt_devs.c
index b3b813288092..fd6755040751 100644
--- a/drivers/block/nd/btt_devs.c
+++ b/drivers/block/nd/btt_devs.c
@@ -342,7 +342,8 @@ struct nd_btt *nd_btt_create(struct nd_bus *nd_bus)
  */
 u64 nd_btt_sb_checksum(struct btt_sb *btt_sb)
 {
-	u64 sum, sum_save;
+	u64 sum;
+	__le64 sum_save;
 
 	sum_save = btt_sb->checksum;
 	btt_sb->checksum = 0;
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index 6c89695956a4..6a864e9ae97a 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -76,6 +76,7 @@ int __init nd_bus_init(void);
 void nd_bus_exit(void);
 int __init nd_dimm_init(void);
 int __init nd_region_init(void);
+void __init nd_region_init_locks(void);
 void nd_dimm_exit(void);
 int nd_region_exit(void);
 void nd_region_probe_start(struct nd_bus *nd_bus, struct device *dev);
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index 73e830785f74..b706f25da7e5 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -22,6 +22,12 @@
 #include "label.h"
 
 enum {
+	/*
+	 * Limits the maximum number of block apertures a dimm can
+	 * support and is an input to the geometry/on-disk-format of a
+	 * BTT instance
+	 */
+	ND_MAX_LANES = 256,
 	SECTOR_SHIFT = 9,
 };
 
@@ -101,7 +107,7 @@ struct nd_region {
 	u16 ndr_mappings;
 	u64 ndr_size;
 	u64 ndr_start;
-	int id;
+	int id, num_lanes;
 	void *provider_data;
 	struct nd_interleave_set *nd_set;
 	struct nd_mapping mapping[0];
@@ -226,6 +232,8 @@ struct nd_btt *to_nd_btt(struct device *dev);
 struct btt_sb;
 u64 nd_btt_sb_checksum(struct btt_sb *btt_sb);
 struct nd_region *to_nd_region(struct device *dev);
+unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
+void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
 int nd_region_to_namespace_type(struct nd_region *nd_region);
 int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
 u64 nd_region_interleave_set_cookie(struct nd_region *nd_region);
diff --git a/drivers/block/nd/region.c b/drivers/block/nd/region.c
index 31bb33962e14..0e872f54dcd2 100644
--- a/drivers/block/nd/region.c
+++ b/drivers/block/nd/region.c
@@ -10,18 +10,106 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/cpumask.h>
 #include <linux/module.h>
 #include <linux/device.h>
 #include <linux/nd.h>
 #include "nd.h"
 
+struct nd_percpu_lane {
+	int count[CONFIG_ND_MAX_REGIONS];
+	spinlock_t lock[CONFIG_ND_MAX_REGIONS];
+};
+
+static DEFINE_PER_CPU(struct nd_percpu_lane, nd_percpu_lane);
+
+static void __init nd_region_init_locks(void)
+{
+	unsigned int i, j;
+
+	for (i = 0; i < nr_cpu_ids; i++)
+		for (j = 0; j < CONFIG_ND_MAX_REGIONS; j++) {
+			struct nd_percpu_lane *ndl;
+
+			ndl = per_cpu_ptr(&nd_percpu_lane, i);
+			spin_lock_init(&ndl->lock[j]);
+			ndl->count[j] = 0;
+		}
+}
+
+/**
+ * nd_region_acquire_lane - allocate and lock a lane
+ * @nd_region: region id and number of lanes possible
+ *
+ * A lane correlates to a BLK-data-window and/or a log slot in the BTT.
+ * We optimize for the common case where there are 256 lanes, one
+ * per-cpu.  For larger systems we need to lock to share lanes.  For now
+ * this implementation assumes the cost of maintaining an allocator for
+ * free lanes is on the order of the lock hold time, so it implements a
+ * static lane = cpu % num_lanes mapping.
+ *
+ * In the case of a BTT instance on top of a BLK namespace a lane may be
+ * acquired recursively.  We lock on the first instance.
+ *
+ * In the case of a BTT instance on top of PMEM, we only acquire a lane
+ * for the BTT metadata updates.
+ */
+unsigned int nd_region_acquire_lane(struct nd_region *nd_region)
+{
+	unsigned int cpu, lane;
+
+	cpu = get_cpu();
+	if (nd_region->num_lanes < nr_cpu_ids) {
+		struct nd_percpu_lane *ndl_lock, *ndl_count;
+		unsigned int id = nd_region->id;
+
+		lane = cpu % nd_region->num_lanes;
+		ndl_count = per_cpu_ptr(&nd_percpu_lane, cpu);
+		ndl_lock = per_cpu_ptr(&nd_percpu_lane, lane);
+		if (ndl_count->count[id]++ == 0)
+			spin_lock(&ndl_lock->lock[id]);
+	} else
+		lane = cpu;
+
+	return lane;
+}
+EXPORT_SYMBOL(nd_region_acquire_lane);
+
+void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane)
+{
+	if (nd_region->num_lanes < nr_cpu_ids) {
+		unsigned int cpu = get_cpu();
+		unsigned int id = nd_region->id;
+		struct nd_percpu_lane *ndl_lock, *ndl_count;
+
+		ndl_count = per_cpu_ptr(&nd_percpu_lane, cpu);
+		ndl_lock = per_cpu_ptr(&nd_percpu_lane, lane);
+		if (--ndl_count->count[id] == 0)
+			spin_unlock(&ndl_lock->lock[id]);
+		put_cpu();
+	}
+	put_cpu();
+}
+EXPORT_SYMBOL(nd_region_release_lane);
+
 static int nd_region_probe(struct device *dev)
 {
 	int err;
+	static unsigned long once;
 	struct nd_region_namespaces *num_ns;
 	struct nd_region *nd_region = to_nd_region(dev);
 	int rc = nd_region_register_namespaces(nd_region, &err);
 
+	if (nd_region->num_lanes > num_online_cpus()
+			&& nd_region->num_lanes < num_possible_cpus()
+			&& !test_and_set_bit(0, &once)) {
+		dev_info(dev, "online cpus (%d) < concurrent i/o lanes (%d) < possible cpus (%d)\n",
+				num_online_cpus(), nd_region->num_lanes,
+				num_possible_cpus());
+		dev_info(dev, "setting nr_cpus=%d may yield better libnd device performance\n",
+				nd_region->num_lanes);
+	}
+
 	num_ns = devm_kzalloc(dev, sizeof(*num_ns), GFP_KERNEL);
 	if (!num_ns)
 		return -ENOMEM;
@@ -84,6 +172,7 @@ static struct nd_device_driver nd_region_driver = {
 
 int __init nd_region_init(void)
 {
+	nd_region_init_locks();
 	return nd_driver_register(&nd_region_driver);
 }
 
diff --git a/drivers/block/nd/region_devs.c b/drivers/block/nd/region_devs.c
index 1ae6bb44c371..4965004147ae 100644
--- a/drivers/block/nd/region_devs.c
+++ b/drivers/block/nd/region_devs.c
@@ -543,6 +543,12 @@ static noinline struct nd_region *nd_region_create(struct nd_bus *nd_bus,
 	if (nd_region->id < 0) {
 		kfree(nd_region);
 		return NULL;
+	} else if (nd_region->id >= CONFIG_ND_MAX_REGIONS) {
+		dev_err(&nd_bus->dev, "max region limit %d reached\n",
+				CONFIG_ND_MAX_REGIONS);
+		ida_simple_remove(&region_ida, nd_region->id);
+		kfree(nd_region);
+		return NULL;
 	}
 
 	memcpy(nd_region->mapping, ndr_desc->nd_mapping,
@@ -556,6 +562,7 @@ static noinline struct nd_region *nd_region_create(struct nd_bus *nd_bus,
 	nd_region->ndr_mappings = ndr_desc->num_mappings;
 	nd_region->provider_data = ndr_desc->provider_data;
 	nd_region->nd_set = ndr_desc->nd_set;
+	nd_region->num_lanes = ndr_desc->num_lanes;
 	ida_init(&nd_region->ns_ida);
 	dev = &nd_region->dev;
 	dev_set_name(dev, "region%d", nd_region->id);
@@ -572,6 +579,7 @@ static noinline struct nd_region *nd_region_create(struct nd_bus *nd_bus,
 struct nd_region *nd_pmem_region_create(struct nd_bus *nd_bus,
 		struct nd_region_desc *ndr_desc)
 {
+	ndr_desc->num_lanes = ND_MAX_LANES;
 	return nd_region_create(nd_bus, ndr_desc, &nd_pmem_device_type);
 }
 EXPORT_SYMBOL_GPL(nd_pmem_region_create);
@@ -581,6 +589,7 @@ struct nd_region *nd_blk_region_create(struct nd_bus *nd_bus,
 {
 	if (ndr_desc->num_mappings > 1)
 		return NULL;
+	ndr_desc->num_lanes = min(ndr_desc->num_lanes, ND_MAX_LANES);
 	return nd_region_create(nd_bus, ndr_desc, &nd_blk_device_type);
 }
 EXPORT_SYMBOL_GPL(nd_blk_region_create);
@@ -588,6 +597,7 @@ EXPORT_SYMBOL_GPL(nd_blk_region_create);
 struct nd_region *nd_volatile_region_create(struct nd_bus *nd_bus,
 		struct nd_region_desc *ndr_desc)
 {
+	ndr_desc->num_lanes = ND_MAX_LANES;
 	return nd_region_create(nd_bus, ndr_desc, &nd_volatile_device_type);
 }
 EXPORT_SYMBOL_GPL(nd_volatile_region_create);
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index 43f58330d14c..6146690b23e7 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -76,6 +76,7 @@ struct nd_region_desc {
 	const struct attribute_group **attr_groups;
 	struct nd_interleave_set *nd_set;
 	void *provider_data;
+	int num_lanes;
 };
 
 struct nd_bus;


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 18/21] nd_btt: atomic sector updates
@ 2015-05-20 20:57   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: Boaz Harrosh, Vishal Verma, neilb, gregkh, linux-nvdimm,
	Dave Chinner, linux-kernel, Andy Lutomirski, Jens Axboe,
	linux-acpi, jmoyer, H. Peter Anvin, hch, mingo

From: Vishal Verma <vishal.l.verma@linux.intel.com>

BTT stands for Block Translation Table, and is a way to provide power
fail sector atomicity semantics for block devices that have the ability
to perform byte granularity IO. It relies on the ->rw_bytes() capability
of libnd namespace devices.

The BTT works as a stacked blocked device, and reserves a chunk of space
from the backing device for its accounting metadata.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
[jmoyer: fix nmi watchdog timeout in btt_map_init]
[jmoyer: move btt initialization to module load path]
[jmoyer: fix memory leak in the btt initialization path]
[jmoyer: Don't overwrite corrupted arenas]
Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 Documentation/blockdev/btt.txt |  273 ++++++++
 drivers/acpi/nfit.c            |    1 
 drivers/block/nd/Kconfig       |   20 +
 drivers/block/nd/Makefile      |    3 
 drivers/block/nd/btt.c         | 1438 ++++++++++++++++++++++++++++++++++++++++
 drivers/block/nd/btt.h         |  141 ++++
 drivers/block/nd/btt_devs.c    |    3 
 drivers/block/nd/nd-private.h  |    1 
 drivers/block/nd/nd.h          |   10 
 drivers/block/nd/region.c      |   89 ++
 drivers/block/nd/region_devs.c |   10 
 include/linux/libnd.h          |    1 
 12 files changed, 1986 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/blockdev/btt.txt
 create mode 100644 drivers/block/nd/btt.c

diff --git a/Documentation/blockdev/btt.txt b/Documentation/blockdev/btt.txt
new file mode 100644
index 000000000000..95134d5ec4a0
--- /dev/null
+++ b/Documentation/blockdev/btt.txt
@@ -0,0 +1,273 @@
+BTT - Block Translation Table
+=============================
+
+
+1. Introduction
+---------------
+
+Persistent memory based storage is able to perform IO at byte (or more
+accurately, cache line) granularity. However, we often want to expose such
+storage as traditional block devices. The block drivers for persistent memory
+will do exactly this. However, they do not provide any atomicity guarantees.
+Traditional SSDs typically provide protection against torn sectors in hardware,
+using stored energy in capacitors to complete in-flight block writes, or perhaps
+in firmware. We don't have this luxury with persistent memory - if a write is in
+progress, and we experience a power failure, the block will contain a mix of old
+and new data. Applications may not be prepared to handle such a scenario.
+
+The Block Translation Table (BTT) provides atomic sector update semantics for
+persistent memory devices, so that applications that rely on sector writes not
+being torn can continue to do so. The BTT manifests itself as a stacked block
+device, and reserves a portion of the underlying storage for its metadata. At
+the heart of it, is an indirection table that re-maps all the blocks on the
+volume. It can be thought of as an extremely simple file system that only
+provides atomic sector updates.
+
+
+2. Static Layout
+----------------
+
+The underlying storage on which a BTT can be laid out is not limited in any way.
+The BTT, however, splits the available space into chunks of up to 512 GiB,
+called "Arenas".
+
+Each arena follows the same layout for its metadata, and all references in an
+arena are internal to it (with the exception of one field that points to the
+next arena). The following depicts the "On-disk" metadata layout:
+
+
+  Backing Store     +------->  Arena
++---------------+   |   +------------------+
+|               |   |   | Arena info block |
+|    Arena 0    +---+   |       4K         |
+|     512G      |       +------------------+
+|               |       |                  |
++---------------+       |                  |
+|               |       |                  |
+|    Arena 1    |       |   Data Blocks    |
+|     512G      |       |                  |
+|               |       |                  |
++---------------+       |                  |
+|       .       |       |                  |
+|       .       |       |                  |
+|       .       |       |                  |
+|               |       |                  |
+|               |       |                  |
++---------------+       +------------------+
+                        |                  |
+                        |     BTT Map      |
+                        |                  |
+                        |                  |
+                        +------------------+
+                        |                  |
+                        |     BTT Flog     |
+                        |                  |
+                        +------------------+
+                        | Info block copy  |
+                        |       4K         |
+                        +------------------+
+
+
+3. Theory of Operation
+----------------------
+
+
+a. The BTT Map
+--------------
+
+The map is a simple lookup/indirection table that maps an LBA to an internal
+block. Each map entry is 32 bits. The two most significant bits are special
+flags, and the remaining form the internal block number.
+
+Bit      Description
+31     : TRIM flag - marks if the block was trimmed or discarded
+30     : ERROR flag - marks an error block. Cleared on write.
+29 - 0 : Mappings to internal 'postmap' blocks
+
+
+Some of the terminology that will be subsequently used:
+
+External LBA  : LBA as made visible to upper layers.
+ABA           : Arena Block Address - Block offset/number within an arena
+Premap ABA    : The block offset into an arena, which was decided upon by range
+		checking the External LBA
+Postmap ABA   : The block number in the "Data Blocks" area obtained after
+		indirection from the map
+nfree	      : The number of free blocks that are maintained at any given time.
+		This is the number of concurrent writes that can happen to the
+		arena.
+
+
+For example, after adding a BTT, we surface a disk of 1024G. We get a read for
+the external LBA at 768G. This falls into the second arena, and of the 512G
+worth of blocks that this arena contributes, this block is at 256G. Thus, the
+premap ABA is 256G. We now refer to the map, and find out the mapping for block
+'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64.
+
+
+b. The BTT Flog
+---------------
+
+The BTT provides sector atomicity by making every write an "allocating write",
+i.e. Every write goes to a "free" block. A running list of free blocks is
+maintained in the form of the BTT flog. 'Flog' is a combination of the words
+"free list" and "log". The flog contains 'nfree' entries, and an entry contains:
+
+lba     : The premap ABA that is being written to
+old_map : The old postmap ABA - after 'this' write completes, this will be a
+	  free block.
+new_map : The new postmap ABA. The map will up updated to reflect this
+	  lba->postmap_aba mapping, but we log it here in case we have to
+	  recover.
+seq	: Sequence number to mark which of the 2 sections of this flog entry is
+	  valid/newest. It cycles between 01->10->11->01 (binary) under normal
+	  operation, with 00 indicating an uninitialized state.
+lba'	: alternate lba entry
+old_map': alternate old postmap entry
+new_map': alternate new postmap entry
+seq'	: alternate sequence number.
+
+Each of the above fields is 32-bit, making one entry 16 bytes. Flog updates are
+done such that for any entry being written, it:
+a. overwrites the 'old' section in the entry based on sequence numbers
+b. writes the new entry such that the sequence number is written last.
+
+
+c. The concept of lanes
+-----------------------
+
+While 'nfree' describes the number of concurrent IOs an arena can process
+concurrently, 'nlanes' is the number of IOs the BTT device as a whole can
+process.
+ nlanes = min(nfree, num_cpus)
+A lane number is obtained at the start of any IO, and is used for indexing into
+all the on-disk and in-memory data structures for the duration of the IO. It is
+protected by a spinlock.
+
+
+d. In-memory data structure: Read Tracking Table (RTT)
+------------------------------------------------------
+
+Consider a case where we have two threads, one doing reads and the other,
+writes. We can hit a condition where the writer thread grabs a free block to do
+a new IO, but the (slow) reader thread is still reading from it. In other words,
+the reader consulted a map entry, and started reading the corresponding block. A
+writer started writing to the same external LBA, and finished the write updating
+the map for that external LBA to point to its new postmap ABA. At this point the
+internal, postmap block that the reader is (still) reading has been inserted
+into the list of free blocks. If another write comes in for the same LBA, it can
+grab this free block, and start writing to it, causing the reader to read
+incorrect data. To prevent this, we introduce the RTT.
+
+The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts
+into rtt[lane_number], the postmap ABA it is reading, and clears it after the
+read is complete. Every writer thread, after grabbing a free block, checks the
+RTT for its presence. If the postmap free block is in the RTT, it waits till the
+reader clears the RTT entry, and only then starts writing to it.
+
+
+e. In-memory data structure: map locks
+--------------------------------------
+
+Consider a case where two writer threads are writing to the same LBA. There can
+be a race in the following sequence of steps:
+
+free[lane] = map[premap_aba]
+map[premap_aba] = postmap_aba
+
+Both threads can update their respective free[lane] with the same old, freed
+postmap_aba. This has made the layout inconsistent by losing a free entry, and
+at the same time, duplicating another free entry for two lanes.
+
+To solve this, we could have a single map lock (per arena) that has to be taken
+before performing the above sequence, but we feel that could be too contentious.
+Instead we use an array of (nfree) map_locks that is indexed by
+(premap_aba modulo nfree).
+
+
+f. Reconstruction from the Flog
+-------------------------------
+
+On startup, we analyze the BTT flog to create our list of free blocks. We walk
+through all the entries, and for each lane, of the set of two possible
+'sections', we always look at the most recent one only (based on the sequence
+number). The reconstruction rules/steps are simple:
+- Read map[log_entry.lba].
+- If log_entry.new matches the map entry, then log_entry.old is free.
+- If log_entry.new does not match the map entry, then log_entry.new is free.
+  (This case can only be caused by power-fails/unsafe shutdowns)
+
+
+g. Summarizing - Read and Write flows
+-------------------------------------
+
+Read:
+
+1.  Convert external LBA to arena number + pre-map ABA
+2.  Get a lane (and take lane_lock)
+3.  Read map to get the entry for this pre-map ABA
+4.  Enter post-map ABA into RTT[lane]
+5.  If TRIM flag set in map, return zeroes, and end IO (go to step 8)
+6.  If ERROR flag set in map, end IO with EIO (go to step 8)
+7.  Read data from this block
+8.  Remove post-map ABA entry from RTT[lane]
+9.  Release lane (and lane_lock)
+
+Write:
+
+1.  Convert external LBA to Arena number + pre-map ABA
+2.  Get a lane (and take lane_lock)
+3.  Use lane to index into in-memory free list and obtain a new block, next flog
+        index, next sequence number
+4.  Scan the RTT to check if free block is present, and spin/wait if it is.
+5.  Write data to this free block
+6.  Read map to get the existing post-map ABA entry for this pre-map ABA
+7.  Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num]
+8.  Write new post-map ABA into map.
+9.  Write old post-map entry into the free list
+10. Calculate next sequence number and write into the free list entry
+11. Release lane (and lane_lock)
+
+
+4. Error Handling
+=================
+
+An arena would be in an error state if any of the metadata is corrupted
+irrecoverably, either due to a bug or a media error. The following conditions
+indicate an error:
+- Info block checksum does not match (and recovering from the copy also fails)
+- All internal available blocks are not uniquely and entirely addressed by the
+  sum of mapped blocks and free blocks (from the BTT flog).
+- Rebuilding free list from the flog reveals missing/duplicate/impossible
+  entries
+- A map entry is out of bounds
+
+If any of these error conditions are encountered, the arena is put into a read
+only state using a flag in the info block.
+
+
+5. In-kernel usage
+==================
+
+Any block driver that supports byte granularity IO to the storage may register
+with the BTT. It will have to provide the rw_bytes interface in its
+block_device_operations struct:
+
+	int (*rw_bytes)(struct gendisk *, void *, size_t, off_t, int rw);
+
+It may register with the BTT after it adds its own gendisk, using btt_init:
+
+	struct btt *btt_init(struct gendisk *disk, unsigned long long rawsize,
+			u32 lbasize, u8 uuid[], int maxlane);
+
+note that maxlane is the maximum amount of concurrency the driver wishes to
+allow the BTT to use.
+
+The BTT 'disk' appears as a stacked block device that grabs the underlying block
+device in the O_EXCL mode.
+
+When the driver wishes to remove the backing disk, it should similarly call
+btt_fini using the same struct btt* handle that was provided to it by btt_init.
+
+	void btt_fini(struct btt *btt);
+
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 7c4d47492372..a9aca87301c6 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -892,6 +892,7 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 			} else {
 				nd_mapping->size = nfit_mem->bdw->capacity;
 				nd_mapping->start = nfit_mem->bdw->start_address;
+				ndr_desc.num_lanes = nfit_mem->bdw->windows;
 			}
 
 			ndr_desc.nd_mapping = nd_mapping;
diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index 00d9afe9475e..2b169806eac5 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -32,9 +32,25 @@ config BLK_DEV_PMEM
 	  capable of DAX (direct-access) file system mappings.  See
 	  Documentation/blockdev/nd.txt for more details.
 
-	  Say Y if you want to use a NVDIMM described by NFIT
+	  Say Y if you want to use a NVDIMM described by ACPI, E820, etc...
 
 config ND_BTT_DEVS
-	def_bool y
+	bool
+
+config ND_BTT
+	tristate "BTT: Block Translation Table (atomic sector updates)"
+	depends on LIBND
+	default LIBND
+	select ND_BTT_DEVS
+
+config ND_MAX_REGIONS
+	int "Maximum number of regions supported by the sub-system"
+	default 64
+	---help---
+	  A 'region' corresponds to an individual DIMM or an interleave
+	  set of DIMMs.  A typical maximally configured system may have
+	  up to 32 DIMMs.
+
+	  Leave the default of 64 if you are unsure.
 
 endif
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 9866669d7738..1e8fe93a0a42 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -1,8 +1,11 @@
 obj-$(CONFIG_LIBND) += libnd.o
 obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
+obj-$(CONFIG_ND_BTT) += nd_btt.o
 
 nd_pmem-y := pmem.o
 
+nd_btt-y := btt.o
+
 libnd-y := core.o
 libnd-y += bus.o
 libnd-y += dimm_devs.o
diff --git a/drivers/block/nd/btt.c b/drivers/block/nd/btt.c
new file mode 100644
index 000000000000..a4287b6f4224
--- /dev/null
+++ b/drivers/block/nd/btt.c
@@ -0,0 +1,1438 @@
+/*
+ * Block Translation Table
+ * Copyright (c) 2014-2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <linux/highmem.h>
+#include <linux/debugfs.h>
+#include <linux/blkdev.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/hdreg.h>
+#include <linux/genhd.h>
+#include <linux/sizes.h>
+#include <linux/ndctl.h>
+#include <linux/fs.h>
+#include <linux/nd.h>
+#include "btt.h"
+#include "nd.h"
+
+enum log_ent_request {
+	LOG_NEW_ENT = 0,
+	LOG_OLD_ENT
+};
+
+static int btt_major;
+
+static int nd_btt_rw_bytes(struct nd_btt *nd_btt, void *buf, size_t offset,
+		size_t n, unsigned long flags)
+{
+	struct nd_io *ndio = nd_btt->ndio;
+
+	if (unlikely(nd_data_dir(flags) == WRITE)
+			&& bdev_read_only(nd_btt->backing_dev))
+		return -EACCES;
+
+	return ndio->rw_bytes(ndio, buf, offset + nd_btt->offset, n, flags);
+}
+
+static int arena_rw_bytes(struct arena_info *arena, void *buf, size_t n,
+		size_t offset, unsigned long flags)
+{
+	/* yes, FIXME,  'offset' and 'n' are swapped */
+	return nd_btt_rw_bytes(arena->nd_btt, buf, offset, n, flags);
+}
+
+static int btt_info_write(struct arena_info *arena, struct btt_sb *super)
+{
+	int ret;
+
+	ret = arena_rw_bytes(arena, super, sizeof(struct btt_sb),
+			arena->info2off, WRITE);
+	if (ret)
+		return ret;
+
+	return arena_rw_bytes(arena, super, sizeof(struct btt_sb),
+			arena->infooff, WRITE);
+}
+
+static int btt_info_read(struct arena_info *arena, struct btt_sb *super)
+{
+	WARN_ON(!super);
+	return arena_rw_bytes(arena, super, sizeof(struct btt_sb),
+			arena->infooff, READ);
+}
+
+/*
+ * 'raw' version of btt_map write
+ * Assumptions:
+ *   mapping is in little-endian
+ *   mapping contains 'E' and 'Z' flags as desired
+ */
+static int __btt_map_write(struct arena_info *arena, u32 lba, __le32 mapping)
+{
+	u64 ns_off = arena->mapoff + (lba * MAP_ENT_SIZE);
+
+	WARN_ON(lba >= arena->external_nlba);
+	return arena_rw_bytes(arena, &mapping, MAP_ENT_SIZE, ns_off, WRITE);
+}
+
+static int btt_map_write(struct arena_info *arena, u32 lba, u32 mapping,
+			u32 z_flag, u32 e_flag)
+{
+	u32 ze;
+	__le32 mapping_le;
+
+	/*
+	 * This 'mapping' is supposed to be just the LBA mapping, without
+	 * any flags set, so strip the flag bits.
+	 */
+	mapping &= MAP_LBA_MASK;
+
+	ze = (z_flag << 1) + e_flag;
+	switch (ze) {
+	case 0:
+		/*
+		 * We want to set neither of the Z or E flags, and
+		 * in the actual layout, this means setting the bit
+		 * positions of both to '1' to indicate a 'normal'
+		 * map entry
+		 */
+		mapping |= MAP_ENT_NORMAL;
+		break;
+	case 1:
+		mapping |= (1 << MAP_ERR_SHIFT);
+		break;
+	case 2:
+		mapping |= (1 << MAP_TRIM_SHIFT);
+		break;
+	default:
+		/*
+		 * The case where Z and E are both sent in as '1' could be
+		 * construed as a valid 'normal' case, but we decide not to,
+		 * to avoid confusion
+		 */
+		WARN_ONCE(1, "Invalid use of Z and E flags\n");
+		return -EIO;
+	}
+
+	mapping_le = cpu_to_le32(mapping);
+	return __btt_map_write(arena, lba, mapping_le);
+}
+
+static int btt_map_read(struct arena_info *arena, u32 lba, u32 *mapping,
+			int *trim, int *error)
+{
+	int ret;
+	__le32 in;
+	u32 raw_mapping, postmap, ze, z_flag, e_flag;
+	u64 ns_off = arena->mapoff + (lba * MAP_ENT_SIZE);
+
+	WARN_ON(lba >= arena->external_nlba);
+
+	ret = arena_rw_bytes(arena, &in, MAP_ENT_SIZE, ns_off, READ);
+	if (ret)
+		return ret;
+
+	raw_mapping = le32_to_cpu(in);
+
+	z_flag = (raw_mapping & MAP_TRIM_MASK) >> MAP_TRIM_SHIFT;
+	e_flag = (raw_mapping & MAP_ERR_MASK) >> MAP_ERR_SHIFT;
+	ze = (z_flag << 1) + e_flag;
+	postmap = raw_mapping & MAP_LBA_MASK;
+
+	/* Reuse the {z,e}_flag variables for *trim and *error */
+	z_flag = 0;
+	e_flag = 0;
+
+	switch (ze) {
+	case 0:
+		/* Initial state. Return postmap = premap */
+		*mapping = lba;
+		break;
+	case 1:
+		*mapping = postmap;
+		e_flag = 1;
+		break;
+	case 2:
+		*mapping = postmap;
+		z_flag = 1;
+		break;
+	case 3:
+		*mapping = postmap;
+		break;
+	default:
+		return -EIO;
+	}
+
+	if (trim)
+		*trim = z_flag;
+	if (error)
+		*error = e_flag;
+
+	return ret;
+}
+
+static int btt_log_read_pair(struct arena_info *arena, u32 lane,
+			struct log_entry *ent)
+{
+	WARN_ON(!ent);
+	return arena_rw_bytes(arena, ent, 2 * LOG_ENT_SIZE,
+			arena->logoff + (2 * lane * LOG_ENT_SIZE), READ);
+}
+
+static struct dentry *debugfs_root;
+
+static void arena_debugfs_init(struct arena_info *a, struct dentry *parent,
+				int idx)
+{
+	char dirname[32];
+	struct dentry *d;
+
+	/* If for some reason, parent bttN was not created, exit */
+	if (!parent)
+		return;
+
+	snprintf(dirname, 32, "arena%d", idx);
+	d = debugfs_create_dir(dirname, parent);
+	if (IS_ERR_OR_NULL(d))
+		return;
+	a->debugfs_dir = d;
+
+	debugfs_create_x64("size", S_IRUGO, d, &a->size);
+	debugfs_create_x64("external_lba_start", S_IRUGO, d,
+				&a->external_lba_start);
+	debugfs_create_x32("internal_nlba", S_IRUGO, d, &a->internal_nlba);
+	debugfs_create_u32("internal_lbasize", S_IRUGO, d,
+				&a->internal_lbasize);
+	debugfs_create_x32("external_nlba", S_IRUGO, d, &a->external_nlba);
+	debugfs_create_u32("external_lbasize", S_IRUGO, d,
+				&a->external_lbasize);
+	debugfs_create_u32("nfree", S_IRUGO, d, &a->nfree);
+	debugfs_create_u16("version_major", S_IRUGO, d, &a->version_major);
+	debugfs_create_u16("version_minor", S_IRUGO, d, &a->version_minor);
+	debugfs_create_x64("nextoff", S_IRUGO, d, &a->nextoff);
+	debugfs_create_x64("infooff", S_IRUGO, d, &a->infooff);
+	debugfs_create_x64("dataoff", S_IRUGO, d, &a->dataoff);
+	debugfs_create_x64("mapoff", S_IRUGO, d, &a->mapoff);
+	debugfs_create_x64("logoff", S_IRUGO, d, &a->logoff);
+	debugfs_create_x64("info2off", S_IRUGO, d, &a->info2off);
+	debugfs_create_x32("flags", S_IRUGO, d, &a->flags);
+}
+
+static void btt_debugfs_init(struct btt *btt)
+{
+	int i = 0;
+	struct arena_info *arena;
+
+	btt->debugfs_dir = debugfs_create_dir(dev_name(&btt->nd_btt->dev),
+						debugfs_root);
+	if (IS_ERR_OR_NULL(btt->debugfs_dir))
+		return;
+
+	list_for_each_entry(arena, &btt->arena_list, list) {
+		arena_debugfs_init(arena, btt->debugfs_dir, i);
+		i++;
+	}
+}
+
+/*
+ * This function accepts two log entries, and uses the
+ * sequence number to find the 'older' entry.
+ * It also updates the sequence number in this old entry to
+ * make it the 'new' one if the mark_flag is set.
+ * Finally, it returns which of the entries was the older one.
+ *
+ * TODO The logic feels a bit kludge-y. make it better..
+ */
+static int btt_log_get_old(struct log_entry *ent)
+{
+	int old;
+
+	/*
+	 * the first ever time this is seen, the entry goes into [0]
+	 * the next time, the following logic works out to put this
+	 * (next) entry into [1]
+	 */
+	if (ent[0].seq == 0) {
+		ent[0].seq = cpu_to_le32(1);
+		return 0;
+	}
+
+	if (ent[0].seq == ent[1].seq)
+		return -EINVAL;
+	if (le32_to_cpu(ent[0].seq) + le32_to_cpu(ent[1].seq) > 5)
+		return -EINVAL;
+
+	if (le32_to_cpu(ent[0].seq) < le32_to_cpu(ent[1].seq)) {
+		if (le32_to_cpu(ent[1].seq) - le32_to_cpu(ent[0].seq) == 1)
+			old = 0;
+		else
+			old = 1;
+	} else {
+		if (le32_to_cpu(ent[0].seq) - le32_to_cpu(ent[1].seq) == 1)
+			old = 1;
+		else
+			old = 0;
+	}
+
+	return old;
+}
+
+static struct device *to_dev(struct arena_info *arena)
+{
+	return &arena->nd_btt->dev;
+}
+
+/*
+ * This function copies the desired (old/new) log entry into ent if
+ * it is not NULL. It returns the sub-slot number (0 or 1)
+ * where the desired log entry was found. Negative return values
+ * indicate errors.
+ */
+static int btt_log_read(struct arena_info *arena, u32 lane,
+			struct log_entry *ent, int old_flag)
+{
+	int ret;
+	int old_ent, ret_ent;
+	struct log_entry log[2];
+
+	ret = btt_log_read_pair(arena, lane, log);
+	if (ret)
+		return -EIO;
+
+	old_ent = btt_log_get_old(log);
+	if (old_ent < 0 || old_ent > 1) {
+		dev_info(to_dev(arena),
+				"log corruption (%d): lane %d seq [%d, %d]\n",
+			old_ent, lane, log[0].seq, log[1].seq);
+		/* TODO set error state? */
+		return -EIO;
+	}
+
+	ret_ent = (old_flag ? old_ent : (1 - old_ent));
+
+	if (ent != NULL)
+		memcpy(ent, &log[ret_ent], LOG_ENT_SIZE);
+
+	return ret_ent;
+}
+
+/*
+ * This function commits a log entry to media
+ * It does _not_ prepare the freelist entry for the next write
+ * btt_flog_write is the wrapper for updating the freelist elements
+ */
+static int __btt_log_write(struct arena_info *arena, u32 lane,
+			u32 sub, struct log_entry *ent)
+{
+	int ret;
+	/*
+	 * Ignore the padding in log_entry for calculating log_half.
+	 * The entry is 'committed' when we write the sequence number,
+	 * and we want to ensure that that is the last thing written.
+	 * We don't bother writing the padding as that would be extra
+	 * media wear and write amplification
+	 */
+	unsigned int log_half = (LOG_ENT_SIZE - 2 * sizeof(u64)) / 2;
+	u64 ns_off = arena->logoff + (((2 * lane) + sub) * LOG_ENT_SIZE);
+	void *src = ent;
+
+	/* split the 16B write into atomic, durable halves */
+	ret = arena_rw_bytes(arena, src, log_half, ns_off, WRITE);
+	if (ret)
+		return ret;
+
+	ns_off += log_half;
+	src += log_half;
+	return arena_rw_bytes(arena, src, log_half, ns_off, WRITE);
+}
+
+static int btt_flog_write(struct arena_info *arena, u32 lane, u32 sub,
+			struct log_entry *ent)
+{
+	int ret;
+
+	ret = __btt_log_write(arena, lane, sub, ent);
+	if (ret)
+		return ret;
+
+	/* prepare the next free entry */
+	arena->freelist[lane].sub = 1 - arena->freelist[lane].sub;
+	if (++(arena->freelist[lane].seq) == 4)
+		arena->freelist[lane].seq = 1;
+	arena->freelist[lane].block = le32_to_cpu(ent->old_map);
+
+	return ret;
+}
+
+/*
+ * This function initializes the BTT map to the initial state, which is
+ * all-zeroes, and indicates an identity mapping
+ */
+static int btt_map_init(struct arena_info *arena)
+{
+	int ret = -EINVAL;
+	void *zerobuf;
+	size_t offset = 0;
+	size_t chunk_size = SZ_2M;
+	size_t mapsize = arena->logoff - arena->mapoff;
+
+	zerobuf = kzalloc(chunk_size, GFP_KERNEL);
+	if (!zerobuf)
+		return -ENOMEM;
+
+	while (mapsize) {
+		size_t size = min(mapsize, chunk_size);
+
+		ret = arena_rw_bytes(arena, zerobuf, size,
+				arena->mapoff + offset, WRITE);
+		if (ret)
+			goto free;
+
+		offset += size;
+		mapsize -= size;
+		cond_resched();
+	}
+
+ free:
+	kfree(zerobuf);
+	return ret;
+}
+
+/*
+ * This function initializes the BTT log with 'fake' entries pointing
+ * to the initial reserved set of blocks as being free
+ */
+static int btt_log_init(struct arena_info *arena)
+{
+	int ret;
+	u32 i;
+	struct log_entry log, zerolog;
+
+	memset(&zerolog, 0, sizeof(zerolog));
+
+	for (i = 0; i < arena->nfree; i++) {
+		log.lba = cpu_to_le32(i);
+		log.old_map = cpu_to_le32(arena->external_nlba + i);
+		log.new_map = cpu_to_le32(arena->external_nlba + i);
+		log.seq = cpu_to_le32(LOG_SEQ_INIT);
+		ret = __btt_log_write(arena, i, 0, &log);
+		if (ret)
+			return ret;
+		ret = __btt_log_write(arena, i, 1, &zerolog);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int btt_freelist_init(struct arena_info *arena)
+{
+	int old, new, ret;
+	u32 i, map_entry;
+	struct log_entry log_new, log_old;
+
+	arena->freelist = kcalloc(arena->nfree, sizeof(struct free_entry),
+					GFP_KERNEL);
+	if (!arena->freelist)
+		return -ENOMEM;
+
+	for (i = 0; i < arena->nfree; i++) {
+		old = btt_log_read(arena, i, &log_old, LOG_OLD_ENT);
+		if (old < 0)
+			return old;
+
+		new = btt_log_read(arena, i, &log_new, LOG_NEW_ENT);
+		if (new < 0)
+			return new;
+
+		/* sub points to the next one to be overwritten */
+		arena->freelist[i].sub = 1 - new;
+		arena->freelist[i].seq = nd_inc_seq(le32_to_cpu(log_new.seq));
+		arena->freelist[i].block = le32_to_cpu(log_new.old_map);
+
+		/* This implies a newly created or untouched flog entry */
+		if (log_new.old_map == log_new.new_map)
+			continue;
+
+		/* Check if map recovery is needed */
+		ret = btt_map_read(arena, le32_to_cpu(log_new.lba), &map_entry,
+				NULL, NULL);
+		if (ret)
+			return ret;
+		if ((le32_to_cpu(log_new.new_map) != map_entry) &&
+				(le32_to_cpu(log_new.old_map) == map_entry)) {
+			/*
+			 * Last transaction wrote the flog, but wasn't able
+			 * to complete the map write. So fix up the map.
+			 */
+			ret = btt_map_write(arena, le32_to_cpu(log_new.lba),
+					le32_to_cpu(log_new.new_map), 0, 0);
+			if (ret)
+				return ret;
+		}
+
+	}
+
+	return 0;
+}
+
+static int btt_rtt_init(struct arena_info *arena)
+{
+	arena->rtt = kcalloc(arena->nfree, sizeof(u32), GFP_KERNEL);
+	if (arena->rtt == NULL)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int btt_maplocks_init(struct arena_info *arena)
+{
+	u32 i;
+
+	arena->map_locks = kcalloc(arena->nfree, sizeof(struct aligned_lock),
+				GFP_KERNEL);
+	if (!arena->map_locks)
+		return -ENOMEM;
+
+	for (i = 0; i < arena->nfree; i++)
+		spin_lock_init(&arena->map_locks[i].lock);
+
+	return 0;
+}
+
+static struct arena_info *alloc_arena(struct btt *btt, size_t size,
+				size_t start, size_t arena_off)
+{
+	struct arena_info *arena;
+	u64 logsize, mapsize, datasize;
+	u64 available = size;
+
+	arena = kzalloc(sizeof(struct arena_info), GFP_KERNEL);
+	if (!arena)
+		return NULL;
+	arena->nd_btt = btt->nd_btt;
+
+	if (!size)
+		return arena;
+
+	arena->size = size;
+	arena->external_lba_start = start;
+	arena->external_lbasize = btt->lbasize;
+	arena->internal_lbasize = roundup(arena->external_lbasize,
+					INT_LBASIZE_ALIGNMENT);
+	arena->nfree = BTT_DEFAULT_NFREE;
+	arena->version_major = 1;
+	arena->version_minor = 1;
+
+	if (available % BTT_PG_SIZE)
+		available -= (available % BTT_PG_SIZE);
+
+	/* Two pages are reserved for the super block and its copy */
+	available -= 2 * BTT_PG_SIZE;
+
+	/* The log takes a fixed amount of space based on nfree */
+	logsize = roundup(2 * arena->nfree * sizeof(struct log_entry),
+				BTT_PG_SIZE);
+	available -= logsize;
+
+	/* Calculate optimal split between map and data area */
+	arena->internal_nlba = div_u64(available - BTT_PG_SIZE,
+			arena->internal_lbasize + MAP_ENT_SIZE);
+	arena->external_nlba = arena->internal_nlba - arena->nfree;
+
+	mapsize = roundup((arena->external_nlba * MAP_ENT_SIZE), BTT_PG_SIZE);
+	datasize = available - mapsize;
+
+	/* 'Absolute' values, relative to start of storage space */
+	arena->infooff = arena_off;
+	arena->dataoff = arena->infooff + BTT_PG_SIZE;
+	arena->mapoff = arena->dataoff + datasize;
+	arena->logoff = arena->mapoff + mapsize;
+	arena->info2off = arena->logoff + logsize;
+	return arena;
+}
+
+static void free_arenas(struct btt *btt)
+{
+	struct arena_info *arena, *next;
+
+	list_for_each_entry_safe(arena, next, &btt->arena_list, list) {
+		list_del(&arena->list);
+		kfree(arena->rtt);
+		kfree(arena->map_locks);
+		kfree(arena->freelist);
+		debugfs_remove_recursive(arena->debugfs_dir);
+		kfree(arena);
+	}
+}
+
+/*
+ * This function checks if the metadata layout is valid and error free
+ */
+static int arena_is_valid(struct arena_info *arena, struct btt_sb *super,
+				u8 *uuid, u32 lbasize)
+{
+	u64 checksum;
+
+	if (memcmp(super->uuid, uuid, 16))
+		return 0;
+
+	checksum = le64_to_cpu(super->checksum);
+	super->checksum = 0;
+	if (checksum != nd_btt_sb_checksum(super))
+		return 0;
+	super->checksum = cpu_to_le64(checksum);
+
+	if (lbasize != le32_to_cpu(super->external_lbasize))
+		return 0;
+
+	/* TODO: figure out action for this */
+	if ((le32_to_cpu(super->flags) & IB_FLAG_ERROR_MASK) != 0)
+		dev_info(to_dev(arena), "Found arena with an error flag\n");
+
+	return 1;
+}
+
+/*
+ * This function reads an existing valid btt superblock and
+ * populates the corresponding arena_info struct
+ */
+static void parse_arena_meta(struct arena_info *arena, struct btt_sb *super,
+				u64 arena_off)
+{
+	arena->internal_nlba = le32_to_cpu(super->internal_nlba);
+	arena->internal_lbasize = le32_to_cpu(super->internal_lbasize);
+	arena->external_nlba = le32_to_cpu(super->external_nlba);
+	arena->external_lbasize = le32_to_cpu(super->external_lbasize);
+	arena->nfree = le32_to_cpu(super->nfree);
+	arena->version_major = le16_to_cpu(super->version_major);
+	arena->version_minor = le16_to_cpu(super->version_minor);
+
+	arena->nextoff = (super->nextoff == 0) ? 0 : (arena_off +
+			le64_to_cpu(super->nextoff));
+	arena->infooff = arena_off;
+	arena->dataoff = arena_off + le64_to_cpu(super->dataoff);
+	arena->mapoff = arena_off + le64_to_cpu(super->mapoff);
+	arena->logoff = arena_off + le64_to_cpu(super->logoff);
+	arena->info2off = arena_off + le64_to_cpu(super->info2off);
+
+	arena->size = (super->nextoff > 0) ? (le64_to_cpu(super->nextoff)) :
+			(arena->info2off - arena->infooff + BTT_PG_SIZE);
+
+	arena->flags = le32_to_cpu(super->flags);
+}
+
+static int discover_arenas(struct btt *btt)
+{
+	int ret = 0;
+	struct arena_info *arena;
+	struct btt_sb *super;
+	size_t remaining = btt->rawsize;
+	u64 cur_nlba = 0;
+	size_t cur_off = 0;
+	int num_arenas = 0;
+
+	super = kzalloc(sizeof(*super), GFP_KERNEL);
+	if (!super)
+		return -ENOMEM;
+
+	while (remaining) {
+		/* Alloc memory for arena */
+		arena = alloc_arena(btt, 0, 0, 0);
+		if (!arena) {
+			ret = -ENOMEM;
+			goto out_super;
+		}
+
+		arena->infooff = cur_off;
+		ret = btt_info_read(arena, super);
+		if (ret)
+			goto out;
+
+		if (!arena_is_valid(arena, super, btt->nd_btt->uuid,
+				btt->lbasize)) {
+			if (remaining == btt->rawsize) {
+				btt->init_state = INIT_NOTFOUND;
+				dev_info(to_dev(arena), "No existing arenas\n");
+				goto out;
+			} else {
+				dev_info(to_dev(arena),
+						"Found corrupted metadata!\n");
+				ret = -ENODEV;
+				goto out;
+			}
+		}
+
+		arena->external_lba_start = cur_nlba;
+		parse_arena_meta(arena, super, cur_off);
+
+		ret = btt_freelist_init(arena);
+		if (ret)
+			goto out;
+
+		ret = btt_rtt_init(arena);
+		if (ret)
+			goto out;
+
+		ret = btt_maplocks_init(arena);
+		if (ret)
+			goto out;
+
+		list_add_tail(&arena->list, &btt->arena_list);
+
+		remaining -= arena->size;
+		cur_off += arena->size;
+		cur_nlba += arena->external_nlba;
+		num_arenas++;
+
+		if (arena->nextoff == 0)
+			break;
+	}
+	btt->num_arenas = num_arenas;
+	btt->nlba = cur_nlba;
+	btt->init_state = INIT_READY;
+
+	kfree(super);
+	return ret;
+
+ out:
+	kfree(arena);
+	free_arenas(btt);
+ out_super:
+	kfree(super);
+	return ret;
+}
+
+static int create_arenas(struct btt *btt)
+{
+	size_t remaining = btt->rawsize;
+	size_t cur_off = 0;
+
+	while (remaining) {
+		struct arena_info *arena;
+		size_t arena_size = min_t(u64, ARENA_MAX_SIZE, remaining);
+
+		remaining -= arena_size;
+		if (arena_size < ARENA_MIN_SIZE)
+			break;
+
+		arena = alloc_arena(btt, arena_size, btt->nlba, cur_off);
+		if (!arena) {
+			free_arenas(btt);
+			return -ENOMEM;
+		}
+		btt->nlba += arena->external_nlba;
+		if (remaining >= ARENA_MIN_SIZE)
+			arena->nextoff = arena->size;
+		else
+			arena->nextoff = 0;
+		cur_off += arena_size;
+		list_add_tail(&arena->list, &btt->arena_list);
+	}
+
+	return 0;
+}
+
+/*
+ * This function completes arena initialization by writing
+ * all the metadata.
+ * It is only called for an uninitialized arena when a write
+ * to that arena occurs for the first time.
+ */
+static int btt_arena_write_layout(struct arena_info *arena, u8 *uuid)
+{
+	int ret;
+	struct btt_sb *super;
+
+	ret = btt_map_init(arena);
+	if (ret)
+		return ret;
+
+	ret = btt_log_init(arena);
+	if (ret)
+		return ret;
+
+	super = kzalloc(sizeof(struct btt_sb), GFP_NOIO);
+	if (!super)
+		return -ENOMEM;
+
+	strncpy(super->signature, BTT_SIG, BTT_SIG_LEN);
+	memcpy(super->uuid, uuid, 16);
+	super->flags = cpu_to_le32(arena->flags);
+	super->version_major = cpu_to_le16(arena->version_major);
+	super->version_minor = cpu_to_le16(arena->version_minor);
+	super->external_lbasize = cpu_to_le32(arena->external_lbasize);
+	super->external_nlba = cpu_to_le32(arena->external_nlba);
+	super->internal_lbasize = cpu_to_le32(arena->internal_lbasize);
+	super->internal_nlba = cpu_to_le32(arena->internal_nlba);
+	super->nfree = cpu_to_le32(arena->nfree);
+	super->infosize = cpu_to_le32(sizeof(struct btt_sb));
+	super->nextoff = cpu_to_le64(arena->nextoff);
+	/*
+	 * Subtract arena->infooff (arena start) so numbers are relative
+	 * to 'this' arena
+	 */
+	super->dataoff = cpu_to_le64(arena->dataoff - arena->infooff);
+	super->mapoff = cpu_to_le64(arena->mapoff - arena->infooff);
+	super->logoff = cpu_to_le64(arena->logoff - arena->infooff);
+	super->info2off = cpu_to_le64(arena->info2off - arena->infooff);
+
+	super->flags = 0;
+	super->checksum = cpu_to_le64(nd_btt_sb_checksum(super));
+
+	ret = btt_info_write(arena, super);
+
+	kfree(super);
+	return ret;
+}
+
+/*
+ * This function completes the initialization for the BTT namespace
+ * such that it is ready to accept IOs
+ */
+static int btt_meta_init(struct btt *btt)
+{
+	int ret = 0;
+	struct arena_info *arena;
+
+	mutex_lock(&btt->init_lock);
+	list_for_each_entry(arena, &btt->arena_list, list) {
+		ret = btt_arena_write_layout(arena, btt->nd_btt->uuid);
+		if (ret)
+			goto unlock;
+
+		ret = btt_freelist_init(arena);
+		if (ret)
+			goto unlock;
+
+		ret = btt_rtt_init(arena);
+		if (ret)
+			goto unlock;
+
+		ret = btt_maplocks_init(arena);
+		if (ret)
+			goto unlock;
+	}
+
+	btt->init_state = INIT_READY;
+
+ unlock:
+	mutex_unlock(&btt->init_lock);
+	return ret;
+}
+
+/*
+ * This function calculates the arena in which the given LBA lies
+ * by doing a linear walk. This is acceptable since we expect only
+ * a few arenas. If we have backing devices that get much larger,
+ * we can construct a balanced binary tree of arenas at init time
+ * so that this range search becomes faster.
+ */
+static int lba_to_arena(struct btt *btt, sector_t sector, __u32 *premap,
+				struct arena_info **arena)
+{
+	struct arena_info *arena_list;
+	__u64 lba = div_u64(sector << SECTOR_SHIFT, btt->sector_size);
+
+	list_for_each_entry(arena_list, &btt->arena_list, list) {
+		if (lba < arena_list->external_nlba) {
+			*arena = arena_list;
+			*premap = lba;
+			return 0;
+		}
+		lba -= arena_list->external_nlba;
+	}
+
+	return -EIO;
+}
+
+/*
+ * The following (lock_map, unlock_map) are mostly just to improve
+ * readability, since they index into an array of locks
+ */
+static void lock_map(struct arena_info *arena, u32 premap)
+{
+	u32 idx = (premap * MAP_ENT_SIZE / L1_CACHE_BYTES) % arena->nfree;
+
+	spin_lock(&arena->map_locks[idx].lock);
+}
+
+static void unlock_map(struct arena_info *arena, u32 premap)
+{
+	u32 idx = (premap * MAP_ENT_SIZE / L1_CACHE_BYTES) % arena->nfree;
+
+	spin_unlock(&arena->map_locks[idx].lock);
+}
+
+static u64 to_namespace_offset(struct arena_info *arena, u64 lba)
+{
+	return arena->dataoff + ((u64)lba * arena->internal_lbasize);
+}
+
+static int btt_data_read(struct arena_info *arena, struct page *page,
+			unsigned int off, u32 lba, u32 len)
+{
+	int ret;
+	u64 nsoff = to_namespace_offset(arena, lba);
+	void *mem = kmap_atomic(page);
+
+	ret = arena_rw_bytes(arena, mem + off, len, nsoff, READ);
+	kunmap_atomic(mem);
+
+	return ret;
+}
+
+static int btt_data_write(struct arena_info *arena, u32 lba,
+			struct page *page, unsigned int off, u32 len)
+{
+	int ret;
+	u64 nsoff = to_namespace_offset(arena, lba);
+	void *mem = kmap_atomic(page);
+
+	ret = arena_rw_bytes(arena, mem + off, len, nsoff, WRITE);
+	kunmap_atomic(mem);
+
+	return ret;
+}
+
+static void zero_fill_data(struct page *page, unsigned int off, u32 len)
+{
+	void *mem = kmap_atomic(page);
+
+	memset(mem + off, 0, len);
+	kunmap_atomic(mem);
+}
+
+static int btt_read_pg(struct btt *btt, struct page *page, unsigned int off,
+			sector_t sector, unsigned int len)
+{
+	int ret = 0;
+	int t_flag, e_flag;
+	struct arena_info *arena = NULL;
+	u32 lane = 0, premap, postmap;
+
+	while (len) {
+		u32 cur_len;
+
+		lane = nd_region_acquire_lane(btt->nd_region);
+
+		ret = lba_to_arena(btt, sector, &premap, &arena);
+		if (ret)
+			goto out_lane;
+
+		cur_len = min(btt->sector_size, len);
+
+		ret = btt_map_read(arena, premap, &postmap, &t_flag, &e_flag);
+		if (ret)
+			goto out_lane;
+
+		/*
+		 * We loop to make sure that the post map LBA didn't change
+		 * from under us between writing the RTT and doing the actual
+		 * read.
+		 */
+		while (1) {
+			u32 new_map;
+
+			if (t_flag) {
+				zero_fill_data(page, off, cur_len);
+				goto out_lane;
+			}
+
+			if (e_flag) {
+				ret = -EIO;
+				goto out_lane;
+			}
+
+			arena->rtt[lane] = RTT_VALID | postmap;
+			/*
+			 * Barrier to make sure this write is not reordered
+			 * to do the verification map_read before the RTT store
+			 */
+			barrier();
+
+			ret = btt_map_read(arena, premap, &new_map, &t_flag,
+						&e_flag);
+			if (ret)
+				goto out_rtt;
+
+			if (postmap == new_map)
+				break;
+
+			postmap = new_map;
+		}
+
+		ret = btt_data_read(arena, page, off, postmap, cur_len);
+		if (ret)
+			goto out_rtt;
+
+		arena->rtt[lane] = RTT_INVALID;
+		nd_region_release_lane(btt->nd_region, lane);
+
+		len -= cur_len;
+		off += cur_len;
+		sector += btt->sector_size >> SECTOR_SHIFT;
+	}
+
+	return 0;
+
+ out_rtt:
+	arena->rtt[lane] = RTT_INVALID;
+ out_lane:
+	nd_region_release_lane(btt->nd_region, lane);
+	return ret;
+}
+
+static int btt_write_pg(struct btt *btt, sector_t sector, struct page *page,
+		unsigned int off, unsigned int len)
+{
+	int ret = 0;
+	struct arena_info *arena = NULL;
+	u32 premap = 0, old_postmap, new_postmap, lane = 0, i;
+	struct log_entry log;
+	int sub;
+
+	while (len) {
+		u32 cur_len;
+
+		lane = nd_region_acquire_lane(btt->nd_region);
+
+		ret = lba_to_arena(btt, sector, &premap, &arena);
+		if (ret)
+			goto out_lane;
+		cur_len = min(btt->sector_size, len);
+
+		if ((arena->flags & IB_FLAG_ERROR_MASK) != 0) {
+			ret = -EIO;
+			goto out_lane;
+		}
+
+		new_postmap = arena->freelist[lane].block;
+
+		/* Wait if the new block is being read from */
+		for (i = 0; i < arena->nfree; i++)
+			while (arena->rtt[i] == (RTT_VALID | new_postmap))
+				cpu_relax();
+
+
+		if (new_postmap >= arena->internal_nlba) {
+			ret = -EIO;
+			goto out_lane;
+		} else
+			ret = btt_data_write(arena, new_postmap, page,
+						off, cur_len);
+		if (ret)
+			goto out_lane;
+
+		lock_map(arena, premap);
+		ret = btt_map_read(arena, premap, &old_postmap, NULL, NULL);
+		if (ret)
+			goto out_map;
+		if (old_postmap >= arena->internal_nlba) {
+			ret = -EIO;
+			goto out_map;
+		}
+
+		log.lba = cpu_to_le32(premap);
+		log.old_map = cpu_to_le32(old_postmap);
+		log.new_map = cpu_to_le32(new_postmap);
+		log.seq = cpu_to_le32(arena->freelist[lane].seq);
+		sub = arena->freelist[lane].sub;
+		ret = btt_flog_write(arena, lane, sub, &log);
+		if (ret)
+			goto out_map;
+
+		ret = btt_map_write(arena, premap, new_postmap, 0, 0);
+		if (ret)
+			goto out_map;
+
+		unlock_map(arena, premap);
+		nd_region_release_lane(btt->nd_region, lane);
+
+		len -= cur_len;
+		off += cur_len;
+		sector += btt->sector_size >> SECTOR_SHIFT;
+	}
+
+	return 0;
+
+ out_map:
+	unlock_map(arena, premap);
+ out_lane:
+	nd_region_release_lane(btt->nd_region, lane);
+	return ret;
+}
+
+static int btt_do_bvec(struct btt *btt, struct page *page,
+			unsigned int len, unsigned int off, int rw,
+			sector_t sector)
+{
+	int ret;
+
+	if (rw == READ) {
+		ret = btt_read_pg(btt, page, off, sector, len);
+		flush_dcache_page(page);
+	} else {
+		flush_dcache_page(page);
+		ret = btt_write_pg(btt, sector, page, off, len);
+	}
+
+	return ret;
+}
+
+static void btt_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct block_device *bdev = bio->bi_bdev;
+	struct btt *btt = q->queuedata;
+	int rw;
+	struct bio_vec bvec;
+	sector_t sector;
+	struct bvec_iter iter;
+	int err = 0;
+
+	sector = bio->bi_iter.bi_sector;
+	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
+		err = -EIO;
+		goto out;
+	}
+
+	BUG_ON(bio->bi_rw & REQ_DISCARD);
+
+	rw = bio_rw(bio);
+	if (rw == READA)
+		rw = READ;
+
+	bio_for_each_segment(bvec, bio, iter) {
+		unsigned int len = bvec.bv_len;
+
+		BUG_ON(len > PAGE_SIZE);
+		/* Make sure len is in multiples of sector size. */
+		/* XXX is this right? */
+		BUG_ON(len < btt->sector_size);
+		BUG_ON(len % btt->sector_size);
+
+		err = btt_do_bvec(btt, bvec.bv_page, len, bvec.bv_offset,
+				rw, sector);
+		if (err) {
+			dev_info(&btt->nd_btt->dev,
+					"io error in %s sector %lld, len %d,\n",
+					(rw == READ) ? "READ" : "WRITE",
+					(unsigned long long) sector, len);
+			goto out;
+		}
+		sector += len >> SECTOR_SHIFT;
+	}
+
+out:
+	bio_endio(bio, err);
+}
+
+static int btt_rw_page(struct block_device *bdev, sector_t sector,
+		struct page *page, int rw)
+{
+	struct btt *btt = bdev->bd_disk->private_data;
+
+	btt_do_bvec(btt, page, PAGE_CACHE_SIZE, 0, rw, sector);
+	page_endio(page, rw & WRITE, 0);
+	return 0;
+}
+
+
+static int btt_getgeo(struct block_device *bd, struct hd_geometry *geo)
+{
+	/* some standard values */
+	geo->heads = 1 << 6;
+	geo->sectors = 1 << 5;
+	geo->cylinders = get_capacity(bd->bd_disk) >> 11;
+	return 0;
+}
+
+static const struct block_device_operations btt_fops = {
+	.owner =		THIS_MODULE,
+	.rw_page =		btt_rw_page,
+	.getgeo =		btt_getgeo,
+};
+
+static int btt_blk_init(struct btt *btt)
+{
+	struct nd_btt *nd_btt = btt->nd_btt;
+	char name[BDEVNAME_SIZE];
+	int ret;
+
+	/* create a new disk and request queue for btt */
+	btt->btt_queue = blk_alloc_queue(GFP_KERNEL);
+	if (!btt->btt_queue)
+		return -ENOMEM;
+
+	btt->btt_disk = alloc_disk(0);
+	if (!btt->btt_disk) {
+		ret = -ENOMEM;
+		goto out_free_queue;
+	}
+
+	sprintf(btt->btt_disk->disk_name, "%ss",
+			bdevname(nd_btt->backing_dev, name));
+	btt->btt_disk->driverfs_dev = &btt->nd_btt->dev;
+	btt->btt_disk->major = btt_major;
+	btt->btt_disk->first_minor = 0;
+	btt->btt_disk->fops = &btt_fops;
+	btt->btt_disk->private_data = btt;
+	btt->btt_disk->queue = btt->btt_queue;
+	btt->btt_disk->flags = GENHD_FL_EXT_DEVT;
+
+	blk_queue_make_request(btt->btt_queue, btt_make_request);
+	blk_queue_max_hw_sectors(btt->btt_queue, 1024);
+	blk_queue_bounce_limit(btt->btt_queue, BLK_BOUNCE_ANY);
+	blk_queue_logical_block_size(btt->btt_queue, btt->sector_size);
+	btt->btt_queue->queuedata = btt;
+
+	set_capacity(btt->btt_disk, btt->nlba * btt->sector_size >> SECTOR_SHIFT);
+	add_disk(btt->btt_disk);
+
+	return 0;
+
+out_free_queue:
+	blk_cleanup_queue(btt->btt_queue);
+	return ret;
+}
+
+static void btt_blk_cleanup(struct btt *btt)
+{
+	del_gendisk(btt->btt_disk);
+	put_disk(btt->btt_disk);
+	blk_cleanup_queue(btt->btt_queue);
+}
+
+/**
+ * btt_init - initialize a block translation table for the given device
+ * @nd_btt:	device with BTT geometry and backing device info
+ * @rawsize:	raw size in bytes of the backing device
+ * @lbasize:	lba size of the backing device
+ * @uuid:	A uuid for the backing device - this is stored on media
+ * @maxlane:	maximum number of parallel requests the device can handle
+ *
+ * Initialize a Block Translation Table on a backing device to provide
+ * single sector power fail atomicity.
+ *
+ * Context:
+ * Might sleep.
+ *
+ * Returns:
+ * Pointer to a new struct btt on success, NULL on failure.
+ */
+static struct btt *btt_init(struct nd_btt *nd_btt, unsigned long long rawsize,
+		u32 lbasize, u8 *uuid, struct nd_region *nd_region)
+{
+	int ret;
+	struct btt *btt;
+	struct device *dev = &nd_btt->dev;
+
+	btt = kzalloc(sizeof(struct btt), GFP_KERNEL);
+	if (!btt)
+		return NULL;
+
+	btt->nd_btt = nd_btt;
+	btt->rawsize = rawsize;
+	btt->lbasize = lbasize;
+	btt->sector_size = ((lbasize >= 4096) ? 4096 : 512);
+	INIT_LIST_HEAD(&btt->arena_list);
+	mutex_init(&btt->init_lock);
+	btt->nd_region = nd_region;
+
+	ret = discover_arenas(btt);
+	if (ret) {
+		dev_err(dev, "init: error in arena_discover: %d\n", ret);
+		goto out_free;
+	}
+
+	if (btt->init_state != INIT_READY) {
+		btt->num_arenas = (rawsize / ARENA_MAX_SIZE) +
+			((rawsize % ARENA_MAX_SIZE) ? 1 : 0);
+		dev_dbg(dev, "init: %d arenas for %llu rawsize\n",
+				btt->num_arenas, rawsize);
+
+		ret = create_arenas(btt);
+		if (ret) {
+			dev_info(dev, "init: create_arenas: %d\n", ret);
+			goto out_free;
+		}
+
+		ret = btt_meta_init(btt);
+		if (ret) {
+			dev_err(dev, "init: error in meta_init: %d\n", ret);
+			return NULL;
+		}
+	}
+
+	ret = btt_blk_init(btt);
+	if (ret) {
+		dev_err(dev, "init: error in blk_init: %d\n", ret);
+		goto out_free;
+	}
+
+	btt_debugfs_init(btt);
+
+	return btt;
+
+ out_free:
+	kfree(btt);
+	return NULL;
+}
+
+/**
+ * btt_fini - de-initialize a BTT
+ * @btt:	the BTT handle that was generated by btt_init
+ *
+ * De-initialize a Block Translation Table on device removal
+ *
+ * Context:
+ * Might sleep.
+ */
+static void btt_fini(struct btt *btt)
+{
+	if (btt) {
+		btt_blk_cleanup(btt);
+		free_arenas(btt);
+		debugfs_remove_recursive(btt->debugfs_dir);
+		kfree(btt);
+	}
+}
+
+static int link_btt(struct nd_btt *nd_btt)
+{
+	struct block_device *bdev = nd_btt->backing_dev;
+	struct kobject *dir = &part_to_dev(bdev->bd_part)->kobj;
+
+	return sysfs_create_link(dir, &nd_btt->dev.kobj, "nd_btt");
+}
+
+static void unlink_btt(struct nd_btt *nd_btt)
+{
+	struct block_device *bdev = nd_btt->backing_dev;
+	struct kobject *dir;
+
+	/* if backing_dev was deleted first we may have nothing to unlink */
+	if (!nd_btt->backing_dev)
+		return;
+
+	dir = &part_to_dev(bdev->bd_part)->kobj;
+	sysfs_remove_link(dir, "nd_btt");
+}
+
+static int nd_btt_probe(struct device *dev)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	struct nd_io_claim *ndio_claim = nd_btt->ndio_claim;
+	struct nd_region *nd_region;
+	struct block_device *bdev;
+	struct btt *btt;
+	size_t rawsize;
+	int rc;
+
+	if (!ndio_claim || !nd_btt->uuid || !nd_btt->backing_dev
+			|| !nd_btt->lbasize)
+		return -ENODEV;
+
+	rc = link_btt(nd_btt);
+	if (rc)
+		return rc;
+
+	bdev = nd_btt->backing_dev;
+	sync_blockdev(bdev);
+	invalidate_bdev(bdev);
+	/* the first 4K of a device is padding */
+	nd_btt->offset = nd_partition_offset(bdev) + SZ_4K;
+	rawsize = (bdev->bd_part->nr_sects << SECTOR_SHIFT) - SZ_4K;
+	if (rawsize < ARENA_MIN_SIZE) {
+		rc = -ENXIO;
+		goto err_btt;
+	}
+	nd_btt->ndio = nd_btt->ndio_claim->parent;
+	nd_region = to_nd_region(nd_btt->ndio->dev->parent);
+	btt = btt_init(nd_btt, rawsize, nd_btt->lbasize, nd_btt->uuid,
+			nd_region);
+	if (!btt) {
+		rc = -ENOMEM;
+		goto err_btt;
+	}
+	dev_set_drvdata(dev, btt);
+
+	return 0;
+ err_btt:
+	unlink_btt(nd_btt);
+	return rc;
+}
+
+static int nd_btt_remove(struct device *dev)
+{
+	struct nd_btt *nd_btt = to_nd_btt(dev);
+	struct btt *btt = dev_get_drvdata(dev);
+
+	btt_fini(btt);
+	unlink_btt(nd_btt);
+
+	return 0;
+}
+
+static struct nd_device_driver nd_btt_driver = {
+	.probe = nd_btt_probe,
+	.remove = nd_btt_remove,
+	.drv = {
+		.name = "nd_btt",
+	},
+	.type = ND_DRIVER_BTT,
+};
+
+static int __init nd_btt_init(void)
+{
+	int rc;
+
+	BUILD_BUG_ON(sizeof(struct btt_sb) != SZ_4K);
+
+	btt_major = register_blkdev(0, "btt");
+	if (btt_major < 0)
+		return btt_major;
+
+	debugfs_root = debugfs_create_dir("btt", NULL);
+	if (IS_ERR_OR_NULL(debugfs_root)) {
+		rc = -ENXIO;
+		goto err_debugfs;
+	}
+
+	rc = nd_driver_register(&nd_btt_driver);
+	if (rc < 0)
+		goto err_driver;
+	return 0;
+
+ err_driver:
+	debugfs_remove_recursive(debugfs_root);
+ err_debugfs:
+	unregister_blkdev(btt_major, "btt");
+
+	return rc;
+}
+
+static void __exit nd_btt_exit(void)
+{
+	driver_unregister(&nd_btt_driver.drv);
+	debugfs_remove_recursive(debugfs_root);
+	unregister_blkdev(btt_major, "btt");
+}
+
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_BTT);
+MODULE_AUTHOR("Vishal Verma <vishal.l.verma@linux.intel.com>");
+MODULE_LICENSE("GPL v2");
+module_init(nd_btt_init);
+module_exit(nd_btt_exit);
diff --git a/drivers/block/nd/btt.h b/drivers/block/nd/btt.h
index e8f6d8e0ddd3..c9fe38e5b61a 100644
--- a/drivers/block/nd/btt.h
+++ b/drivers/block/nd/btt.h
@@ -19,6 +19,39 @@
 
 #define BTT_SIG_LEN 16
 #define BTT_SIG "BTT_ARENA_INFO\0"
+#define MAP_ENT_SIZE 4
+#define MAP_TRIM_SHIFT 31
+#define MAP_TRIM_MASK (1 << MAP_TRIM_SHIFT)
+#define MAP_ERR_SHIFT 30
+#define MAP_ERR_MASK (1 << MAP_ERR_SHIFT)
+#define MAP_LBA_MASK (~((1 << MAP_TRIM_SHIFT) | (1 << MAP_ERR_SHIFT)))
+#define MAP_ENT_NORMAL 0xC0000000
+#define LOG_ENT_SIZE sizeof(struct log_entry)
+#define ARENA_MIN_SIZE (1UL << 24)	/* 16 MB */
+#define ARENA_MAX_SIZE (1ULL << 39)	/* 512 GB */
+#define RTT_VALID (1UL << 31)
+#define RTT_INVALID 0
+#define INT_LBASIZE_ALIGNMENT 256
+#define BTT_PG_SIZE 4096
+#define BTT_DEFAULT_NFREE ND_MAX_LANES
+#define LOG_SEQ_INIT 1
+
+#define IB_FLAG_ERROR 0x00000001
+#define IB_FLAG_ERROR_MASK 0x00000001
+
+enum btt_init_state {
+	INIT_UNCHECKED = 0,
+	INIT_NOTFOUND,
+	INIT_READY
+};
+
+struct log_entry {
+	__le32 lba;
+	__le32 old_map;
+	__le32 new_map;
+	__le32 seq;
+	__le64 padding[2];
+};
 
 struct btt_sb {
 	u8 signature[BTT_SIG_LEN];
@@ -42,4 +75,112 @@ struct btt_sb {
 	__le64 checksum;
 };
 
+struct free_entry {
+	u32 block;
+	u8 sub;
+	u8 seq;
+};
+
+struct aligned_lock {
+	union {
+		spinlock_t lock;
+		u8 cacheline_padding[L1_CACHE_BYTES];
+	};
+};
+
+/**
+ * struct arena_info - handle for an arena
+ * @size:		Size in bytes this arena occupies on the raw device.
+ *			This includes arena metadata.
+ * @external_lba_start:	The first external LBA in this arena.
+ * @internal_nlba:	Number of internal blocks available in the arena
+ *			including nfree reserved blocks
+ * @internal_lbasize:	Internal and external lba sizes may be different as
+ *			we can round up 'odd' external lbasizes such as 520B
+ *			to be aligned.
+ * @external_nlba:	Number of blocks contributed by the arena to the number
+ *			reported to upper layers. (internal_nlba - nfree)
+ * @external_lbasize:	LBA size as exposed to upper layers.
+ * @nfree:		A reserve number of 'free' blocks that is used to
+ *			handle incoming writes.
+ * @version_major:	Metadata layout version major.
+ * @version_minor:	Metadata layout version minor.
+ * @nextoff:		Offset in bytes to the start of the next arena.
+ * @infooff:		Offset in bytes to the info block of this arena.
+ * @dataoff:		Offset in bytes to the data area of this arena.
+ * @mapoff:		Offset in bytes to the map area of this arena.
+ * @logoff:		Offset in bytes to the log area of this arena.
+ * @info2off:		Offset in bytes to the backup info block of this arena.
+ * @freelist:		Pointer to in-memory list of free blocks
+ * @rtt:		Pointer to in-memory "Read Tracking Table"
+ * @map_locks:		Spinlocks protecting concurrent map writes
+ * @nd_btt:		Pointer to parent nd_btt structure.
+ * @list:		List head for list of arenas
+ * @debugfs_dir:	Debugfs dentry
+ * @flags:		Arena flags - may signify error states.
+ *
+ * arena_info is a per-arena handle. Once an arena is narrowed down for an
+ * IO, this struct is passed around for the duration of the IO.
+ */
+struct arena_info {
+	u64 size;			/* Total bytes for this arena */
+	u64 external_lba_start;
+	u32 internal_nlba;
+	u32 internal_lbasize;
+	u32 external_nlba;
+	u32 external_lbasize;
+	u32 nfree;
+	u16 version_major;
+	u16 version_minor;
+	/* Byte offsets to the different on-media structures */
+	u64 nextoff;
+	u64 infooff;
+	u64 dataoff;
+	u64 mapoff;
+	u64 logoff;
+	u64 info2off;
+	/* Pointers to other in-memory structures for this arena */
+	struct free_entry *freelist;
+	u32 *rtt;
+	struct aligned_lock *map_locks;
+	struct nd_btt *nd_btt;
+	struct list_head list;
+	struct dentry *debugfs_dir;
+	/* Arena flags */
+	u32 flags;
+};
+
+/**
+ * struct btt - handle for a BTT instance
+ * @btt_disk:		Pointer to the gendisk for BTT device
+ * @btt_queue:		Pointer to the request queue for the BTT device
+ * @arena_list:		Head of the list of arenas
+ * @debugfs_dir:	Debugfs dentry
+ * @nd_btt:		Parent nd_btt struct
+ * @nlba:		Number of logical blocks exposed to the	upper layers
+ *			after removing the amount of space needed by metadata
+ * @rawsize:		Total size in bytes of the available backing device
+ * @lbasize:		LBA size as requested and presented to upper layers.
+ * 			This is sector_size + size of any metadata.
+ * @sector_size:	The Linux sector size - 512 or 4096
+ * @lanes:		Per-lane spinlocks
+ * @init_lock:		Mutex used for the BTT initialization
+ * @init_state:		Flag describing the initialization state for the BTT
+ * @num_arenas:		Number of arenas in the BTT instance
+ */
+struct btt {
+	struct gendisk *btt_disk;
+	struct request_queue *btt_queue;
+	struct list_head arena_list;
+	struct dentry *debugfs_dir;
+	struct nd_btt *nd_btt;
+	u64 nlba;
+	unsigned long long rawsize;
+	u32 lbasize;
+	u32 sector_size;
+	struct nd_region *nd_region;
+	struct mutex init_lock;
+	int init_state;
+	int num_arenas;
+};
 #endif
diff --git a/drivers/block/nd/btt_devs.c b/drivers/block/nd/btt_devs.c
index b3b813288092..fd6755040751 100644
--- a/drivers/block/nd/btt_devs.c
+++ b/drivers/block/nd/btt_devs.c
@@ -342,7 +342,8 @@ struct nd_btt *nd_btt_create(struct nd_bus *nd_bus)
  */
 u64 nd_btt_sb_checksum(struct btt_sb *btt_sb)
 {
-	u64 sum, sum_save;
+	u64 sum;
+	__le64 sum_save;
 
 	sum_save = btt_sb->checksum;
 	btt_sb->checksum = 0;
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index 6c89695956a4..6a864e9ae97a 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -76,6 +76,7 @@ int __init nd_bus_init(void);
 void nd_bus_exit(void);
 int __init nd_dimm_init(void);
 int __init nd_region_init(void);
+void __init nd_region_init_locks(void);
 void nd_dimm_exit(void);
 int nd_region_exit(void);
 void nd_region_probe_start(struct nd_bus *nd_bus, struct device *dev);
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index 73e830785f74..b706f25da7e5 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -22,6 +22,12 @@
 #include "label.h"
 
 enum {
+	/*
+	 * Limits the maximum number of block apertures a dimm can
+	 * support and is an input to the geometry/on-disk-format of a
+	 * BTT instance
+	 */
+	ND_MAX_LANES = 256,
 	SECTOR_SHIFT = 9,
 };
 
@@ -101,7 +107,7 @@ struct nd_region {
 	u16 ndr_mappings;
 	u64 ndr_size;
 	u64 ndr_start;
-	int id;
+	int id, num_lanes;
 	void *provider_data;
 	struct nd_interleave_set *nd_set;
 	struct nd_mapping mapping[0];
@@ -226,6 +232,8 @@ struct nd_btt *to_nd_btt(struct device *dev);
 struct btt_sb;
 u64 nd_btt_sb_checksum(struct btt_sb *btt_sb);
 struct nd_region *to_nd_region(struct device *dev);
+unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
+void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
 int nd_region_to_namespace_type(struct nd_region *nd_region);
 int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
 u64 nd_region_interleave_set_cookie(struct nd_region *nd_region);
diff --git a/drivers/block/nd/region.c b/drivers/block/nd/region.c
index 31bb33962e14..0e872f54dcd2 100644
--- a/drivers/block/nd/region.c
+++ b/drivers/block/nd/region.c
@@ -10,18 +10,106 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/cpumask.h>
 #include <linux/module.h>
 #include <linux/device.h>
 #include <linux/nd.h>
 #include "nd.h"
 
+struct nd_percpu_lane {
+	int count[CONFIG_ND_MAX_REGIONS];
+	spinlock_t lock[CONFIG_ND_MAX_REGIONS];
+};
+
+static DEFINE_PER_CPU(struct nd_percpu_lane, nd_percpu_lane);
+
+static void __init nd_region_init_locks(void)
+{
+	unsigned int i, j;
+
+	for (i = 0; i < nr_cpu_ids; i++)
+		for (j = 0; j < CONFIG_ND_MAX_REGIONS; j++) {
+			struct nd_percpu_lane *ndl;
+
+			ndl = per_cpu_ptr(&nd_percpu_lane, i);
+			spin_lock_init(&ndl->lock[j]);
+			ndl->count[j] = 0;
+		}
+}
+
+/**
+ * nd_region_acquire_lane - allocate and lock a lane
+ * @nd_region: region id and number of lanes possible
+ *
+ * A lane correlates to a BLK-data-window and/or a log slot in the BTT.
+ * We optimize for the common case where there are 256 lanes, one
+ * per-cpu.  For larger systems we need to lock to share lanes.  For now
+ * this implementation assumes the cost of maintaining an allocator for
+ * free lanes is on the order of the lock hold time, so it implements a
+ * static lane = cpu % num_lanes mapping.
+ *
+ * In the case of a BTT instance on top of a BLK namespace a lane may be
+ * acquired recursively.  We lock on the first instance.
+ *
+ * In the case of a BTT instance on top of PMEM, we only acquire a lane
+ * for the BTT metadata updates.
+ */
+unsigned int nd_region_acquire_lane(struct nd_region *nd_region)
+{
+	unsigned int cpu, lane;
+
+	cpu = get_cpu();
+	if (nd_region->num_lanes < nr_cpu_ids) {
+		struct nd_percpu_lane *ndl_lock, *ndl_count;
+		unsigned int id = nd_region->id;
+
+		lane = cpu % nd_region->num_lanes;
+		ndl_count = per_cpu_ptr(&nd_percpu_lane, cpu);
+		ndl_lock = per_cpu_ptr(&nd_percpu_lane, lane);
+		if (ndl_count->count[id]++ == 0)
+			spin_lock(&ndl_lock->lock[id]);
+	} else
+		lane = cpu;
+
+	return lane;
+}
+EXPORT_SYMBOL(nd_region_acquire_lane);
+
+void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane)
+{
+	if (nd_region->num_lanes < nr_cpu_ids) {
+		unsigned int cpu = get_cpu();
+		unsigned int id = nd_region->id;
+		struct nd_percpu_lane *ndl_lock, *ndl_count;
+
+		ndl_count = per_cpu_ptr(&nd_percpu_lane, cpu);
+		ndl_lock = per_cpu_ptr(&nd_percpu_lane, lane);
+		if (--ndl_count->count[id] == 0)
+			spin_unlock(&ndl_lock->lock[id]);
+		put_cpu();
+	}
+	put_cpu();
+}
+EXPORT_SYMBOL(nd_region_release_lane);
+
 static int nd_region_probe(struct device *dev)
 {
 	int err;
+	static unsigned long once;
 	struct nd_region_namespaces *num_ns;
 	struct nd_region *nd_region = to_nd_region(dev);
 	int rc = nd_region_register_namespaces(nd_region, &err);
 
+	if (nd_region->num_lanes > num_online_cpus()
+			&& nd_region->num_lanes < num_possible_cpus()
+			&& !test_and_set_bit(0, &once)) {
+		dev_info(dev, "online cpus (%d) < concurrent i/o lanes (%d) < possible cpus (%d)\n",
+				num_online_cpus(), nd_region->num_lanes,
+				num_possible_cpus());
+		dev_info(dev, "setting nr_cpus=%d may yield better libnd device performance\n",
+				nd_region->num_lanes);
+	}
+
 	num_ns = devm_kzalloc(dev, sizeof(*num_ns), GFP_KERNEL);
 	if (!num_ns)
 		return -ENOMEM;
@@ -84,6 +172,7 @@ static struct nd_device_driver nd_region_driver = {
 
 int __init nd_region_init(void)
 {
+	nd_region_init_locks();
 	return nd_driver_register(&nd_region_driver);
 }
 
diff --git a/drivers/block/nd/region_devs.c b/drivers/block/nd/region_devs.c
index 1ae6bb44c371..4965004147ae 100644
--- a/drivers/block/nd/region_devs.c
+++ b/drivers/block/nd/region_devs.c
@@ -543,6 +543,12 @@ static noinline struct nd_region *nd_region_create(struct nd_bus *nd_bus,
 	if (nd_region->id < 0) {
 		kfree(nd_region);
 		return NULL;
+	} else if (nd_region->id >= CONFIG_ND_MAX_REGIONS) {
+		dev_err(&nd_bus->dev, "max region limit %d reached\n",
+				CONFIG_ND_MAX_REGIONS);
+		ida_simple_remove(&region_ida, nd_region->id);
+		kfree(nd_region);
+		return NULL;
 	}
 
 	memcpy(nd_region->mapping, ndr_desc->nd_mapping,
@@ -556,6 +562,7 @@ static noinline struct nd_region *nd_region_create(struct nd_bus *nd_bus,
 	nd_region->ndr_mappings = ndr_desc->num_mappings;
 	nd_region->provider_data = ndr_desc->provider_data;
 	nd_region->nd_set = ndr_desc->nd_set;
+	nd_region->num_lanes = ndr_desc->num_lanes;
 	ida_init(&nd_region->ns_ida);
 	dev = &nd_region->dev;
 	dev_set_name(dev, "region%d", nd_region->id);
@@ -572,6 +579,7 @@ static noinline struct nd_region *nd_region_create(struct nd_bus *nd_bus,
 struct nd_region *nd_pmem_region_create(struct nd_bus *nd_bus,
 		struct nd_region_desc *ndr_desc)
 {
+	ndr_desc->num_lanes = ND_MAX_LANES;
 	return nd_region_create(nd_bus, ndr_desc, &nd_pmem_device_type);
 }
 EXPORT_SYMBOL_GPL(nd_pmem_region_create);
@@ -581,6 +589,7 @@ struct nd_region *nd_blk_region_create(struct nd_bus *nd_bus,
 {
 	if (ndr_desc->num_mappings > 1)
 		return NULL;
+	ndr_desc->num_lanes = min(ndr_desc->num_lanes, ND_MAX_LANES);
 	return nd_region_create(nd_bus, ndr_desc, &nd_blk_device_type);
 }
 EXPORT_SYMBOL_GPL(nd_blk_region_create);
@@ -588,6 +597,7 @@ EXPORT_SYMBOL_GPL(nd_blk_region_create);
 struct nd_region *nd_volatile_region_create(struct nd_bus *nd_bus,
 		struct nd_region_desc *ndr_desc)
 {
+	ndr_desc->num_lanes = ND_MAX_LANES;
 	return nd_region_create(nd_bus, ndr_desc, &nd_volatile_device_type);
 }
 EXPORT_SYMBOL_GPL(nd_volatile_region_create);
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index 43f58330d14c..6146690b23e7 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -76,6 +76,7 @@ struct nd_region_desc {
 	const struct attribute_group **attr_groups;
 	struct nd_interleave_set *nd_set;
 	void *provider_data;
+	int num_lanes;
 };
 
 struct nd_bus;


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 19/21] libnd, nfit, nd_blk: driver for BLK-mode access persistent memory
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:57   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: Boaz Harrosh, linux-nvdimm, neilb, gregkh, linux-kernel,
	Andy Lutomirski, Jens Axboe, linux-acpi, jmoyer, H. Peter Anvin,
	Ross Zwisler, hch, mingo

From: Ross Zwisler <ross.zwisler@linux.intel.com>

The libnd implementation handles allocating dimm address space (DPA)
between PMEM and BLK mode interfaces.  After DPA has been allocated from
a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
as a struct bio based block device. Unlike PMEM, BLK is required to
handle platform specific details like mmio register formats and memory
controller interleave.  For this reason the libnd generic nd_blk driver
calls back into the bus provider to carry out the I/O.

This initial implementation handles the BLK interface defined by the
ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
DCR (dimm control region), BDW (block data window), IDT (interleave
descriptor) NFIT structures and the hardware register format.
[1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
[2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c               |  442 +++++++++++++++++++++++++++++++++++--
 drivers/acpi/nfit.h               |   50 ++++
 drivers/block/nd/Kconfig          |   13 +
 drivers/block/nd/Makefile         |    3 
 drivers/block/nd/blk.c            |  252 +++++++++++++++++++++
 drivers/block/nd/dimm_devs.c      |    9 +
 drivers/block/nd/namespace_devs.c |   47 ++++
 drivers/block/nd/nd-private.h     |    3 
 drivers/block/nd/nd.h             |   13 +
 drivers/block/nd/region.c         |    8 +
 drivers/block/nd/region_devs.c    |   90 ++++++--
 include/linux/libnd.h             |   17 +
 12 files changed, 909 insertions(+), 38 deletions(-)
 create mode 100644 drivers/block/nd/blk.c

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index a9aca87301c6..c4ce498da9eb 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -13,12 +13,16 @@
 #include <linux/list_sort.h>
 #include <linux/module.h>
 #include <linux/libnd.h>
+#include <linux/mutex.h>
 #include <linux/ndctl.h>
 #include <linux/list.h>
 #include <linux/acpi.h>
 #include <linux/sort.h>
+#include <linux/io.h>
 #include "nfit.h"
 
+#include <asm-generic/io-64-nonatomic-hi-lo.h>
+
 static bool force_enable_dimms;
 module_param(force_enable_dimms, bool, S_IRUGO|S_IWUSR);
 MODULE_PARM_DESC(force_enable_dimms, "Ignore _STA (ACPI DIMM device) status");
@@ -71,7 +75,7 @@ static int acpi_nfit_ctl(struct nd_bus_descriptor *nd_desc,
 
 		if (!adev)
 			return -ENOTTY;
-		dimm_name = dev_name(&adev->dev);
+		dimm_name = nd_dimm_name(nd_dimm);
 		cmd_name = nd_dimm_cmd_name(cmd);
 		dsm_mask = nfit_mem->dsm_mask;
 		desc = nd_cmd_dimm_desc(cmd);
@@ -266,10 +270,20 @@ static void *add_table(struct acpi_nfit_desc *acpi_desc, void *table, const void
 				bdw->region_index, bdw->windows);
 		break;
 	}
-	/* TODO */
-	case ACPI_NFIT_TYPE_INTERLEAVE:
-		dev_dbg(dev, "%s: idt\n", __func__);
+	case ACPI_NFIT_TYPE_INTERLEAVE: {
+		struct nfit_idt *nfit_idt = devm_kzalloc(dev, sizeof(*nfit_idt),
+				GFP_KERNEL);
+		struct acpi_nfit_interleave *idt = table;
+
+		if (!nfit_idt)
+			return err;
+		INIT_LIST_HEAD(&nfit_idt->list);
+		nfit_idt->idt = idt;
+		list_add_tail(&nfit_idt->list, &acpi_desc->idts);
+		dev_dbg(dev, "%s: idt index: %d num_lines: %d\n", __func__,
+				idt->interleave_index, idt->line_count);
 		break;
+	}
 	case ACPI_NFIT_TYPE_FLUSH_ADDRESS:
 		dev_dbg(dev, "%s: flush\n", __func__);
 		break;
@@ -321,8 +335,11 @@ static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
 		struct nfit_mem *nfit_mem, struct acpi_nfit_system_address *spa)
 {
 	u16 dcr_index = __to_nfit_memdev(nfit_mem)->region_index;
+	struct nfit_memdev *nfit_memdev;
 	struct nfit_dcr *nfit_dcr;
 	struct nfit_bdw *nfit_bdw;
+	struct nfit_idt *nfit_idt;
+	u16 idt_index, range_index;
 
 	list_for_each_entry(nfit_dcr, &acpi_desc->dcrs, list) {
 		if (nfit_dcr->dcr->region_index != dcr_index)
@@ -355,6 +372,26 @@ static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
 		return 0;
 
 	nfit_mem_find_spa_bdw(acpi_desc, nfit_mem);
+
+	if (!nfit_mem->spa_bdw)
+		return 0;
+
+	range_index = nfit_mem->spa_bdw->range_index;
+	list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
+		if (nfit_memdev->memdev->range_index != range_index ||
+				nfit_memdev->memdev->region_index != dcr_index)
+			continue;
+		nfit_mem->memdev_bdw = nfit_memdev->memdev;
+		idt_index = nfit_memdev->memdev->interleave_index;
+		list_for_each_entry(nfit_idt, &acpi_desc->idts, list) {
+			if (nfit_idt->idt->interleave_index != idt_index)
+				continue;
+			nfit_mem->idt_bdw = nfit_idt->idt;
+			break;
+		}
+		break;
+	}
+
 	return 0;
 }
 
@@ -398,9 +435,19 @@ static int nfit_mem_dcr_init(struct acpi_nfit_desc *acpi_desc,
 		}
 
 		if (type == NFIT_SPA_DCR) {
+			struct nfit_idt *nfit_idt;
+			u16 idt_index;
+
 			/* multiple dimms may share a SPA when interleaved */
 			nfit_mem->spa_dcr = spa;
 			nfit_mem->memdev_dcr = nfit_memdev->memdev;
+			idt_index = nfit_memdev->memdev->interleave_index;
+			list_for_each_entry(nfit_idt, &acpi_desc->idts, list) {
+				if (nfit_idt->idt->interleave_index != idt_index)
+					continue;
+				nfit_mem->idt_dcr = nfit_idt->idt;
+				break;
+			}
 		} else {
 			/*
 			 * A single dimm may belong to multiple SPA-PM
@@ -830,13 +877,362 @@ static int acpi_nfit_init_interleave_set(struct acpi_nfit_desc *acpi_desc,
 	return 0;
 }
 
+static u64 to_interleave_offset(u64 offset, struct nfit_blk_mmio *mmio)
+{
+	struct acpi_nfit_interleave *idt = mmio->idt;
+	u32 sub_line_offset, line_index, line_offset;
+	u64 line_no, table_skip_count, table_offset;
+
+	line_no = div_u64_rem(offset, mmio->line_size, &sub_line_offset);
+	table_skip_count = div_u64_rem(line_no, mmio->num_lines, &line_index);
+	line_offset = idt->line_offset[line_index]
+		* mmio->line_size;
+	table_offset = table_skip_count * mmio->table_size;
+
+	return mmio->base_offset + line_offset + table_offset + sub_line_offset;
+}
+
+static u64 read_blk_stat(struct nfit_blk *nfit_blk, unsigned int bw)
+{
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[DCR];
+	u64 offset = nfit_blk->stat_offset + mmio->size * bw;
+
+	if (mmio->num_lines)
+		offset = to_interleave_offset(offset, mmio);
+
+	return readq(mmio->base + offset);
+}
+
+static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
+		resource_size_t dpa, unsigned int len, unsigned int write)
+{
+	u64 cmd, offset;
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[DCR];
+
+	enum {
+		BCW_OFFSET_MASK = (1ULL << 48)-1,
+		BCW_LEN_SHIFT = 48,
+		BCW_LEN_MASK = (1ULL << 8) - 1,
+		BCW_CMD_SHIFT = 56,
+	};
+
+	cmd = (dpa >> L1_CACHE_SHIFT) & BCW_OFFSET_MASK;
+	len = len >> L1_CACHE_SHIFT;
+	cmd |= ((u64) len & BCW_LEN_MASK) << BCW_LEN_SHIFT;
+	cmd |= ((u64) write) << BCW_CMD_SHIFT;
+
+	offset = nfit_blk->cmd_offset + mmio->size * bw;
+	if (mmio->num_lines)
+		offset = to_interleave_offset(offset, mmio);
+
+	writeq(cmd, mmio->base + offset);
+	/* FIXME: conditionally perform read-back if mandated by firmware */
+}
+
+static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk, void *iobuf,
+		unsigned int len, int write, resource_size_t dpa,
+		unsigned int bw)
+{
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW];
+	unsigned int copied = 0;
+	u64 base_offset;
+	int rc;
+
+	base_offset = nfit_blk->bdw_offset + dpa % L1_CACHE_BYTES + bw * mmio->size;
+	/* TODO: non-temporal access, flush hints, cache management etc... */
+	write_blk_ctl(nfit_blk, bw, dpa, len, write);
+	while (len) {
+		unsigned int c;
+		u64 offset;
+
+		if (mmio->num_lines) {
+			u32 line_offset;
+
+			offset = to_interleave_offset(base_offset + copied,
+					mmio);
+			div_u64_rem(offset, mmio->line_size, &line_offset);
+			c = min(len, mmio->line_size - line_offset);
+		} else {
+			offset = base_offset + nfit_blk->bdw_offset;
+			c = len;
+		}
+
+		if (write)
+			memcpy(mmio->base + offset, iobuf + copied, c);
+		else
+			memcpy(iobuf + copied, mmio->base + offset, c);
+
+		copied += c;
+		len -= c;
+	}
+	rc = read_blk_stat(nfit_blk, bw) ? -EIO : 0;
+	return rc;
+}
+
+static int acpi_nfit_blk_region_do_io(struct nd_blk_region *ndbr, void *iobuf,
+		u64 len, int write, resource_size_t dpa)
+{
+	struct nfit_blk *nfit_blk = nd_blk_region_provider_data(ndbr);
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW];
+	struct nd_region *nd_region = nfit_blk->nd_region;
+	unsigned int bw, copied = 0;
+	int rc = 0;
+
+	bw = nd_region_acquire_lane(nd_region);
+	while (len) {
+		u64 c = min(len, mmio->size);
+
+		rc = acpi_nfit_blk_single_io(nfit_blk, iobuf + copied, c, write,
+				dpa + copied, bw);
+		if (rc)
+			break;
+
+		copied += c;
+		len -= c;
+	}
+	nd_region_release_lane(nd_region, bw);
+
+	return rc;
+}
+
+static void nfit_spa_mapping_release(struct kref *kref)
+{
+	struct nfit_spa_mapping *spa_map = to_spa_map(kref);
+	struct acpi_nfit_system_address *spa = spa_map->spa;
+	struct acpi_nfit_desc *acpi_desc = spa_map->acpi_desc;
+
+	WARN_ON(!mutex_is_locked(&acpi_desc->spa_map_mutex));
+	dev_dbg(acpi_desc->dev, "%s: SPA%d\n", __func__, spa->range_index);
+	iounmap(spa_map->iomem);
+	release_mem_region(spa->address, spa->length);
+	list_del(&spa_map->list);
+	kfree(spa_map);
+}
+
+static struct nfit_spa_mapping *find_spa_mapping(struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	struct nfit_spa_mapping *spa_map;
+
+	WARN_ON(!mutex_is_locked(&acpi_desc->spa_map_mutex));
+	list_for_each_entry(spa_map, &acpi_desc->spa_maps, list)
+		if (spa_map->spa == spa)
+			return spa_map;
+
+	return NULL;
+}
+
+static void nfit_spa_unmap(struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	struct nfit_spa_mapping *spa_map;
+
+	mutex_lock(&acpi_desc->spa_map_mutex);
+	spa_map = find_spa_mapping(acpi_desc, spa);
+
+	if (spa_map)
+		kref_put(&spa_map->kref, nfit_spa_mapping_release);
+	mutex_unlock(&acpi_desc->spa_map_mutex);
+}
+
+static void *__nfit_spa_map(struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	resource_size_t start = spa->address;
+	resource_size_t n = spa->length;
+	struct nfit_spa_mapping *spa_map;
+	struct resource *res;
+
+	WARN_ON(!mutex_is_locked(&acpi_desc->spa_map_mutex));
+
+	spa_map = find_spa_mapping(acpi_desc, spa);
+	if (spa_map) {
+		kref_get(&spa_map->kref);
+		return spa_map->iomem;
+	}
+
+	spa_map = kzalloc(sizeof(*spa_map), GFP_KERNEL);
+	if (!spa_map)
+		return NULL;
+
+	INIT_LIST_HEAD(&spa_map->list);
+	spa_map->spa = spa;
+	kref_init(&spa_map->kref);
+	spa_map->acpi_desc = acpi_desc;
+
+	res = request_mem_region(start, n, dev_name(acpi_desc->dev));
+	if (!res)
+		goto err_mem;
+
+	/* TODO: cacheability based on the spa type */
+	spa_map->iomem = ioremap_nocache(start, n);
+	if (!spa_map->iomem)
+		goto err_map;
+
+	list_add_tail(&spa_map->list, &acpi_desc->spa_maps);
+	return spa_map->iomem;
+
+ err_map:
+	release_mem_region(start, n);
+ err_mem:
+	kfree(spa_map);
+	return NULL;
+}
+
+/**
+ * nfit_spa_map - interleave-aware managed-mappings of acpi_nfit_system_address ranges
+ * @nd_bus: NFIT-bus that provided the spa table entry
+ * @nfit_spa: spa table to map
+ *
+ * In the case where block-data-window apertures and
+ * dimm-control-regions are interleaved they will end up sharing a
+ * single request_mem_region() + ioremap() for the address range.  In
+ * the style of devm nfit_spa_map() mappings are automatically dropped
+ * when all region devices referencing the same mapping are disabled /
+ * unbound.
+ */
+static void *nfit_spa_map(struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	struct nfit_spa_mapping *spa_map;
+
+	mutex_lock(&acpi_desc->spa_map_mutex);
+	spa_map = __nfit_spa_map(acpi_desc, spa);
+	mutex_unlock(&acpi_desc->spa_map_mutex);
+
+	return spa_map;
+}
+
+static int nfit_blk_init_interleave(struct nfit_blk_mmio *mmio,
+		struct acpi_nfit_interleave *idt, u16 interleave_ways)
+{
+	if (idt) {
+		mmio->num_lines = idt->line_count;
+		mmio->line_size = idt->line_size;
+		if (interleave_ways == 0)
+			return -ENXIO;
+		mmio->table_size = mmio->num_lines * interleave_ways
+			* mmio->line_size;
+	}
+
+	return 0;
+}
+
+static int acpi_nfit_blk_region_enable(struct nd_bus *nd_bus, struct device *dev)
+{
+	struct nd_bus_descriptor *nd_desc = to_nd_desc(nd_bus);
+	struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
+	struct nd_blk_region *ndbr = to_nd_blk_region(dev);
+	struct nfit_blk_mmio *mmio;
+	struct nfit_blk *nfit_blk;
+	struct nfit_mem *nfit_mem;
+	struct nd_dimm *nd_dimm;
+	int rc;
+
+	nd_dimm = nd_blk_region_to_dimm(ndbr);
+	nfit_mem = nd_dimm_provider_data(nd_dimm);
+	if (!nfit_mem || !nfit_mem->dcr || !nfit_mem->bdw) {
+		dev_dbg(dev, "%s: missing%s%s%s\n", __func__,
+				nfit_mem ? "" : " nfit_mem",
+				nfit_mem->dcr ? "" : " dcr",
+				nfit_mem->bdw ? "" : " bdw");
+		return -ENXIO;
+	}
+
+	nfit_blk = devm_kzalloc(dev, sizeof(*nfit_blk), GFP_KERNEL);
+	if (!nfit_blk)
+		return -ENOMEM;
+	nd_blk_region_set_provider_data(ndbr, nfit_blk);
+	nfit_blk->nd_region = to_nd_region(dev);
+
+	/* map block aperture memory */
+	nfit_blk->bdw_offset = nfit_mem->bdw->offset;
+	mmio = &nfit_blk->mmio[BDW];
+	mmio->base = nfit_spa_map(acpi_desc, nfit_mem->spa_bdw);
+	if (!mmio->base) {
+		dev_dbg(dev, "%s: %s failed to map bdw\n", __func__,
+				nd_dimm_name(nd_dimm));
+		return -ENOMEM;
+	}
+	mmio->size = nfit_mem->bdw->size;
+	mmio->base_offset = nfit_mem->memdev_bdw->region_offset;
+	mmio->idt = nfit_mem->idt_bdw;
+	mmio->spa = nfit_mem->spa_bdw;
+	rc = nfit_blk_init_interleave(mmio, nfit_mem->idt_bdw,
+			nfit_mem->memdev_bdw->interleave_ways);
+	if (rc) {
+		dev_dbg(dev, "%s: %s failed to init bdw interleave\n",
+				__func__, nd_dimm_name(nd_dimm));
+		return rc;
+	}
+
+	/* map block control memory */
+	nfit_blk->cmd_offset = nfit_mem->dcr->command_offset;
+	nfit_blk->stat_offset = nfit_mem->dcr->status_offset;
+	mmio = &nfit_blk->mmio[DCR];
+	mmio->base = nfit_spa_map(acpi_desc, nfit_mem->spa_dcr);
+	if (!mmio->base) {
+		dev_dbg(dev, "%s: %s failed to map dcr\n", __func__,
+				nd_dimm_name(nd_dimm));
+		return -ENOMEM;
+	}
+	mmio->size = nfit_mem->dcr->window_size;
+	mmio->base_offset = nfit_mem->memdev_dcr->region_offset;
+	mmio->idt = nfit_mem->idt_dcr;
+	mmio->spa = nfit_mem->spa_dcr;
+	rc = nfit_blk_init_interleave(mmio, nfit_mem->idt_dcr,
+			nfit_mem->memdev_dcr->interleave_ways);
+	if (rc) {
+		dev_dbg(dev, "%s: %s failed to init dcr interleave\n",
+				__func__, nd_dimm_name(nd_dimm));
+		return rc;
+	}
+
+	if (mmio->line_size == 0)
+		return 0;
+
+	if ((u32) nfit_blk->cmd_offset % mmio->line_size + 8 > mmio->line_size) {
+		dev_dbg(dev, "cmd_offset crosses interleave boundary\n");
+		return -ENXIO;
+	} else if ((u32) nfit_blk->stat_offset % mmio->line_size + 8 > mmio->line_size) {
+		dev_dbg(dev, "stat_offset crosses interleave boundary\n");
+		return -ENXIO;
+	}
+
+	return 0;
+}
+
+static void acpi_nfit_blk_region_disable(struct nd_bus *nd_bus,
+		struct device *dev)
+{
+	struct nd_bus_descriptor *nd_desc = to_nd_desc(nd_bus);
+	struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
+	struct nd_blk_region *ndbr = to_nd_blk_region(dev);
+	struct nfit_blk *nfit_blk = nd_blk_region_provider_data(ndbr);
+	int i;
+
+	if (!nfit_blk)
+		return; /* never enabled */
+
+	/* auto-free BLK spa mappings */
+	for (i = 0; i < 2; i++) {
+		struct nfit_blk_mmio *mmio = &nfit_blk->mmio[i];
+
+		if (mmio->base)
+			nfit_spa_unmap(acpi_desc, mmio->spa);
+	}
+	nd_blk_region_set_provider_data(ndbr, NULL);
+	/* devm will free nfit_blk */
+}
+
 static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 		struct nfit_spa *nfit_spa)
 {
 	static struct nd_mapping nd_mappings[ND_MAX_MAPPINGS];
 	struct acpi_nfit_system_address *spa = nfit_spa->spa;
+	struct nd_blk_region_desc ndbr_desc;
+	struct nd_region_desc *ndr_desc;
 	struct nfit_memdev *nfit_memdev;
-	struct nd_region_desc ndr_desc;
 	int spa_type, count = 0, rc;
 	struct resource res;
 	u16 range_index;
@@ -851,12 +1247,13 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 
 	memset(&res, 0, sizeof(res));
 	memset(&nd_mappings, 0, sizeof(nd_mappings));
-	memset(&ndr_desc, 0, sizeof(ndr_desc));
+	memset(&ndbr_desc, 0, sizeof(ndr_desc));
 	res.start = spa->address;
 	res.end = res.start + spa->length - 1;
-	ndr_desc.res = &res;
-	ndr_desc.provider_data = nfit_spa;
-	ndr_desc.attr_groups = acpi_nfit_region_attribute_groups;
+	ndr_desc = &ndbr_desc.ndr_desc;
+	ndr_desc->res = &res;
+	ndr_desc->provider_data = nfit_spa;
+	ndr_desc->attr_groups = acpi_nfit_region_attribute_groups;
 	list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
 		struct acpi_nfit_memory_map *memdev = nfit_memdev->memdev;
 		struct nd_mapping *nd_mapping;
@@ -892,26 +1289,29 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 			} else {
 				nd_mapping->size = nfit_mem->bdw->capacity;
 				nd_mapping->start = nfit_mem->bdw->start_address;
-				ndr_desc.num_lanes = nfit_mem->bdw->windows;
+				ndr_desc->num_lanes = nfit_mem->bdw->windows;
 			}
 
-			ndr_desc.nd_mapping = nd_mapping;
-			ndr_desc.num_mappings = blk_valid;
-			if (!nd_blk_region_create(acpi_desc->nd_bus, &ndr_desc))
+			ndr_desc->nd_mapping = nd_mapping;
+			ndr_desc->num_mappings = blk_valid;
+			ndbr_desc.enable = acpi_desc->blk_enable;
+			ndbr_desc.disable = acpi_desc->blk_disable;
+			ndbr_desc.do_io = acpi_desc->blk_do_io;
+			if (!nd_blk_region_create(acpi_desc->nd_bus, ndr_desc))
 				return -ENOMEM;
 		}
 	}
 
-	ndr_desc.nd_mapping = nd_mappings;
-	ndr_desc.num_mappings = count;
-	rc = acpi_nfit_init_interleave_set(acpi_desc, &ndr_desc, spa);
+	ndr_desc->nd_mapping = nd_mappings;
+	ndr_desc->num_mappings = count;
+	rc = acpi_nfit_init_interleave_set(acpi_desc, ndr_desc, spa);
 	if (rc)
 		return rc;
 	if (spa_type == NFIT_SPA_PM) {
-		if (!nd_pmem_region_create(acpi_desc->nd_bus, &ndr_desc))
+		if (!nd_pmem_region_create(acpi_desc->nd_bus, ndr_desc))
 			return -ENOMEM;
 	} else if (spa_type == NFIT_SPA_VOLATILE) {
-		if (!nd_volatile_region_create(acpi_desc->nd_bus, &ndr_desc))
+		if (!nd_volatile_region_create(acpi_desc->nd_bus, ndr_desc))
 			return -ENOMEM;
 	}
 	return 0;
@@ -937,11 +1337,14 @@ static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 	u8 *data;
 	int rc;
 
+	INIT_LIST_HEAD(&acpi_desc->spa_maps);
 	INIT_LIST_HEAD(&acpi_desc->spas);
 	INIT_LIST_HEAD(&acpi_desc->dcrs);
 	INIT_LIST_HEAD(&acpi_desc->bdws);
+	INIT_LIST_HEAD(&acpi_desc->idts);
 	INIT_LIST_HEAD(&acpi_desc->memdevs);
 	INIT_LIST_HEAD(&acpi_desc->dimms);
+	mutex_init(&acpi_desc->spa_map_mutex);
 
 	data = (u8 *) acpi_desc->nfit;
 	end = data + sz;
@@ -990,6 +1393,9 @@ static int acpi_nfit_add(struct acpi_device *adev)
 	dev_set_drvdata(dev, acpi_desc);
 	acpi_desc->dev = dev;
 	acpi_desc->nfit = (struct acpi_table_nfit *) tbl;
+	acpi_desc->blk_enable = acpi_nfit_blk_region_enable;
+	acpi_desc->blk_disable = acpi_nfit_blk_region_disable;
+	acpi_desc->blk_do_io = acpi_nfit_blk_region_do_io;
 	nd_desc = &acpi_desc->nd_desc;
 	nd_desc->provider_name = "ACPI.NFIT";
 	nd_desc->ndctl = acpi_nfit_ctl;
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index cc496ba6bbd2..1fc49cc51d4a 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -52,6 +52,11 @@ struct nfit_bdw {
 	struct list_head list;
 };
 
+struct nfit_idt {
+	struct acpi_nfit_interleave *idt;
+	struct list_head list;
+};
+
 struct nfit_memdev {
 	struct acpi_nfit_memory_map *memdev;
 	struct list_head list;
@@ -62,10 +67,13 @@ struct nfit_mem {
 	struct nd_dimm *nd_dimm;
 	struct acpi_nfit_memory_map *memdev_dcr;
 	struct acpi_nfit_memory_map *memdev_pmem;
+	struct acpi_nfit_memory_map *memdev_bdw;
 	struct acpi_nfit_control_region *dcr;
 	struct acpi_nfit_data_region *bdw;
 	struct acpi_nfit_system_address *spa_dcr;
 	struct acpi_nfit_system_address *spa_bdw;
+	struct acpi_nfit_interleave *idt_dcr;
+	struct acpi_nfit_interleave *idt_bdw;
 	struct list_head list;
 	struct acpi_device *adev;
 	unsigned long dsm_mask;
@@ -74,16 +82,58 @@ struct nfit_mem {
 struct acpi_nfit_desc {
 	struct nd_bus_descriptor nd_desc;
 	struct acpi_table_nfit *nfit;
+	struct mutex spa_map_mutex;
+	struct list_head spa_maps;
 	struct list_head memdevs;
 	struct list_head dimms;
 	struct list_head spas;
 	struct list_head dcrs;
 	struct list_head bdws;
+	struct list_head idts;
 	struct nd_bus *nd_bus;
 	struct device *dev;
 	unsigned long dimm_dsm_force_en;
+	int (*blk_enable)(struct nd_bus *nd_bus, struct device *dev);
+	void (*blk_disable)(struct nd_bus *nd_bus, struct device *dev);
+	int (*blk_do_io)(struct nd_blk_region *ndbr, void *iobuf,
+			u64 len, int write, resource_size_t dpa);
+};
+
+enum nd_blk_mmio_selector {
+	BDW,
+	DCR,
+};
+
+struct nfit_blk {
+	struct nfit_blk_mmio {
+		void *base;
+		u64 size;
+		u64 base_offset;
+		u32 line_size;
+		u32 num_lines;
+		u32 table_size;
+		struct acpi_nfit_interleave *idt;
+		struct acpi_nfit_system_address *spa;
+	} mmio[2];
+	struct nd_region *nd_region;
+	u64 bdw_offset; /* post interleave offset */
+	u64 stat_offset;
+	u64 cmd_offset;
 };
 
+struct nfit_spa_mapping {
+	struct acpi_nfit_desc *acpi_desc;
+	struct acpi_nfit_system_address *spa;
+	struct list_head list;
+	struct kref kref;
+	void *iomem;
+};
+
+static inline struct nfit_spa_mapping *to_spa_map(struct kref *kref)
+{
+	return container_of(kref, struct nfit_spa_mapping, kref);
+}
+
 static inline struct acpi_nfit_memory_map *__to_nfit_memdev(struct nfit_mem *nfit_mem)
 {
 	if (nfit_mem->memdev_dcr)
diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index 2b169806eac5..f97bf0db6519 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -34,6 +34,19 @@ config BLK_DEV_PMEM
 
 	  Say Y if you want to use a NVDIMM described by ACPI, E820, etc...
 
+config ND_BLK
+	tristate "BLK: Block data window (aperture) device support"
+	depends on LIBND
+	default LIBND
+	help
+	  Support NVDIMMs, or other devices, that implement a BLK-mode
+	  access capability.  BLK-mode access uses memory-mapped-i/o
+	  apertures to access persistent media.
+
+	  Say Y if your platform firmware emits an ACPI.NFIT table
+	  (CONFIG_ACPI_NFIT), or otherwise exposes BLK-mode
+	  capabilities.
+
 config ND_BTT_DEVS
 	bool
 
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 1e8fe93a0a42..29a797686429 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -1,11 +1,14 @@
 obj-$(CONFIG_LIBND) += libnd.o
 obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
 obj-$(CONFIG_ND_BTT) += nd_btt.o
+obj-$(CONFIG_ND_BLK) += nd_blk.o
 
 nd_pmem-y := pmem.o
 
 nd_btt-y := btt.o
 
+nd_blk-y := blk.o
+
 libnd-y := core.o
 libnd-y += bus.o
 libnd-y += dimm_devs.o
diff --git a/drivers/block/nd/blk.c b/drivers/block/nd/blk.c
new file mode 100644
index 000000000000..464a3442fd40
--- /dev/null
+++ b/drivers/block/nd/blk.c
@@ -0,0 +1,252 @@
+/*
+ * NVDIMM Block Window Driver
+ * Copyright (c) 2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/blkdev.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/nd.h>
+#include <linux/sizes.h>
+#include "nd.h"
+
+struct nd_blk_device {
+	struct request_queue *queue;
+	struct gendisk *disk;
+	struct nd_namespace_blk *nsblk;
+	struct nd_blk_region *ndbr;
+	struct nd_io ndio;
+	size_t disk_size;
+};
+
+static int nd_blk_major;
+
+static resource_size_t to_dev_offset(struct nd_namespace_blk *nsblk,
+				resource_size_t ns_offset, unsigned int len)
+{
+	int i;
+
+	for (i = 0; i < nsblk->num_resources; i++) {
+		if (ns_offset < resource_size(nsblk->res[i])) {
+			if (ns_offset + len > resource_size(nsblk->res[i])) {
+				dev_WARN_ONCE(&nsblk->dev, 1,
+					"%s: illegal request\n", __func__);
+				return SIZE_MAX;
+			}
+			return nsblk->res[i]->start + ns_offset;
+		}
+		ns_offset -= resource_size(nsblk->res[i]);
+	}
+
+	dev_WARN_ONCE(&nsblk->dev, 1, "%s: request out of range\n", __func__);
+	return SIZE_MAX;
+}
+
+static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct block_device *bdev = bio->bi_bdev;
+	struct gendisk *disk = bdev->bd_disk;
+	struct nd_namespace_blk *nsblk;
+	struct nd_blk_device *blk_dev;
+	struct nd_blk_region *ndbr;
+	struct bvec_iter iter;
+	struct bio_vec bvec;
+	int err = 0, rw;
+	sector_t sector;
+
+	sector = bio->bi_iter.bi_sector;
+	if (bio_end_sector(bio) > get_capacity(disk)) {
+		err = -EIO;
+		goto out;
+	}
+
+	BUG_ON(bio->bi_rw & REQ_DISCARD);
+
+	rw = bio_data_dir(bio);
+
+	blk_dev = disk->private_data;
+	nsblk = blk_dev->nsblk;
+	ndbr = blk_dev->ndbr;
+	bio_for_each_segment(bvec, bio, iter) {
+		unsigned int len = bvec.bv_len;
+		resource_size_t	dev_offset;
+		void *iobuf;
+
+		BUG_ON(len > PAGE_SIZE);
+
+		dev_offset = to_dev_offset(nsblk, sector << SECTOR_SHIFT, len);
+		if (dev_offset == SIZE_MAX) {
+			err = -EIO;
+			goto out;
+		}
+
+		iobuf = kmap_atomic(bvec.bv_page);
+		err = ndbr->do_io(ndbr, iobuf + bvec.bv_offset, len, rw, dev_offset);
+		kunmap_atomic(iobuf);
+		if (err)
+			goto out;
+
+		sector += len >> SECTOR_SHIFT;
+	}
+
+ out:
+	bio_endio(bio, err);
+}
+
+static int nd_blk_rw_bytes(struct nd_io *ndio, void *iobuf, size_t offset,
+		size_t n, unsigned long flags)
+{
+	struct nd_namespace_blk *nsblk;
+	struct nd_blk_device *blk_dev;
+	int rw = nd_data_dir(flags);
+	struct nd_blk_region *ndbr;
+	resource_size_t	dev_offset;
+
+	blk_dev = container_of(ndio, typeof(*blk_dev), ndio);
+	ndbr = blk_dev->ndbr;
+	nsblk = blk_dev->nsblk;
+	dev_offset = to_dev_offset(nsblk, offset, n);
+
+	if (unlikely(offset + n > blk_dev->disk_size)) {
+		dev_WARN_ONCE(ndio->dev, 1, "%s: request out of range\n",
+				__func__);
+		return -EFAULT;
+	}
+
+	if (dev_offset == SIZE_MAX)
+		return -EIO;
+
+	return ndbr->do_io(ndbr, iobuf, n, rw, dev_offset);
+}
+
+static const struct block_device_operations nd_blk_fops = {
+	.owner =		THIS_MODULE,
+};
+
+static int nd_blk_probe(struct device *dev)
+{
+	struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+	struct nd_blk_device *blk_dev;
+	resource_size_t disk_size;
+	struct gendisk *disk;
+	int err;
+
+	disk_size = nd_namespace_blk_validate(nsblk);
+	if (disk_size < ND_MIN_NAMESPACE_SIZE)
+		return -ENXIO;
+
+	blk_dev = kzalloc(sizeof(*blk_dev), GFP_KERNEL);
+	if (!blk_dev)
+		return -ENOMEM;
+
+	blk_dev->disk_size	= disk_size;
+
+	blk_dev->queue = blk_alloc_queue(GFP_KERNEL);
+	if (!blk_dev->queue) {
+		err = -ENOMEM;
+		goto err_alloc_queue;
+	}
+
+	blk_queue_make_request(blk_dev->queue, nd_blk_make_request);
+	blk_queue_max_hw_sectors(blk_dev->queue, 1024);
+	blk_queue_bounce_limit(blk_dev->queue, BLK_BOUNCE_ANY);
+	blk_queue_logical_block_size(blk_dev->queue, nsblk->lbasize);
+
+	disk = blk_dev->disk = alloc_disk(0);
+	if (!disk) {
+		err = -ENOMEM;
+		goto err_alloc_disk;
+	}
+
+	blk_dev->ndbr = to_nd_blk_region(nsblk->dev.parent);
+	blk_dev->nsblk = nsblk;
+
+	disk->driverfs_dev	= dev;
+	disk->major		= nd_blk_major;
+	disk->first_minor	= 0;
+	disk->fops		= &nd_blk_fops;
+	disk->private_data	= blk_dev;
+	disk->queue		= blk_dev->queue;
+	disk->flags		= GENHD_FL_EXT_DEVT;
+	sprintf(disk->disk_name, "ndblk%d.%d", nd_region->id, nsblk->id);
+	set_capacity(disk, disk_size >> SECTOR_SHIFT);
+
+	nd_bus_lock(dev);
+	dev_set_drvdata(dev, blk_dev);
+
+	add_disk(disk);
+	nd_init_ndio(&blk_dev->ndio, nd_blk_rw_bytes, dev, disk, 0);
+	nd_register_ndio(&blk_dev->ndio);
+	nd_bus_unlock(dev);
+
+	return 0;
+
+ err_alloc_disk:
+	blk_cleanup_queue(blk_dev->queue);
+ err_alloc_queue:
+	kfree(blk_dev);
+	return err;
+}
+
+static int nd_blk_remove(struct device *dev)
+{
+	struct nd_blk_device *blk_dev = dev_get_drvdata(dev);
+
+	nd_unregister_ndio(&blk_dev->ndio);
+	del_gendisk(blk_dev->disk);
+	put_disk(blk_dev->disk);
+	blk_cleanup_queue(blk_dev->queue);
+	kfree(blk_dev);
+
+	return 0;
+}
+
+static struct nd_device_driver nd_blk_driver = {
+	.probe = nd_blk_probe,
+	.remove = nd_blk_remove,
+	.drv = {
+		.name = "nd_blk",
+	},
+	.type = ND_DRIVER_NAMESPACE_BLK,
+};
+
+static int __init nd_blk_init(void)
+{
+	int rc;
+
+	rc = register_blkdev(0, "nd_blk");
+	if (rc < 0)
+		return rc;
+
+	nd_blk_major = rc;
+	rc = nd_driver_register(&nd_blk_driver);
+
+	if (rc < 0)
+		unregister_blkdev(nd_blk_major, "nd_blk");
+
+	return rc;
+}
+
+static void __exit nd_blk_exit(void)
+{
+	driver_unregister(&nd_blk_driver.drv);
+	unregister_blkdev(nd_blk_major, "nd_blk");
+}
+
+MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_BLK);
+module_init(nd_blk_init);
+module_exit(nd_blk_exit);
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 4b225c8b7d0a..df6c98fc2ae6 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -209,6 +209,15 @@ struct nd_dimm *to_nd_dimm(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(to_nd_dimm);
 
+struct nd_dimm *nd_blk_region_to_dimm(struct nd_blk_region *ndbr)
+{
+	struct nd_region *nd_region = &ndbr->nd_region;
+	struct nd_mapping *nd_mapping = &nd_region->mapping[0];
+
+	return nd_mapping->nd_dimm;
+}
+EXPORT_SYMBOL_GPL(nd_blk_region_to_dimm);
+
 struct nd_dimm_drvdata *to_ndd(struct nd_mapping *nd_mapping)
 {
 	struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
diff --git a/drivers/block/nd/namespace_devs.c b/drivers/block/nd/namespace_devs.c
index c193ba6c6445..0734b1a4a0a3 100644
--- a/drivers/block/nd/namespace_devs.c
+++ b/drivers/block/nd/namespace_devs.c
@@ -151,6 +151,53 @@ static resource_size_t nd_namespace_blk_size(struct nd_namespace_blk *nsblk)
 	return size;
 }
 
+resource_size_t nd_namespace_blk_validate(struct nd_namespace_blk *nsblk)
+{
+	struct nd_region *nd_region = to_nd_region(nsblk->dev.parent);
+	struct nd_mapping *nd_mapping = &nd_region->mapping[0];
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	struct nd_label_id label_id;
+	struct resource *res;
+	int count, i;
+
+	if (!nsblk->uuid || !nsblk->lbasize)
+		return 0;
+
+	count = 0;
+	nd_label_gen_id(&label_id, nsblk->uuid, NSLABEL_FLAG_LOCAL);
+	for_each_dpa_resource(ndd, res) {
+		if (strcmp(res->name, label_id.id) != 0)
+			continue;
+		/*
+		 * Resources with unacknoweldged adjustments indicate a
+		 * failure to update labels
+		 */
+		if (res->flags & DPA_RESOURCE_ADJUSTED)
+			return 0;
+		count++;
+	}
+
+	/* These values match after a successful label update */
+	if (count != nsblk->num_resources)
+		return 0;
+
+	for (i = 0; i < nsblk->num_resources; i++) {
+		struct resource *found = NULL;
+
+		for_each_dpa_resource(ndd, res)
+			if (res == nsblk->res[i]) {
+				found = res;
+				break;
+			}
+		/* stale resource */
+		if (!found)
+			return 0;
+	}
+
+	return nd_namespace_blk_size(nsblk);
+}
+EXPORT_SYMBOL(nd_namespace_blk_validate);
+
 static int nd_namespace_label_update(struct nd_region *nd_region, struct device *dev)
 {
 	dev_WARN_ONCE(dev, dev->driver,
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index 6a864e9ae97a..b0571e334af9 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -22,7 +22,6 @@ extern struct list_head nd_bus_list;
 extern struct mutex nd_bus_list_mutex;
 extern int nd_dimm_major;
 
-struct block_device;
 struct nd_io_claim;
 struct nd_btt;
 struct nd_io;
@@ -50,8 +49,8 @@ struct nd_dimm {
 
 struct nd_io *ndio_lookup(struct nd_bus *nd_bus, const char *diskname);
 bool is_nd_dimm(struct device *dev);
-bool is_nd_blk(struct device *dev);
 bool is_nd_pmem(struct device *dev);
+bool is_nd_blk(struct device *dev);
 #if IS_ENABLED(CONFIG_ND_BTT_DEVS)
 bool is_nd_btt(struct device *dev);
 struct nd_btt *nd_btt_create(struct nd_bus *nd_bus);
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index b706f25da7e5..b830801c9892 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -113,6 +113,15 @@ struct nd_region {
 	struct nd_mapping mapping[0];
 };
 
+struct nd_blk_region {
+	int (*enable)(struct nd_bus *nd_bus, struct device *dev);
+	void (*disable)(struct nd_bus *nd_bus, struct device *dev);
+	int (*do_io)(struct nd_blk_region *ndbr, void *iobuf, u64 len,
+			int write, resource_size_t dpa);
+	void *blk_provider_data;
+	struct nd_region nd_region;
+};
+
 /*
  * Lookup next in the repeating sequence of 01, 10, and 11.
  */
@@ -232,8 +241,6 @@ struct nd_btt *to_nd_btt(struct device *dev);
 struct btt_sb;
 u64 nd_btt_sb_checksum(struct btt_sb *btt_sb);
 struct nd_region *to_nd_region(struct device *dev);
-unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
-void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
 int nd_region_to_namespace_type(struct nd_region *nd_region);
 int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
 u64 nd_region_interleave_set_cookie(struct nd_region *nd_region);
@@ -245,4 +252,6 @@ void nd_dimm_free_dpa(struct nd_dimm_drvdata *ndd, struct resource *res);
 struct resource *nd_dimm_allocate_dpa(struct nd_dimm_drvdata *ndd,
 		struct nd_label_id *label_id, resource_size_t start,
 		resource_size_t n);
+int nd_blk_region_init(struct nd_region *nd_region);
+resource_size_t nd_namespace_blk_validate(struct nd_namespace_blk *nsblk);
 #endif /* __ND_H__ */
diff --git a/drivers/block/nd/region.c b/drivers/block/nd/region.c
index 0e872f54dcd2..75ae27279f0e 100644
--- a/drivers/block/nd/region.c
+++ b/drivers/block/nd/region.c
@@ -94,11 +94,10 @@ EXPORT_SYMBOL(nd_region_release_lane);
 
 static int nd_region_probe(struct device *dev)
 {
-	int err;
+	int err, rc;
 	static unsigned long once;
 	struct nd_region_namespaces *num_ns;
 	struct nd_region *nd_region = to_nd_region(dev);
-	int rc = nd_region_register_namespaces(nd_region, &err);
 
 	if (nd_region->num_lanes > num_online_cpus()
 			&& nd_region->num_lanes < num_possible_cpus()
@@ -110,6 +109,11 @@ static int nd_region_probe(struct device *dev)
 				nd_region->num_lanes);
 	}
 
+	rc = nd_blk_region_init(nd_region);
+	if (rc)
+		return rc;
+
+	rc = nd_region_register_namespaces(nd_region, &err);
 	num_ns = devm_kzalloc(dev, sizeof(*num_ns), GFP_KERNEL);
 	if (!num_ns)
 		return -ENOMEM;
diff --git a/drivers/block/nd/region_devs.c b/drivers/block/nd/region_devs.c
index 4965004147ae..88af2a42397f 100644
--- a/drivers/block/nd/region_devs.c
+++ b/drivers/block/nd/region_devs.c
@@ -11,6 +11,7 @@
  * General Public License for more details.
  */
 #include <linux/scatterlist.h>
+#include <linux/highmem.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/sort.h>
@@ -33,7 +34,10 @@ static void nd_region_release(struct device *dev)
 		put_device(&nd_dimm->dev);
 	}
 	ida_simple_remove(&region_ida, nd_region->id);
-	kfree(nd_region);
+	if (is_nd_blk(dev))
+		kfree(to_nd_blk_region(dev));
+	else
+		kfree(nd_region);
 }
 
 static struct device_type nd_blk_device_type = {
@@ -70,6 +74,33 @@ struct nd_region *to_nd_region(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(to_nd_region);
 
+struct nd_blk_region *to_nd_blk_region(struct device *dev)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+
+	WARN_ON(!is_nd_blk(dev));
+	return container_of(nd_region, struct nd_blk_region, nd_region);
+}
+EXPORT_SYMBOL_GPL(to_nd_blk_region);
+
+void *nd_region_provider_data(struct nd_region *nd_region)
+{
+	return nd_region->provider_data;
+}
+EXPORT_SYMBOL_GPL(nd_region_provider_data);
+
+void *nd_blk_region_provider_data(struct nd_blk_region *ndbr)
+{
+	return ndbr->blk_provider_data;
+}
+EXPORT_SYMBOL_GPL(nd_blk_region_provider_data);
+
+void nd_blk_region_set_provider_data(struct nd_blk_region *ndbr, void *data)
+{
+	ndbr->blk_provider_data = data;
+}
+EXPORT_SYMBOL_GPL(nd_blk_region_set_provider_data);
+
 /**
  * nd_region_to_namespace_type() - region to an integer namespace type
  * @nd_region: region-device to interrogate
@@ -344,14 +375,13 @@ u64 nd_region_interleave_set_cookie(struct nd_region *nd_region)
 
 /*
  * Upon successful probe/remove, take/release a reference on the
- * associated interleave set (if present)
+ * associated dimms in the interleave set, on successful probe of a BLK
+ * namespace check if we need a new seed, and on remove or failed probe
+ * of a BLK region notify the provider to disable the region.
  */
 static void nd_region_notify_driver_action(struct nd_bus *nd_bus,
 		struct device *dev, int rc, bool probe)
 {
-	if (rc)
-		return;
-
 	if (is_nd_pmem(dev) || is_nd_blk(dev)) {
 		struct nd_region *nd_region = to_nd_region(dev);
 		int i;
@@ -360,11 +390,16 @@ static void nd_region_notify_driver_action(struct nd_bus *nd_bus,
 			struct nd_mapping *nd_mapping = &nd_region->mapping[i];
 			struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
 
-			if (probe)
+			if (probe && rc == 0)
 				atomic_inc(&nd_dimm->busy);
-			else
+			else if (!probe)
 				atomic_dec(&nd_dimm->busy);
 		}
+
+		if (is_nd_pmem(dev) || (probe && rc == 0))
+			return;
+
+		to_nd_blk_region(dev)->disable(nd_bus, dev);
 	} else if (dev->parent && is_nd_blk(dev->parent) && probe && rc == 0) {
 		struct nd_region *nd_region = to_nd_region(dev->parent);
 
@@ -508,11 +543,21 @@ struct attribute_group nd_mapping_attribute_group = {
 };
 EXPORT_SYMBOL_GPL(nd_mapping_attribute_group);
 
-void *nd_region_provider_data(struct nd_region *nd_region)
+int nd_blk_region_init(struct nd_region *nd_region)
 {
-	return nd_region->provider_data;
+	struct device *dev = &nd_region->dev;
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
+	if (!is_nd_blk(dev))
+		return 0;
+
+	if (nd_region->ndr_mappings < 1) {
+		dev_err(dev, "invalid BLK region\n");
+		return -ENXIO;
+	}
+
+	return to_nd_blk_region(dev)->enable(nd_bus, dev);
 }
-EXPORT_SYMBOL_GPL(nd_region_provider_data);
 
 static noinline struct nd_region *nd_region_create(struct nd_bus *nd_bus,
 		struct nd_region_desc *ndr_desc, struct device_type *dev_type)
@@ -534,9 +579,28 @@ static noinline struct nd_region *nd_region_create(struct nd_bus *nd_bus,
 		}
 	}
 
-	nd_region = kzalloc(sizeof(struct nd_region)
-			+ sizeof(struct nd_mapping) * ndr_desc->num_mappings,
-			GFP_KERNEL);
+	if (dev_type == &nd_blk_device_type) {
+		struct nd_blk_region_desc *ndbr_desc;
+		struct nd_blk_region *ndbr;
+
+		ndbr_desc = container_of(ndr_desc, typeof(*ndbr_desc), ndr_desc);
+		ndbr = kzalloc(sizeof(*ndbr) + sizeof(struct nd_mapping)
+				* ndr_desc->num_mappings,
+				GFP_KERNEL);
+		if (ndbr) {
+			nd_region = &ndbr->nd_region;
+			ndbr->enable = ndbr_desc->enable;
+			ndbr->disable = ndbr_desc->disable;
+			ndbr->do_io = ndbr_desc->do_io;
+		} else
+			nd_region = NULL;
+	} else {
+		nd_region = kzalloc(sizeof(struct nd_region)
+				+ sizeof(struct nd_mapping)
+				* ndr_desc->num_mappings,
+				GFP_KERNEL);
+	}
+
 	if (!nd_region)
 		return NULL;
 	nd_region->id = ida_simple_get(&region_ida, 0, 0, GFP_KERNEL);
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index 6146690b23e7..31969f082407 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -81,6 +81,15 @@ struct nd_region_desc {
 
 struct nd_bus;
 struct device;
+struct nd_blk_region;
+struct nd_blk_region_desc {
+	int (*enable)(struct nd_bus *nd_bus, struct device *dev);
+	void (*disable)(struct nd_bus *nd_bus, struct device *dev);
+	int (*do_io)(struct nd_blk_region *ndbr, void *iobuf, u64 len,
+			int write, resource_size_t dpa);
+	struct nd_region_desc ndr_desc;
+};
+
 struct nd_bus *__nd_bus_register(struct device *parent,
 		struct nd_bus_descriptor *nfit_desc, struct module *module);
 #define nd_bus_register(parent, desc) \
@@ -89,10 +98,10 @@ void nd_bus_unregister(struct nd_bus *nd_bus);
 struct nd_bus *to_nd_bus(struct device *dev);
 struct nd_dimm *to_nd_dimm(struct device *dev);
 struct nd_region *to_nd_region(struct device *dev);
+struct nd_blk_region *to_nd_blk_region(struct device *dev);
 struct nd_bus_descriptor *to_nd_desc(struct nd_bus *nd_bus);
 const char *nd_dimm_name(struct nd_dimm *nd_dimm);
 void *nd_dimm_provider_data(struct nd_dimm *nd_dimm);
-void *nd_region_provider_data(struct nd_region *nd_region);
 struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
 		const struct attribute_group **groups, unsigned long flags,
 		unsigned long *dsm_mask);
@@ -110,5 +119,11 @@ struct nd_region *nd_blk_region_create(struct nd_bus *nd_bus,
 		struct nd_region_desc *ndr_desc);
 struct nd_region *nd_volatile_region_create(struct nd_bus *nd_bus,
 		struct nd_region_desc *ndr_desc);
+void *nd_region_provider_data(struct nd_region *nd_region);
+void *nd_blk_region_provider_data(struct nd_blk_region *ndbr);
+void nd_blk_region_set_provider_data(struct nd_blk_region *ndbr, void *data);
+struct nd_dimm *nd_blk_region_to_dimm(struct nd_blk_region *ndbr);
+unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
+void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
 u64 nd_fletcher64(void *addr, size_t len, bool le);
 #endif /* __LIBND_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 19/21] libnd, nfit, nd_blk: driver for BLK-mode access persistent memory
@ 2015-05-20 20:57   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:57 UTC (permalink / raw)
  To: axboe
  Cc: Boaz Harrosh, linux-nvdimm, neilb, gregkh, linux-kernel,
	Andy Lutomirski, Jens Axboe, linux-acpi, jmoyer, H. Peter Anvin,
	Ross Zwisler, hch, mingo

From: Ross Zwisler <ross.zwisler@linux.intel.com>

The libnd implementation handles allocating dimm address space (DPA)
between PMEM and BLK mode interfaces.  After DPA has been allocated from
a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
as a struct bio based block device. Unlike PMEM, BLK is required to
handle platform specific details like mmio register formats and memory
controller interleave.  For this reason the libnd generic nd_blk driver
calls back into the bus provider to carry out the I/O.

This initial implementation handles the BLK interface defined by the
ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
DCR (dimm control region), BDW (block data window), IDT (interleave
descriptor) NFIT structures and the hardware register format.
[1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
[2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c               |  442 +++++++++++++++++++++++++++++++++++--
 drivers/acpi/nfit.h               |   50 ++++
 drivers/block/nd/Kconfig          |   13 +
 drivers/block/nd/Makefile         |    3 
 drivers/block/nd/blk.c            |  252 +++++++++++++++++++++
 drivers/block/nd/dimm_devs.c      |    9 +
 drivers/block/nd/namespace_devs.c |   47 ++++
 drivers/block/nd/nd-private.h     |    3 
 drivers/block/nd/nd.h             |   13 +
 drivers/block/nd/region.c         |    8 +
 drivers/block/nd/region_devs.c    |   90 ++++++--
 include/linux/libnd.h             |   17 +
 12 files changed, 909 insertions(+), 38 deletions(-)
 create mode 100644 drivers/block/nd/blk.c

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index a9aca87301c6..c4ce498da9eb 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -13,12 +13,16 @@
 #include <linux/list_sort.h>
 #include <linux/module.h>
 #include <linux/libnd.h>
+#include <linux/mutex.h>
 #include <linux/ndctl.h>
 #include <linux/list.h>
 #include <linux/acpi.h>
 #include <linux/sort.h>
+#include <linux/io.h>
 #include "nfit.h"
 
+#include <asm-generic/io-64-nonatomic-hi-lo.h>
+
 static bool force_enable_dimms;
 module_param(force_enable_dimms, bool, S_IRUGO|S_IWUSR);
 MODULE_PARM_DESC(force_enable_dimms, "Ignore _STA (ACPI DIMM device) status");
@@ -71,7 +75,7 @@ static int acpi_nfit_ctl(struct nd_bus_descriptor *nd_desc,
 
 		if (!adev)
 			return -ENOTTY;
-		dimm_name = dev_name(&adev->dev);
+		dimm_name = nd_dimm_name(nd_dimm);
 		cmd_name = nd_dimm_cmd_name(cmd);
 		dsm_mask = nfit_mem->dsm_mask;
 		desc = nd_cmd_dimm_desc(cmd);
@@ -266,10 +270,20 @@ static void *add_table(struct acpi_nfit_desc *acpi_desc, void *table, const void
 				bdw->region_index, bdw->windows);
 		break;
 	}
-	/* TODO */
-	case ACPI_NFIT_TYPE_INTERLEAVE:
-		dev_dbg(dev, "%s: idt\n", __func__);
+	case ACPI_NFIT_TYPE_INTERLEAVE: {
+		struct nfit_idt *nfit_idt = devm_kzalloc(dev, sizeof(*nfit_idt),
+				GFP_KERNEL);
+		struct acpi_nfit_interleave *idt = table;
+
+		if (!nfit_idt)
+			return err;
+		INIT_LIST_HEAD(&nfit_idt->list);
+		nfit_idt->idt = idt;
+		list_add_tail(&nfit_idt->list, &acpi_desc->idts);
+		dev_dbg(dev, "%s: idt index: %d num_lines: %d\n", __func__,
+				idt->interleave_index, idt->line_count);
 		break;
+	}
 	case ACPI_NFIT_TYPE_FLUSH_ADDRESS:
 		dev_dbg(dev, "%s: flush\n", __func__);
 		break;
@@ -321,8 +335,11 @@ static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
 		struct nfit_mem *nfit_mem, struct acpi_nfit_system_address *spa)
 {
 	u16 dcr_index = __to_nfit_memdev(nfit_mem)->region_index;
+	struct nfit_memdev *nfit_memdev;
 	struct nfit_dcr *nfit_dcr;
 	struct nfit_bdw *nfit_bdw;
+	struct nfit_idt *nfit_idt;
+	u16 idt_index, range_index;
 
 	list_for_each_entry(nfit_dcr, &acpi_desc->dcrs, list) {
 		if (nfit_dcr->dcr->region_index != dcr_index)
@@ -355,6 +372,26 @@ static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
 		return 0;
 
 	nfit_mem_find_spa_bdw(acpi_desc, nfit_mem);
+
+	if (!nfit_mem->spa_bdw)
+		return 0;
+
+	range_index = nfit_mem->spa_bdw->range_index;
+	list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
+		if (nfit_memdev->memdev->range_index != range_index ||
+				nfit_memdev->memdev->region_index != dcr_index)
+			continue;
+		nfit_mem->memdev_bdw = nfit_memdev->memdev;
+		idt_index = nfit_memdev->memdev->interleave_index;
+		list_for_each_entry(nfit_idt, &acpi_desc->idts, list) {
+			if (nfit_idt->idt->interleave_index != idt_index)
+				continue;
+			nfit_mem->idt_bdw = nfit_idt->idt;
+			break;
+		}
+		break;
+	}
+
 	return 0;
 }
 
@@ -398,9 +435,19 @@ static int nfit_mem_dcr_init(struct acpi_nfit_desc *acpi_desc,
 		}
 
 		if (type == NFIT_SPA_DCR) {
+			struct nfit_idt *nfit_idt;
+			u16 idt_index;
+
 			/* multiple dimms may share a SPA when interleaved */
 			nfit_mem->spa_dcr = spa;
 			nfit_mem->memdev_dcr = nfit_memdev->memdev;
+			idt_index = nfit_memdev->memdev->interleave_index;
+			list_for_each_entry(nfit_idt, &acpi_desc->idts, list) {
+				if (nfit_idt->idt->interleave_index != idt_index)
+					continue;
+				nfit_mem->idt_dcr = nfit_idt->idt;
+				break;
+			}
 		} else {
 			/*
 			 * A single dimm may belong to multiple SPA-PM
@@ -830,13 +877,362 @@ static int acpi_nfit_init_interleave_set(struct acpi_nfit_desc *acpi_desc,
 	return 0;
 }
 
+static u64 to_interleave_offset(u64 offset, struct nfit_blk_mmio *mmio)
+{
+	struct acpi_nfit_interleave *idt = mmio->idt;
+	u32 sub_line_offset, line_index, line_offset;
+	u64 line_no, table_skip_count, table_offset;
+
+	line_no = div_u64_rem(offset, mmio->line_size, &sub_line_offset);
+	table_skip_count = div_u64_rem(line_no, mmio->num_lines, &line_index);
+	line_offset = idt->line_offset[line_index]
+		* mmio->line_size;
+	table_offset = table_skip_count * mmio->table_size;
+
+	return mmio->base_offset + line_offset + table_offset + sub_line_offset;
+}
+
+static u64 read_blk_stat(struct nfit_blk *nfit_blk, unsigned int bw)
+{
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[DCR];
+	u64 offset = nfit_blk->stat_offset + mmio->size * bw;
+
+	if (mmio->num_lines)
+		offset = to_interleave_offset(offset, mmio);
+
+	return readq(mmio->base + offset);
+}
+
+static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
+		resource_size_t dpa, unsigned int len, unsigned int write)
+{
+	u64 cmd, offset;
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[DCR];
+
+	enum {
+		BCW_OFFSET_MASK = (1ULL << 48)-1,
+		BCW_LEN_SHIFT = 48,
+		BCW_LEN_MASK = (1ULL << 8) - 1,
+		BCW_CMD_SHIFT = 56,
+	};
+
+	cmd = (dpa >> L1_CACHE_SHIFT) & BCW_OFFSET_MASK;
+	len = len >> L1_CACHE_SHIFT;
+	cmd |= ((u64) len & BCW_LEN_MASK) << BCW_LEN_SHIFT;
+	cmd |= ((u64) write) << BCW_CMD_SHIFT;
+
+	offset = nfit_blk->cmd_offset + mmio->size * bw;
+	if (mmio->num_lines)
+		offset = to_interleave_offset(offset, mmio);
+
+	writeq(cmd, mmio->base + offset);
+	/* FIXME: conditionally perform read-back if mandated by firmware */
+}
+
+static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk, void *iobuf,
+		unsigned int len, int write, resource_size_t dpa,
+		unsigned int bw)
+{
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW];
+	unsigned int copied = 0;
+	u64 base_offset;
+	int rc;
+
+	base_offset = nfit_blk->bdw_offset + dpa % L1_CACHE_BYTES + bw * mmio->size;
+	/* TODO: non-temporal access, flush hints, cache management etc... */
+	write_blk_ctl(nfit_blk, bw, dpa, len, write);
+	while (len) {
+		unsigned int c;
+		u64 offset;
+
+		if (mmio->num_lines) {
+			u32 line_offset;
+
+			offset = to_interleave_offset(base_offset + copied,
+					mmio);
+			div_u64_rem(offset, mmio->line_size, &line_offset);
+			c = min(len, mmio->line_size - line_offset);
+		} else {
+			offset = base_offset + nfit_blk->bdw_offset;
+			c = len;
+		}
+
+		if (write)
+			memcpy(mmio->base + offset, iobuf + copied, c);
+		else
+			memcpy(iobuf + copied, mmio->base + offset, c);
+
+		copied += c;
+		len -= c;
+	}
+	rc = read_blk_stat(nfit_blk, bw) ? -EIO : 0;
+	return rc;
+}
+
+static int acpi_nfit_blk_region_do_io(struct nd_blk_region *ndbr, void *iobuf,
+		u64 len, int write, resource_size_t dpa)
+{
+	struct nfit_blk *nfit_blk = nd_blk_region_provider_data(ndbr);
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW];
+	struct nd_region *nd_region = nfit_blk->nd_region;
+	unsigned int bw, copied = 0;
+	int rc = 0;
+
+	bw = nd_region_acquire_lane(nd_region);
+	while (len) {
+		u64 c = min(len, mmio->size);
+
+		rc = acpi_nfit_blk_single_io(nfit_blk, iobuf + copied, c, write,
+				dpa + copied, bw);
+		if (rc)
+			break;
+
+		copied += c;
+		len -= c;
+	}
+	nd_region_release_lane(nd_region, bw);
+
+	return rc;
+}
+
+static void nfit_spa_mapping_release(struct kref *kref)
+{
+	struct nfit_spa_mapping *spa_map = to_spa_map(kref);
+	struct acpi_nfit_system_address *spa = spa_map->spa;
+	struct acpi_nfit_desc *acpi_desc = spa_map->acpi_desc;
+
+	WARN_ON(!mutex_is_locked(&acpi_desc->spa_map_mutex));
+	dev_dbg(acpi_desc->dev, "%s: SPA%d\n", __func__, spa->range_index);
+	iounmap(spa_map->iomem);
+	release_mem_region(spa->address, spa->length);
+	list_del(&spa_map->list);
+	kfree(spa_map);
+}
+
+static struct nfit_spa_mapping *find_spa_mapping(struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	struct nfit_spa_mapping *spa_map;
+
+	WARN_ON(!mutex_is_locked(&acpi_desc->spa_map_mutex));
+	list_for_each_entry(spa_map, &acpi_desc->spa_maps, list)
+		if (spa_map->spa == spa)
+			return spa_map;
+
+	return NULL;
+}
+
+static void nfit_spa_unmap(struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	struct nfit_spa_mapping *spa_map;
+
+	mutex_lock(&acpi_desc->spa_map_mutex);
+	spa_map = find_spa_mapping(acpi_desc, spa);
+
+	if (spa_map)
+		kref_put(&spa_map->kref, nfit_spa_mapping_release);
+	mutex_unlock(&acpi_desc->spa_map_mutex);
+}
+
+static void *__nfit_spa_map(struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	resource_size_t start = spa->address;
+	resource_size_t n = spa->length;
+	struct nfit_spa_mapping *spa_map;
+	struct resource *res;
+
+	WARN_ON(!mutex_is_locked(&acpi_desc->spa_map_mutex));
+
+	spa_map = find_spa_mapping(acpi_desc, spa);
+	if (spa_map) {
+		kref_get(&spa_map->kref);
+		return spa_map->iomem;
+	}
+
+	spa_map = kzalloc(sizeof(*spa_map), GFP_KERNEL);
+	if (!spa_map)
+		return NULL;
+
+	INIT_LIST_HEAD(&spa_map->list);
+	spa_map->spa = spa;
+	kref_init(&spa_map->kref);
+	spa_map->acpi_desc = acpi_desc;
+
+	res = request_mem_region(start, n, dev_name(acpi_desc->dev));
+	if (!res)
+		goto err_mem;
+
+	/* TODO: cacheability based on the spa type */
+	spa_map->iomem = ioremap_nocache(start, n);
+	if (!spa_map->iomem)
+		goto err_map;
+
+	list_add_tail(&spa_map->list, &acpi_desc->spa_maps);
+	return spa_map->iomem;
+
+ err_map:
+	release_mem_region(start, n);
+ err_mem:
+	kfree(spa_map);
+	return NULL;
+}
+
+/**
+ * nfit_spa_map - interleave-aware managed-mappings of acpi_nfit_system_address ranges
+ * @nd_bus: NFIT-bus that provided the spa table entry
+ * @nfit_spa: spa table to map
+ *
+ * In the case where block-data-window apertures and
+ * dimm-control-regions are interleaved they will end up sharing a
+ * single request_mem_region() + ioremap() for the address range.  In
+ * the style of devm nfit_spa_map() mappings are automatically dropped
+ * when all region devices referencing the same mapping are disabled /
+ * unbound.
+ */
+static void *nfit_spa_map(struct acpi_nfit_desc *acpi_desc,
+		struct acpi_nfit_system_address *spa)
+{
+	struct nfit_spa_mapping *spa_map;
+
+	mutex_lock(&acpi_desc->spa_map_mutex);
+	spa_map = __nfit_spa_map(acpi_desc, spa);
+	mutex_unlock(&acpi_desc->spa_map_mutex);
+
+	return spa_map;
+}
+
+static int nfit_blk_init_interleave(struct nfit_blk_mmio *mmio,
+		struct acpi_nfit_interleave *idt, u16 interleave_ways)
+{
+	if (idt) {
+		mmio->num_lines = idt->line_count;
+		mmio->line_size = idt->line_size;
+		if (interleave_ways == 0)
+			return -ENXIO;
+		mmio->table_size = mmio->num_lines * interleave_ways
+			* mmio->line_size;
+	}
+
+	return 0;
+}
+
+static int acpi_nfit_blk_region_enable(struct nd_bus *nd_bus, struct device *dev)
+{
+	struct nd_bus_descriptor *nd_desc = to_nd_desc(nd_bus);
+	struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
+	struct nd_blk_region *ndbr = to_nd_blk_region(dev);
+	struct nfit_blk_mmio *mmio;
+	struct nfit_blk *nfit_blk;
+	struct nfit_mem *nfit_mem;
+	struct nd_dimm *nd_dimm;
+	int rc;
+
+	nd_dimm = nd_blk_region_to_dimm(ndbr);
+	nfit_mem = nd_dimm_provider_data(nd_dimm);
+	if (!nfit_mem || !nfit_mem->dcr || !nfit_mem->bdw) {
+		dev_dbg(dev, "%s: missing%s%s%s\n", __func__,
+				nfit_mem ? "" : " nfit_mem",
+				nfit_mem->dcr ? "" : " dcr",
+				nfit_mem->bdw ? "" : " bdw");
+		return -ENXIO;
+	}
+
+	nfit_blk = devm_kzalloc(dev, sizeof(*nfit_blk), GFP_KERNEL);
+	if (!nfit_blk)
+		return -ENOMEM;
+	nd_blk_region_set_provider_data(ndbr, nfit_blk);
+	nfit_blk->nd_region = to_nd_region(dev);
+
+	/* map block aperture memory */
+	nfit_blk->bdw_offset = nfit_mem->bdw->offset;
+	mmio = &nfit_blk->mmio[BDW];
+	mmio->base = nfit_spa_map(acpi_desc, nfit_mem->spa_bdw);
+	if (!mmio->base) {
+		dev_dbg(dev, "%s: %s failed to map bdw\n", __func__,
+				nd_dimm_name(nd_dimm));
+		return -ENOMEM;
+	}
+	mmio->size = nfit_mem->bdw->size;
+	mmio->base_offset = nfit_mem->memdev_bdw->region_offset;
+	mmio->idt = nfit_mem->idt_bdw;
+	mmio->spa = nfit_mem->spa_bdw;
+	rc = nfit_blk_init_interleave(mmio, nfit_mem->idt_bdw,
+			nfit_mem->memdev_bdw->interleave_ways);
+	if (rc) {
+		dev_dbg(dev, "%s: %s failed to init bdw interleave\n",
+				__func__, nd_dimm_name(nd_dimm));
+		return rc;
+	}
+
+	/* map block control memory */
+	nfit_blk->cmd_offset = nfit_mem->dcr->command_offset;
+	nfit_blk->stat_offset = nfit_mem->dcr->status_offset;
+	mmio = &nfit_blk->mmio[DCR];
+	mmio->base = nfit_spa_map(acpi_desc, nfit_mem->spa_dcr);
+	if (!mmio->base) {
+		dev_dbg(dev, "%s: %s failed to map dcr\n", __func__,
+				nd_dimm_name(nd_dimm));
+		return -ENOMEM;
+	}
+	mmio->size = nfit_mem->dcr->window_size;
+	mmio->base_offset = nfit_mem->memdev_dcr->region_offset;
+	mmio->idt = nfit_mem->idt_dcr;
+	mmio->spa = nfit_mem->spa_dcr;
+	rc = nfit_blk_init_interleave(mmio, nfit_mem->idt_dcr,
+			nfit_mem->memdev_dcr->interleave_ways);
+	if (rc) {
+		dev_dbg(dev, "%s: %s failed to init dcr interleave\n",
+				__func__, nd_dimm_name(nd_dimm));
+		return rc;
+	}
+
+	if (mmio->line_size == 0)
+		return 0;
+
+	if ((u32) nfit_blk->cmd_offset % mmio->line_size + 8 > mmio->line_size) {
+		dev_dbg(dev, "cmd_offset crosses interleave boundary\n");
+		return -ENXIO;
+	} else if ((u32) nfit_blk->stat_offset % mmio->line_size + 8 > mmio->line_size) {
+		dev_dbg(dev, "stat_offset crosses interleave boundary\n");
+		return -ENXIO;
+	}
+
+	return 0;
+}
+
+static void acpi_nfit_blk_region_disable(struct nd_bus *nd_bus,
+		struct device *dev)
+{
+	struct nd_bus_descriptor *nd_desc = to_nd_desc(nd_bus);
+	struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
+	struct nd_blk_region *ndbr = to_nd_blk_region(dev);
+	struct nfit_blk *nfit_blk = nd_blk_region_provider_data(ndbr);
+	int i;
+
+	if (!nfit_blk)
+		return; /* never enabled */
+
+	/* auto-free BLK spa mappings */
+	for (i = 0; i < 2; i++) {
+		struct nfit_blk_mmio *mmio = &nfit_blk->mmio[i];
+
+		if (mmio->base)
+			nfit_spa_unmap(acpi_desc, mmio->spa);
+	}
+	nd_blk_region_set_provider_data(ndbr, NULL);
+	/* devm will free nfit_blk */
+}
+
 static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 		struct nfit_spa *nfit_spa)
 {
 	static struct nd_mapping nd_mappings[ND_MAX_MAPPINGS];
 	struct acpi_nfit_system_address *spa = nfit_spa->spa;
+	struct nd_blk_region_desc ndbr_desc;
+	struct nd_region_desc *ndr_desc;
 	struct nfit_memdev *nfit_memdev;
-	struct nd_region_desc ndr_desc;
 	int spa_type, count = 0, rc;
 	struct resource res;
 	u16 range_index;
@@ -851,12 +1247,13 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 
 	memset(&res, 0, sizeof(res));
 	memset(&nd_mappings, 0, sizeof(nd_mappings));
-	memset(&ndr_desc, 0, sizeof(ndr_desc));
+	memset(&ndbr_desc, 0, sizeof(ndr_desc));
 	res.start = spa->address;
 	res.end = res.start + spa->length - 1;
-	ndr_desc.res = &res;
-	ndr_desc.provider_data = nfit_spa;
-	ndr_desc.attr_groups = acpi_nfit_region_attribute_groups;
+	ndr_desc = &ndbr_desc.ndr_desc;
+	ndr_desc->res = &res;
+	ndr_desc->provider_data = nfit_spa;
+	ndr_desc->attr_groups = acpi_nfit_region_attribute_groups;
 	list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
 		struct acpi_nfit_memory_map *memdev = nfit_memdev->memdev;
 		struct nd_mapping *nd_mapping;
@@ -892,26 +1289,29 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 			} else {
 				nd_mapping->size = nfit_mem->bdw->capacity;
 				nd_mapping->start = nfit_mem->bdw->start_address;
-				ndr_desc.num_lanes = nfit_mem->bdw->windows;
+				ndr_desc->num_lanes = nfit_mem->bdw->windows;
 			}
 
-			ndr_desc.nd_mapping = nd_mapping;
-			ndr_desc.num_mappings = blk_valid;
-			if (!nd_blk_region_create(acpi_desc->nd_bus, &ndr_desc))
+			ndr_desc->nd_mapping = nd_mapping;
+			ndr_desc->num_mappings = blk_valid;
+			ndbr_desc.enable = acpi_desc->blk_enable;
+			ndbr_desc.disable = acpi_desc->blk_disable;
+			ndbr_desc.do_io = acpi_desc->blk_do_io;
+			if (!nd_blk_region_create(acpi_desc->nd_bus, ndr_desc))
 				return -ENOMEM;
 		}
 	}
 
-	ndr_desc.nd_mapping = nd_mappings;
-	ndr_desc.num_mappings = count;
-	rc = acpi_nfit_init_interleave_set(acpi_desc, &ndr_desc, spa);
+	ndr_desc->nd_mapping = nd_mappings;
+	ndr_desc->num_mappings = count;
+	rc = acpi_nfit_init_interleave_set(acpi_desc, ndr_desc, spa);
 	if (rc)
 		return rc;
 	if (spa_type == NFIT_SPA_PM) {
-		if (!nd_pmem_region_create(acpi_desc->nd_bus, &ndr_desc))
+		if (!nd_pmem_region_create(acpi_desc->nd_bus, ndr_desc))
 			return -ENOMEM;
 	} else if (spa_type == NFIT_SPA_VOLATILE) {
-		if (!nd_volatile_region_create(acpi_desc->nd_bus, &ndr_desc))
+		if (!nd_volatile_region_create(acpi_desc->nd_bus, ndr_desc))
 			return -ENOMEM;
 	}
 	return 0;
@@ -937,11 +1337,14 @@ static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 	u8 *data;
 	int rc;
 
+	INIT_LIST_HEAD(&acpi_desc->spa_maps);
 	INIT_LIST_HEAD(&acpi_desc->spas);
 	INIT_LIST_HEAD(&acpi_desc->dcrs);
 	INIT_LIST_HEAD(&acpi_desc->bdws);
+	INIT_LIST_HEAD(&acpi_desc->idts);
 	INIT_LIST_HEAD(&acpi_desc->memdevs);
 	INIT_LIST_HEAD(&acpi_desc->dimms);
+	mutex_init(&acpi_desc->spa_map_mutex);
 
 	data = (u8 *) acpi_desc->nfit;
 	end = data + sz;
@@ -990,6 +1393,9 @@ static int acpi_nfit_add(struct acpi_device *adev)
 	dev_set_drvdata(dev, acpi_desc);
 	acpi_desc->dev = dev;
 	acpi_desc->nfit = (struct acpi_table_nfit *) tbl;
+	acpi_desc->blk_enable = acpi_nfit_blk_region_enable;
+	acpi_desc->blk_disable = acpi_nfit_blk_region_disable;
+	acpi_desc->blk_do_io = acpi_nfit_blk_region_do_io;
 	nd_desc = &acpi_desc->nd_desc;
 	nd_desc->provider_name = "ACPI.NFIT";
 	nd_desc->ndctl = acpi_nfit_ctl;
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index cc496ba6bbd2..1fc49cc51d4a 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -52,6 +52,11 @@ struct nfit_bdw {
 	struct list_head list;
 };
 
+struct nfit_idt {
+	struct acpi_nfit_interleave *idt;
+	struct list_head list;
+};
+
 struct nfit_memdev {
 	struct acpi_nfit_memory_map *memdev;
 	struct list_head list;
@@ -62,10 +67,13 @@ struct nfit_mem {
 	struct nd_dimm *nd_dimm;
 	struct acpi_nfit_memory_map *memdev_dcr;
 	struct acpi_nfit_memory_map *memdev_pmem;
+	struct acpi_nfit_memory_map *memdev_bdw;
 	struct acpi_nfit_control_region *dcr;
 	struct acpi_nfit_data_region *bdw;
 	struct acpi_nfit_system_address *spa_dcr;
 	struct acpi_nfit_system_address *spa_bdw;
+	struct acpi_nfit_interleave *idt_dcr;
+	struct acpi_nfit_interleave *idt_bdw;
 	struct list_head list;
 	struct acpi_device *adev;
 	unsigned long dsm_mask;
@@ -74,16 +82,58 @@ struct nfit_mem {
 struct acpi_nfit_desc {
 	struct nd_bus_descriptor nd_desc;
 	struct acpi_table_nfit *nfit;
+	struct mutex spa_map_mutex;
+	struct list_head spa_maps;
 	struct list_head memdevs;
 	struct list_head dimms;
 	struct list_head spas;
 	struct list_head dcrs;
 	struct list_head bdws;
+	struct list_head idts;
 	struct nd_bus *nd_bus;
 	struct device *dev;
 	unsigned long dimm_dsm_force_en;
+	int (*blk_enable)(struct nd_bus *nd_bus, struct device *dev);
+	void (*blk_disable)(struct nd_bus *nd_bus, struct device *dev);
+	int (*blk_do_io)(struct nd_blk_region *ndbr, void *iobuf,
+			u64 len, int write, resource_size_t dpa);
+};
+
+enum nd_blk_mmio_selector {
+	BDW,
+	DCR,
+};
+
+struct nfit_blk {
+	struct nfit_blk_mmio {
+		void *base;
+		u64 size;
+		u64 base_offset;
+		u32 line_size;
+		u32 num_lines;
+		u32 table_size;
+		struct acpi_nfit_interleave *idt;
+		struct acpi_nfit_system_address *spa;
+	} mmio[2];
+	struct nd_region *nd_region;
+	u64 bdw_offset; /* post interleave offset */
+	u64 stat_offset;
+	u64 cmd_offset;
 };
 
+struct nfit_spa_mapping {
+	struct acpi_nfit_desc *acpi_desc;
+	struct acpi_nfit_system_address *spa;
+	struct list_head list;
+	struct kref kref;
+	void *iomem;
+};
+
+static inline struct nfit_spa_mapping *to_spa_map(struct kref *kref)
+{
+	return container_of(kref, struct nfit_spa_mapping, kref);
+}
+
 static inline struct acpi_nfit_memory_map *__to_nfit_memdev(struct nfit_mem *nfit_mem)
 {
 	if (nfit_mem->memdev_dcr)
diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index 2b169806eac5..f97bf0db6519 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -34,6 +34,19 @@ config BLK_DEV_PMEM
 
 	  Say Y if you want to use a NVDIMM described by ACPI, E820, etc...
 
+config ND_BLK
+	tristate "BLK: Block data window (aperture) device support"
+	depends on LIBND
+	default LIBND
+	help
+	  Support NVDIMMs, or other devices, that implement a BLK-mode
+	  access capability.  BLK-mode access uses memory-mapped-i/o
+	  apertures to access persistent media.
+
+	  Say Y if your platform firmware emits an ACPI.NFIT table
+	  (CONFIG_ACPI_NFIT), or otherwise exposes BLK-mode
+	  capabilities.
+
 config ND_BTT_DEVS
 	bool
 
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 1e8fe93a0a42..29a797686429 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -1,11 +1,14 @@
 obj-$(CONFIG_LIBND) += libnd.o
 obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
 obj-$(CONFIG_ND_BTT) += nd_btt.o
+obj-$(CONFIG_ND_BLK) += nd_blk.o
 
 nd_pmem-y := pmem.o
 
 nd_btt-y := btt.o
 
+nd_blk-y := blk.o
+
 libnd-y := core.o
 libnd-y += bus.o
 libnd-y += dimm_devs.o
diff --git a/drivers/block/nd/blk.c b/drivers/block/nd/blk.c
new file mode 100644
index 000000000000..464a3442fd40
--- /dev/null
+++ b/drivers/block/nd/blk.c
@@ -0,0 +1,252 @@
+/*
+ * NVDIMM Block Window Driver
+ * Copyright (c) 2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/blkdev.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/nd.h>
+#include <linux/sizes.h>
+#include "nd.h"
+
+struct nd_blk_device {
+	struct request_queue *queue;
+	struct gendisk *disk;
+	struct nd_namespace_blk *nsblk;
+	struct nd_blk_region *ndbr;
+	struct nd_io ndio;
+	size_t disk_size;
+};
+
+static int nd_blk_major;
+
+static resource_size_t to_dev_offset(struct nd_namespace_blk *nsblk,
+				resource_size_t ns_offset, unsigned int len)
+{
+	int i;
+
+	for (i = 0; i < nsblk->num_resources; i++) {
+		if (ns_offset < resource_size(nsblk->res[i])) {
+			if (ns_offset + len > resource_size(nsblk->res[i])) {
+				dev_WARN_ONCE(&nsblk->dev, 1,
+					"%s: illegal request\n", __func__);
+				return SIZE_MAX;
+			}
+			return nsblk->res[i]->start + ns_offset;
+		}
+		ns_offset -= resource_size(nsblk->res[i]);
+	}
+
+	dev_WARN_ONCE(&nsblk->dev, 1, "%s: request out of range\n", __func__);
+	return SIZE_MAX;
+}
+
+static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct block_device *bdev = bio->bi_bdev;
+	struct gendisk *disk = bdev->bd_disk;
+	struct nd_namespace_blk *nsblk;
+	struct nd_blk_device *blk_dev;
+	struct nd_blk_region *ndbr;
+	struct bvec_iter iter;
+	struct bio_vec bvec;
+	int err = 0, rw;
+	sector_t sector;
+
+	sector = bio->bi_iter.bi_sector;
+	if (bio_end_sector(bio) > get_capacity(disk)) {
+		err = -EIO;
+		goto out;
+	}
+
+	BUG_ON(bio->bi_rw & REQ_DISCARD);
+
+	rw = bio_data_dir(bio);
+
+	blk_dev = disk->private_data;
+	nsblk = blk_dev->nsblk;
+	ndbr = blk_dev->ndbr;
+	bio_for_each_segment(bvec, bio, iter) {
+		unsigned int len = bvec.bv_len;
+		resource_size_t	dev_offset;
+		void *iobuf;
+
+		BUG_ON(len > PAGE_SIZE);
+
+		dev_offset = to_dev_offset(nsblk, sector << SECTOR_SHIFT, len);
+		if (dev_offset == SIZE_MAX) {
+			err = -EIO;
+			goto out;
+		}
+
+		iobuf = kmap_atomic(bvec.bv_page);
+		err = ndbr->do_io(ndbr, iobuf + bvec.bv_offset, len, rw, dev_offset);
+		kunmap_atomic(iobuf);
+		if (err)
+			goto out;
+
+		sector += len >> SECTOR_SHIFT;
+	}
+
+ out:
+	bio_endio(bio, err);
+}
+
+static int nd_blk_rw_bytes(struct nd_io *ndio, void *iobuf, size_t offset,
+		size_t n, unsigned long flags)
+{
+	struct nd_namespace_blk *nsblk;
+	struct nd_blk_device *blk_dev;
+	int rw = nd_data_dir(flags);
+	struct nd_blk_region *ndbr;
+	resource_size_t	dev_offset;
+
+	blk_dev = container_of(ndio, typeof(*blk_dev), ndio);
+	ndbr = blk_dev->ndbr;
+	nsblk = blk_dev->nsblk;
+	dev_offset = to_dev_offset(nsblk, offset, n);
+
+	if (unlikely(offset + n > blk_dev->disk_size)) {
+		dev_WARN_ONCE(ndio->dev, 1, "%s: request out of range\n",
+				__func__);
+		return -EFAULT;
+	}
+
+	if (dev_offset == SIZE_MAX)
+		return -EIO;
+
+	return ndbr->do_io(ndbr, iobuf, n, rw, dev_offset);
+}
+
+static const struct block_device_operations nd_blk_fops = {
+	.owner =		THIS_MODULE,
+};
+
+static int nd_blk_probe(struct device *dev)
+{
+	struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+	struct nd_blk_device *blk_dev;
+	resource_size_t disk_size;
+	struct gendisk *disk;
+	int err;
+
+	disk_size = nd_namespace_blk_validate(nsblk);
+	if (disk_size < ND_MIN_NAMESPACE_SIZE)
+		return -ENXIO;
+
+	blk_dev = kzalloc(sizeof(*blk_dev), GFP_KERNEL);
+	if (!blk_dev)
+		return -ENOMEM;
+
+	blk_dev->disk_size	= disk_size;
+
+	blk_dev->queue = blk_alloc_queue(GFP_KERNEL);
+	if (!blk_dev->queue) {
+		err = -ENOMEM;
+		goto err_alloc_queue;
+	}
+
+	blk_queue_make_request(blk_dev->queue, nd_blk_make_request);
+	blk_queue_max_hw_sectors(blk_dev->queue, 1024);
+	blk_queue_bounce_limit(blk_dev->queue, BLK_BOUNCE_ANY);
+	blk_queue_logical_block_size(blk_dev->queue, nsblk->lbasize);
+
+	disk = blk_dev->disk = alloc_disk(0);
+	if (!disk) {
+		err = -ENOMEM;
+		goto err_alloc_disk;
+	}
+
+	blk_dev->ndbr = to_nd_blk_region(nsblk->dev.parent);
+	blk_dev->nsblk = nsblk;
+
+	disk->driverfs_dev	= dev;
+	disk->major		= nd_blk_major;
+	disk->first_minor	= 0;
+	disk->fops		= &nd_blk_fops;
+	disk->private_data	= blk_dev;
+	disk->queue		= blk_dev->queue;
+	disk->flags		= GENHD_FL_EXT_DEVT;
+	sprintf(disk->disk_name, "ndblk%d.%d", nd_region->id, nsblk->id);
+	set_capacity(disk, disk_size >> SECTOR_SHIFT);
+
+	nd_bus_lock(dev);
+	dev_set_drvdata(dev, blk_dev);
+
+	add_disk(disk);
+	nd_init_ndio(&blk_dev->ndio, nd_blk_rw_bytes, dev, disk, 0);
+	nd_register_ndio(&blk_dev->ndio);
+	nd_bus_unlock(dev);
+
+	return 0;
+
+ err_alloc_disk:
+	blk_cleanup_queue(blk_dev->queue);
+ err_alloc_queue:
+	kfree(blk_dev);
+	return err;
+}
+
+static int nd_blk_remove(struct device *dev)
+{
+	struct nd_blk_device *blk_dev = dev_get_drvdata(dev);
+
+	nd_unregister_ndio(&blk_dev->ndio);
+	del_gendisk(blk_dev->disk);
+	put_disk(blk_dev->disk);
+	blk_cleanup_queue(blk_dev->queue);
+	kfree(blk_dev);
+
+	return 0;
+}
+
+static struct nd_device_driver nd_blk_driver = {
+	.probe = nd_blk_probe,
+	.remove = nd_blk_remove,
+	.drv = {
+		.name = "nd_blk",
+	},
+	.type = ND_DRIVER_NAMESPACE_BLK,
+};
+
+static int __init nd_blk_init(void)
+{
+	int rc;
+
+	rc = register_blkdev(0, "nd_blk");
+	if (rc < 0)
+		return rc;
+
+	nd_blk_major = rc;
+	rc = nd_driver_register(&nd_blk_driver);
+
+	if (rc < 0)
+		unregister_blkdev(nd_blk_major, "nd_blk");
+
+	return rc;
+}
+
+static void __exit nd_blk_exit(void)
+{
+	driver_unregister(&nd_blk_driver.drv);
+	unregister_blkdev(nd_blk_major, "nd_blk");
+}
+
+MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_BLK);
+module_init(nd_blk_init);
+module_exit(nd_blk_exit);
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 4b225c8b7d0a..df6c98fc2ae6 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -209,6 +209,15 @@ struct nd_dimm *to_nd_dimm(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(to_nd_dimm);
 
+struct nd_dimm *nd_blk_region_to_dimm(struct nd_blk_region *ndbr)
+{
+	struct nd_region *nd_region = &ndbr->nd_region;
+	struct nd_mapping *nd_mapping = &nd_region->mapping[0];
+
+	return nd_mapping->nd_dimm;
+}
+EXPORT_SYMBOL_GPL(nd_blk_region_to_dimm);
+
 struct nd_dimm_drvdata *to_ndd(struct nd_mapping *nd_mapping)
 {
 	struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
diff --git a/drivers/block/nd/namespace_devs.c b/drivers/block/nd/namespace_devs.c
index c193ba6c6445..0734b1a4a0a3 100644
--- a/drivers/block/nd/namespace_devs.c
+++ b/drivers/block/nd/namespace_devs.c
@@ -151,6 +151,53 @@ static resource_size_t nd_namespace_blk_size(struct nd_namespace_blk *nsblk)
 	return size;
 }
 
+resource_size_t nd_namespace_blk_validate(struct nd_namespace_blk *nsblk)
+{
+	struct nd_region *nd_region = to_nd_region(nsblk->dev.parent);
+	struct nd_mapping *nd_mapping = &nd_region->mapping[0];
+	struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+	struct nd_label_id label_id;
+	struct resource *res;
+	int count, i;
+
+	if (!nsblk->uuid || !nsblk->lbasize)
+		return 0;
+
+	count = 0;
+	nd_label_gen_id(&label_id, nsblk->uuid, NSLABEL_FLAG_LOCAL);
+	for_each_dpa_resource(ndd, res) {
+		if (strcmp(res->name, label_id.id) != 0)
+			continue;
+		/*
+		 * Resources with unacknoweldged adjustments indicate a
+		 * failure to update labels
+		 */
+		if (res->flags & DPA_RESOURCE_ADJUSTED)
+			return 0;
+		count++;
+	}
+
+	/* These values match after a successful label update */
+	if (count != nsblk->num_resources)
+		return 0;
+
+	for (i = 0; i < nsblk->num_resources; i++) {
+		struct resource *found = NULL;
+
+		for_each_dpa_resource(ndd, res)
+			if (res == nsblk->res[i]) {
+				found = res;
+				break;
+			}
+		/* stale resource */
+		if (!found)
+			return 0;
+	}
+
+	return nd_namespace_blk_size(nsblk);
+}
+EXPORT_SYMBOL(nd_namespace_blk_validate);
+
 static int nd_namespace_label_update(struct nd_region *nd_region, struct device *dev)
 {
 	dev_WARN_ONCE(dev, dev->driver,
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index 6a864e9ae97a..b0571e334af9 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -22,7 +22,6 @@ extern struct list_head nd_bus_list;
 extern struct mutex nd_bus_list_mutex;
 extern int nd_dimm_major;
 
-struct block_device;
 struct nd_io_claim;
 struct nd_btt;
 struct nd_io;
@@ -50,8 +49,8 @@ struct nd_dimm {
 
 struct nd_io *ndio_lookup(struct nd_bus *nd_bus, const char *diskname);
 bool is_nd_dimm(struct device *dev);
-bool is_nd_blk(struct device *dev);
 bool is_nd_pmem(struct device *dev);
+bool is_nd_blk(struct device *dev);
 #if IS_ENABLED(CONFIG_ND_BTT_DEVS)
 bool is_nd_btt(struct device *dev);
 struct nd_btt *nd_btt_create(struct nd_bus *nd_bus);
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index b706f25da7e5..b830801c9892 100644
--- a/drivers/block/nd/nd.h
+++ b/drivers/block/nd/nd.h
@@ -113,6 +113,15 @@ struct nd_region {
 	struct nd_mapping mapping[0];
 };
 
+struct nd_blk_region {
+	int (*enable)(struct nd_bus *nd_bus, struct device *dev);
+	void (*disable)(struct nd_bus *nd_bus, struct device *dev);
+	int (*do_io)(struct nd_blk_region *ndbr, void *iobuf, u64 len,
+			int write, resource_size_t dpa);
+	void *blk_provider_data;
+	struct nd_region nd_region;
+};
+
 /*
  * Lookup next in the repeating sequence of 01, 10, and 11.
  */
@@ -232,8 +241,6 @@ struct nd_btt *to_nd_btt(struct device *dev);
 struct btt_sb;
 u64 nd_btt_sb_checksum(struct btt_sb *btt_sb);
 struct nd_region *to_nd_region(struct device *dev);
-unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
-void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
 int nd_region_to_namespace_type(struct nd_region *nd_region);
 int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
 u64 nd_region_interleave_set_cookie(struct nd_region *nd_region);
@@ -245,4 +252,6 @@ void nd_dimm_free_dpa(struct nd_dimm_drvdata *ndd, struct resource *res);
 struct resource *nd_dimm_allocate_dpa(struct nd_dimm_drvdata *ndd,
 		struct nd_label_id *label_id, resource_size_t start,
 		resource_size_t n);
+int nd_blk_region_init(struct nd_region *nd_region);
+resource_size_t nd_namespace_blk_validate(struct nd_namespace_blk *nsblk);
 #endif /* __ND_H__ */
diff --git a/drivers/block/nd/region.c b/drivers/block/nd/region.c
index 0e872f54dcd2..75ae27279f0e 100644
--- a/drivers/block/nd/region.c
+++ b/drivers/block/nd/region.c
@@ -94,11 +94,10 @@ EXPORT_SYMBOL(nd_region_release_lane);
 
 static int nd_region_probe(struct device *dev)
 {
-	int err;
+	int err, rc;
 	static unsigned long once;
 	struct nd_region_namespaces *num_ns;
 	struct nd_region *nd_region = to_nd_region(dev);
-	int rc = nd_region_register_namespaces(nd_region, &err);
 
 	if (nd_region->num_lanes > num_online_cpus()
 			&& nd_region->num_lanes < num_possible_cpus()
@@ -110,6 +109,11 @@ static int nd_region_probe(struct device *dev)
 				nd_region->num_lanes);
 	}
 
+	rc = nd_blk_region_init(nd_region);
+	if (rc)
+		return rc;
+
+	rc = nd_region_register_namespaces(nd_region, &err);
 	num_ns = devm_kzalloc(dev, sizeof(*num_ns), GFP_KERNEL);
 	if (!num_ns)
 		return -ENOMEM;
diff --git a/drivers/block/nd/region_devs.c b/drivers/block/nd/region_devs.c
index 4965004147ae..88af2a42397f 100644
--- a/drivers/block/nd/region_devs.c
+++ b/drivers/block/nd/region_devs.c
@@ -11,6 +11,7 @@
  * General Public License for more details.
  */
 #include <linux/scatterlist.h>
+#include <linux/highmem.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/sort.h>
@@ -33,7 +34,10 @@ static void nd_region_release(struct device *dev)
 		put_device(&nd_dimm->dev);
 	}
 	ida_simple_remove(&region_ida, nd_region->id);
-	kfree(nd_region);
+	if (is_nd_blk(dev))
+		kfree(to_nd_blk_region(dev));
+	else
+		kfree(nd_region);
 }
 
 static struct device_type nd_blk_device_type = {
@@ -70,6 +74,33 @@ struct nd_region *to_nd_region(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(to_nd_region);
 
+struct nd_blk_region *to_nd_blk_region(struct device *dev)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+
+	WARN_ON(!is_nd_blk(dev));
+	return container_of(nd_region, struct nd_blk_region, nd_region);
+}
+EXPORT_SYMBOL_GPL(to_nd_blk_region);
+
+void *nd_region_provider_data(struct nd_region *nd_region)
+{
+	return nd_region->provider_data;
+}
+EXPORT_SYMBOL_GPL(nd_region_provider_data);
+
+void *nd_blk_region_provider_data(struct nd_blk_region *ndbr)
+{
+	return ndbr->blk_provider_data;
+}
+EXPORT_SYMBOL_GPL(nd_blk_region_provider_data);
+
+void nd_blk_region_set_provider_data(struct nd_blk_region *ndbr, void *data)
+{
+	ndbr->blk_provider_data = data;
+}
+EXPORT_SYMBOL_GPL(nd_blk_region_set_provider_data);
+
 /**
  * nd_region_to_namespace_type() - region to an integer namespace type
  * @nd_region: region-device to interrogate
@@ -344,14 +375,13 @@ u64 nd_region_interleave_set_cookie(struct nd_region *nd_region)
 
 /*
  * Upon successful probe/remove, take/release a reference on the
- * associated interleave set (if present)
+ * associated dimms in the interleave set, on successful probe of a BLK
+ * namespace check if we need a new seed, and on remove or failed probe
+ * of a BLK region notify the provider to disable the region.
  */
 static void nd_region_notify_driver_action(struct nd_bus *nd_bus,
 		struct device *dev, int rc, bool probe)
 {
-	if (rc)
-		return;
-
 	if (is_nd_pmem(dev) || is_nd_blk(dev)) {
 		struct nd_region *nd_region = to_nd_region(dev);
 		int i;
@@ -360,11 +390,16 @@ static void nd_region_notify_driver_action(struct nd_bus *nd_bus,
 			struct nd_mapping *nd_mapping = &nd_region->mapping[i];
 			struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
 
-			if (probe)
+			if (probe && rc == 0)
 				atomic_inc(&nd_dimm->busy);
-			else
+			else if (!probe)
 				atomic_dec(&nd_dimm->busy);
 		}
+
+		if (is_nd_pmem(dev) || (probe && rc == 0))
+			return;
+
+		to_nd_blk_region(dev)->disable(nd_bus, dev);
 	} else if (dev->parent && is_nd_blk(dev->parent) && probe && rc == 0) {
 		struct nd_region *nd_region = to_nd_region(dev->parent);
 
@@ -508,11 +543,21 @@ struct attribute_group nd_mapping_attribute_group = {
 };
 EXPORT_SYMBOL_GPL(nd_mapping_attribute_group);
 
-void *nd_region_provider_data(struct nd_region *nd_region)
+int nd_blk_region_init(struct nd_region *nd_region)
 {
-	return nd_region->provider_data;
+	struct device *dev = &nd_region->dev;
+	struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
+	if (!is_nd_blk(dev))
+		return 0;
+
+	if (nd_region->ndr_mappings < 1) {
+		dev_err(dev, "invalid BLK region\n");
+		return -ENXIO;
+	}
+
+	return to_nd_blk_region(dev)->enable(nd_bus, dev);
 }
-EXPORT_SYMBOL_GPL(nd_region_provider_data);
 
 static noinline struct nd_region *nd_region_create(struct nd_bus *nd_bus,
 		struct nd_region_desc *ndr_desc, struct device_type *dev_type)
@@ -534,9 +579,28 @@ static noinline struct nd_region *nd_region_create(struct nd_bus *nd_bus,
 		}
 	}
 
-	nd_region = kzalloc(sizeof(struct nd_region)
-			+ sizeof(struct nd_mapping) * ndr_desc->num_mappings,
-			GFP_KERNEL);
+	if (dev_type == &nd_blk_device_type) {
+		struct nd_blk_region_desc *ndbr_desc;
+		struct nd_blk_region *ndbr;
+
+		ndbr_desc = container_of(ndr_desc, typeof(*ndbr_desc), ndr_desc);
+		ndbr = kzalloc(sizeof(*ndbr) + sizeof(struct nd_mapping)
+				* ndr_desc->num_mappings,
+				GFP_KERNEL);
+		if (ndbr) {
+			nd_region = &ndbr->nd_region;
+			ndbr->enable = ndbr_desc->enable;
+			ndbr->disable = ndbr_desc->disable;
+			ndbr->do_io = ndbr_desc->do_io;
+		} else
+			nd_region = NULL;
+	} else {
+		nd_region = kzalloc(sizeof(struct nd_region)
+				+ sizeof(struct nd_mapping)
+				* ndr_desc->num_mappings,
+				GFP_KERNEL);
+	}
+
 	if (!nd_region)
 		return NULL;
 	nd_region->id = ida_simple_get(&region_ida, 0, 0, GFP_KERNEL);
diff --git a/include/linux/libnd.h b/include/linux/libnd.h
index 6146690b23e7..31969f082407 100644
--- a/include/linux/libnd.h
+++ b/include/linux/libnd.h
@@ -81,6 +81,15 @@ struct nd_region_desc {
 
 struct nd_bus;
 struct device;
+struct nd_blk_region;
+struct nd_blk_region_desc {
+	int (*enable)(struct nd_bus *nd_bus, struct device *dev);
+	void (*disable)(struct nd_bus *nd_bus, struct device *dev);
+	int (*do_io)(struct nd_blk_region *ndbr, void *iobuf, u64 len,
+			int write, resource_size_t dpa);
+	struct nd_region_desc ndr_desc;
+};
+
 struct nd_bus *__nd_bus_register(struct device *parent,
 		struct nd_bus_descriptor *nfit_desc, struct module *module);
 #define nd_bus_register(parent, desc) \
@@ -89,10 +98,10 @@ void nd_bus_unregister(struct nd_bus *nd_bus);
 struct nd_bus *to_nd_bus(struct device *dev);
 struct nd_dimm *to_nd_dimm(struct device *dev);
 struct nd_region *to_nd_region(struct device *dev);
+struct nd_blk_region *to_nd_blk_region(struct device *dev);
 struct nd_bus_descriptor *to_nd_desc(struct nd_bus *nd_bus);
 const char *nd_dimm_name(struct nd_dimm *nd_dimm);
 void *nd_dimm_provider_data(struct nd_dimm *nd_dimm);
-void *nd_region_provider_data(struct nd_region *nd_region);
 struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
 		const struct attribute_group **groups, unsigned long flags,
 		unsigned long *dsm_mask);
@@ -110,5 +119,11 @@ struct nd_region *nd_blk_region_create(struct nd_bus *nd_bus,
 		struct nd_region_desc *ndr_desc);
 struct nd_region *nd_volatile_region_create(struct nd_bus *nd_bus,
 		struct nd_region_desc *ndr_desc);
+void *nd_region_provider_data(struct nd_region *nd_region);
+void *nd_blk_region_provider_data(struct nd_blk_region *ndbr);
+void nd_blk_region_set_provider_data(struct nd_blk_region *ndbr, void *data);
+struct nd_dimm *nd_blk_region_to_dimm(struct nd_blk_region *ndbr);
+unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
+void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
 u64 nd_fletcher64(void *addr, size_t len, bool le);
 #endif /* __LIBND_H__ */


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 20/21] nfit-test: manufactured NFITs for interface development
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:58   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:58 UTC (permalink / raw)
  To: axboe
  Cc: mingo, linux-nvdimm, neilb, gregkh, Rafael J. Wysocki,
	linux-kernel, Robert Moore, linux-acpi, jmoyer, Lv Zheng, hch

Manually create and register NFITs to describe 2 topologies.  Topology1
is an advanced plausible configuration for BLK/PMEM aliased NVDIMMs.
Topology2 is an example configuration for current platforms that only
ship with a persistent address range.

 Kernel provider "nfit_test.0" produces an NFIT with the following attributes:

                              (a)               (b)           DIMM   BLK-REGION
           +-------------------+--------+--------+--------+
 +------+  |       pm0.0       | blk2.0 | pm1.0  | blk2.1 |    0      region2
 | imc0 +--+- - - region0- - - +--------+        +--------+
 +--+---+  |       pm0.0       | blk3.0 | pm1.0  | blk3.1 |    1      region3
    |      +-------------------+--------v        v--------+
 +--+---+                               |                 |
 | cpu0 |                                     region1
 +--+---+                               |                 |
    |      +----------------------------^        ^--------+
 +--+---+  |           blk4.0           | pm1.0  | blk4.0 |    2      region4
 | imc1 +--+----------------------------|        +--------+
 +------+  |           blk5.0           | pm1.0  | blk5.0 |    3      region5
           +----------------------------+--------+--------+

 *) In this layout we have four dimms and two memory controllers in one
    socket.  Each unique interface ("block" or "pmem") to DPA space
    is identified by a region device with a dynamically assigned id.

 *) The first portion of dimm0 and dimm1 are interleaved as REGION0.
    A single "pmem" namespace is created in the REGION0-"spa"-range
    that spans dimm0 and dimm1 with a user-specified name of "pm0.0".
    Some of that interleaved "spa" range is reclaimed as "bdw"
    accessed space starting at offset (a) into each dimm.  In that
    reclaimed space we create two "bdw" "namespaces" from REGION2 and
    REGION3 where "blk2.0" and "blk3.0" are just human readable names
    that could be set to any user-desired name in the label.

 *) In the last portion of dimm0 and dimm1 we have an interleaved
    "spa" range, REGION1, that spans those two dimms as well as dimm2
    and dimm3.  Some of REGION1 allocated to a "pmem" namespace named
    "pm1.0" the rest is reclaimed in 4 "bdw" namespaces (for each
    dimm in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
    "blk5.0".

 *) The portion of dimm2 and dimm3 that do not participate in the
    REGION1 interleaved "spa" range (i.e. the DPA address below
    offset (b) are also included in the "blk4.0" and "blk5.0"
    namespaces.  Note, that this example shows that "bdw" namespaces
    don't need to be contiguous in DPA-space.

 Kernel provider "nfit_test.1" produces an NFIT with the following attributes:

 region2
 +---------------------+
 |---------------------|
 ||       pm2.0       ||
 |---------------------|
 +---------------------+

 *) Describes a simple system-physical-address range with no backing
    dimm or interleave description.

Cc: <linux-acpi@vger.kernel.org>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c               |    6 
 drivers/acpi/nfit.h               |   12 
 drivers/block/nd/Kconfig          |   22 +
 drivers/block/nd/Makefile         |    9 
 drivers/block/nd/test/Makefile    |    5 
 drivers/block/nd/test/iomap.c     |  151 +++++
 drivers/block/nd/test/nfit.c      | 1171 +++++++++++++++++++++++++++++++++++++
 drivers/block/nd/test/nfit_test.h |   28 +
 8 files changed, 1402 insertions(+), 2 deletions(-)
 create mode 100644 drivers/block/nd/test/Makefile
 create mode 100644 drivers/block/nd/test/iomap.c
 create mode 100644 drivers/block/nd/test/nfit.c
 create mode 100644 drivers/block/nd/test/nfit_test.h

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index c4ce498da9eb..1efbd01d4860 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -29,10 +29,11 @@ MODULE_PARM_DESC(force_enable_dimms, "Ignore _STA (ACPI DIMM device) status");
 
 static u8 nfit_uuid[NFIT_UUID_MAX][16];
 
-static const u8 *to_nfit_uuid(enum nfit_uuids id)
+const u8 *to_nfit_uuid(enum nfit_uuids id)
 {
 	return nfit_uuid[id];
 }
+EXPORT_SYMBOL(to_nfit_uuid);
 
 static struct acpi_nfit_desc *to_acpi_nfit_desc(struct nd_bus_descriptor *nd_desc)
 {
@@ -1330,7 +1331,7 @@ static int acpi_nfit_register_regions(struct acpi_nfit_desc *acpi_desc)
 	return 0;
 }
 
-static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
+int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 {
 	struct device *dev = acpi_desc->dev;
 	const void *end;
@@ -1369,6 +1370,7 @@ static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 
 	return acpi_nfit_register_regions(acpi_desc);
 }
+EXPORT_SYMBOL_GPL(acpi_nfit_init);
 
 static int acpi_nfit_add(struct acpi_device *adev)
 {
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index 1fc49cc51d4a..eedbd3d79e02 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -37,6 +37,15 @@ enum nfit_uuids {
 	NFIT_UUID_MAX,
 };
 
+#define NFIT_DIMM_HANDLE(node, socket, imc, chan, dimm) \
+       (((node & 0xfff) << 16) | ((socket & 0xf) << 12) \
+        | ((imc & 0xf) << 8) | ((chan & 0xf) << 4) | (dimm & 0xf))
+#define NFIT_DIMM_NODE(handle) ((handle) >> 16 & 0xfff)
+#define NFIT_DIMM_SOCKET(handle) ((handle) >> 12 & 0xf)
+#define NFIT_DIMM_CHAN(handle) ((handle) >> 8 & 0xf)
+#define NFIT_DIMM_IMC(handle) ((handle) >> 4 & 0xf)
+#define NFIT_DIMM_DIMM(handle) ((handle) & 0xf)
+
 struct nfit_spa {
 	struct acpi_nfit_system_address *spa;
 	struct list_head list;
@@ -145,4 +154,7 @@ static inline struct acpi_nfit_desc *to_acpi_desc(struct nd_bus_descriptor *nd_d
 {
 	return container_of(nd_desc, struct acpi_nfit_desc, nd_desc);
 }
+
+const u8 *to_nfit_uuid(enum nfit_uuids id);
+int acpi_nfit_init(struct acpi_nfit_desc *nfit, acpi_size sz);
 #endif /* __NFIT_H__ */
diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index f97bf0db6519..5f2935aefd41 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -17,6 +17,28 @@ if ND_DEVICES
 config LIBND
 	tristate
 
+config NFIT_TEST
+	tristate "NFIT TEST: Manufactured NFIT for interface testing"
+	default n
+	depends on EXPERT
+	depends on DMA_CMA
+	depends on LIBND=m
+	depends on ACPI_NFIT
+	depends on m
+	help
+	  For development purposes register a manufactured
+	  NFIT table to verify the resulting device model topology.
+	  Note, this module arranges for ioremap_cache() to be
+	  overridden locally to allow simulation of system-memory as an
+	  io-memory-resource.
+
+	  Note, this test expects to be able to find at least 256MB of
+	  contiguous DMA space (CONFIG_CMA_SIZE_MBYTES, cma=) or it
+	  will fail to load.  This much contiguos memory is needed to
+	  properly simulate a DAX capable memory region.
+
+	  Say N unless you are doing development of the 'libnd' subsystem.
+
 config BLK_DEV_PMEM
 	tristate "PMEM: Persistent memory block device support"
 	depends on LIBND
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 29a797686429..e1e0f01ae960 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -1,4 +1,13 @@
+ifdef CONFIG_NFIT_TEST
+ldflags-y += --wrap=ioremap_cache
+ldflags-y += --wrap=ioremap_nocache
+ldflags-y += --wrap=iounmap
+ldflags-y += --wrap=__request_region
+ldflags-y += --wrap=__release_region
+endif
+
 obj-$(CONFIG_LIBND) += libnd.o
+obj-$(CONFIG_NFIT_TEST) += test/
 obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
 obj-$(CONFIG_ND_BTT) += nd_btt.o
 obj-$(CONFIG_ND_BLK) += nd_blk.o
diff --git a/drivers/block/nd/test/Makefile b/drivers/block/nd/test/Makefile
new file mode 100644
index 000000000000..c7f319cbd082
--- /dev/null
+++ b/drivers/block/nd/test/Makefile
@@ -0,0 +1,5 @@
+obj-$(CONFIG_NFIT_TEST) += nfit_test.o
+obj-$(CONFIG_NFIT_TEST) += nfit_test_iomap.o
+
+nfit_test-y := nfit.o
+nfit_test_iomap-y := iomap.o
diff --git a/drivers/block/nd/test/iomap.c b/drivers/block/nd/test/iomap.c
new file mode 100644
index 000000000000..c85a6f6ba559
--- /dev/null
+++ b/drivers/block/nd/test/iomap.c
@@ -0,0 +1,151 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/rculist.h>
+#include <linux/export.h>
+#include <linux/ioport.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/io.h>
+#include "nfit_test.h"
+
+static LIST_HEAD(iomap_head);
+
+static struct iomap_ops {
+	nfit_test_lookup_fn nfit_test_lookup;
+	struct list_head list;
+} iomap_ops = {
+	.list = LIST_HEAD_INIT(iomap_ops.list),
+};
+
+void nfit_test_setup(nfit_test_lookup_fn lookup)
+{
+	iomap_ops.nfit_test_lookup = lookup;
+	list_add_rcu(&iomap_ops.list, &iomap_head);
+}
+EXPORT_SYMBOL(nfit_test_setup);
+
+void nfit_test_teardown(void)
+{
+	list_del_rcu(&iomap_ops.list);
+	synchronize_rcu();
+}
+EXPORT_SYMBOL(nfit_test_teardown);
+
+static struct nfit_test_resource *get_nfit_res(resource_size_t resource)
+{
+	struct iomap_ops *ops;
+
+	ops = list_first_or_null_rcu(&iomap_head, typeof(*ops), list);
+	if (ops)
+		return ops->nfit_test_lookup(resource);
+	return NULL;
+}
+
+void __iomem *__nfit_test_ioremap(resource_size_t offset, unsigned long size,
+		void __iomem *(*fallback_fn)(resource_size_t, unsigned long))
+{
+	struct nfit_test_resource *nfit_res;
+
+	rcu_read_lock();
+	nfit_res = get_nfit_res(offset);
+	rcu_read_unlock();
+	if (nfit_res)
+		return (void __iomem *) nfit_res->buf + offset
+			- nfit_res->res->start;
+	return fallback_fn(offset, size);
+}
+
+void __iomem *__wrap_ioremap_cache(resource_size_t offset, unsigned long size)
+{
+	return __nfit_test_ioremap(offset, size, ioremap_cache);
+}
+EXPORT_SYMBOL(__wrap_ioremap_cache);
+
+void __iomem *__wrap_ioremap_nocache(resource_size_t offset, unsigned long size)
+{
+	return __nfit_test_ioremap(offset, size, ioremap_nocache);
+}
+EXPORT_SYMBOL(__wrap_ioremap_nocache);
+
+void __wrap_iounmap(volatile void __iomem *addr)
+{
+	struct nfit_test_resource *nfit_res;
+
+	rcu_read_lock();
+	nfit_res = get_nfit_res((unsigned long) addr);
+	rcu_read_unlock();
+	if (nfit_res)
+		return;
+	return iounmap(addr);
+}
+EXPORT_SYMBOL(__wrap_iounmap);
+
+struct resource *__wrap___request_region(struct resource *parent,
+		resource_size_t start, resource_size_t n, const char *name,
+		int flags)
+{
+	struct nfit_test_resource *nfit_res;
+
+	if (parent == &iomem_resource) {
+		rcu_read_lock();
+		nfit_res = get_nfit_res(start);
+		rcu_read_unlock();
+		if (nfit_res) {
+			struct resource *res = nfit_res->res + 1;
+
+			if (start + n > nfit_res->res->start
+					+ resource_size(nfit_res->res)) {
+				pr_debug("%s: start: %llx n: %llx overflow: %pr\n",
+						__func__, start, n,
+						nfit_res->res);
+				return NULL;
+			}
+
+			res->start = start;
+			res->end = start + n - 1;
+			res->name = name;
+			res->flags = resource_type(parent);
+			res->flags |= IORESOURCE_BUSY | flags;
+			pr_debug("%s: %pr\n", __func__, res);
+			return res;
+		}
+	}
+	return __request_region(parent, start, n, name, flags);
+}
+EXPORT_SYMBOL(__wrap___request_region);
+
+void __wrap___release_region(struct resource *parent, resource_size_t start,
+				resource_size_t n)
+{
+	struct nfit_test_resource *nfit_res;
+
+	if (parent == &iomem_resource) {
+		rcu_read_lock();
+		nfit_res = get_nfit_res(start);
+		rcu_read_unlock();
+		if (nfit_res) {
+			struct resource *res = nfit_res->res + 1;
+
+			if (start != res->start || resource_size(res) != n)
+				pr_info("%s: start: %llx n: %llx mismatch: %pr\n",
+						__func__, start, n, res);
+			else
+				memset(res, 0, sizeof(*res));
+			return;
+		}
+	}
+	__release_region(parent, start, n);
+}
+EXPORT_SYMBOL(__wrap___release_region);
+
+MODULE_LICENSE("GPL v2");
diff --git a/drivers/block/nd/test/nfit.c b/drivers/block/nd/test/nfit.c
new file mode 100644
index 000000000000..973e46c06abc
--- /dev/null
+++ b/drivers/block/nd/test/nfit.c
@@ -0,0 +1,1171 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/platform_device.h>
+#include <linux/dma-mapping.h>
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/libnd.h>
+#include <linux/ndctl.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+#include "../../../acpi/nfit.h"
+#include "nfit_test.h"
+#include "../nd.h"
+
+/*
+ * Generate an NFIT table to describe the following topology:
+ *
+ * BUS0: Interleaved PMEM regions, and aliasing with BLK regions
+ *
+ *                     (a)                       (b)            DIMM   BLK-REGION
+ *           +----------+--------------+----------+---------+
+ * +------+  |  blk2.0  |     pm0.0    |  blk2.1  |  pm1.0  |    0      region2
+ * | imc0 +--+- - - - - region0 - - - -+----------+         +
+ * +--+---+  |  blk3.0  |     pm0.0    |  blk3.1  |  pm1.0  |    1      region3
+ *    |      +----------+--------------v----------v         v
+ * +--+---+                            |                    |
+ * | cpu0 |                                    region1
+ * +--+---+                            |                    |
+ *    |      +-------------------------^----------^         ^
+ * +--+---+  |                 blk4.0             |  pm1.0  |    2      region4
+ * | imc1 +--+-------------------------+----------+         +
+ * +------+  |                 blk5.0             |  pm1.0  |    3      region5
+ *           +-------------------------+----------+-+-------+
+ *
+ * *) In this layout we have four dimms and two memory controllers in one
+ *    socket.  Each unique interface (BLK or PMEM) to DPA space
+ *    is identified by a region device with a dynamically assigned id.
+ *
+ * *) The first portion of dimm0 and dimm1 are interleaved as REGION0.
+ *    A single PMEM namespace "pm0.0" is created using half of the
+ *    REGION0 SPA-range.  REGION0 spans dimm0 and dimm1.  PMEM namespace
+ *    allocate from from the bottom of a region.  The unallocated
+ *    portion of REGION0 aliases with REGION2 and REGION3.  That
+ *    unallacted capacity is reclaimed as BLK namespaces ("blk2.0" and
+ *    "blk3.0") starting at the base of each DIMM to offset (a) in those
+ *    DIMMs.  "pm0.0", "blk2.0" and "blk3.0" are free-form readable
+ *    names that can be assigned to a namespace.
+ *
+ * *) In the last portion of dimm0 and dimm1 we have an interleaved
+ *    SPA range, REGION1, that spans those two dimms as well as dimm2
+ *    and dimm3.  Some of REGION1 allocated to a PMEM namespace named
+ *    "pm1.0" the rest is reclaimed in 4 BLK namespaces (for each
+ *    dimm in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
+ *    "blk5.0".
+ *
+ * *) The portion of dimm2 and dimm3 that do not participate in the
+ *    REGION1 interleaved SPA range (i.e. the DPA address below offset
+ *    (b) are also included in the "blk4.0" and "blk5.0" namespaces.
+ *    Note, that BLK namespaces need not be contiguous in DPA-space, and
+ *    can consume aliased capacity from multiple interleave sets.
+ *
+ * BUS1: Legacy NVDIMM (single contiguous range)
+ *
+ *  region2
+ * +---------------------+
+ * |---------------------|
+ * ||       pm2.0       ||
+ * |---------------------|
+ * +---------------------+
+ *
+ * *) A NFIT-table may describe a simple system-physical-address range
+ *    with no BLK aliasing.  This type of region may optionally
+ *    reference an NVDIMM.
+ */
+enum {
+	NUM_PM  = 2,
+	NUM_DCR = 4,
+	NUM_BDW = NUM_DCR,
+	NUM_SPA = NUM_PM + NUM_DCR + NUM_BDW,
+	NUM_MEM = NUM_DCR + NUM_BDW + 2 /* spa0 iset */ + 4 /* spa1 iset */,
+	DIMM_SIZE = SZ_32M,
+	LABEL_SIZE = SZ_128K,
+	SPA0_SIZE = DIMM_SIZE,
+	SPA1_SIZE = DIMM_SIZE*2,
+	SPA2_SIZE = DIMM_SIZE,
+	BDW_SIZE = 64 << 8,
+	DCR_SIZE = 12,
+	NUM_NFITS = 2, /* permit testing multiple NFITs per system */
+};
+
+struct nfit_test_dcr {
+	__le64 bdw_addr;
+	__le32 bdw_status;
+	__u8 aperature[BDW_SIZE];
+};
+
+static u32 handle[NUM_DCR] = {
+	[0] = NFIT_DIMM_HANDLE(0, 0, 0, 0, 0),
+	[1] = NFIT_DIMM_HANDLE(0, 0, 0, 0, 1),
+	[2] = NFIT_DIMM_HANDLE(0, 0, 1, 0, 0),
+	[3] = NFIT_DIMM_HANDLE(0, 0, 1, 0, 1),
+};
+
+struct nfit_test {
+	struct acpi_nfit_desc acpi_desc;
+	struct platform_device pdev;
+	struct list_head resources;
+	void *nfit_buf;
+	dma_addr_t nfit_dma;
+	size_t nfit_size;
+	int num_dcr;
+	int num_pm;
+	void **dimm;
+	dma_addr_t *dimm_dma;
+	void **label;
+	dma_addr_t *label_dma;
+	void **spa_set;
+	dma_addr_t *spa_set_dma;
+	struct nfit_test_dcr **dcr;
+	dma_addr_t *dcr_dma;
+	int (*alloc)(struct nfit_test *t);
+	void (*setup)(struct nfit_test *t);
+};
+
+static struct nfit_test *to_nfit_test(struct device *dev)
+{
+	struct platform_device *pdev = to_platform_device(dev);
+
+	return container_of(pdev, struct nfit_test, pdev);
+}
+
+static int nfit_test_ctl(struct nd_bus_descriptor *nd_desc,
+		struct nd_dimm *nd_dimm, unsigned int cmd, void *buf,
+		unsigned int buf_len)
+{
+	struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
+	struct nfit_test *t = container_of(acpi_desc, typeof(*t), acpi_desc);
+	struct nfit_mem *nfit_mem = nd_dimm_provider_data(nd_dimm);
+	int i, rc;
+
+	if (!nfit_mem || !test_bit(cmd, &nfit_mem->dsm_mask))
+		return -ENXIO;
+
+	/* lookup label space for the given dimm */
+	for (i = 0; i < ARRAY_SIZE(handle); i++)
+		if (__to_nfit_memdev(nfit_mem)->device_handle == handle[i])
+			break;
+	if (i >= ARRAY_SIZE(handle))
+		return -ENXIO;
+
+	switch (cmd) {
+	case ND_CMD_GET_CONFIG_SIZE: {
+		struct nd_cmd_get_config_size *nd_cmd = buf;
+
+		if (buf_len < sizeof(*nd_cmd))
+			return -EINVAL;
+		nd_cmd->status = 0;
+		nd_cmd->config_size = LABEL_SIZE;
+		nd_cmd->max_xfer = SZ_4K;
+		rc = 0;
+		break;
+	}
+	case ND_CMD_GET_CONFIG_DATA: {
+		struct nd_cmd_get_config_data_hdr *nd_cmd = buf;
+		unsigned int len, offset = nd_cmd->in_offset;
+
+		if (buf_len < sizeof(*nd_cmd))
+			return -EINVAL;
+		if (offset >= LABEL_SIZE)
+			return -EINVAL;
+		if (nd_cmd->in_length + sizeof(*nd_cmd) > buf_len)
+			return -EINVAL;
+
+		nd_cmd->status = 0;
+		len = min(nd_cmd->in_length, LABEL_SIZE - offset);
+		memcpy(nd_cmd->out_buf, t->label[i] + offset, len);
+		rc = buf_len - sizeof(*nd_cmd) - len;
+		break;
+	}
+	case ND_CMD_SET_CONFIG_DATA: {
+		struct nd_cmd_set_config_hdr *nd_cmd = buf;
+		unsigned int len, offset = nd_cmd->in_offset;
+		u32 *status;
+
+		if (buf_len < sizeof(*nd_cmd))
+			return -EINVAL;
+		if (offset >= LABEL_SIZE)
+			return -EINVAL;
+		if (nd_cmd->in_length + sizeof(*nd_cmd) + 4 > buf_len)
+			return -EINVAL;
+
+		status = buf + nd_cmd->in_length + sizeof(*nd_cmd);
+		*status = 0;
+		len = min(nd_cmd->in_length, LABEL_SIZE - offset);
+		memcpy(t->label[i] + offset, nd_cmd->in_buf, len);
+		rc = buf_len - sizeof(*nd_cmd) - (len + 4);
+		break;
+	}
+	default:
+		return -ENOTTY;
+	}
+
+	return rc;
+}
+
+static DEFINE_SPINLOCK(nfit_test_lock);
+static struct nfit_test *instances[NUM_NFITS];
+
+static void release_nfit_res(void *data)
+{
+	struct nfit_test_resource *nfit_res = data;
+	struct resource *res = nfit_res->res;
+
+	spin_lock(&nfit_test_lock);
+	list_del(&nfit_res->list);
+	spin_unlock(&nfit_test_lock);
+
+	if (is_vmalloc_addr(nfit_res->buf))
+		vfree(nfit_res->buf);
+	else
+		dma_free_coherent(nfit_res->dev, resource_size(res),
+				nfit_res->buf, res->start);
+	kfree(res);
+	kfree(nfit_res);
+}
+
+static void *__test_alloc(struct nfit_test *t, size_t size, dma_addr_t *dma,
+		void *buf)
+{
+	struct device *dev = &t->pdev.dev;
+	struct resource *res = kzalloc(sizeof(*res) * 2, GFP_KERNEL);
+	struct nfit_test_resource *nfit_res = kzalloc(sizeof(*nfit_res),
+			GFP_KERNEL);
+	int rc;
+
+	if (!res || !buf || !nfit_res)
+		goto err;
+	rc = devm_add_action(dev, release_nfit_res, nfit_res);
+	if (rc)
+		goto err;
+	INIT_LIST_HEAD(&nfit_res->list);
+	memset(buf, 0, size);
+	nfit_res->dev = dev;
+	nfit_res->buf = buf;
+	nfit_res->res = res;
+	res->start = *dma;
+	res->end = *dma + size - 1;
+	res->name = "NFIT";
+	spin_lock(&nfit_test_lock);
+	list_add(&nfit_res->list, &t->resources);
+	spin_unlock(&nfit_test_lock);
+
+	return nfit_res->buf;
+ err:
+	if (buf && !is_vmalloc_addr(buf))
+		dma_free_coherent(dev, size, buf, *dma);
+	else if (buf)
+		vfree(buf);
+	kfree(res);
+	kfree(nfit_res);
+	return NULL;
+}
+
+static void *test_alloc(struct nfit_test *t, size_t size, dma_addr_t *dma)
+{
+	void *buf = vmalloc(size);
+
+	*dma = (unsigned long) buf;
+	return __test_alloc(t, size, dma, buf);
+}
+
+static void *test_alloc_coherent(struct nfit_test *t, size_t size, dma_addr_t *dma)
+{
+	struct device *dev = &t->pdev.dev;
+	void *buf = dma_alloc_coherent(dev, size, dma, GFP_KERNEL);
+
+	return __test_alloc(t, size, dma, buf);
+}
+
+static struct nfit_test_resource *nfit_test_lookup(resource_size_t addr)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(instances); i++) {
+		struct nfit_test_resource *n, *nfit_res = NULL;
+		struct nfit_test *t = instances[i];
+
+		if (!t)
+			continue;
+		spin_lock(&nfit_test_lock);
+		list_for_each_entry(n, &t->resources, list) {
+			if (addr >= n->res->start && (addr < n->res->start
+						+ resource_size(n->res))) {
+				nfit_res = n;
+				break;
+			} else if (addr >= (unsigned long) n->buf
+					&& (addr < (unsigned long) n->buf
+						+ resource_size(n->res))) {
+				nfit_res = n;
+				break;
+			}
+		}
+		spin_unlock(&nfit_test_lock);
+		if (nfit_res)
+			return nfit_res;
+	}
+
+	return NULL;
+}
+
+static int nfit_test0_alloc(struct nfit_test *t)
+{
+	size_t nfit_size = sizeof(struct acpi_table_nfit)
+			+ sizeof(struct acpi_nfit_system_address) * NUM_SPA
+			+ sizeof(struct acpi_nfit_memory_map) * NUM_MEM
+			+ sizeof(struct acpi_nfit_control_region) * NUM_DCR
+			+ sizeof(struct acpi_nfit_data_region) * NUM_BDW;
+	int i;
+
+	t->nfit_buf = test_alloc(t, nfit_size, &t->nfit_dma);
+	if (!t->nfit_buf)
+		return -ENOMEM;
+	t->nfit_size = nfit_size;
+
+	t->spa_set[0] = test_alloc_coherent(t, SPA0_SIZE, &t->spa_set_dma[0]);
+	if (!t->spa_set[0])
+		return -ENOMEM;
+
+	t->spa_set[1] = test_alloc_coherent(t, SPA1_SIZE, &t->spa_set_dma[1]);
+	if (!t->spa_set[1])
+		return -ENOMEM;
+
+	for (i = 0; i < NUM_DCR; i++) {
+		t->dimm[i] = test_alloc(t, DIMM_SIZE, &t->dimm_dma[i]);
+		if (!t->dimm[i])
+			return -ENOMEM;
+
+		t->label[i] = test_alloc(t, LABEL_SIZE, &t->label_dma[i]);
+		if (!t->label[i])
+			return -ENOMEM;
+		sprintf(t->label[i], "label%d", i);
+	}
+
+	for (i = 0; i < NUM_DCR; i++) {
+		t->dcr[i] = test_alloc(t, LABEL_SIZE, &t->dcr_dma[i]);
+		if (!t->dcr[i])
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int nfit_test1_alloc(struct nfit_test *t)
+{
+	size_t nfit_size = sizeof(struct acpi_table_nfit)
+		+ sizeof(struct acpi_nfit_system_address) + sizeof(struct acpi_nfit_memory_map)
+		+ sizeof(struct acpi_nfit_control_region);
+
+	t->nfit_buf = test_alloc(t, nfit_size, &t->nfit_dma);
+	if (!t->nfit_buf)
+		return -ENOMEM;
+	t->nfit_size = nfit_size;
+
+	t->spa_set[0] = test_alloc_coherent(t, SPA2_SIZE, &t->spa_set_dma[0]);
+	if (!t->spa_set[0])
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void nfit_test_init_header(struct acpi_table_nfit *nfit, size_t size)
+{
+	memcpy(nfit->header.signature, ACPI_SIG_NFIT, 4);
+	nfit->header.length = size;
+	nfit->header.revision = 1;
+	memcpy(nfit->header.oem_id, "LIBND", 6);
+	memcpy(nfit->header.oem_table_id, "TEST", 5);
+	nfit->header.oem_revision = 1;
+	memcpy(nfit->header.asl_compiler_id, "TST", 4);
+	nfit->header.asl_compiler_revision = 1;
+}
+
+static void nfit_test0_setup(struct nfit_test *t)
+{
+	struct nd_bus_descriptor *nd_desc;
+	struct acpi_nfit_desc *acpi_desc;
+	struct acpi_nfit_memory_map *memdev;
+	void *nfit_buf = t->nfit_buf;
+	size_t size = t->nfit_size;
+	struct acpi_nfit_system_address *spa;
+	struct acpi_nfit_control_region *dcr;
+	struct acpi_nfit_data_region *bdw;
+	unsigned int offset;
+
+	nfit_test_init_header(nfit_buf, size);
+
+	/*
+	 * spa0 (interleave first half of dimm0 and dimm1, note storage
+	 * does not actually alias the related block-data-window
+	 * regions)
+	 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit);
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_PM), 16);
+	spa->range_index = 0+1;
+	spa->address = t->spa_set_dma[0];
+	spa->length = SPA0_SIZE;
+
+	/*
+	 * spa1 (interleave last half of the 4 DIMMS, note storage
+	 * does not actually alias the related block-data-window
+	 * regions)
+	 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa);
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_PM), 16);
+	spa->range_index = 1+1;
+	spa->address = t->spa_set_dma[1];
+	spa->length = SPA1_SIZE;
+
+	/* spa2 (dcr0) dimm0 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 2;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+	spa->range_index = 2+1;
+	spa->address = t->dcr_dma[0];
+	spa->length = DCR_SIZE;
+
+	/* spa3 (dcr1) dimm1 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 3;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+	spa->range_index = 3+1;
+	spa->address = t->dcr_dma[1];
+	spa->length = DCR_SIZE;
+
+	/* spa4 (dcr2) dimm2 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 4;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+	spa->range_index = 4+1;
+	spa->address = t->dcr_dma[2];
+	spa->length = DCR_SIZE;
+
+	/* spa5 (dcr3) dimm3 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 5;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+	spa->range_index = 5+1;
+	spa->address = t->dcr_dma[3];
+	spa->length = DCR_SIZE;
+
+	/* spa6 (bdw for dcr0) dimm0 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 6;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+	spa->range_index = 6+1;
+	spa->address = t->dimm_dma[0];
+	spa->length = DIMM_SIZE;
+
+	/* spa7 (bdw for dcr1) dimm1 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 7;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+	spa->range_index = 7+1;
+	spa->address = t->dimm_dma[1];
+	spa->length = DIMM_SIZE;
+
+	/* spa8 (bdw for dcr2) dimm2 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 8;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+	spa->range_index = 8+1;
+	spa->address = t->dimm_dma[2];
+	spa->length = DIMM_SIZE;
+
+	/* spa9 (bdw for dcr3) dimm3 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 9;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+	spa->range_index = 9+1;
+	spa->address = t->dimm_dma[3];
+	spa->length = DIMM_SIZE;
+
+	offset = sizeof(struct acpi_table_nfit) + sizeof(*spa) * 10;
+	/* mem-region0 (spa0, dimm0) */
+	memdev = nfit_buf + offset;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[0];
+	memdev->physical_id = 0;
+	memdev->region_id = 0;
+	memdev->range_index = 0+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = SPA0_SIZE/2;
+	memdev->region_offset = t->spa_set_dma[0];
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 2;
+
+	/* mem-region1 (spa0, dimm1) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map);
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[1];
+	memdev->physical_id = 1;
+	memdev->region_id = 0;
+	memdev->range_index = 0+1;
+	memdev->region_index = 1+1;
+	memdev->region_size = SPA0_SIZE/2;
+	memdev->region_offset = t->spa_set_dma[0] + SPA0_SIZE/2;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 2;
+
+	/* mem-region2 (spa1, dimm0) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 2;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[0];
+	memdev->physical_id = 0;
+	memdev->region_id = 1;
+	memdev->range_index = 1+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = SPA1_SIZE/4;
+	memdev->region_offset = t->spa_set_dma[1];
+	memdev->address = SPA0_SIZE/2;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 4;
+
+	/* mem-region3 (spa1, dimm1) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 3;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[1];
+	memdev->physical_id = 1;
+	memdev->region_id = 1;
+	memdev->range_index = 1+1;
+	memdev->region_index = 1+1;
+	memdev->region_size = SPA1_SIZE/4;
+	memdev->region_offset = t->spa_set_dma[1] + SPA1_SIZE/4;
+	memdev->address = SPA0_SIZE/2;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 4;
+
+	/* mem-region4 (spa1, dimm2) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 4;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[2];
+	memdev->physical_id = 2;
+	memdev->region_id = 0;
+	memdev->range_index = 1+1;
+	memdev->region_index = 2+1;
+	memdev->region_size = SPA1_SIZE/4;
+	memdev->region_offset = t->spa_set_dma[1] + 2*SPA1_SIZE/4;
+	memdev->address = SPA0_SIZE/2;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 4;
+
+	/* mem-region5 (spa1, dimm3) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 5;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[3];
+	memdev->physical_id = 3;
+	memdev->region_id = 0;
+	memdev->range_index = 1+1;
+	memdev->region_index = 3+1;
+	memdev->region_size = SPA1_SIZE/4;
+	memdev->region_offset = t->spa_set_dma[1] + 3*SPA1_SIZE/4;
+	memdev->address = SPA0_SIZE/2;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 4;
+
+	/* mem-region6 (spa/dcr0, dimm0) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 6;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[0];
+	memdev->physical_id = 0;
+	memdev->region_id = 0;
+	memdev->range_index = 2+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region7 (spa/dcr1, dimm1) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 7;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[1];
+	memdev->physical_id = 1;
+	memdev->region_id = 0;
+	memdev->range_index = 3+1;
+	memdev->region_index = 1+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region8 (spa/dcr2, dimm2) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 8;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[2];
+	memdev->physical_id = 2;
+	memdev->region_id = 0;
+	memdev->range_index = 4+1;
+	memdev->region_index = 2+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region9 (spa/dcr3, dimm3) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 9;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[3];
+	memdev->physical_id = 3;
+	memdev->region_id = 0;
+	memdev->range_index = 5+1;
+	memdev->region_index = 3+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region10 (spa/bdw0, dimm0) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 10;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[0];
+	memdev->physical_id = 0;
+	memdev->region_id = 0;
+	memdev->range_index = 6+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region11 (spa/bdw1, dimm1) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 11;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[1];
+	memdev->physical_id = 1;
+	memdev->region_id = 0;
+	memdev->range_index = 7+1;
+	memdev->region_index = 1+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region12 (spa/bdw2, dimm2) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 12;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[2];
+	memdev->physical_id = 2;
+	memdev->region_id = 0;
+	memdev->range_index = 8+1;
+	memdev->region_index = 2+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region13 (spa/dcr3, dimm3) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 13;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[3];
+	memdev->physical_id = 3;
+	memdev->region_id = 0;
+	memdev->range_index = 9+1;
+	memdev->region_index = 3+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	offset = offset + sizeof(struct acpi_nfit_memory_map) * 14;
+	/* dcr-descriptor0 */
+	dcr = nfit_buf + offset;
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 0+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~handle[0];
+	dcr->windows = 1;
+	dcr->window_size = DCR_SIZE;
+	dcr->command_offset = 0;
+	dcr->command_size = 8;
+	dcr->status_offset = 8;
+	dcr->status_size = 4;
+
+	/* dcr-descriptor1 */
+	dcr = nfit_buf + offset + sizeof(struct acpi_nfit_control_region);
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 1+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~handle[1];
+	dcr->windows = 1;
+	dcr->window_size = DCR_SIZE;
+	dcr->command_offset = 0;
+	dcr->command_size = 8;
+	dcr->status_offset = 8;
+	dcr->status_size = 4;
+
+	/* dcr-descriptor2 */
+	dcr = nfit_buf + offset + sizeof(struct acpi_nfit_control_region) * 2;
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 2+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~handle[2];
+	dcr->windows = 1;
+	dcr->window_size = DCR_SIZE;
+	dcr->command_offset = 0;
+	dcr->command_size = 8;
+	dcr->status_offset = 8;
+	dcr->status_size = 4;
+
+	/* dcr-descriptor3 */
+	dcr = nfit_buf + offset + sizeof(struct acpi_nfit_control_region) * 3;
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 3+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~handle[3];
+	dcr->windows = 1;
+	dcr->window_size = DCR_SIZE;
+	dcr->command_offset = 0;
+	dcr->command_size = 8;
+	dcr->status_offset = 8;
+	dcr->status_size = 4;
+
+	offset = offset + sizeof(struct acpi_nfit_control_region) * 4;
+	/* bdw0 (spa/dcr0, dimm0) */
+	bdw = nfit_buf + offset;
+	bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+	bdw->header.length = sizeof(struct acpi_nfit_data_region);
+	bdw->region_index = 0+1;
+	bdw->windows = 1;
+	bdw->offset = 0;
+	bdw->size = BDW_SIZE;
+	bdw->capacity = DIMM_SIZE;
+	bdw->start_address = 0;
+
+	/* bdw1 (spa/dcr1, dimm1) */
+	bdw = nfit_buf + offset + sizeof(struct acpi_nfit_data_region);
+	bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+	bdw->header.length = sizeof(struct acpi_nfit_data_region);
+	bdw->region_index = 1+1;
+	bdw->windows = 1;
+	bdw->offset = 0;
+	bdw->size = BDW_SIZE;
+	bdw->capacity = DIMM_SIZE;
+	bdw->start_address = 0;
+
+	/* bdw2 (spa/dcr2, dimm2) */
+	bdw = nfit_buf + offset + sizeof(struct acpi_nfit_data_region) * 2;
+	bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+	bdw->header.length = sizeof(struct acpi_nfit_data_region);
+	bdw->region_index = 2+1;
+	bdw->windows = 1;
+	bdw->offset = 0;
+	bdw->size = BDW_SIZE;
+	bdw->capacity = DIMM_SIZE;
+	bdw->start_address = 0;
+
+	/* bdw3 (spa/dcr3, dimm3) */
+	bdw = nfit_buf + offset + sizeof(struct acpi_nfit_data_region) * 3;
+	bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+	bdw->header.length = sizeof(struct acpi_nfit_data_region);
+	bdw->region_index = 3+1;
+	bdw->windows = 1;
+	bdw->offset = 0;
+	bdw->size = BDW_SIZE;
+	bdw->capacity = DIMM_SIZE;
+	bdw->start_address = 0;
+
+	acpi_desc = &t->acpi_desc;
+	set_bit(ND_CMD_GET_CONFIG_SIZE, &acpi_desc->dimm_dsm_force_en);
+	set_bit(ND_CMD_GET_CONFIG_DATA, &acpi_desc->dimm_dsm_force_en);
+	set_bit(ND_CMD_SET_CONFIG_DATA, &acpi_desc->dimm_dsm_force_en);
+	nd_desc = &acpi_desc->nd_desc;
+	nd_desc->ndctl = nfit_test_ctl;
+}
+
+static void nfit_test1_setup(struct nfit_test *t)
+{
+	size_t size = t->nfit_size, offset;
+	void *nfit_buf = t->nfit_buf;
+	struct acpi_nfit_memory_map *memdev;
+	struct acpi_nfit_control_region *dcr;
+	struct acpi_nfit_system_address *spa;
+
+	nfit_test_init_header(nfit_buf, size);
+
+	offset = sizeof(struct acpi_table_nfit);
+	/* spa0 (flat range with no bdw aliasing) */
+	spa = nfit_buf + offset;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_PM), 16);
+	spa->range_index = 0+1;
+	spa->address = t->spa_set_dma[0];
+	spa->length = SPA2_SIZE;
+
+	offset += sizeof(*spa);
+	/* mem-region0 (spa0, dimm0) */
+	memdev = nfit_buf + offset;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = 0;
+	memdev->physical_id = 0;
+	memdev->region_id = 0;
+	memdev->range_index = 0+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = SPA2_SIZE;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	offset += sizeof(*memdev);
+	/* dcr-descriptor0 */
+	dcr = nfit_buf + offset;
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 0+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~0;
+	dcr->code = 0x201;
+	dcr->windows = 0;
+	dcr->window_size = 0;
+	dcr->command_offset = 0;
+	dcr->command_size = 0;
+	dcr->status_offset = 0;
+	dcr->status_size = 0;
+}
+
+static int nfit_test_blk_region_enable(struct nd_bus *nd_bus, struct device *dev)
+{
+	struct nd_blk_region *ndbr = to_nd_blk_region(dev);
+	struct acpi_nfit_system_address *spa_bdw;
+	struct nfit_blk_mmio *mmio;
+	struct nfit_blk *nfit_blk;
+	struct nfit_mem *nfit_mem;
+	struct nd_dimm *nd_dimm;
+
+	nd_dimm = nd_blk_region_to_dimm(ndbr);
+	nfit_mem = nd_dimm_provider_data(nd_dimm);
+	if (!nfit_mem || !nfit_mem->dcr || !nfit_mem->bdw) {
+		dev_dbg(dev, "%s: missing%s%s%s\n", __func__,
+				nfit_mem ? "" : " nfit_mem",
+				nfit_mem->dcr ? "" : " dcr",
+				nfit_mem->bdw ? "" : " bdw");
+		return -ENXIO;
+	}
+
+	nfit_blk = devm_kzalloc(dev, sizeof(*nfit_blk), GFP_KERNEL);
+	if (!nfit_blk)
+		return -ENOMEM;
+	nd_blk_region_set_provider_data(ndbr, nfit_blk);
+	nfit_blk->nd_region = to_nd_region(dev);
+
+	/* block aperture memory is all we use in nfit_test */
+	nfit_blk->bdw_offset = nfit_mem->bdw->offset;
+	mmio = &nfit_blk->mmio[BDW];
+	spa_bdw = nfit_mem->spa_bdw;
+	mmio->base = __wrap_ioremap_nocache(spa_bdw->address, spa_bdw->length);
+	if (!mmio->base) {
+		release_mem_region(spa_bdw->address, spa_bdw->length);
+		dev_dbg(dev, "%s: %s failed to map bdw\n", __func__,
+				nd_dimm_name(nd_dimm));
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void nfit_test_blk_region_disable(struct nd_bus *nd_bus, struct device *dev)
+{
+	struct nd_blk_region *ndbr = to_nd_blk_region(dev);
+	struct nfit_blk *nfit_blk = nd_blk_region_provider_data(ndbr);
+	struct acpi_nfit_system_address *spa_bdw;
+	struct nfit_blk_mmio *mmio;
+	struct nfit_mem *nfit_mem;
+	struct nd_dimm *nd_dimm;
+
+	if (!nfit_blk)
+		return; /* never enabled */
+
+	nd_dimm = nd_blk_region_to_dimm(ndbr);
+	nfit_mem = nd_dimm_provider_data(nd_dimm);
+	spa_bdw = nfit_mem->spa_bdw;
+	mmio = &nfit_blk->mmio[BDW];
+	__wrap_iounmap(mmio->base);
+	nd_blk_region_set_provider_data(ndbr, NULL);
+}
+
+static int nfit_test_blk_do_io(struct nd_blk_region *ndbr, void *iobuf,
+                u64 len, int rw, resource_size_t dpa)
+{
+	struct nfit_blk *nfit_blk = ndbr->blk_provider_data;
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW];
+	struct nd_region *nd_region = &ndbr->nd_region;
+        struct nfit_test_resource *nfit_res;
+	unsigned int bw;
+
+        nfit_res = nfit_test_lookup((unsigned long) mmio->base);
+        if (!nfit_res) {
+		dev_WARN_ONCE(&nd_region->dev, 1, "no test resource\n");
+		return -EIO;
+	}
+	dev_vdbg(&nd_region->dev, "%s: base: %p offset: %pa\n",
+			__func__, mmio->base, &dpa);
+	bw = nd_region_acquire_lane(nd_region);
+	if (rw)
+		memcpy(nfit_res->buf + dpa, iobuf, len);
+	else
+		memcpy(iobuf, nfit_res->buf + dpa, len);
+	nd_region_release_lane(nd_region, bw);
+
+        return 0;
+}
+
+extern const struct attribute_group *acpi_nfit_attribute_groups[];
+
+static int nfit_test_probe(struct platform_device *pdev)
+{
+	struct nd_bus_descriptor *nd_desc;
+	struct acpi_nfit_desc *acpi_desc;
+	struct device *dev = &pdev->dev;
+	struct nfit_test *nfit_test;
+	int rc;
+
+	nfit_test = to_nfit_test(&pdev->dev);
+
+	/* common alloc */
+	if (nfit_test->num_dcr) {
+		int num = nfit_test->num_dcr;
+
+		nfit_test->dimm = devm_kcalloc(dev, num, sizeof(void *), GFP_KERNEL);
+		nfit_test->dimm_dma = devm_kcalloc(dev, num, sizeof(dma_addr_t), GFP_KERNEL);
+		nfit_test->label = devm_kcalloc(dev, num, sizeof(void *), GFP_KERNEL);
+		nfit_test->label_dma = devm_kcalloc(dev, num, sizeof(dma_addr_t), GFP_KERNEL);
+		nfit_test->dcr = devm_kcalloc(dev, num, sizeof(struct nfit_test_dcr *), GFP_KERNEL);
+		nfit_test->dcr_dma = devm_kcalloc(dev, num, sizeof(dma_addr_t), GFP_KERNEL);
+		if (nfit_test->dimm && nfit_test->dimm_dma && nfit_test->label
+				&& nfit_test->label_dma && nfit_test->dcr
+				&& nfit_test->dcr_dma)
+			/* pass */;
+		else
+			return -ENOMEM;
+	}
+
+	if (nfit_test->num_pm) {
+		int num = nfit_test->num_pm;
+
+		nfit_test->spa_set = devm_kcalloc(dev, num, sizeof(void *), GFP_KERNEL);
+		nfit_test->spa_set_dma = devm_kcalloc(dev, num,
+				sizeof(dma_addr_t), GFP_KERNEL);
+		if (nfit_test->spa_set && nfit_test->spa_set_dma)
+			/* pass */;
+		else
+			return -ENOMEM;
+	}
+
+	/* per-nfit specific alloc */
+	if (nfit_test->alloc(nfit_test))
+		return -ENOMEM;
+
+	nfit_test->setup(nfit_test);
+	acpi_desc = &nfit_test->acpi_desc;
+	acpi_desc->dev = &pdev->dev;
+	acpi_desc->nfit = nfit_test->nfit_buf;
+	acpi_desc->blk_enable = nfit_test_blk_region_enable;
+	acpi_desc->blk_disable = nfit_test_blk_region_disable;
+	acpi_desc->blk_do_io = nfit_test_blk_do_io;
+	nd_desc = &acpi_desc->nd_desc;
+	nd_desc->attr_groups = acpi_nfit_attribute_groups;
+	acpi_desc->nd_bus = nd_bus_register(&pdev->dev, nd_desc);
+	if (!acpi_desc->nd_bus)
+		return -ENXIO;
+
+	rc = acpi_nfit_init(acpi_desc, nfit_test->nfit_size);
+	if (rc) {
+		nd_bus_unregister(acpi_desc->nd_bus);
+		return rc;
+	}
+
+	return 0;
+}
+
+static int nfit_test_remove(struct platform_device *pdev)
+{
+	struct nfit_test *nfit_test = to_nfit_test(&pdev->dev);
+	struct acpi_nfit_desc *acpi_desc = &nfit_test->acpi_desc;
+
+	nd_bus_unregister(acpi_desc->nd_bus);
+
+	return 0;
+}
+
+static void nfit_test_release(struct device *dev)
+{
+	struct nfit_test *nfit_test = to_nfit_test(dev);
+
+	kfree(nfit_test);
+}
+
+static const struct platform_device_id nfit_test_id[] = {
+	{ KBUILD_MODNAME },
+	{ },
+};
+
+static struct platform_driver nfit_test_driver = {
+	.probe = nfit_test_probe,
+	.remove = nfit_test_remove,
+	.driver = {
+		.name = KBUILD_MODNAME,
+	},
+	.id_table = nfit_test_id,
+};
+
+#ifdef CONFIG_CMA_SIZE_MBYTES
+#define CMA_SIZE_MBYTES CONFIG_CMA_SIZE_MBYTES
+#else
+#define CMA_SIZE_MBYTES 0
+#endif
+
+static __init int nfit_test_init(void)
+{
+	int rc, i;
+
+	nfit_test_setup(nfit_test_lookup);
+
+	for (i = 0; i < NUM_NFITS; i++) {
+		struct nfit_test *nfit_test;
+		struct platform_device *pdev;
+		static int once;
+
+		nfit_test = kzalloc(sizeof(*nfit_test), GFP_KERNEL);
+		if (!nfit_test) {
+			rc = -ENOMEM;
+			goto err_register;
+		}
+		INIT_LIST_HEAD(&nfit_test->resources);
+		switch (i) {
+		case 0:
+			nfit_test->num_pm = NUM_PM;
+			nfit_test->num_dcr = NUM_DCR;
+			nfit_test->alloc = nfit_test0_alloc;
+			nfit_test->setup = nfit_test0_setup;
+			break;
+		case 1:
+			nfit_test->num_pm = 1;
+			nfit_test->alloc = nfit_test1_alloc;
+			nfit_test->setup = nfit_test1_setup;
+			break;
+		default:
+			rc = -EINVAL;
+			goto err_register;
+		}
+		pdev = &nfit_test->pdev;
+		pdev->name = KBUILD_MODNAME;
+		pdev->id = i;
+		pdev->dev.release = nfit_test_release;
+		rc = platform_device_register(pdev);
+		if (rc) {
+			put_device(&pdev->dev);
+			goto err_register;
+		}
+
+		rc = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
+		if (rc)
+			goto err_register;
+
+		instances[i] = nfit_test;
+
+		if (!once++) {
+			dma_addr_t dma;
+			void *buf;
+
+			buf = dma_alloc_coherent(&pdev->dev, SZ_128M, &dma,
+					GFP_KERNEL);
+			if (!buf) {
+				rc = -ENOMEM;
+				dev_warn(&pdev->dev, "need 128M of free cma\n");
+				goto err_register;
+			}
+			dma_free_coherent(&pdev->dev, SZ_128M, buf, dma);
+		}
+	}
+
+	rc = platform_driver_register(&nfit_test_driver);
+	if (rc)
+		goto err_register;
+	return 0;
+
+ err_register:
+	for (i = 0; i < NUM_NFITS; i++)
+		if (instances[i])
+			platform_device_unregister(&instances[i]->pdev);
+	nfit_test_teardown();
+	return rc;
+}
+
+static __exit void nfit_test_exit(void)
+{
+	int i;
+
+	platform_driver_unregister(&nfit_test_driver);
+	for (i = 0; i < NUM_NFITS; i++)
+		platform_device_unregister(&instances[i]->pdev);
+	nfit_test_teardown();
+}
+
+module_init(nfit_test_init);
+module_exit(nfit_test_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Intel Corporation");
diff --git a/drivers/block/nd/test/nfit_test.h b/drivers/block/nd/test/nfit_test.h
new file mode 100644
index 000000000000..4a1215ec45c0
--- /dev/null
+++ b/drivers/block/nd/test/nfit_test.h
@@ -0,0 +1,28 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __NFIT_TEST_H__
+#define __NFIT_TEST_H__
+
+struct nfit_test_resource {
+	struct list_head list;
+	struct resource *res;
+	struct device *dev;
+	void *buf;
+};
+
+typedef struct nfit_test_resource *(*nfit_test_lookup_fn)(resource_size_t);
+void __iomem *__wrap_ioremap_nocache(resource_size_t offset, unsigned long size);
+void __wrap_iounmap(volatile void __iomem *addr);
+void nfit_test_setup(nfit_test_lookup_fn lookup);
+void nfit_test_teardown(void);
+#endif


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 20/21] nfit-test: manufactured NFITs for interface development
@ 2015-05-20 20:58   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:58 UTC (permalink / raw)
  To: axboe
  Cc: mingo, linux-nvdimm, neilb, gregkh, Rafael J. Wysocki,
	linux-kernel, Robert Moore, linux-acpi, jmoyer, Lv Zheng, hch

Manually create and register NFITs to describe 2 topologies.  Topology1
is an advanced plausible configuration for BLK/PMEM aliased NVDIMMs.
Topology2 is an example configuration for current platforms that only
ship with a persistent address range.

 Kernel provider "nfit_test.0" produces an NFIT with the following attributes:

                              (a)               (b)           DIMM   BLK-REGION
           +-------------------+--------+--------+--------+
 +------+  |       pm0.0       | blk2.0 | pm1.0  | blk2.1 |    0      region2
 | imc0 +--+- - - region0- - - +--------+        +--------+
 +--+---+  |       pm0.0       | blk3.0 | pm1.0  | blk3.1 |    1      region3
    |      +-------------------+--------v        v--------+
 +--+---+                               |                 |
 | cpu0 |                                     region1
 +--+---+                               |                 |
    |      +----------------------------^        ^--------+
 +--+---+  |           blk4.0           | pm1.0  | blk4.0 |    2      region4
 | imc1 +--+----------------------------|        +--------+
 +------+  |           blk5.0           | pm1.0  | blk5.0 |    3      region5
           +----------------------------+--------+--------+

 *) In this layout we have four dimms and two memory controllers in one
    socket.  Each unique interface ("block" or "pmem") to DPA space
    is identified by a region device with a dynamically assigned id.

 *) The first portion of dimm0 and dimm1 are interleaved as REGION0.
    A single "pmem" namespace is created in the REGION0-"spa"-range
    that spans dimm0 and dimm1 with a user-specified name of "pm0.0".
    Some of that interleaved "spa" range is reclaimed as "bdw"
    accessed space starting at offset (a) into each dimm.  In that
    reclaimed space we create two "bdw" "namespaces" from REGION2 and
    REGION3 where "blk2.0" and "blk3.0" are just human readable names
    that could be set to any user-desired name in the label.

 *) In the last portion of dimm0 and dimm1 we have an interleaved
    "spa" range, REGION1, that spans those two dimms as well as dimm2
    and dimm3.  Some of REGION1 allocated to a "pmem" namespace named
    "pm1.0" the rest is reclaimed in 4 "bdw" namespaces (for each
    dimm in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
    "blk5.0".

 *) The portion of dimm2 and dimm3 that do not participate in the
    REGION1 interleaved "spa" range (i.e. the DPA address below
    offset (b) are also included in the "blk4.0" and "blk5.0"
    namespaces.  Note, that this example shows that "bdw" namespaces
    don't need to be contiguous in DPA-space.

 Kernel provider "nfit_test.1" produces an NFIT with the following attributes:

 region2
 +---------------------+
 |---------------------|
 ||       pm2.0       ||
 |---------------------|
 +---------------------+

 *) Describes a simple system-physical-address range with no backing
    dimm or interleave description.

Cc: <linux-acpi@vger.kernel.org>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/nfit.c               |    6 
 drivers/acpi/nfit.h               |   12 
 drivers/block/nd/Kconfig          |   22 +
 drivers/block/nd/Makefile         |    9 
 drivers/block/nd/test/Makefile    |    5 
 drivers/block/nd/test/iomap.c     |  151 +++++
 drivers/block/nd/test/nfit.c      | 1171 +++++++++++++++++++++++++++++++++++++
 drivers/block/nd/test/nfit_test.h |   28 +
 8 files changed, 1402 insertions(+), 2 deletions(-)
 create mode 100644 drivers/block/nd/test/Makefile
 create mode 100644 drivers/block/nd/test/iomap.c
 create mode 100644 drivers/block/nd/test/nfit.c
 create mode 100644 drivers/block/nd/test/nfit_test.h

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index c4ce498da9eb..1efbd01d4860 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -29,10 +29,11 @@ MODULE_PARM_DESC(force_enable_dimms, "Ignore _STA (ACPI DIMM device) status");
 
 static u8 nfit_uuid[NFIT_UUID_MAX][16];
 
-static const u8 *to_nfit_uuid(enum nfit_uuids id)
+const u8 *to_nfit_uuid(enum nfit_uuids id)
 {
 	return nfit_uuid[id];
 }
+EXPORT_SYMBOL(to_nfit_uuid);
 
 static struct acpi_nfit_desc *to_acpi_nfit_desc(struct nd_bus_descriptor *nd_desc)
 {
@@ -1330,7 +1331,7 @@ static int acpi_nfit_register_regions(struct acpi_nfit_desc *acpi_desc)
 	return 0;
 }
 
-static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
+int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 {
 	struct device *dev = acpi_desc->dev;
 	const void *end;
@@ -1369,6 +1370,7 @@ static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
 
 	return acpi_nfit_register_regions(acpi_desc);
 }
+EXPORT_SYMBOL_GPL(acpi_nfit_init);
 
 static int acpi_nfit_add(struct acpi_device *adev)
 {
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index 1fc49cc51d4a..eedbd3d79e02 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -37,6 +37,15 @@ enum nfit_uuids {
 	NFIT_UUID_MAX,
 };
 
+#define NFIT_DIMM_HANDLE(node, socket, imc, chan, dimm) \
+       (((node & 0xfff) << 16) | ((socket & 0xf) << 12) \
+        | ((imc & 0xf) << 8) | ((chan & 0xf) << 4) | (dimm & 0xf))
+#define NFIT_DIMM_NODE(handle) ((handle) >> 16 & 0xfff)
+#define NFIT_DIMM_SOCKET(handle) ((handle) >> 12 & 0xf)
+#define NFIT_DIMM_CHAN(handle) ((handle) >> 8 & 0xf)
+#define NFIT_DIMM_IMC(handle) ((handle) >> 4 & 0xf)
+#define NFIT_DIMM_DIMM(handle) ((handle) & 0xf)
+
 struct nfit_spa {
 	struct acpi_nfit_system_address *spa;
 	struct list_head list;
@@ -145,4 +154,7 @@ static inline struct acpi_nfit_desc *to_acpi_desc(struct nd_bus_descriptor *nd_d
 {
 	return container_of(nd_desc, struct acpi_nfit_desc, nd_desc);
 }
+
+const u8 *to_nfit_uuid(enum nfit_uuids id);
+int acpi_nfit_init(struct acpi_nfit_desc *nfit, acpi_size sz);
 #endif /* __NFIT_H__ */
diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index f97bf0db6519..5f2935aefd41 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -17,6 +17,28 @@ if ND_DEVICES
 config LIBND
 	tristate
 
+config NFIT_TEST
+	tristate "NFIT TEST: Manufactured NFIT for interface testing"
+	default n
+	depends on EXPERT
+	depends on DMA_CMA
+	depends on LIBND=m
+	depends on ACPI_NFIT
+	depends on m
+	help
+	  For development purposes register a manufactured
+	  NFIT table to verify the resulting device model topology.
+	  Note, this module arranges for ioremap_cache() to be
+	  overridden locally to allow simulation of system-memory as an
+	  io-memory-resource.
+
+	  Note, this test expects to be able to find at least 256MB of
+	  contiguous DMA space (CONFIG_CMA_SIZE_MBYTES, cma=) or it
+	  will fail to load.  This much contiguos memory is needed to
+	  properly simulate a DAX capable memory region.
+
+	  Say N unless you are doing development of the 'libnd' subsystem.
+
 config BLK_DEV_PMEM
 	tristate "PMEM: Persistent memory block device support"
 	depends on LIBND
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 29a797686429..e1e0f01ae960 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -1,4 +1,13 @@
+ifdef CONFIG_NFIT_TEST
+ldflags-y += --wrap=ioremap_cache
+ldflags-y += --wrap=ioremap_nocache
+ldflags-y += --wrap=iounmap
+ldflags-y += --wrap=__request_region
+ldflags-y += --wrap=__release_region
+endif
+
 obj-$(CONFIG_LIBND) += libnd.o
+obj-$(CONFIG_NFIT_TEST) += test/
 obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
 obj-$(CONFIG_ND_BTT) += nd_btt.o
 obj-$(CONFIG_ND_BLK) += nd_blk.o
diff --git a/drivers/block/nd/test/Makefile b/drivers/block/nd/test/Makefile
new file mode 100644
index 000000000000..c7f319cbd082
--- /dev/null
+++ b/drivers/block/nd/test/Makefile
@@ -0,0 +1,5 @@
+obj-$(CONFIG_NFIT_TEST) += nfit_test.o
+obj-$(CONFIG_NFIT_TEST) += nfit_test_iomap.o
+
+nfit_test-y := nfit.o
+nfit_test_iomap-y := iomap.o
diff --git a/drivers/block/nd/test/iomap.c b/drivers/block/nd/test/iomap.c
new file mode 100644
index 000000000000..c85a6f6ba559
--- /dev/null
+++ b/drivers/block/nd/test/iomap.c
@@ -0,0 +1,151 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/rculist.h>
+#include <linux/export.h>
+#include <linux/ioport.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/io.h>
+#include "nfit_test.h"
+
+static LIST_HEAD(iomap_head);
+
+static struct iomap_ops {
+	nfit_test_lookup_fn nfit_test_lookup;
+	struct list_head list;
+} iomap_ops = {
+	.list = LIST_HEAD_INIT(iomap_ops.list),
+};
+
+void nfit_test_setup(nfit_test_lookup_fn lookup)
+{
+	iomap_ops.nfit_test_lookup = lookup;
+	list_add_rcu(&iomap_ops.list, &iomap_head);
+}
+EXPORT_SYMBOL(nfit_test_setup);
+
+void nfit_test_teardown(void)
+{
+	list_del_rcu(&iomap_ops.list);
+	synchronize_rcu();
+}
+EXPORT_SYMBOL(nfit_test_teardown);
+
+static struct nfit_test_resource *get_nfit_res(resource_size_t resource)
+{
+	struct iomap_ops *ops;
+
+	ops = list_first_or_null_rcu(&iomap_head, typeof(*ops), list);
+	if (ops)
+		return ops->nfit_test_lookup(resource);
+	return NULL;
+}
+
+void __iomem *__nfit_test_ioremap(resource_size_t offset, unsigned long size,
+		void __iomem *(*fallback_fn)(resource_size_t, unsigned long))
+{
+	struct nfit_test_resource *nfit_res;
+
+	rcu_read_lock();
+	nfit_res = get_nfit_res(offset);
+	rcu_read_unlock();
+	if (nfit_res)
+		return (void __iomem *) nfit_res->buf + offset
+			- nfit_res->res->start;
+	return fallback_fn(offset, size);
+}
+
+void __iomem *__wrap_ioremap_cache(resource_size_t offset, unsigned long size)
+{
+	return __nfit_test_ioremap(offset, size, ioremap_cache);
+}
+EXPORT_SYMBOL(__wrap_ioremap_cache);
+
+void __iomem *__wrap_ioremap_nocache(resource_size_t offset, unsigned long size)
+{
+	return __nfit_test_ioremap(offset, size, ioremap_nocache);
+}
+EXPORT_SYMBOL(__wrap_ioremap_nocache);
+
+void __wrap_iounmap(volatile void __iomem *addr)
+{
+	struct nfit_test_resource *nfit_res;
+
+	rcu_read_lock();
+	nfit_res = get_nfit_res((unsigned long) addr);
+	rcu_read_unlock();
+	if (nfit_res)
+		return;
+	return iounmap(addr);
+}
+EXPORT_SYMBOL(__wrap_iounmap);
+
+struct resource *__wrap___request_region(struct resource *parent,
+		resource_size_t start, resource_size_t n, const char *name,
+		int flags)
+{
+	struct nfit_test_resource *nfit_res;
+
+	if (parent == &iomem_resource) {
+		rcu_read_lock();
+		nfit_res = get_nfit_res(start);
+		rcu_read_unlock();
+		if (nfit_res) {
+			struct resource *res = nfit_res->res + 1;
+
+			if (start + n > nfit_res->res->start
+					+ resource_size(nfit_res->res)) {
+				pr_debug("%s: start: %llx n: %llx overflow: %pr\n",
+						__func__, start, n,
+						nfit_res->res);
+				return NULL;
+			}
+
+			res->start = start;
+			res->end = start + n - 1;
+			res->name = name;
+			res->flags = resource_type(parent);
+			res->flags |= IORESOURCE_BUSY | flags;
+			pr_debug("%s: %pr\n", __func__, res);
+			return res;
+		}
+	}
+	return __request_region(parent, start, n, name, flags);
+}
+EXPORT_SYMBOL(__wrap___request_region);
+
+void __wrap___release_region(struct resource *parent, resource_size_t start,
+				resource_size_t n)
+{
+	struct nfit_test_resource *nfit_res;
+
+	if (parent == &iomem_resource) {
+		rcu_read_lock();
+		nfit_res = get_nfit_res(start);
+		rcu_read_unlock();
+		if (nfit_res) {
+			struct resource *res = nfit_res->res + 1;
+
+			if (start != res->start || resource_size(res) != n)
+				pr_info("%s: start: %llx n: %llx mismatch: %pr\n",
+						__func__, start, n, res);
+			else
+				memset(res, 0, sizeof(*res));
+			return;
+		}
+	}
+	__release_region(parent, start, n);
+}
+EXPORT_SYMBOL(__wrap___release_region);
+
+MODULE_LICENSE("GPL v2");
diff --git a/drivers/block/nd/test/nfit.c b/drivers/block/nd/test/nfit.c
new file mode 100644
index 000000000000..973e46c06abc
--- /dev/null
+++ b/drivers/block/nd/test/nfit.c
@@ -0,0 +1,1171 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/platform_device.h>
+#include <linux/dma-mapping.h>
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/libnd.h>
+#include <linux/ndctl.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+#include "../../../acpi/nfit.h"
+#include "nfit_test.h"
+#include "../nd.h"
+
+/*
+ * Generate an NFIT table to describe the following topology:
+ *
+ * BUS0: Interleaved PMEM regions, and aliasing with BLK regions
+ *
+ *                     (a)                       (b)            DIMM   BLK-REGION
+ *           +----------+--------------+----------+---------+
+ * +------+  |  blk2.0  |     pm0.0    |  blk2.1  |  pm1.0  |    0      region2
+ * | imc0 +--+- - - - - region0 - - - -+----------+         +
+ * +--+---+  |  blk3.0  |     pm0.0    |  blk3.1  |  pm1.0  |    1      region3
+ *    |      +----------+--------------v----------v         v
+ * +--+---+                            |                    |
+ * | cpu0 |                                    region1
+ * +--+---+                            |                    |
+ *    |      +-------------------------^----------^         ^
+ * +--+---+  |                 blk4.0             |  pm1.0  |    2      region4
+ * | imc1 +--+-------------------------+----------+         +
+ * +------+  |                 blk5.0             |  pm1.0  |    3      region5
+ *           +-------------------------+----------+-+-------+
+ *
+ * *) In this layout we have four dimms and two memory controllers in one
+ *    socket.  Each unique interface (BLK or PMEM) to DPA space
+ *    is identified by a region device with a dynamically assigned id.
+ *
+ * *) The first portion of dimm0 and dimm1 are interleaved as REGION0.
+ *    A single PMEM namespace "pm0.0" is created using half of the
+ *    REGION0 SPA-range.  REGION0 spans dimm0 and dimm1.  PMEM namespace
+ *    allocate from from the bottom of a region.  The unallocated
+ *    portion of REGION0 aliases with REGION2 and REGION3.  That
+ *    unallacted capacity is reclaimed as BLK namespaces ("blk2.0" and
+ *    "blk3.0") starting at the base of each DIMM to offset (a) in those
+ *    DIMMs.  "pm0.0", "blk2.0" and "blk3.0" are free-form readable
+ *    names that can be assigned to a namespace.
+ *
+ * *) In the last portion of dimm0 and dimm1 we have an interleaved
+ *    SPA range, REGION1, that spans those two dimms as well as dimm2
+ *    and dimm3.  Some of REGION1 allocated to a PMEM namespace named
+ *    "pm1.0" the rest is reclaimed in 4 BLK namespaces (for each
+ *    dimm in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
+ *    "blk5.0".
+ *
+ * *) The portion of dimm2 and dimm3 that do not participate in the
+ *    REGION1 interleaved SPA range (i.e. the DPA address below offset
+ *    (b) are also included in the "blk4.0" and "blk5.0" namespaces.
+ *    Note, that BLK namespaces need not be contiguous in DPA-space, and
+ *    can consume aliased capacity from multiple interleave sets.
+ *
+ * BUS1: Legacy NVDIMM (single contiguous range)
+ *
+ *  region2
+ * +---------------------+
+ * |---------------------|
+ * ||       pm2.0       ||
+ * |---------------------|
+ * +---------------------+
+ *
+ * *) A NFIT-table may describe a simple system-physical-address range
+ *    with no BLK aliasing.  This type of region may optionally
+ *    reference an NVDIMM.
+ */
+enum {
+	NUM_PM  = 2,
+	NUM_DCR = 4,
+	NUM_BDW = NUM_DCR,
+	NUM_SPA = NUM_PM + NUM_DCR + NUM_BDW,
+	NUM_MEM = NUM_DCR + NUM_BDW + 2 /* spa0 iset */ + 4 /* spa1 iset */,
+	DIMM_SIZE = SZ_32M,
+	LABEL_SIZE = SZ_128K,
+	SPA0_SIZE = DIMM_SIZE,
+	SPA1_SIZE = DIMM_SIZE*2,
+	SPA2_SIZE = DIMM_SIZE,
+	BDW_SIZE = 64 << 8,
+	DCR_SIZE = 12,
+	NUM_NFITS = 2, /* permit testing multiple NFITs per system */
+};
+
+struct nfit_test_dcr {
+	__le64 bdw_addr;
+	__le32 bdw_status;
+	__u8 aperature[BDW_SIZE];
+};
+
+static u32 handle[NUM_DCR] = {
+	[0] = NFIT_DIMM_HANDLE(0, 0, 0, 0, 0),
+	[1] = NFIT_DIMM_HANDLE(0, 0, 0, 0, 1),
+	[2] = NFIT_DIMM_HANDLE(0, 0, 1, 0, 0),
+	[3] = NFIT_DIMM_HANDLE(0, 0, 1, 0, 1),
+};
+
+struct nfit_test {
+	struct acpi_nfit_desc acpi_desc;
+	struct platform_device pdev;
+	struct list_head resources;
+	void *nfit_buf;
+	dma_addr_t nfit_dma;
+	size_t nfit_size;
+	int num_dcr;
+	int num_pm;
+	void **dimm;
+	dma_addr_t *dimm_dma;
+	void **label;
+	dma_addr_t *label_dma;
+	void **spa_set;
+	dma_addr_t *spa_set_dma;
+	struct nfit_test_dcr **dcr;
+	dma_addr_t *dcr_dma;
+	int (*alloc)(struct nfit_test *t);
+	void (*setup)(struct nfit_test *t);
+};
+
+static struct nfit_test *to_nfit_test(struct device *dev)
+{
+	struct platform_device *pdev = to_platform_device(dev);
+
+	return container_of(pdev, struct nfit_test, pdev);
+}
+
+static int nfit_test_ctl(struct nd_bus_descriptor *nd_desc,
+		struct nd_dimm *nd_dimm, unsigned int cmd, void *buf,
+		unsigned int buf_len)
+{
+	struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
+	struct nfit_test *t = container_of(acpi_desc, typeof(*t), acpi_desc);
+	struct nfit_mem *nfit_mem = nd_dimm_provider_data(nd_dimm);
+	int i, rc;
+
+	if (!nfit_mem || !test_bit(cmd, &nfit_mem->dsm_mask))
+		return -ENXIO;
+
+	/* lookup label space for the given dimm */
+	for (i = 0; i < ARRAY_SIZE(handle); i++)
+		if (__to_nfit_memdev(nfit_mem)->device_handle == handle[i])
+			break;
+	if (i >= ARRAY_SIZE(handle))
+		return -ENXIO;
+
+	switch (cmd) {
+	case ND_CMD_GET_CONFIG_SIZE: {
+		struct nd_cmd_get_config_size *nd_cmd = buf;
+
+		if (buf_len < sizeof(*nd_cmd))
+			return -EINVAL;
+		nd_cmd->status = 0;
+		nd_cmd->config_size = LABEL_SIZE;
+		nd_cmd->max_xfer = SZ_4K;
+		rc = 0;
+		break;
+	}
+	case ND_CMD_GET_CONFIG_DATA: {
+		struct nd_cmd_get_config_data_hdr *nd_cmd = buf;
+		unsigned int len, offset = nd_cmd->in_offset;
+
+		if (buf_len < sizeof(*nd_cmd))
+			return -EINVAL;
+		if (offset >= LABEL_SIZE)
+			return -EINVAL;
+		if (nd_cmd->in_length + sizeof(*nd_cmd) > buf_len)
+			return -EINVAL;
+
+		nd_cmd->status = 0;
+		len = min(nd_cmd->in_length, LABEL_SIZE - offset);
+		memcpy(nd_cmd->out_buf, t->label[i] + offset, len);
+		rc = buf_len - sizeof(*nd_cmd) - len;
+		break;
+	}
+	case ND_CMD_SET_CONFIG_DATA: {
+		struct nd_cmd_set_config_hdr *nd_cmd = buf;
+		unsigned int len, offset = nd_cmd->in_offset;
+		u32 *status;
+
+		if (buf_len < sizeof(*nd_cmd))
+			return -EINVAL;
+		if (offset >= LABEL_SIZE)
+			return -EINVAL;
+		if (nd_cmd->in_length + sizeof(*nd_cmd) + 4 > buf_len)
+			return -EINVAL;
+
+		status = buf + nd_cmd->in_length + sizeof(*nd_cmd);
+		*status = 0;
+		len = min(nd_cmd->in_length, LABEL_SIZE - offset);
+		memcpy(t->label[i] + offset, nd_cmd->in_buf, len);
+		rc = buf_len - sizeof(*nd_cmd) - (len + 4);
+		break;
+	}
+	default:
+		return -ENOTTY;
+	}
+
+	return rc;
+}
+
+static DEFINE_SPINLOCK(nfit_test_lock);
+static struct nfit_test *instances[NUM_NFITS];
+
+static void release_nfit_res(void *data)
+{
+	struct nfit_test_resource *nfit_res = data;
+	struct resource *res = nfit_res->res;
+
+	spin_lock(&nfit_test_lock);
+	list_del(&nfit_res->list);
+	spin_unlock(&nfit_test_lock);
+
+	if (is_vmalloc_addr(nfit_res->buf))
+		vfree(nfit_res->buf);
+	else
+		dma_free_coherent(nfit_res->dev, resource_size(res),
+				nfit_res->buf, res->start);
+	kfree(res);
+	kfree(nfit_res);
+}
+
+static void *__test_alloc(struct nfit_test *t, size_t size, dma_addr_t *dma,
+		void *buf)
+{
+	struct device *dev = &t->pdev.dev;
+	struct resource *res = kzalloc(sizeof(*res) * 2, GFP_KERNEL);
+	struct nfit_test_resource *nfit_res = kzalloc(sizeof(*nfit_res),
+			GFP_KERNEL);
+	int rc;
+
+	if (!res || !buf || !nfit_res)
+		goto err;
+	rc = devm_add_action(dev, release_nfit_res, nfit_res);
+	if (rc)
+		goto err;
+	INIT_LIST_HEAD(&nfit_res->list);
+	memset(buf, 0, size);
+	nfit_res->dev = dev;
+	nfit_res->buf = buf;
+	nfit_res->res = res;
+	res->start = *dma;
+	res->end = *dma + size - 1;
+	res->name = "NFIT";
+	spin_lock(&nfit_test_lock);
+	list_add(&nfit_res->list, &t->resources);
+	spin_unlock(&nfit_test_lock);
+
+	return nfit_res->buf;
+ err:
+	if (buf && !is_vmalloc_addr(buf))
+		dma_free_coherent(dev, size, buf, *dma);
+	else if (buf)
+		vfree(buf);
+	kfree(res);
+	kfree(nfit_res);
+	return NULL;
+}
+
+static void *test_alloc(struct nfit_test *t, size_t size, dma_addr_t *dma)
+{
+	void *buf = vmalloc(size);
+
+	*dma = (unsigned long) buf;
+	return __test_alloc(t, size, dma, buf);
+}
+
+static void *test_alloc_coherent(struct nfit_test *t, size_t size, dma_addr_t *dma)
+{
+	struct device *dev = &t->pdev.dev;
+	void *buf = dma_alloc_coherent(dev, size, dma, GFP_KERNEL);
+
+	return __test_alloc(t, size, dma, buf);
+}
+
+static struct nfit_test_resource *nfit_test_lookup(resource_size_t addr)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(instances); i++) {
+		struct nfit_test_resource *n, *nfit_res = NULL;
+		struct nfit_test *t = instances[i];
+
+		if (!t)
+			continue;
+		spin_lock(&nfit_test_lock);
+		list_for_each_entry(n, &t->resources, list) {
+			if (addr >= n->res->start && (addr < n->res->start
+						+ resource_size(n->res))) {
+				nfit_res = n;
+				break;
+			} else if (addr >= (unsigned long) n->buf
+					&& (addr < (unsigned long) n->buf
+						+ resource_size(n->res))) {
+				nfit_res = n;
+				break;
+			}
+		}
+		spin_unlock(&nfit_test_lock);
+		if (nfit_res)
+			return nfit_res;
+	}
+
+	return NULL;
+}
+
+static int nfit_test0_alloc(struct nfit_test *t)
+{
+	size_t nfit_size = sizeof(struct acpi_table_nfit)
+			+ sizeof(struct acpi_nfit_system_address) * NUM_SPA
+			+ sizeof(struct acpi_nfit_memory_map) * NUM_MEM
+			+ sizeof(struct acpi_nfit_control_region) * NUM_DCR
+			+ sizeof(struct acpi_nfit_data_region) * NUM_BDW;
+	int i;
+
+	t->nfit_buf = test_alloc(t, nfit_size, &t->nfit_dma);
+	if (!t->nfit_buf)
+		return -ENOMEM;
+	t->nfit_size = nfit_size;
+
+	t->spa_set[0] = test_alloc_coherent(t, SPA0_SIZE, &t->spa_set_dma[0]);
+	if (!t->spa_set[0])
+		return -ENOMEM;
+
+	t->spa_set[1] = test_alloc_coherent(t, SPA1_SIZE, &t->spa_set_dma[1]);
+	if (!t->spa_set[1])
+		return -ENOMEM;
+
+	for (i = 0; i < NUM_DCR; i++) {
+		t->dimm[i] = test_alloc(t, DIMM_SIZE, &t->dimm_dma[i]);
+		if (!t->dimm[i])
+			return -ENOMEM;
+
+		t->label[i] = test_alloc(t, LABEL_SIZE, &t->label_dma[i]);
+		if (!t->label[i])
+			return -ENOMEM;
+		sprintf(t->label[i], "label%d", i);
+	}
+
+	for (i = 0; i < NUM_DCR; i++) {
+		t->dcr[i] = test_alloc(t, LABEL_SIZE, &t->dcr_dma[i]);
+		if (!t->dcr[i])
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int nfit_test1_alloc(struct nfit_test *t)
+{
+	size_t nfit_size = sizeof(struct acpi_table_nfit)
+		+ sizeof(struct acpi_nfit_system_address) + sizeof(struct acpi_nfit_memory_map)
+		+ sizeof(struct acpi_nfit_control_region);
+
+	t->nfit_buf = test_alloc(t, nfit_size, &t->nfit_dma);
+	if (!t->nfit_buf)
+		return -ENOMEM;
+	t->nfit_size = nfit_size;
+
+	t->spa_set[0] = test_alloc_coherent(t, SPA2_SIZE, &t->spa_set_dma[0]);
+	if (!t->spa_set[0])
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void nfit_test_init_header(struct acpi_table_nfit *nfit, size_t size)
+{
+	memcpy(nfit->header.signature, ACPI_SIG_NFIT, 4);
+	nfit->header.length = size;
+	nfit->header.revision = 1;
+	memcpy(nfit->header.oem_id, "LIBND", 6);
+	memcpy(nfit->header.oem_table_id, "TEST", 5);
+	nfit->header.oem_revision = 1;
+	memcpy(nfit->header.asl_compiler_id, "TST", 4);
+	nfit->header.asl_compiler_revision = 1;
+}
+
+static void nfit_test0_setup(struct nfit_test *t)
+{
+	struct nd_bus_descriptor *nd_desc;
+	struct acpi_nfit_desc *acpi_desc;
+	struct acpi_nfit_memory_map *memdev;
+	void *nfit_buf = t->nfit_buf;
+	size_t size = t->nfit_size;
+	struct acpi_nfit_system_address *spa;
+	struct acpi_nfit_control_region *dcr;
+	struct acpi_nfit_data_region *bdw;
+	unsigned int offset;
+
+	nfit_test_init_header(nfit_buf, size);
+
+	/*
+	 * spa0 (interleave first half of dimm0 and dimm1, note storage
+	 * does not actually alias the related block-data-window
+	 * regions)
+	 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit);
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_PM), 16);
+	spa->range_index = 0+1;
+	spa->address = t->spa_set_dma[0];
+	spa->length = SPA0_SIZE;
+
+	/*
+	 * spa1 (interleave last half of the 4 DIMMS, note storage
+	 * does not actually alias the related block-data-window
+	 * regions)
+	 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa);
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_PM), 16);
+	spa->range_index = 1+1;
+	spa->address = t->spa_set_dma[1];
+	spa->length = SPA1_SIZE;
+
+	/* spa2 (dcr0) dimm0 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 2;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+	spa->range_index = 2+1;
+	spa->address = t->dcr_dma[0];
+	spa->length = DCR_SIZE;
+
+	/* spa3 (dcr1) dimm1 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 3;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+	spa->range_index = 3+1;
+	spa->address = t->dcr_dma[1];
+	spa->length = DCR_SIZE;
+
+	/* spa4 (dcr2) dimm2 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 4;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+	spa->range_index = 4+1;
+	spa->address = t->dcr_dma[2];
+	spa->length = DCR_SIZE;
+
+	/* spa5 (dcr3) dimm3 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 5;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+	spa->range_index = 5+1;
+	spa->address = t->dcr_dma[3];
+	spa->length = DCR_SIZE;
+
+	/* spa6 (bdw for dcr0) dimm0 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 6;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+	spa->range_index = 6+1;
+	spa->address = t->dimm_dma[0];
+	spa->length = DIMM_SIZE;
+
+	/* spa7 (bdw for dcr1) dimm1 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 7;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+	spa->range_index = 7+1;
+	spa->address = t->dimm_dma[1];
+	spa->length = DIMM_SIZE;
+
+	/* spa8 (bdw for dcr2) dimm2 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 8;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+	spa->range_index = 8+1;
+	spa->address = t->dimm_dma[2];
+	spa->length = DIMM_SIZE;
+
+	/* spa9 (bdw for dcr3) dimm3 */
+	spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 9;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+	spa->range_index = 9+1;
+	spa->address = t->dimm_dma[3];
+	spa->length = DIMM_SIZE;
+
+	offset = sizeof(struct acpi_table_nfit) + sizeof(*spa) * 10;
+	/* mem-region0 (spa0, dimm0) */
+	memdev = nfit_buf + offset;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[0];
+	memdev->physical_id = 0;
+	memdev->region_id = 0;
+	memdev->range_index = 0+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = SPA0_SIZE/2;
+	memdev->region_offset = t->spa_set_dma[0];
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 2;
+
+	/* mem-region1 (spa0, dimm1) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map);
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[1];
+	memdev->physical_id = 1;
+	memdev->region_id = 0;
+	memdev->range_index = 0+1;
+	memdev->region_index = 1+1;
+	memdev->region_size = SPA0_SIZE/2;
+	memdev->region_offset = t->spa_set_dma[0] + SPA0_SIZE/2;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 2;
+
+	/* mem-region2 (spa1, dimm0) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 2;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[0];
+	memdev->physical_id = 0;
+	memdev->region_id = 1;
+	memdev->range_index = 1+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = SPA1_SIZE/4;
+	memdev->region_offset = t->spa_set_dma[1];
+	memdev->address = SPA0_SIZE/2;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 4;
+
+	/* mem-region3 (spa1, dimm1) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 3;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[1];
+	memdev->physical_id = 1;
+	memdev->region_id = 1;
+	memdev->range_index = 1+1;
+	memdev->region_index = 1+1;
+	memdev->region_size = SPA1_SIZE/4;
+	memdev->region_offset = t->spa_set_dma[1] + SPA1_SIZE/4;
+	memdev->address = SPA0_SIZE/2;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 4;
+
+	/* mem-region4 (spa1, dimm2) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 4;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[2];
+	memdev->physical_id = 2;
+	memdev->region_id = 0;
+	memdev->range_index = 1+1;
+	memdev->region_index = 2+1;
+	memdev->region_size = SPA1_SIZE/4;
+	memdev->region_offset = t->spa_set_dma[1] + 2*SPA1_SIZE/4;
+	memdev->address = SPA0_SIZE/2;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 4;
+
+	/* mem-region5 (spa1, dimm3) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 5;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[3];
+	memdev->physical_id = 3;
+	memdev->region_id = 0;
+	memdev->range_index = 1+1;
+	memdev->region_index = 3+1;
+	memdev->region_size = SPA1_SIZE/4;
+	memdev->region_offset = t->spa_set_dma[1] + 3*SPA1_SIZE/4;
+	memdev->address = SPA0_SIZE/2;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 4;
+
+	/* mem-region6 (spa/dcr0, dimm0) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 6;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[0];
+	memdev->physical_id = 0;
+	memdev->region_id = 0;
+	memdev->range_index = 2+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region7 (spa/dcr1, dimm1) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 7;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[1];
+	memdev->physical_id = 1;
+	memdev->region_id = 0;
+	memdev->range_index = 3+1;
+	memdev->region_index = 1+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region8 (spa/dcr2, dimm2) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 8;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[2];
+	memdev->physical_id = 2;
+	memdev->region_id = 0;
+	memdev->range_index = 4+1;
+	memdev->region_index = 2+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region9 (spa/dcr3, dimm3) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 9;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[3];
+	memdev->physical_id = 3;
+	memdev->region_id = 0;
+	memdev->range_index = 5+1;
+	memdev->region_index = 3+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region10 (spa/bdw0, dimm0) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 10;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[0];
+	memdev->physical_id = 0;
+	memdev->region_id = 0;
+	memdev->range_index = 6+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region11 (spa/bdw1, dimm1) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 11;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[1];
+	memdev->physical_id = 1;
+	memdev->region_id = 0;
+	memdev->range_index = 7+1;
+	memdev->region_index = 1+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region12 (spa/bdw2, dimm2) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 12;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[2];
+	memdev->physical_id = 2;
+	memdev->region_id = 0;
+	memdev->range_index = 8+1;
+	memdev->region_index = 2+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	/* mem-region13 (spa/dcr3, dimm3) */
+	memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 13;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = handle[3];
+	memdev->physical_id = 3;
+	memdev->region_id = 0;
+	memdev->range_index = 9+1;
+	memdev->region_index = 3+1;
+	memdev->region_size = 0;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	offset = offset + sizeof(struct acpi_nfit_memory_map) * 14;
+	/* dcr-descriptor0 */
+	dcr = nfit_buf + offset;
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 0+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~handle[0];
+	dcr->windows = 1;
+	dcr->window_size = DCR_SIZE;
+	dcr->command_offset = 0;
+	dcr->command_size = 8;
+	dcr->status_offset = 8;
+	dcr->status_size = 4;
+
+	/* dcr-descriptor1 */
+	dcr = nfit_buf + offset + sizeof(struct acpi_nfit_control_region);
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 1+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~handle[1];
+	dcr->windows = 1;
+	dcr->window_size = DCR_SIZE;
+	dcr->command_offset = 0;
+	dcr->command_size = 8;
+	dcr->status_offset = 8;
+	dcr->status_size = 4;
+
+	/* dcr-descriptor2 */
+	dcr = nfit_buf + offset + sizeof(struct acpi_nfit_control_region) * 2;
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 2+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~handle[2];
+	dcr->windows = 1;
+	dcr->window_size = DCR_SIZE;
+	dcr->command_offset = 0;
+	dcr->command_size = 8;
+	dcr->status_offset = 8;
+	dcr->status_size = 4;
+
+	/* dcr-descriptor3 */
+	dcr = nfit_buf + offset + sizeof(struct acpi_nfit_control_region) * 3;
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 3+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~handle[3];
+	dcr->windows = 1;
+	dcr->window_size = DCR_SIZE;
+	dcr->command_offset = 0;
+	dcr->command_size = 8;
+	dcr->status_offset = 8;
+	dcr->status_size = 4;
+
+	offset = offset + sizeof(struct acpi_nfit_control_region) * 4;
+	/* bdw0 (spa/dcr0, dimm0) */
+	bdw = nfit_buf + offset;
+	bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+	bdw->header.length = sizeof(struct acpi_nfit_data_region);
+	bdw->region_index = 0+1;
+	bdw->windows = 1;
+	bdw->offset = 0;
+	bdw->size = BDW_SIZE;
+	bdw->capacity = DIMM_SIZE;
+	bdw->start_address = 0;
+
+	/* bdw1 (spa/dcr1, dimm1) */
+	bdw = nfit_buf + offset + sizeof(struct acpi_nfit_data_region);
+	bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+	bdw->header.length = sizeof(struct acpi_nfit_data_region);
+	bdw->region_index = 1+1;
+	bdw->windows = 1;
+	bdw->offset = 0;
+	bdw->size = BDW_SIZE;
+	bdw->capacity = DIMM_SIZE;
+	bdw->start_address = 0;
+
+	/* bdw2 (spa/dcr2, dimm2) */
+	bdw = nfit_buf + offset + sizeof(struct acpi_nfit_data_region) * 2;
+	bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+	bdw->header.length = sizeof(struct acpi_nfit_data_region);
+	bdw->region_index = 2+1;
+	bdw->windows = 1;
+	bdw->offset = 0;
+	bdw->size = BDW_SIZE;
+	bdw->capacity = DIMM_SIZE;
+	bdw->start_address = 0;
+
+	/* bdw3 (spa/dcr3, dimm3) */
+	bdw = nfit_buf + offset + sizeof(struct acpi_nfit_data_region) * 3;
+	bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+	bdw->header.length = sizeof(struct acpi_nfit_data_region);
+	bdw->region_index = 3+1;
+	bdw->windows = 1;
+	bdw->offset = 0;
+	bdw->size = BDW_SIZE;
+	bdw->capacity = DIMM_SIZE;
+	bdw->start_address = 0;
+
+	acpi_desc = &t->acpi_desc;
+	set_bit(ND_CMD_GET_CONFIG_SIZE, &acpi_desc->dimm_dsm_force_en);
+	set_bit(ND_CMD_GET_CONFIG_DATA, &acpi_desc->dimm_dsm_force_en);
+	set_bit(ND_CMD_SET_CONFIG_DATA, &acpi_desc->dimm_dsm_force_en);
+	nd_desc = &acpi_desc->nd_desc;
+	nd_desc->ndctl = nfit_test_ctl;
+}
+
+static void nfit_test1_setup(struct nfit_test *t)
+{
+	size_t size = t->nfit_size, offset;
+	void *nfit_buf = t->nfit_buf;
+	struct acpi_nfit_memory_map *memdev;
+	struct acpi_nfit_control_region *dcr;
+	struct acpi_nfit_system_address *spa;
+
+	nfit_test_init_header(nfit_buf, size);
+
+	offset = sizeof(struct acpi_table_nfit);
+	/* spa0 (flat range with no bdw aliasing) */
+	spa = nfit_buf + offset;
+	spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+	spa->header.length = sizeof(*spa);
+	memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_PM), 16);
+	spa->range_index = 0+1;
+	spa->address = t->spa_set_dma[0];
+	spa->length = SPA2_SIZE;
+
+	offset += sizeof(*spa);
+	/* mem-region0 (spa0, dimm0) */
+	memdev = nfit_buf + offset;
+	memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+	memdev->header.length = sizeof(*memdev);
+	memdev->device_handle = 0;
+	memdev->physical_id = 0;
+	memdev->region_id = 0;
+	memdev->range_index = 0+1;
+	memdev->region_index = 0+1;
+	memdev->region_size = SPA2_SIZE;
+	memdev->region_offset = 0;
+	memdev->address = 0;
+	memdev->interleave_index = 0;
+	memdev->interleave_ways = 1;
+
+	offset += sizeof(*memdev);
+	/* dcr-descriptor0 */
+	dcr = nfit_buf + offset;
+	dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+	dcr->header.length = sizeof(struct acpi_nfit_control_region);
+	dcr->region_index = 0+1;
+	dcr->vendor_id = 0xabcd;
+	dcr->device_id = 0;
+	dcr->revision_id = 1;
+	dcr->serial_number = ~0;
+	dcr->code = 0x201;
+	dcr->windows = 0;
+	dcr->window_size = 0;
+	dcr->command_offset = 0;
+	dcr->command_size = 0;
+	dcr->status_offset = 0;
+	dcr->status_size = 0;
+}
+
+static int nfit_test_blk_region_enable(struct nd_bus *nd_bus, struct device *dev)
+{
+	struct nd_blk_region *ndbr = to_nd_blk_region(dev);
+	struct acpi_nfit_system_address *spa_bdw;
+	struct nfit_blk_mmio *mmio;
+	struct nfit_blk *nfit_blk;
+	struct nfit_mem *nfit_mem;
+	struct nd_dimm *nd_dimm;
+
+	nd_dimm = nd_blk_region_to_dimm(ndbr);
+	nfit_mem = nd_dimm_provider_data(nd_dimm);
+	if (!nfit_mem || !nfit_mem->dcr || !nfit_mem->bdw) {
+		dev_dbg(dev, "%s: missing%s%s%s\n", __func__,
+				nfit_mem ? "" : " nfit_mem",
+				nfit_mem->dcr ? "" : " dcr",
+				nfit_mem->bdw ? "" : " bdw");
+		return -ENXIO;
+	}
+
+	nfit_blk = devm_kzalloc(dev, sizeof(*nfit_blk), GFP_KERNEL);
+	if (!nfit_blk)
+		return -ENOMEM;
+	nd_blk_region_set_provider_data(ndbr, nfit_blk);
+	nfit_blk->nd_region = to_nd_region(dev);
+
+	/* block aperture memory is all we use in nfit_test */
+	nfit_blk->bdw_offset = nfit_mem->bdw->offset;
+	mmio = &nfit_blk->mmio[BDW];
+	spa_bdw = nfit_mem->spa_bdw;
+	mmio->base = __wrap_ioremap_nocache(spa_bdw->address, spa_bdw->length);
+	if (!mmio->base) {
+		release_mem_region(spa_bdw->address, spa_bdw->length);
+		dev_dbg(dev, "%s: %s failed to map bdw\n", __func__,
+				nd_dimm_name(nd_dimm));
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void nfit_test_blk_region_disable(struct nd_bus *nd_bus, struct device *dev)
+{
+	struct nd_blk_region *ndbr = to_nd_blk_region(dev);
+	struct nfit_blk *nfit_blk = nd_blk_region_provider_data(ndbr);
+	struct acpi_nfit_system_address *spa_bdw;
+	struct nfit_blk_mmio *mmio;
+	struct nfit_mem *nfit_mem;
+	struct nd_dimm *nd_dimm;
+
+	if (!nfit_blk)
+		return; /* never enabled */
+
+	nd_dimm = nd_blk_region_to_dimm(ndbr);
+	nfit_mem = nd_dimm_provider_data(nd_dimm);
+	spa_bdw = nfit_mem->spa_bdw;
+	mmio = &nfit_blk->mmio[BDW];
+	__wrap_iounmap(mmio->base);
+	nd_blk_region_set_provider_data(ndbr, NULL);
+}
+
+static int nfit_test_blk_do_io(struct nd_blk_region *ndbr, void *iobuf,
+                u64 len, int rw, resource_size_t dpa)
+{
+	struct nfit_blk *nfit_blk = ndbr->blk_provider_data;
+	struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW];
+	struct nd_region *nd_region = &ndbr->nd_region;
+        struct nfit_test_resource *nfit_res;
+	unsigned int bw;
+
+        nfit_res = nfit_test_lookup((unsigned long) mmio->base);
+        if (!nfit_res) {
+		dev_WARN_ONCE(&nd_region->dev, 1, "no test resource\n");
+		return -EIO;
+	}
+	dev_vdbg(&nd_region->dev, "%s: base: %p offset: %pa\n",
+			__func__, mmio->base, &dpa);
+	bw = nd_region_acquire_lane(nd_region);
+	if (rw)
+		memcpy(nfit_res->buf + dpa, iobuf, len);
+	else
+		memcpy(iobuf, nfit_res->buf + dpa, len);
+	nd_region_release_lane(nd_region, bw);
+
+        return 0;
+}
+
+extern const struct attribute_group *acpi_nfit_attribute_groups[];
+
+static int nfit_test_probe(struct platform_device *pdev)
+{
+	struct nd_bus_descriptor *nd_desc;
+	struct acpi_nfit_desc *acpi_desc;
+	struct device *dev = &pdev->dev;
+	struct nfit_test *nfit_test;
+	int rc;
+
+	nfit_test = to_nfit_test(&pdev->dev);
+
+	/* common alloc */
+	if (nfit_test->num_dcr) {
+		int num = nfit_test->num_dcr;
+
+		nfit_test->dimm = devm_kcalloc(dev, num, sizeof(void *), GFP_KERNEL);
+		nfit_test->dimm_dma = devm_kcalloc(dev, num, sizeof(dma_addr_t), GFP_KERNEL);
+		nfit_test->label = devm_kcalloc(dev, num, sizeof(void *), GFP_KERNEL);
+		nfit_test->label_dma = devm_kcalloc(dev, num, sizeof(dma_addr_t), GFP_KERNEL);
+		nfit_test->dcr = devm_kcalloc(dev, num, sizeof(struct nfit_test_dcr *), GFP_KERNEL);
+		nfit_test->dcr_dma = devm_kcalloc(dev, num, sizeof(dma_addr_t), GFP_KERNEL);
+		if (nfit_test->dimm && nfit_test->dimm_dma && nfit_test->label
+				&& nfit_test->label_dma && nfit_test->dcr
+				&& nfit_test->dcr_dma)
+			/* pass */;
+		else
+			return -ENOMEM;
+	}
+
+	if (nfit_test->num_pm) {
+		int num = nfit_test->num_pm;
+
+		nfit_test->spa_set = devm_kcalloc(dev, num, sizeof(void *), GFP_KERNEL);
+		nfit_test->spa_set_dma = devm_kcalloc(dev, num,
+				sizeof(dma_addr_t), GFP_KERNEL);
+		if (nfit_test->spa_set && nfit_test->spa_set_dma)
+			/* pass */;
+		else
+			return -ENOMEM;
+	}
+
+	/* per-nfit specific alloc */
+	if (nfit_test->alloc(nfit_test))
+		return -ENOMEM;
+
+	nfit_test->setup(nfit_test);
+	acpi_desc = &nfit_test->acpi_desc;
+	acpi_desc->dev = &pdev->dev;
+	acpi_desc->nfit = nfit_test->nfit_buf;
+	acpi_desc->blk_enable = nfit_test_blk_region_enable;
+	acpi_desc->blk_disable = nfit_test_blk_region_disable;
+	acpi_desc->blk_do_io = nfit_test_blk_do_io;
+	nd_desc = &acpi_desc->nd_desc;
+	nd_desc->attr_groups = acpi_nfit_attribute_groups;
+	acpi_desc->nd_bus = nd_bus_register(&pdev->dev, nd_desc);
+	if (!acpi_desc->nd_bus)
+		return -ENXIO;
+
+	rc = acpi_nfit_init(acpi_desc, nfit_test->nfit_size);
+	if (rc) {
+		nd_bus_unregister(acpi_desc->nd_bus);
+		return rc;
+	}
+
+	return 0;
+}
+
+static int nfit_test_remove(struct platform_device *pdev)
+{
+	struct nfit_test *nfit_test = to_nfit_test(&pdev->dev);
+	struct acpi_nfit_desc *acpi_desc = &nfit_test->acpi_desc;
+
+	nd_bus_unregister(acpi_desc->nd_bus);
+
+	return 0;
+}
+
+static void nfit_test_release(struct device *dev)
+{
+	struct nfit_test *nfit_test = to_nfit_test(dev);
+
+	kfree(nfit_test);
+}
+
+static const struct platform_device_id nfit_test_id[] = {
+	{ KBUILD_MODNAME },
+	{ },
+};
+
+static struct platform_driver nfit_test_driver = {
+	.probe = nfit_test_probe,
+	.remove = nfit_test_remove,
+	.driver = {
+		.name = KBUILD_MODNAME,
+	},
+	.id_table = nfit_test_id,
+};
+
+#ifdef CONFIG_CMA_SIZE_MBYTES
+#define CMA_SIZE_MBYTES CONFIG_CMA_SIZE_MBYTES
+#else
+#define CMA_SIZE_MBYTES 0
+#endif
+
+static __init int nfit_test_init(void)
+{
+	int rc, i;
+
+	nfit_test_setup(nfit_test_lookup);
+
+	for (i = 0; i < NUM_NFITS; i++) {
+		struct nfit_test *nfit_test;
+		struct platform_device *pdev;
+		static int once;
+
+		nfit_test = kzalloc(sizeof(*nfit_test), GFP_KERNEL);
+		if (!nfit_test) {
+			rc = -ENOMEM;
+			goto err_register;
+		}
+		INIT_LIST_HEAD(&nfit_test->resources);
+		switch (i) {
+		case 0:
+			nfit_test->num_pm = NUM_PM;
+			nfit_test->num_dcr = NUM_DCR;
+			nfit_test->alloc = nfit_test0_alloc;
+			nfit_test->setup = nfit_test0_setup;
+			break;
+		case 1:
+			nfit_test->num_pm = 1;
+			nfit_test->alloc = nfit_test1_alloc;
+			nfit_test->setup = nfit_test1_setup;
+			break;
+		default:
+			rc = -EINVAL;
+			goto err_register;
+		}
+		pdev = &nfit_test->pdev;
+		pdev->name = KBUILD_MODNAME;
+		pdev->id = i;
+		pdev->dev.release = nfit_test_release;
+		rc = platform_device_register(pdev);
+		if (rc) {
+			put_device(&pdev->dev);
+			goto err_register;
+		}
+
+		rc = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
+		if (rc)
+			goto err_register;
+
+		instances[i] = nfit_test;
+
+		if (!once++) {
+			dma_addr_t dma;
+			void *buf;
+
+			buf = dma_alloc_coherent(&pdev->dev, SZ_128M, &dma,
+					GFP_KERNEL);
+			if (!buf) {
+				rc = -ENOMEM;
+				dev_warn(&pdev->dev, "need 128M of free cma\n");
+				goto err_register;
+			}
+			dma_free_coherent(&pdev->dev, SZ_128M, buf, dma);
+		}
+	}
+
+	rc = platform_driver_register(&nfit_test_driver);
+	if (rc)
+		goto err_register;
+	return 0;
+
+ err_register:
+	for (i = 0; i < NUM_NFITS; i++)
+		if (instances[i])
+			platform_device_unregister(&instances[i]->pdev);
+	nfit_test_teardown();
+	return rc;
+}
+
+static __exit void nfit_test_exit(void)
+{
+	int i;
+
+	platform_driver_unregister(&nfit_test_driver);
+	for (i = 0; i < NUM_NFITS; i++)
+		platform_device_unregister(&instances[i]->pdev);
+	nfit_test_teardown();
+}
+
+module_init(nfit_test_init);
+module_exit(nfit_test_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Intel Corporation");
diff --git a/drivers/block/nd/test/nfit_test.h b/drivers/block/nd/test/nfit_test.h
new file mode 100644
index 000000000000..4a1215ec45c0
--- /dev/null
+++ b/drivers/block/nd/test/nfit_test.h
@@ -0,0 +1,28 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __NFIT_TEST_H__
+#define __NFIT_TEST_H__
+
+struct nfit_test_resource {
+	struct list_head list;
+	struct resource *res;
+	struct device *dev;
+	void *buf;
+};
+
+typedef struct nfit_test_resource *(*nfit_test_lookup_fn)(resource_size_t);
+void __iomem *__wrap_ioremap_nocache(resource_size_t offset, unsigned long size);
+void __wrap_iounmap(volatile void __iomem *addr);
+void nfit_test_setup(nfit_test_lookup_fn lookup);
+void nfit_test_teardown(void);
+#endif


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 21/21] libnd: Non-Volatile Devices
  2015-05-20 20:56 ` Dan Williams
@ 2015-05-20 20:58   ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:58 UTC (permalink / raw)
  To: axboe
  Cc: Boaz Harrosh, linux-nvdimm, neilb, gregkh, linux-kernel,
	Andy Lutomirski, Jens Axboe, linux-acpi, jmoyer, H. Peter Anvin,
	hch, mingo

Maintainer information and documentation for drivers/block/nd/

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Greg KH <gregkh@linuxfoundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 Documentation/blockdev/libnd.txt |  804 ++++++++++++++++++++++++++++++++++++++
 MAINTAINERS                      |   39 ++
 2 files changed, 837 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/blockdev/libnd.txt

diff --git a/Documentation/blockdev/libnd.txt b/Documentation/blockdev/libnd.txt
new file mode 100644
index 000000000000..c074a23f369a
--- /dev/null
+++ b/Documentation/blockdev/libnd.txt
@@ -0,0 +1,804 @@
+			  LIBND: Non-Volatile Devices
+	      libnd - kernel / libndctl - userspace helper library
+			   linux-nvdimm@lists.01.org
+				      v11
+
+
+	Glossary
+	Overview
+	    Supporting Documents
+	    Git Trees
+	LIBND PMEM and BLK
+	Why BLK?
+	    PMEM vs BLK
+	        BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
+	Example NVDIMM Platform
+	LIBND Kernel Device Model and LIBNDCTL Userspace API
+	    LIBNDCTL: Context
+	        libndctl: instantiate a new library context example
+	    LIBND/LIBNDCTL: Bus
+	        libnd: control class device in /sys/class
+	        libnd: bus
+	        libndctl: bus enumeration example
+	    LIBND/LIBNDCTL: DIMM (NMEM)
+	        libnd: DIMM (NMEM)
+	        libndctl: DIMM enumeration example
+	    LIBND/LIBNDCTL: Region
+	        libnd: region
+	        libndctl: region enumeration example
+	        Why Not Encode the Region Type into the Region Name?
+	        How Do I Determine the Major Type of a Region?
+	    LIBND/LIBNDCTL: Namespace
+	        libnd: namespace
+	        libndctl: namespace enumeration example
+	        libndctl: namespace creation example
+	        Why the Term "namespace"?
+	    LIBND/LIBNDCTL: Block Translation Table "btt"
+	        libnd: btt layout
+	        libndctl: btt creation example
+	Summary LIBNDCTL Diagram
+
+
+Glossary
+--------
+
+PMEM: A system-physical-address range where writes are persistent.  A
+block device composed of PMEM is capable of DAX.  A PMEM address range
+may span an interleave of several DIMMs.
+
+BLK: A set of one or more programmable memory mapped apertures provided
+by a DIMM to access its media.  This indirection precludes the
+performance benefit of interleaving, but enables DIMM-bounded failure
+modes.
+
+DPA: DIMM Physical Address, is a DIMM-relative offset.  With one DIMM in
+the system there would be a 1:1 system-physical-address:DPA association.
+Once more DIMMs are added a memory controller interleave must be
+decoded to determine the DPA associated with a given
+system-physical-address.  BLK capacity always has a 1:1 relationship
+with a single-DIMM's DPA range.
+
+DAX: File system extensions to bypass the page cache and block layer to
+mmap persistent memory, from a PMEM block device, directly into a
+process address space.
+
+BTT: Block Translation Table: Persistent memory is byte addressable.
+Existing software may have an expectation that the power-fail-atomicity
+of writes is at least one sector, 512 bytes.  The BTT is an indirection
+table with atomic update semantics to front a PMEM/BLK block device
+driver and present arbitrary atomic sector sizes.
+
+LABEL: Metadata stored on a DIMM device that partitions and identifies
+(persistently names) storage between PMEM and BLK.  It also partitions
+BLK storage to host BTTs with different parameters per BLK-partition.
+Note that traditional partition tables, GPT/MBR, are layered on top of a
+BLK or PMEM device.
+
+
+Overview
+--------
+
+The LIBND subsystem provides support for three types of NVDIMMs, namely,
+PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM
+and BLK mode access.  These three modes of operation are described by
+the "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6.  While the LIBND
+implementation is generic and supports pre-NFIT platforms, it was guided
+by the superset of capabilities need to support this ACPI 6 definition
+for NVDIMM resources.  The bulk of the kernel implementation is in place
+to handle the case where DPA accessible via PMEM is aliased with DPA
+accessible via BLK.  When that occurs a LABEL is needed to reserve DPA
+for exclusive access via one mode a time.
+
+Supporting Documents
+ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
+NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
+DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
+Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
+
+Git Trees
+LIBND: https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git
+LIBNDCTL: https://github.com/pmem/ndctl.git
+PMEM: https://github.com/01org/prd
+
+
+LIBND PMEM and BLK
+------------------
+
+Prior to the arrival of the NFIT, non-volatile memory was described to a
+system in various ad-hoc ways.  Usually only the bare minimum was
+provided, namely, a single system-physical-address range where writes
+are expected to be durable after a system power loss.  Now, the NFIT
+specification standardizes not only the description of PMEM, but also
+BLK and platform message-passing entry points for control and
+configuration.
+
+For each NVDIMM access method (PMEM, BLK), LIBND provides a block device driver:
+
+    1. PMEM (nd_pmem.ko): Drives a system-physical-address range.  This
+    range is contiguous in system memory and may be interleaved (hardware
+    memory controller striped) across multiple DIMMs.  When interleaved the
+    platform may optionally provide details of which DIMMs are participating
+    in the interleave.
+
+    Note that while LIBND describes system-physical-address ranges that may
+    alias with BLK access as ND_NAMESPACE_PMEM ranges and those without
+    alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no
+    distinction.  The different device-types are an implementation detail
+    that userspace can exploit to implement policies like "only interface
+    with address ranges from certain DIMMs".  It is worth noting that when
+    aliasing is present and a DIMM lacks a label, then no block device can
+    be created by default as userspace needs to do at least one allocation
+    of DPA to the PMEM range.  In contrast ND_NAMESPACE_IO ranges, once
+    registered, can be immediately attached to nd_pmem.
+
+    2. BLK (nd_blk.ko): This driver performs I/O using a set of platform
+    defined apertures.  A set of apertures will all access just one DIMM.
+    Multiple windows allow multiple concurrent accesses, much like
+    tagged-command-queuing, and would likely be used by different threads or
+    different CPUs.
+
+    The NFIT specification defines a standard format for a BLK-aperture, but
+    the spec also allows for vendor specific layouts, and non-NFIT BLK
+    implementations may other designs for BLK I/O.  For this reason "nd_blk"
+    calls back into platform-specific code to perform the I/O.  One such
+    implementation is defined in the "Driver Writer's Guide" and "DSM
+    Interface Example".
+
+
+Why BLK?
+--------
+
+While PMEM provides direct byte-addressable CPU-load/store access to
+NVDIMM storage, it does not provide the best system RAS (recovery,
+availability, and serviceability) model.  An access to a corrupted
+system-physical-address address causes a cpu exception while an access
+to a corrupted address through an BLK-aperture causes that block window
+to raise an error status in a register.  The latter is more aligned with
+the standard error model that host-bus-adapter attached disks present.
+Also, if an administrator ever wants to replace a memory it is easier to
+service a system at DIMM module boundaries.  Compare this to PMEM where
+data could be interleaved in an opaque hardware specific manner across
+several DIMMs.
+
+PMEM vs BLK
+BLK-apertures solve this RAS problem, but their presence is also the
+major contributing factor to the complexity of the ND subsystem.  They
+complicate the implementation because PMEM and BLK alias in DPA space.
+Any given DIMM's DPA-range may contribute to one or more
+system-physical-address sets of interleaved DIMMs, *and* may also be
+accessed in its entirety through its BLK-aperture.  Accessing a DPA
+through a system-physical-address while simultaneously accessing the
+same DPA through a BLK-aperture has undefined results.  For this reason,
+DIMMs with this dual interface configuration include a DSM function to
+store/retrieve a LABEL.  The LABEL effectively partitions the DPA-space
+into exclusive system-physical-address and BLK-aperture accessible
+regions.  For simplicity a DIMM is allowed a PMEM "region" per each
+interleave set in which it is a member.  The remaining DPA space can be
+carved into an arbitrary number of BLK devices with discontiguous
+extents.
+
+BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
+--------------------------------------------------
+
+One of the few
+reasons to allow multiple BLK namespaces per REGION is so that each
+BLK-namespace can be configured with a BTT with unique atomic sector
+sizes.  While a PMEM device can host a BTT the LABEL specification does
+not provide for a sector size to be specified for a PMEM namespace.
+This is due to the expectation that the primary usage model for PMEM is
+via DAX, and the BTT is incompatible with DAX.  However, for the cases
+where an application or filesystem still needs atomic sector update
+guarantees it can register a BTT on a PMEM device or partition.  See
+LIBND/NDCTL: Block Translation Table "btt"
+
+
+Example NVDIMM Platform
+-----------------------
+
+For the remainder of this document the following diagram will be
+referenced for any example sysfs layouts.
+
+
+                             (a)               (b)           DIMM   BLK-REGION
+          +-------------------+--------+--------+--------+
++------+  |       pm0.0       | blk2.0 | pm1.0  | blk2.1 |    0      region2
+| imc0 +--+- - - region0- - - +--------+        +--------+
++--+---+  |       pm0.0       | blk3.0 | pm1.0  | blk3.1 |    1      region3
+   |      +-------------------+--------v        v--------+
++--+---+                               |                 |
+| cpu0 |                                     region1
++--+---+                               |                 |
+   |      +----------------------------^        ^--------+
++--+---+  |           blk4.0           | pm1.0  | blk4.0 |    2      region4
+| imc1 +--+----------------------------|        +--------+
++------+  |           blk5.0           | pm1.0  | blk5.0 |    3      region5
+          +----------------------------+--------+--------+
+
+In this platform we have four DIMMs and two memory controllers in one
+socket.  Each unique interface (BLK or PMEM) to DPA space is identified
+by a region device with a dynamically assigned id (REGION0 - REGION5).
+
+    1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A
+    single PMEM namespace is created in the REGION0-SPA-range that spans
+    DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that
+    interleaved system-physical-address range is reclaimed as BLK-aperture
+    accessed space starting at DPA-offset (a) into each DIMM.  In that
+    reclaimed space we create two BLK-aperture "namespaces" from REGION2 and
+    REGION3 where "blk2.0" and "blk3.0" are just human readable names that
+    could be set to any user-desired name in the LABEL.
+
+    2. In the last portion of DIMM0 and DIMM1 we have an interleaved
+    system-physical-address range, REGION1, that spans those two DIMMs as
+    well as DIMM2 and DIMM3.  Some of REGION1 allocated to a PMEM namespace
+    named "pm1.0" the rest is reclaimed in 4 BLK-aperture namespaces (for
+    each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
+    "blk5.0".
+
+    3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1
+    interleaved system-physical-address range (i.e. the DPA address below
+    offset (b) are also included in the "blk4.0" and "blk5.0" namespaces.
+    Note, that this example shows that BLK-aperture namespaces don't need to
+    be contiguous in DPA-space.
+
+    This bus is provided by the kernel under the device
+    /sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and
+    the nfit_test.ko module is loaded.  This not only test LIBND but the
+    acpi_nfit.ko driver as well.
+
+
+LIBND Kernel Device Model and LIBNDCTL Userspace API
+----------------------------------------------------
+
+What follows is a description of the LIBND sysfs layout and a
+corresponding object hierarchy diagram as viewed through the LIBNDCTL
+api.  The example sysfs paths and diagrams are relative to the Example
+NVDIMM Platform which is also the LIBND bus used in the LIBNDCTL unit
+test.
+
+LIBNDCTL: Context
+Every api call in the LIBNDCTL library requires a context that holds the
+logging parameters and other library instance state.  The library is
+based on the libabc template:
+https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git/
+
+LIBNDCTL: instantiate a new library context example
+
+	struct ndctl_ctx *ctx;
+
+	if (ndctl_new(&ctx) == 0)
+		return ctx;
+	else
+		return NULL;
+
+LIBND/LIBNDCTL: Bus
+-------------------
+
+A bus has a 1:1 relationship with an NFIT.  The current expectation for
+ACPI based systems is that there is only ever one platform-global NFIT.
+That said, it is trivial to register multiple NFITs, the specification
+does not preclude it.  The infrastructure supports multiple busses and
+we we use this capability to test multiple NFIT configurations in the
+unit test.
+
+LIBND: control class device in /sys/class
+
+This character device accepts DSM messages to be passed to DIMM
+identified by its NFIT handle.
+
+	/sys/class/nd/ndctl0
+	|-- dev
+	|-- device -> ../../../ndbus0
+	|-- subsystem -> ../../../../../../../class/nd
+
+
+
+LIBND: bus
+
+	struct nd_bus *nd_bus_register(struct device *parent,
+	       struct nd_bus_descriptor *nfit_desc);
+
+	/sys/devices/platform/nfit_test.0/ndbus0
+	|-- btt0
+	|-- btt_seed
+	|-- commands
+	|-- nd
+	|-- nfit
+	|-- nmem0
+	|-- nmem1
+	|-- nmem2
+	|-- nmem3
+	|-- power
+	|-- provider
+	|-- region0
+	|-- region1
+	|-- region2
+	|-- region3
+	|-- region4
+	|-- region5
+	|-- uevent
+	`-- wait_probe
+
+LIBNDCTL: bus enumeration example
+Find the bus handle that describes the bus from Example NVDIMM Platform
+
+	static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx,
+			const char *provider)
+	{
+		struct ndctl_bus *bus;
+
+		ndctl_bus_foreach(ctx, bus)
+			if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0)
+				return bus;
+
+		return NULL;
+	}
+
+	bus = get_bus_by_provider(ctx, "nfit_test.0");
+
+
+LIBND/LIBNDCTL: DIMM (NMEM)
+---------------------------
+
+The DIMM device provides a character device for sending commands to
+hardware, and it is a container for LABELs.  If the DIMM is defined by
+NFIT then an optional 'nfit' attribute sub-directory is available to add
+NFIT-specifics.
+
+Note that the kernel device name for "DIMMs" is "nmemX".  The NFIT
+describes these devices via "Memory Device to System Physical Address
+Range Mapping Structure", and there is no requirement that they actually
+be physical DIMMs, so we use a more generic name.
+
+LIBND: DIMM (NMEM)
+
+	struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
+			const struct attribute_group **groups, unsigned long flags,
+			unsigned long *dsm_mask);
+
+	/sys/devices/platform/nfit_test.0/ndbus0
+	|-- nmem0
+	|   |-- available_slots
+	|   |-- commands
+	|   |-- dev
+	|   |-- devtype
+	|   |-- driver -> ../../../../../bus/nd/drivers/nd_dimm
+	|   |-- modalias
+	|   |-- nfit
+	|   |   |-- device
+	|   |   |-- format
+	|   |   |-- handle
+	|   |   |-- phys_id
+	|   |   |-- rev_id
+	|   |   |-- serial
+	|   |   `-- vendor
+	|   |-- state
+	|   |-- subsystem -> ../../../../../bus/nd
+	|   `-- uevent
+	|-- nmem1
+	[..]
+
+
+LIBNDCTL: DIMM enumeration example
+
+Note, in this example we are assuming NFIT-defined DIMMs which are
+identified by an "nfit_handle" a 32-bit value where:
+Bit 3:0 DIMM number within the memory channel
+Bit 7:4 memory channel number
+Bit 11:8 memory controller ID
+Bit 15:12 socket ID (within scope of a Node controller if node controller is present)
+Bit 27:16 Node Controller ID
+Bit 31:28 Reserved
+
+	static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus,
+	       unsigned int handle)
+	{
+		struct ndctl_dimm *dimm;
+
+		ndctl_dimm_foreach(bus, dimm)
+			if (ndctl_dimm_get_handle(dimm) == handle)
+				return dimm;
+
+		return NULL;
+	}
+
+	#define DIMM_HANDLE(n, s, i, c, d) \
+		(((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \
+		 | ((c & 0xf) << 4) | (d & 0xf))
+
+	dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0));
+
+LIBND/LIBNDCTL: Region
+----------------------
+
+A generic REGION device is registered for each PMEM range orBLK-aperture
+set.  Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture
+sets on the "nfit_test.0" bus.  The primary role of regions are to be a
+container of "mappings".  A mapping is a tuple of <DIMM,
+DPA-start-offset, length>.
+
+LIBND provides a built-in driver for these REGION devices.  This driver
+is responsible for reconciling the aliased DPA mappings across all
+regions, parsing the LABEL, if present, and then emitting NAMESPACE
+devices with the resolved/exclusive DPA-boundaries for the nd_pmem or
+nd_blk device driver to consume.
+
+In addition to the generic attributes of "mapping"s, "interleave_ways"
+and "size" the REGION device also exports some convenience attributes.
+"nstype" indicates the integer type of namespace-device this region
+emits, "devtype" duplicates the DEVTYPE variable stored by udev at the
+'add' event, "modalias" duplicates the MODALIAS variable stored by udev
+at the 'add' event, and finally, the optional "spa_index" is provided in
+the case where the region is defined by a SPA.
+
+LIBND: region
+
+	struct nd_region *nd_pmem_region_create(struct nd_bus *nd_bus,
+			struct nd_region_desc *ndr_desc);
+	struct nd_region *nd_blk_region_create(struct nd_bus *nd_bus,
+			struct nd_region_desc *ndr_desc);
+
+	/sys/devices/platform/nfit_test.0/ndbus0
+	|-- region0
+	|   |-- available_size
+	|   |-- devtype
+	|   |-- driver -> ../../../../../bus/nd/drivers/nd_region
+	|   |-- init_namespaces
+	|   |-- mapping0
+	|   |-- mapping1
+	|   |-- mappings
+	|   |-- modalias
+	|   |-- namespace0.0
+	|   |-- namespace_seed
+	|   |-- nfit
+	|   |   `-- spa_index
+	|   |-- nstype
+	|   |-- set_cookie
+	|   |-- size
+	|   |-- subsystem -> ../../../../../bus/nd
+	|   `-- uevent
+	|-- region1
+	[..]
+
+LIBNDCTL: region enumeration example
+
+Sample region retrieval routines based on NFIT-unique data like
+"spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for
+BLK.
+
+	static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus,
+			unsigned int spa_index)
+	{
+		struct ndctl_region *region;
+
+		ndctl_region_foreach(bus, region) {
+			if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM)
+				continue;
+			if (ndctl_region_get_spa_index(region) == spa_index)
+				return region;
+		}
+		return NULL;
+	}
+
+	static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus,
+			unsigned int handle)
+	{
+		struct ndctl_region *region;
+
+		ndctl_region_foreach(bus, region) {
+			struct ndctl_mapping *map;
+
+			if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK)
+				continue;
+			ndctl_mapping_foreach(region, map) {
+				struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map);
+
+				if (ndctl_dimm_get_handle(dimm) == handle)
+					return region;
+			}
+		}
+		return NULL;
+	}
+
+
+Why Not Encode the Region Type into the Region Name?
+----------------------------------------------------
+
+At first glance it seems since NFIT defines just PMEM and BLK interface
+types that we should simply name REGION devices with something derived
+from those type names.  However, the ND subsystem explicitly keeps the
+REGION name generic and expects userspace to always consider the
+region-attributes for 4 reasons:
+
+    1. There are already more than two REGION and "namespace" types.  For
+    PMEM there are two subtypes.  As mentioned previously we have PMEM where
+    the constituent DIMM devices are known and anonymous PMEM.  For BLK
+    regions the NFIT specification already anticipates vendor specific
+    implementations.  The exact distinction of what a region contains is in
+    the region-attributes not the region-name or the region-devtype.
+
+    2. A region with zero child-namespaces is a possible configuration.  For
+    example, the NFIT allows for a DCR to be published without a
+    corresponding BLK-aperture.  This equates to a DIMM that can only accept
+    control/configuration messages, but no i/o through a descendant block
+    device.  Again, this "type" is advertised in the attributes ('mappings'
+    == 0) and the name does not tell you much.
+
+    3. What if a third major interface type arises in the future?  Outside
+    of vendor specific implementations, it's not difficult to envision a
+    third class of interface type beyond BLK and PMEM.  With a generic name
+    for the REGION level of the device-hierarchy old userspace
+    implementations can still make sense of new kernel advertised
+    region-types.  Userspace can always rely on the generic region
+    attributes like "mappings", "size", etc and the expected child devices
+    named "namespace".  This generic format of the device-model hierarchy
+    allows the LIBND and LIBNDCTL implementations to be more uniform and
+    future-proof.
+
+    4. There are more robust mechanisms for determining the major type of a
+    region than a device name.  See the next section, How Do I Determine the
+    Major Type of a Region?
+
+How Do I Determine the Major Type of a Region?
+----------------------------------------------
+
+Outside of the blanket recommendation of "use libndctl", or simply
+looking at the kernel header (/usr/include/linux/ndctl.h) to decode the
+"nstype" integer attribute, here are some other options.
+
+    1. module alias lookup:
+
+    The whole point of region/namespace device type differentiation is to
+    decide which block-device driver will attach to a given LIBND namespace.
+    One can simply use the modalias to lookup the resulting module.  It's
+    important to note that this method is robust in the presence of a
+    vendor-specific driver down the road.  If a vendor-specific
+    implementation wants to supplant the standard nd_blk driver it can with
+    minimal impact to the rest of LIBND.
+
+    In fact, a vendor may also want to have a vendor-specific region-driver
+    (outside of nd_region).  For example, if a vendor defined its own LABEL
+    format it would need its own region driver to parse that LABEL and emit
+    the resulting namespaces.  The output from module resolution is more
+    accurate than a region-name or region-devtype.
+
+    2. udev:
+
+    The kernel "devtype" is registered in the udev database
+    # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0
+    P: /devices/platform/nfit_test.0/ndbus0/region0
+    E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0
+    E: DEVTYPE=nd_pmem
+    E: MODALIAS=nd:t2
+    E: SUBSYSTEM=nd
+
+    # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4
+    P: /devices/platform/nfit_test.0/ndbus0/region4
+    E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4
+    E: DEVTYPE=nd_blk
+    E: MODALIAS=nd:t3
+    E: SUBSYSTEM=nd
+
+    ...and is available as a region attribute, but keep in mind that the
+    "devtype" does not indicate sub-type variations and scripts should
+    really be understanding the other attributes.
+
+    3. type specific attributes:
+
+    As it currently stands a BLK-aperture region will never have a
+    "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region.  A
+    BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM
+    that does not allow I/O.  A PMEM region with a "mappings" value of zero
+    is a simple system-physical-address range.
+
+
+LIBND/LIBNDCTL: Namespace
+-------------------------
+
+A REGION, after resolving DPA aliasing and LABEL specified boundaries,
+surfaces one or more "namespace" devices.  The arrival of a "namespace"
+device currently triggers either the nd_blk or nd_pmem driver to load
+and register a disk/block device.
+
+LIBND: namespace
+Here is a sample layout from the three major types of NAMESPACE where
+namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid'
+attribute), namespace2.0 represents a BLK namespace (note it has a
+'sector_size' attribute) that, and namespace6.0 represents an anonymous
+PMEM namespace (note that has no 'uuid' attribute due to not support a
+LABEL).
+
+	/sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0
+	|-- alt_name
+	|-- devtype
+	|-- dpa_extents
+	|-- modalias
+	|-- resource
+	|-- size
+	|-- subsystem -> ../../../../../../bus/nd
+	|-- type
+	|-- uevent
+	`-- uuid
+	/sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0
+	|-- alt_name
+	|-- devtype
+	|-- dpa_extents
+	|-- modalias
+	|-- sector_size
+	|-- size
+	|-- subsystem -> ../../../../../../bus/nd
+	|-- type
+	|-- uevent
+	`-- uuid
+	/sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0
+	|-- block
+	|   `-- pmem0
+	|-- devtype
+	|-- driver -> ../../../../../../bus/nd/drivers/pmem
+	|-- modalias
+	|-- resource
+	|-- size
+	|-- subsystem -> ../../../../../../bus/nd
+	|-- type
+	`-- uevent
+
+LIBNDCTL: namespace enumeration example
+Namespaces are indexed relative to their parent region, example below.
+These indexes are mostly static from boot to boot, but subsystem makes
+no guarantees in this regard.  For a static namespace identifier use its
+'uuid' attribute.
+
+static struct ndctl_namespace *get_namespace_by_id(struct ndctl_region *region,
+                unsigned int id)
+{
+        struct ndctl_namespace *ndns;
+
+        ndctl_namespace_foreach(region, ndns)
+                if (ndctl_namespace_get_id(ndns) == id)
+                        return ndns;
+
+        return NULL;
+}
+
+LIBNDCTL: namespace creation example
+Idle namespaces are automatically created by the kernel if a given
+region has enough available capacity to create a new namespace.
+Namespace instantiation involves finding an idle namespace and
+configuring it.  For the most part the setting of namespace attributes
+can occur in any order, the only constraint is that 'uuid' must be set
+before 'size'.  This enables the kernel to track DPA allocations
+internally with a static identifier.
+
+static int configure_namespace(struct ndctl_region *region,
+                struct ndctl_namespace *ndns,
+                struct namespace_parameters *parameters)
+{
+        char devname[50];
+
+        snprintf(devname, sizeof(devname), "namespace%d.%d",
+                        ndctl_region_get_id(region), paramaters->id);
+
+        ndctl_namespace_set_alt_name(ndns, devname);
+        /* 'uuid' must be set prior to setting size! */
+        ndctl_namespace_set_uuid(ndns, paramaters->uuid);
+        ndctl_namespace_set_size(ndns, paramaters->size);
+        /* unlike pmem namespaces, blk namespaces have a sector size */
+        if (parameters->lbasize)
+                ndctl_namespace_set_sector_size(ndns, parameters->lbasize);
+        ndctl_namespace_enable(ndns);
+}
+
+
+Why the Term "namespace"?
+
+    1. Why not "volume" for instance?  "volume" ran the risk of confusing ND
+    as a volume manager like device-mapper.
+
+    2. The term originated to describe the sub-devices that can be created
+    within a NVME controller (see the nvme specification:
+    http://www.nvmexpress.org/specifications/), and NFIT namespaces are
+    meant to parallel the capabilities and configurability of
+    NVME-namespaces.
+
+
+LIBND/LIBNDCTL: Block Translation Table "btt"
+---------------------------------------------
+
+A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked
+block device driver that fronts either the whole block device or a
+partition of a block device emitted by either a PMEM or BLK NAMESPACE.
+
+LIBND: btt layout
+Every bus will start out with at least one BTT device which is the seed
+device.  To activate it set the "backing_dev", "uuid", and "sector_size"
+attributes and then bind the device to the nd_btt driver.
+
+	/sys/devices/platform/nfit_test.1/ndbus0/btt0/
+	|-- backing_dev
+	|-- delete
+	|-- devtype
+	|-- modalias
+	|-- sector_size
+	|-- subsystem -> ../../../../../bus/nd
+	|-- uevent
+	`-- uuid
+
+LIBNDCTL: btt creation example
+Similar to namespaces an idle BTT device is automatically created per
+bus.  Each time this "seed" btt device is configured and enabled a new
+seed is created.  Creating a BTT configuration involves two steps of
+finding and idle BTT and assigning it to front a PMEM or BLK namespace.
+
+	static struct ndctl_btt *get_idle_btt(struct ndctl_bus *bus)
+	{
+		struct ndctl_btt *btt;
+
+		ndctl_btt_foreach(bus, btt)
+			if (!ndctl_btt_is_enabled(btt) && !ndctl_btt_is_configured(btt))
+				return btt;
+
+		return NULL;
+	}
+
+	static int configure_btt(struct ndctl_bus *bus, struct btt_parameters *parameters)
+	{
+		btt = get_idle_btt(bus);
+
+		sprintf(bdevpath, "/dev/%s",
+				ndctl_namespace_get_block_device(parameters->ndns));
+		ndctl_btt_set_uuid(btt, parameters->uuid);
+		ndctl_btt_set_sector_size(btt, parameters->sector_size);
+		ndctl_btt_set_backing_dev(btt, parametes->bdevpath);
+		ndctl_btt_enable(btt);
+	}
+
+Once instantiated a "nd_btt" link will be created under the
+"backing_dev" (pmem0) block device:
+
+	/sys/block/pmem0/
+	|-- alignment_offset
+	|-- bdi -> ../../../../../../../virtual/bdi/259:0
+	|-- capability
+	|-- dev
+	|-- device -> ../../../namespace0.0
+	|-- discard_alignment
+	|-- ext_range
+	|-- holders
+	|-- inflight
+	|-- nd_btt -> ../../../../btt0
+
+...and a new inactive seed device will appear on the bus.
+
+Once a "backing_dev" is disabled its associated BTT will be
+automatically deleted.  This deletion is only at the device model level.
+In order to destroy a BTT the "info block" needs to be destroyed.
+
+
+Summary LIBNDCTL Diagram
+------------------------
+
+For the given example above, here is the view of the objects as seen by the LIBNDCTL api:
+            +---+
+            |CTX|    +---------+   +--------------+  +---------------+
+            +-+-+  +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" |
+              |    | +---------+   +--------------+  +---------------+
++-------+     |    | +---------+   +--------------+  +---------------+
+| DIMM0 <-+   |    +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" |
++-------+ |   |    | +---------+   +--------------+  +---------------+
+| DIMM1 <-+ +-v--+ | +---------+   +--------------+  +---------------+
++-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6  "blk2.0" |
+| DIMM2 <-+ +----+ | +---------+ | +--------------+  +----------------------+
++-------+ |        |             +-> NAMESPACE2.1 +--> ND5  "blk2.1" | BTT2 |
+| DIMM3 <-+        |               +--------------+  +----------------------+
++-------+          | +---------+   +--------------+  +---------------+
+                   +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4  "blk3.0" |
+                   | +---------+ | +--------------+  +----------------------+
+                   |             +-> NAMESPACE3.1 +--> ND3  "blk3.1" | BTT1 |
+                   |               +--------------+  +----------------------+
+                   | +---------+   +--------------+  +---------------+
+                   +-> REGION4 +---> NAMESPACE4.0 +--> ND2  "blk4.0" |
+                   | +---------+   +--------------+  +---------------+
+                   | +---------+   +--------------+  +----------------------+
+                   +-> REGION5 +---> NAMESPACE5.0 +--> ND1  "blk5.0" | BTT0 |
+                     +---------+   +--------------+  +---------------+------+
+
+
diff --git a/MAINTAINERS b/MAINTAINERS
index 4517613dc638..edb72fcc158e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5854,6 +5854,39 @@ M:	Sasha Levin <sasha.levin@oracle.com>
 S:	Maintained
 F:	tools/lib/lockdep/
 
+LIBND: NON-VOLATILE MEMORY DEVICE SUBSYSTEM
+M:	Dan Williams <dan.j.williams@intel.com>
+L:	linux-nvdimm@lists.01.org
+Q:	https://patchwork.kernel.org/project/linux-nvdimm/list/
+S:	Supported
+F:	drivers/block/nd/*
+F:	include/linux/nd.h
+F:	include/linux/libnd.h
+F:	include/uapi/linux/ndctl.h
+
+LIBND BLK: MMIO-APERTURE DRIVER
+M:	Ross Zwisler <ross.zwisler@linux.intel.com>
+L:	linux-nvdimm@lists.01.org
+Q:	https://patchwork.kernel.org/project/linux-nvdimm/list/
+S:	Supported
+F:	drivers/block/nd/blk.c
+F:	drivers/block/nd/region_devs.c
+F:	drivers/acpi/nfit*
+
+LIBND BTT: BLOCK TRANSLATION TABLE
+M:	Vishal Verma <vishal.verma@linux.intel.com>
+L:	linux-nvdimm@lists.01.org
+Q:	https://patchwork.kernel.org/project/linux-nvdimm/list/
+S:	Supported
+F:	drivers/block/nd/btt*
+
+LIBND PMEM: PERSISTENT MEMORY DRIVER
+M:	Ross Zwisler <ross.zwisler@linux.intel.com>
+L:	linux-nvdimm@lists.01.org
+Q:	https://patchwork.kernel.org/project/linux-nvdimm/list/
+S:	Supported
+F:	drivers/block/nd/pmem.c
+
 LINUX FOR IBM pSERIES (RS/6000)
 M:	Paul Mackerras <paulus@au.ibm.com>
 W:	http://www.ibm.com/linux/ltc/projects/ppc
@@ -8071,12 +8104,6 @@ S:	Maintained
 F:	Documentation/blockdev/ramdisk.txt
 F:	drivers/block/brd.c
 
-PERSISTENT MEMORY DRIVER
-M:	Ross Zwisler <ross.zwisler@linux.intel.com>
-L:	linux-nvdimm@lists.01.org
-S:	Supported
-F:	drivers/block/pmem.c
-
 RANDOM NUMBER DRIVER
 M:	"Theodore Ts'o" <tytso@mit.edu>
 S:	Maintained


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3 21/21] libnd: Non-Volatile Devices
@ 2015-05-20 20:58   ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-20 20:58 UTC (permalink / raw)
  To: axboe
  Cc: Boaz Harrosh, linux-nvdimm, neilb, gregkh, linux-kernel,
	Andy Lutomirski, Jens Axboe, linux-acpi, jmoyer, H. Peter Anvin,
	hch, mingo

Maintainer information and documentation for drivers/block/nd/

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Greg KH <gregkh@linuxfoundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 Documentation/blockdev/libnd.txt |  804 ++++++++++++++++++++++++++++++++++++++
 MAINTAINERS                      |   39 ++
 2 files changed, 837 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/blockdev/libnd.txt

diff --git a/Documentation/blockdev/libnd.txt b/Documentation/blockdev/libnd.txt
new file mode 100644
index 000000000000..c074a23f369a
--- /dev/null
+++ b/Documentation/blockdev/libnd.txt
@@ -0,0 +1,804 @@
+			  LIBND: Non-Volatile Devices
+	      libnd - kernel / libndctl - userspace helper library
+			   linux-nvdimm@lists.01.org
+				      v11
+
+
+	Glossary
+	Overview
+	    Supporting Documents
+	    Git Trees
+	LIBND PMEM and BLK
+	Why BLK?
+	    PMEM vs BLK
+	        BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
+	Example NVDIMM Platform
+	LIBND Kernel Device Model and LIBNDCTL Userspace API
+	    LIBNDCTL: Context
+	        libndctl: instantiate a new library context example
+	    LIBND/LIBNDCTL: Bus
+	        libnd: control class device in /sys/class
+	        libnd: bus
+	        libndctl: bus enumeration example
+	    LIBND/LIBNDCTL: DIMM (NMEM)
+	        libnd: DIMM (NMEM)
+	        libndctl: DIMM enumeration example
+	    LIBND/LIBNDCTL: Region
+	        libnd: region
+	        libndctl: region enumeration example
+	        Why Not Encode the Region Type into the Region Name?
+	        How Do I Determine the Major Type of a Region?
+	    LIBND/LIBNDCTL: Namespace
+	        libnd: namespace
+	        libndctl: namespace enumeration example
+	        libndctl: namespace creation example
+	        Why the Term "namespace"?
+	    LIBND/LIBNDCTL: Block Translation Table "btt"
+	        libnd: btt layout
+	        libndctl: btt creation example
+	Summary LIBNDCTL Diagram
+
+
+Glossary
+--------
+
+PMEM: A system-physical-address range where writes are persistent.  A
+block device composed of PMEM is capable of DAX.  A PMEM address range
+may span an interleave of several DIMMs.
+
+BLK: A set of one or more programmable memory mapped apertures provided
+by a DIMM to access its media.  This indirection precludes the
+performance benefit of interleaving, but enables DIMM-bounded failure
+modes.
+
+DPA: DIMM Physical Address, is a DIMM-relative offset.  With one DIMM in
+the system there would be a 1:1 system-physical-address:DPA association.
+Once more DIMMs are added a memory controller interleave must be
+decoded to determine the DPA associated with a given
+system-physical-address.  BLK capacity always has a 1:1 relationship
+with a single-DIMM's DPA range.
+
+DAX: File system extensions to bypass the page cache and block layer to
+mmap persistent memory, from a PMEM block device, directly into a
+process address space.
+
+BTT: Block Translation Table: Persistent memory is byte addressable.
+Existing software may have an expectation that the power-fail-atomicity
+of writes is at least one sector, 512 bytes.  The BTT is an indirection
+table with atomic update semantics to front a PMEM/BLK block device
+driver and present arbitrary atomic sector sizes.
+
+LABEL: Metadata stored on a DIMM device that partitions and identifies
+(persistently names) storage between PMEM and BLK.  It also partitions
+BLK storage to host BTTs with different parameters per BLK-partition.
+Note that traditional partition tables, GPT/MBR, are layered on top of a
+BLK or PMEM device.
+
+
+Overview
+--------
+
+The LIBND subsystem provides support for three types of NVDIMMs, namely,
+PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM
+and BLK mode access.  These three modes of operation are described by
+the "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6.  While the LIBND
+implementation is generic and supports pre-NFIT platforms, it was guided
+by the superset of capabilities need to support this ACPI 6 definition
+for NVDIMM resources.  The bulk of the kernel implementation is in place
+to handle the case where DPA accessible via PMEM is aliased with DPA
+accessible via BLK.  When that occurs a LABEL is needed to reserve DPA
+for exclusive access via one mode a time.
+
+Supporting Documents
+ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
+NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
+DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
+Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
+
+Git Trees
+LIBND: https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git
+LIBNDCTL: https://github.com/pmem/ndctl.git
+PMEM: https://github.com/01org/prd
+
+
+LIBND PMEM and BLK
+------------------
+
+Prior to the arrival of the NFIT, non-volatile memory was described to a
+system in various ad-hoc ways.  Usually only the bare minimum was
+provided, namely, a single system-physical-address range where writes
+are expected to be durable after a system power loss.  Now, the NFIT
+specification standardizes not only the description of PMEM, but also
+BLK and platform message-passing entry points for control and
+configuration.
+
+For each NVDIMM access method (PMEM, BLK), LIBND provides a block device driver:
+
+    1. PMEM (nd_pmem.ko): Drives a system-physical-address range.  This
+    range is contiguous in system memory and may be interleaved (hardware
+    memory controller striped) across multiple DIMMs.  When interleaved the
+    platform may optionally provide details of which DIMMs are participating
+    in the interleave.
+
+    Note that while LIBND describes system-physical-address ranges that may
+    alias with BLK access as ND_NAMESPACE_PMEM ranges and those without
+    alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no
+    distinction.  The different device-types are an implementation detail
+    that userspace can exploit to implement policies like "only interface
+    with address ranges from certain DIMMs".  It is worth noting that when
+    aliasing is present and a DIMM lacks a label, then no block device can
+    be created by default as userspace needs to do at least one allocation
+    of DPA to the PMEM range.  In contrast ND_NAMESPACE_IO ranges, once
+    registered, can be immediately attached to nd_pmem.
+
+    2. BLK (nd_blk.ko): This driver performs I/O using a set of platform
+    defined apertures.  A set of apertures will all access just one DIMM.
+    Multiple windows allow multiple concurrent accesses, much like
+    tagged-command-queuing, and would likely be used by different threads or
+    different CPUs.
+
+    The NFIT specification defines a standard format for a BLK-aperture, but
+    the spec also allows for vendor specific layouts, and non-NFIT BLK
+    implementations may other designs for BLK I/O.  For this reason "nd_blk"
+    calls back into platform-specific code to perform the I/O.  One such
+    implementation is defined in the "Driver Writer's Guide" and "DSM
+    Interface Example".
+
+
+Why BLK?
+--------
+
+While PMEM provides direct byte-addressable CPU-load/store access to
+NVDIMM storage, it does not provide the best system RAS (recovery,
+availability, and serviceability) model.  An access to a corrupted
+system-physical-address address causes a cpu exception while an access
+to a corrupted address through an BLK-aperture causes that block window
+to raise an error status in a register.  The latter is more aligned with
+the standard error model that host-bus-adapter attached disks present.
+Also, if an administrator ever wants to replace a memory it is easier to
+service a system at DIMM module boundaries.  Compare this to PMEM where
+data could be interleaved in an opaque hardware specific manner across
+several DIMMs.
+
+PMEM vs BLK
+BLK-apertures solve this RAS problem, but their presence is also the
+major contributing factor to the complexity of the ND subsystem.  They
+complicate the implementation because PMEM and BLK alias in DPA space.
+Any given DIMM's DPA-range may contribute to one or more
+system-physical-address sets of interleaved DIMMs, *and* may also be
+accessed in its entirety through its BLK-aperture.  Accessing a DPA
+through a system-physical-address while simultaneously accessing the
+same DPA through a BLK-aperture has undefined results.  For this reason,
+DIMMs with this dual interface configuration include a DSM function to
+store/retrieve a LABEL.  The LABEL effectively partitions the DPA-space
+into exclusive system-physical-address and BLK-aperture accessible
+regions.  For simplicity a DIMM is allowed a PMEM "region" per each
+interleave set in which it is a member.  The remaining DPA space can be
+carved into an arbitrary number of BLK devices with discontiguous
+extents.
+
+BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
+--------------------------------------------------
+
+One of the few
+reasons to allow multiple BLK namespaces per REGION is so that each
+BLK-namespace can be configured with a BTT with unique atomic sector
+sizes.  While a PMEM device can host a BTT the LABEL specification does
+not provide for a sector size to be specified for a PMEM namespace.
+This is due to the expectation that the primary usage model for PMEM is
+via DAX, and the BTT is incompatible with DAX.  However, for the cases
+where an application or filesystem still needs atomic sector update
+guarantees it can register a BTT on a PMEM device or partition.  See
+LIBND/NDCTL: Block Translation Table "btt"
+
+
+Example NVDIMM Platform
+-----------------------
+
+For the remainder of this document the following diagram will be
+referenced for any example sysfs layouts.
+
+
+                             (a)               (b)           DIMM   BLK-REGION
+          +-------------------+--------+--------+--------+
++------+  |       pm0.0       | blk2.0 | pm1.0  | blk2.1 |    0      region2
+| imc0 +--+- - - region0- - - +--------+        +--------+
++--+---+  |       pm0.0       | blk3.0 | pm1.0  | blk3.1 |    1      region3
+   |      +-------------------+--------v        v--------+
++--+---+                               |                 |
+| cpu0 |                                     region1
++--+---+                               |                 |
+   |      +----------------------------^        ^--------+
++--+---+  |           blk4.0           | pm1.0  | blk4.0 |    2      region4
+| imc1 +--+----------------------------|        +--------+
++------+  |           blk5.0           | pm1.0  | blk5.0 |    3      region5
+          +----------------------------+--------+--------+
+
+In this platform we have four DIMMs and two memory controllers in one
+socket.  Each unique interface (BLK or PMEM) to DPA space is identified
+by a region device with a dynamically assigned id (REGION0 - REGION5).
+
+    1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A
+    single PMEM namespace is created in the REGION0-SPA-range that spans
+    DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that
+    interleaved system-physical-address range is reclaimed as BLK-aperture
+    accessed space starting at DPA-offset (a) into each DIMM.  In that
+    reclaimed space we create two BLK-aperture "namespaces" from REGION2 and
+    REGION3 where "blk2.0" and "blk3.0" are just human readable names that
+    could be set to any user-desired name in the LABEL.
+
+    2. In the last portion of DIMM0 and DIMM1 we have an interleaved
+    system-physical-address range, REGION1, that spans those two DIMMs as
+    well as DIMM2 and DIMM3.  Some of REGION1 allocated to a PMEM namespace
+    named "pm1.0" the rest is reclaimed in 4 BLK-aperture namespaces (for
+    each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
+    "blk5.0".
+
+    3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1
+    interleaved system-physical-address range (i.e. the DPA address below
+    offset (b) are also included in the "blk4.0" and "blk5.0" namespaces.
+    Note, that this example shows that BLK-aperture namespaces don't need to
+    be contiguous in DPA-space.
+
+    This bus is provided by the kernel under the device
+    /sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and
+    the nfit_test.ko module is loaded.  This not only test LIBND but the
+    acpi_nfit.ko driver as well.
+
+
+LIBND Kernel Device Model and LIBNDCTL Userspace API
+----------------------------------------------------
+
+What follows is a description of the LIBND sysfs layout and a
+corresponding object hierarchy diagram as viewed through the LIBNDCTL
+api.  The example sysfs paths and diagrams are relative to the Example
+NVDIMM Platform which is also the LIBND bus used in the LIBNDCTL unit
+test.
+
+LIBNDCTL: Context
+Every api call in the LIBNDCTL library requires a context that holds the
+logging parameters and other library instance state.  The library is
+based on the libabc template:
+https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git/
+
+LIBNDCTL: instantiate a new library context example
+
+	struct ndctl_ctx *ctx;
+
+	if (ndctl_new(&ctx) == 0)
+		return ctx;
+	else
+		return NULL;
+
+LIBND/LIBNDCTL: Bus
+-------------------
+
+A bus has a 1:1 relationship with an NFIT.  The current expectation for
+ACPI based systems is that there is only ever one platform-global NFIT.
+That said, it is trivial to register multiple NFITs, the specification
+does not preclude it.  The infrastructure supports multiple busses and
+we we use this capability to test multiple NFIT configurations in the
+unit test.
+
+LIBND: control class device in /sys/class
+
+This character device accepts DSM messages to be passed to DIMM
+identified by its NFIT handle.
+
+	/sys/class/nd/ndctl0
+	|-- dev
+	|-- device -> ../../../ndbus0
+	|-- subsystem -> ../../../../../../../class/nd
+
+
+
+LIBND: bus
+
+	struct nd_bus *nd_bus_register(struct device *parent,
+	       struct nd_bus_descriptor *nfit_desc);
+
+	/sys/devices/platform/nfit_test.0/ndbus0
+	|-- btt0
+	|-- btt_seed
+	|-- commands
+	|-- nd
+	|-- nfit
+	|-- nmem0
+	|-- nmem1
+	|-- nmem2
+	|-- nmem3
+	|-- power
+	|-- provider
+	|-- region0
+	|-- region1
+	|-- region2
+	|-- region3
+	|-- region4
+	|-- region5
+	|-- uevent
+	`-- wait_probe
+
+LIBNDCTL: bus enumeration example
+Find the bus handle that describes the bus from Example NVDIMM Platform
+
+	static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx,
+			const char *provider)
+	{
+		struct ndctl_bus *bus;
+
+		ndctl_bus_foreach(ctx, bus)
+			if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0)
+				return bus;
+
+		return NULL;
+	}
+
+	bus = get_bus_by_provider(ctx, "nfit_test.0");
+
+
+LIBND/LIBNDCTL: DIMM (NMEM)
+---------------------------
+
+The DIMM device provides a character device for sending commands to
+hardware, and it is a container for LABELs.  If the DIMM is defined by
+NFIT then an optional 'nfit' attribute sub-directory is available to add
+NFIT-specifics.
+
+Note that the kernel device name for "DIMMs" is "nmemX".  The NFIT
+describes these devices via "Memory Device to System Physical Address
+Range Mapping Structure", and there is no requirement that they actually
+be physical DIMMs, so we use a more generic name.
+
+LIBND: DIMM (NMEM)
+
+	struct nd_dimm *nd_dimm_create(struct nd_bus *nd_bus, void *provider_data,
+			const struct attribute_group **groups, unsigned long flags,
+			unsigned long *dsm_mask);
+
+	/sys/devices/platform/nfit_test.0/ndbus0
+	|-- nmem0
+	|   |-- available_slots
+	|   |-- commands
+	|   |-- dev
+	|   |-- devtype
+	|   |-- driver -> ../../../../../bus/nd/drivers/nd_dimm
+	|   |-- modalias
+	|   |-- nfit
+	|   |   |-- device
+	|   |   |-- format
+	|   |   |-- handle
+	|   |   |-- phys_id
+	|   |   |-- rev_id
+	|   |   |-- serial
+	|   |   `-- vendor
+	|   |-- state
+	|   |-- subsystem -> ../../../../../bus/nd
+	|   `-- uevent
+	|-- nmem1
+	[..]
+
+
+LIBNDCTL: DIMM enumeration example
+
+Note, in this example we are assuming NFIT-defined DIMMs which are
+identified by an "nfit_handle" a 32-bit value where:
+Bit 3:0 DIMM number within the memory channel
+Bit 7:4 memory channel number
+Bit 11:8 memory controller ID
+Bit 15:12 socket ID (within scope of a Node controller if node controller is present)
+Bit 27:16 Node Controller ID
+Bit 31:28 Reserved
+
+	static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus,
+	       unsigned int handle)
+	{
+		struct ndctl_dimm *dimm;
+
+		ndctl_dimm_foreach(bus, dimm)
+			if (ndctl_dimm_get_handle(dimm) == handle)
+				return dimm;
+
+		return NULL;
+	}
+
+	#define DIMM_HANDLE(n, s, i, c, d) \
+		(((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \
+		 | ((c & 0xf) << 4) | (d & 0xf))
+
+	dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0));
+
+LIBND/LIBNDCTL: Region
+----------------------
+
+A generic REGION device is registered for each PMEM range orBLK-aperture
+set.  Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture
+sets on the "nfit_test.0" bus.  The primary role of regions are to be a
+container of "mappings".  A mapping is a tuple of <DIMM,
+DPA-start-offset, length>.
+
+LIBND provides a built-in driver for these REGION devices.  This driver
+is responsible for reconciling the aliased DPA mappings across all
+regions, parsing the LABEL, if present, and then emitting NAMESPACE
+devices with the resolved/exclusive DPA-boundaries for the nd_pmem or
+nd_blk device driver to consume.
+
+In addition to the generic attributes of "mapping"s, "interleave_ways"
+and "size" the REGION device also exports some convenience attributes.
+"nstype" indicates the integer type of namespace-device this region
+emits, "devtype" duplicates the DEVTYPE variable stored by udev at the
+'add' event, "modalias" duplicates the MODALIAS variable stored by udev
+at the 'add' event, and finally, the optional "spa_index" is provided in
+the case where the region is defined by a SPA.
+
+LIBND: region
+
+	struct nd_region *nd_pmem_region_create(struct nd_bus *nd_bus,
+			struct nd_region_desc *ndr_desc);
+	struct nd_region *nd_blk_region_create(struct nd_bus *nd_bus,
+			struct nd_region_desc *ndr_desc);
+
+	/sys/devices/platform/nfit_test.0/ndbus0
+	|-- region0
+	|   |-- available_size
+	|   |-- devtype
+	|   |-- driver -> ../../../../../bus/nd/drivers/nd_region
+	|   |-- init_namespaces
+	|   |-- mapping0
+	|   |-- mapping1
+	|   |-- mappings
+	|   |-- modalias
+	|   |-- namespace0.0
+	|   |-- namespace_seed
+	|   |-- nfit
+	|   |   `-- spa_index
+	|   |-- nstype
+	|   |-- set_cookie
+	|   |-- size
+	|   |-- subsystem -> ../../../../../bus/nd
+	|   `-- uevent
+	|-- region1
+	[..]
+
+LIBNDCTL: region enumeration example
+
+Sample region retrieval routines based on NFIT-unique data like
+"spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for
+BLK.
+
+	static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus,
+			unsigned int spa_index)
+	{
+		struct ndctl_region *region;
+
+		ndctl_region_foreach(bus, region) {
+			if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM)
+				continue;
+			if (ndctl_region_get_spa_index(region) == spa_index)
+				return region;
+		}
+		return NULL;
+	}
+
+	static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus,
+			unsigned int handle)
+	{
+		struct ndctl_region *region;
+
+		ndctl_region_foreach(bus, region) {
+			struct ndctl_mapping *map;
+
+			if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK)
+				continue;
+			ndctl_mapping_foreach(region, map) {
+				struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map);
+
+				if (ndctl_dimm_get_handle(dimm) == handle)
+					return region;
+			}
+		}
+		return NULL;
+	}
+
+
+Why Not Encode the Region Type into the Region Name?
+----------------------------------------------------
+
+At first glance it seems since NFIT defines just PMEM and BLK interface
+types that we should simply name REGION devices with something derived
+from those type names.  However, the ND subsystem explicitly keeps the
+REGION name generic and expects userspace to always consider the
+region-attributes for 4 reasons:
+
+    1. There are already more than two REGION and "namespace" types.  For
+    PMEM there are two subtypes.  As mentioned previously we have PMEM where
+    the constituent DIMM devices are known and anonymous PMEM.  For BLK
+    regions the NFIT specification already anticipates vendor specific
+    implementations.  The exact distinction of what a region contains is in
+    the region-attributes not the region-name or the region-devtype.
+
+    2. A region with zero child-namespaces is a possible configuration.  For
+    example, the NFIT allows for a DCR to be published without a
+    corresponding BLK-aperture.  This equates to a DIMM that can only accept
+    control/configuration messages, but no i/o through a descendant block
+    device.  Again, this "type" is advertised in the attributes ('mappings'
+    == 0) and the name does not tell you much.
+
+    3. What if a third major interface type arises in the future?  Outside
+    of vendor specific implementations, it's not difficult to envision a
+    third class of interface type beyond BLK and PMEM.  With a generic name
+    for the REGION level of the device-hierarchy old userspace
+    implementations can still make sense of new kernel advertised
+    region-types.  Userspace can always rely on the generic region
+    attributes like "mappings", "size", etc and the expected child devices
+    named "namespace".  This generic format of the device-model hierarchy
+    allows the LIBND and LIBNDCTL implementations to be more uniform and
+    future-proof.
+
+    4. There are more robust mechanisms for determining the major type of a
+    region than a device name.  See the next section, How Do I Determine the
+    Major Type of a Region?
+
+How Do I Determine the Major Type of a Region?
+----------------------------------------------
+
+Outside of the blanket recommendation of "use libndctl", or simply
+looking at the kernel header (/usr/include/linux/ndctl.h) to decode the
+"nstype" integer attribute, here are some other options.
+
+    1. module alias lookup:
+
+    The whole point of region/namespace device type differentiation is to
+    decide which block-device driver will attach to a given LIBND namespace.
+    One can simply use the modalias to lookup the resulting module.  It's
+    important to note that this method is robust in the presence of a
+    vendor-specific driver down the road.  If a vendor-specific
+    implementation wants to supplant the standard nd_blk driver it can with
+    minimal impact to the rest of LIBND.
+
+    In fact, a vendor may also want to have a vendor-specific region-driver
+    (outside of nd_region).  For example, if a vendor defined its own LABEL
+    format it would need its own region driver to parse that LABEL and emit
+    the resulting namespaces.  The output from module resolution is more
+    accurate than a region-name or region-devtype.
+
+    2. udev:
+
+    The kernel "devtype" is registered in the udev database
+    # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0
+    P: /devices/platform/nfit_test.0/ndbus0/region0
+    E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0
+    E: DEVTYPE=nd_pmem
+    E: MODALIAS=nd:t2
+    E: SUBSYSTEM=nd
+
+    # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4
+    P: /devices/platform/nfit_test.0/ndbus0/region4
+    E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4
+    E: DEVTYPE=nd_blk
+    E: MODALIAS=nd:t3
+    E: SUBSYSTEM=nd
+
+    ...and is available as a region attribute, but keep in mind that the
+    "devtype" does not indicate sub-type variations and scripts should
+    really be understanding the other attributes.
+
+    3. type specific attributes:
+
+    As it currently stands a BLK-aperture region will never have a
+    "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region.  A
+    BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM
+    that does not allow I/O.  A PMEM region with a "mappings" value of zero
+    is a simple system-physical-address range.
+
+
+LIBND/LIBNDCTL: Namespace
+-------------------------
+
+A REGION, after resolving DPA aliasing and LABEL specified boundaries,
+surfaces one or more "namespace" devices.  The arrival of a "namespace"
+device currently triggers either the nd_blk or nd_pmem driver to load
+and register a disk/block device.
+
+LIBND: namespace
+Here is a sample layout from the three major types of NAMESPACE where
+namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid'
+attribute), namespace2.0 represents a BLK namespace (note it has a
+'sector_size' attribute) that, and namespace6.0 represents an anonymous
+PMEM namespace (note that has no 'uuid' attribute due to not support a
+LABEL).
+
+	/sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0
+	|-- alt_name
+	|-- devtype
+	|-- dpa_extents
+	|-- modalias
+	|-- resource
+	|-- size
+	|-- subsystem -> ../../../../../../bus/nd
+	|-- type
+	|-- uevent
+	`-- uuid
+	/sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0
+	|-- alt_name
+	|-- devtype
+	|-- dpa_extents
+	|-- modalias
+	|-- sector_size
+	|-- size
+	|-- subsystem -> ../../../../../../bus/nd
+	|-- type
+	|-- uevent
+	`-- uuid
+	/sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0
+	|-- block
+	|   `-- pmem0
+	|-- devtype
+	|-- driver -> ../../../../../../bus/nd/drivers/pmem
+	|-- modalias
+	|-- resource
+	|-- size
+	|-- subsystem -> ../../../../../../bus/nd
+	|-- type
+	`-- uevent
+
+LIBNDCTL: namespace enumeration example
+Namespaces are indexed relative to their parent region, example below.
+These indexes are mostly static from boot to boot, but subsystem makes
+no guarantees in this regard.  For a static namespace identifier use its
+'uuid' attribute.
+
+static struct ndctl_namespace *get_namespace_by_id(struct ndctl_region *region,
+                unsigned int id)
+{
+        struct ndctl_namespace *ndns;
+
+        ndctl_namespace_foreach(region, ndns)
+                if (ndctl_namespace_get_id(ndns) == id)
+                        return ndns;
+
+        return NULL;
+}
+
+LIBNDCTL: namespace creation example
+Idle namespaces are automatically created by the kernel if a given
+region has enough available capacity to create a new namespace.
+Namespace instantiation involves finding an idle namespace and
+configuring it.  For the most part the setting of namespace attributes
+can occur in any order, the only constraint is that 'uuid' must be set
+before 'size'.  This enables the kernel to track DPA allocations
+internally with a static identifier.
+
+static int configure_namespace(struct ndctl_region *region,
+                struct ndctl_namespace *ndns,
+                struct namespace_parameters *parameters)
+{
+        char devname[50];
+
+        snprintf(devname, sizeof(devname), "namespace%d.%d",
+                        ndctl_region_get_id(region), paramaters->id);
+
+        ndctl_namespace_set_alt_name(ndns, devname);
+        /* 'uuid' must be set prior to setting size! */
+        ndctl_namespace_set_uuid(ndns, paramaters->uuid);
+        ndctl_namespace_set_size(ndns, paramaters->size);
+        /* unlike pmem namespaces, blk namespaces have a sector size */
+        if (parameters->lbasize)
+                ndctl_namespace_set_sector_size(ndns, parameters->lbasize);
+        ndctl_namespace_enable(ndns);
+}
+
+
+Why the Term "namespace"?
+
+    1. Why not "volume" for instance?  "volume" ran the risk of confusing ND
+    as a volume manager like device-mapper.
+
+    2. The term originated to describe the sub-devices that can be created
+    within a NVME controller (see the nvme specification:
+    http://www.nvmexpress.org/specifications/), and NFIT namespaces are
+    meant to parallel the capabilities and configurability of
+    NVME-namespaces.
+
+
+LIBND/LIBNDCTL: Block Translation Table "btt"
+---------------------------------------------
+
+A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked
+block device driver that fronts either the whole block device or a
+partition of a block device emitted by either a PMEM or BLK NAMESPACE.
+
+LIBND: btt layout
+Every bus will start out with at least one BTT device which is the seed
+device.  To activate it set the "backing_dev", "uuid", and "sector_size"
+attributes and then bind the device to the nd_btt driver.
+
+	/sys/devices/platform/nfit_test.1/ndbus0/btt0/
+	|-- backing_dev
+	|-- delete
+	|-- devtype
+	|-- modalias
+	|-- sector_size
+	|-- subsystem -> ../../../../../bus/nd
+	|-- uevent
+	`-- uuid
+
+LIBNDCTL: btt creation example
+Similar to namespaces an idle BTT device is automatically created per
+bus.  Each time this "seed" btt device is configured and enabled a new
+seed is created.  Creating a BTT configuration involves two steps of
+finding and idle BTT and assigning it to front a PMEM or BLK namespace.
+
+	static struct ndctl_btt *get_idle_btt(struct ndctl_bus *bus)
+	{
+		struct ndctl_btt *btt;
+
+		ndctl_btt_foreach(bus, btt)
+			if (!ndctl_btt_is_enabled(btt) && !ndctl_btt_is_configured(btt))
+				return btt;
+
+		return NULL;
+	}
+
+	static int configure_btt(struct ndctl_bus *bus, struct btt_parameters *parameters)
+	{
+		btt = get_idle_btt(bus);
+
+		sprintf(bdevpath, "/dev/%s",
+				ndctl_namespace_get_block_device(parameters->ndns));
+		ndctl_btt_set_uuid(btt, parameters->uuid);
+		ndctl_btt_set_sector_size(btt, parameters->sector_size);
+		ndctl_btt_set_backing_dev(btt, parametes->bdevpath);
+		ndctl_btt_enable(btt);
+	}
+
+Once instantiated a "nd_btt" link will be created under the
+"backing_dev" (pmem0) block device:
+
+	/sys/block/pmem0/
+	|-- alignment_offset
+	|-- bdi -> ../../../../../../../virtual/bdi/259:0
+	|-- capability
+	|-- dev
+	|-- device -> ../../../namespace0.0
+	|-- discard_alignment
+	|-- ext_range
+	|-- holders
+	|-- inflight
+	|-- nd_btt -> ../../../../btt0
+
+...and a new inactive seed device will appear on the bus.
+
+Once a "backing_dev" is disabled its associated BTT will be
+automatically deleted.  This deletion is only at the device model level.
+In order to destroy a BTT the "info block" needs to be destroyed.
+
+
+Summary LIBNDCTL Diagram
+------------------------
+
+For the given example above, here is the view of the objects as seen by the LIBNDCTL api:
+            +---+
+            |CTX|    +---------+   +--------------+  +---------------+
+            +-+-+  +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" |
+              |    | +---------+   +--------------+  +---------------+
++-------+     |    | +---------+   +--------------+  +---------------+
+| DIMM0 <-+   |    +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" |
++-------+ |   |    | +---------+   +--------------+  +---------------+
+| DIMM1 <-+ +-v--+ | +---------+   +--------------+  +---------------+
++-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6  "blk2.0" |
+| DIMM2 <-+ +----+ | +---------+ | +--------------+  +----------------------+
++-------+ |        |             +-> NAMESPACE2.1 +--> ND5  "blk2.1" | BTT2 |
+| DIMM3 <-+        |               +--------------+  +----------------------+
++-------+          | +---------+   +--------------+  +---------------+
+                   +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4  "blk3.0" |
+                   | +---------+ | +--------------+  +----------------------+
+                   |             +-> NAMESPACE3.1 +--> ND3  "blk3.1" | BTT1 |
+                   |               +--------------+  +----------------------+
+                   | +---------+   +--------------+  +---------------+
+                   +-> REGION4 +---> NAMESPACE4.0 +--> ND2  "blk4.0" |
+                   | +---------+   +--------------+  +---------------+
+                   | +---------+   +--------------+  +----------------------+
+                   +-> REGION5 +---> NAMESPACE5.0 +--> ND1  "blk5.0" | BTT0 |
+                     +---------+   +--------------+  +---------------+------+
+
+
diff --git a/MAINTAINERS b/MAINTAINERS
index 4517613dc638..edb72fcc158e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5854,6 +5854,39 @@ M:	Sasha Levin <sasha.levin@oracle.com>
 S:	Maintained
 F:	tools/lib/lockdep/
 
+LIBND: NON-VOLATILE MEMORY DEVICE SUBSYSTEM
+M:	Dan Williams <dan.j.williams@intel.com>
+L:	linux-nvdimm@lists.01.org
+Q:	https://patchwork.kernel.org/project/linux-nvdimm/list/
+S:	Supported
+F:	drivers/block/nd/*
+F:	include/linux/nd.h
+F:	include/linux/libnd.h
+F:	include/uapi/linux/ndctl.h
+
+LIBND BLK: MMIO-APERTURE DRIVER
+M:	Ross Zwisler <ross.zwisler@linux.intel.com>
+L:	linux-nvdimm@lists.01.org
+Q:	https://patchwork.kernel.org/project/linux-nvdimm/list/
+S:	Supported
+F:	drivers/block/nd/blk.c
+F:	drivers/block/nd/region_devs.c
+F:	drivers/acpi/nfit*
+
+LIBND BTT: BLOCK TRANSLATION TABLE
+M:	Vishal Verma <vishal.verma@linux.intel.com>
+L:	linux-nvdimm@lists.01.org
+Q:	https://patchwork.kernel.org/project/linux-nvdimm/list/
+S:	Supported
+F:	drivers/block/nd/btt*
+
+LIBND PMEM: PERSISTENT MEMORY DRIVER
+M:	Ross Zwisler <ross.zwisler@linux.intel.com>
+L:	linux-nvdimm@lists.01.org
+Q:	https://patchwork.kernel.org/project/linux-nvdimm/list/
+S:	Supported
+F:	drivers/block/nd/pmem.c
+
 LINUX FOR IBM pSERIES (RS/6000)
 M:	Paul Mackerras <paulus@au.ibm.com>
 W:	http://www.ibm.com/linux/ltc/projects/ppc
@@ -8071,12 +8104,6 @@ S:	Maintained
 F:	Documentation/blockdev/ramdisk.txt
 F:	drivers/block/brd.c
 
-PERSISTENT MEMORY DRIVER
-M:	Ross Zwisler <ross.zwisler@linux.intel.com>
-L:	linux-nvdimm@lists.01.org
-S:	Supported
-F:	drivers/block/pmem.c
-
 RANDOM NUMBER DRIVER
 M:	"Theodore Ts'o" <tytso@mit.edu>
 S:	Maintained


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
  2015-05-20 20:56   ` Dan Williams
@ 2015-05-21 13:55     ` Toshi Kani
  -1 siblings, 0 replies; 89+ messages in thread
From: Toshi Kani @ 2015-05-21 13:55 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, neilb, gregkh, Rafael J. Wysocki,
	linux-kernel, Robert Moore, linux-acpi, Lv Zheng, hch, mingo

On Wed, 2015-05-20 at 16:56 -0400, Dan Williams wrote:
 :
> +/* NVDIMM - NFIT table */
> +
> +#define UUID_VOLATILE_MEMORY            "4f940573-dafd-e344-b16c-3f22d252e5d0"
> +#define UUID_PERSISTENT_MEMORY          "79d3f066-f3b4-7440-ac43-0d3318b78cdb"
> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"
> +#define UUID_DATA_REGION                "3005af91-865d-0e47-a6b0-0a2db9408249"
> +#define UUID_VOLATILE_VIRTUAL_DISK      "5a53ab77-fc45-4b62-5560-f7b281d1f96e"
> +#define UUID_VOLATILE_VIRTUAL_CD        "30bd5a3d-7541-ce87-6d64-d2ade523c4bb"
> +#define UUID_PERSISTENT_VIRTUAL_DISK    "c902ea5c-074d-69d3-269f-4496fbe096f9"
> +#define UUID_PERSISTENT_VIRTUAL_CD      "88810108-cd42-48bb-100f-5387d53ded3d"

acpi_str_to_uuid() performs little-endian byte-swapping, so the UUID
strings here need to be actual values.

For instance, UUID_PERSISTENT_MEMORY should be:
#define UUID_PERSISTENT_MEMORY   "66f0d379-b4f3-4074-ac43-0d3318b78cdb"

Thanks,
-Toshi


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
@ 2015-05-21 13:55     ` Toshi Kani
  0 siblings, 0 replies; 89+ messages in thread
From: Toshi Kani @ 2015-05-21 13:55 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, neilb, gregkh, Rafael J. Wysocki,
	linux-kernel, Robert Moore, linux-acpi, Lv Zheng, hch, mingo

On Wed, 2015-05-20 at 16:56 -0400, Dan Williams wrote:
 :
> +/* NVDIMM - NFIT table */
> +
> +#define UUID_VOLATILE_MEMORY            "4f940573-dafd-e344-b16c-3f22d252e5d0"
> +#define UUID_PERSISTENT_MEMORY          "79d3f066-f3b4-7440-ac43-0d3318b78cdb"
> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"
> +#define UUID_DATA_REGION                "3005af91-865d-0e47-a6b0-0a2db9408249"
> +#define UUID_VOLATILE_VIRTUAL_DISK      "5a53ab77-fc45-4b62-5560-f7b281d1f96e"
> +#define UUID_VOLATILE_VIRTUAL_CD        "30bd5a3d-7541-ce87-6d64-d2ade523c4bb"
> +#define UUID_PERSISTENT_VIRTUAL_DISK    "c902ea5c-074d-69d3-269f-4496fbe096f9"
> +#define UUID_PERSISTENT_VIRTUAL_CD      "88810108-cd42-48bb-100f-5387d53ded3d"

acpi_str_to_uuid() performs little-endian byte-swapping, so the UUID
strings here need to be actual values.

For instance, UUID_PERSISTENT_MEMORY should be:
#define UUID_PERSISTENT_MEMORY   "66f0d379-b4f3-4074-ac43-0d3318b78cdb"

Thanks,
-Toshi


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
  2015-05-21 13:55     ` Toshi Kani
@ 2015-05-21 15:56       ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-21 15:56 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Jens Axboe, linux-nvdimm, Neil Brown, Greg KH, Rafael J. Wysocki,
	linux-kernel, Robert Moore, Linux ACPI, Lv Zheng,
	Christoph Hellwig, Ingo Molnar

On Thu, May 21, 2015 at 6:55 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> On Wed, 2015-05-20 at 16:56 -0400, Dan Williams wrote:
>  :
>> +/* NVDIMM - NFIT table */
>> +
>> +#define UUID_VOLATILE_MEMORY            "4f940573-dafd-e344-b16c-3f22d252e5d0"
>> +#define UUID_PERSISTENT_MEMORY          "79d3f066-f3b4-7440-ac43-0d3318b78cdb"
>> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"
>> +#define UUID_DATA_REGION                "3005af91-865d-0e47-a6b0-0a2db9408249"
>> +#define UUID_VOLATILE_VIRTUAL_DISK      "5a53ab77-fc45-4b62-5560-f7b281d1f96e"
>> +#define UUID_VOLATILE_VIRTUAL_CD        "30bd5a3d-7541-ce87-6d64-d2ade523c4bb"
>> +#define UUID_PERSISTENT_VIRTUAL_DISK    "c902ea5c-074d-69d3-269f-4496fbe096f9"
>> +#define UUID_PERSISTENT_VIRTUAL_CD      "88810108-cd42-48bb-100f-5387d53ded3d"
>
> acpi_str_to_uuid() performs little-endian byte-swapping, so the UUID
> strings here need to be actual values.
>
> For instance, UUID_PERSISTENT_MEMORY should be:
> #define UUID_PERSISTENT_MEMORY   "66f0d379-b4f3-4074-ac43-0d3318b78cdb"
>

No, the spec defines the GUID for persistent memory as:

{ 0x66F0D379, 0xB4F3, 0x4074, 0xAC, 0x43, 0x0D, 0x33, 0x18, 0xB7, 0x8C, 0xDB }

The byte encoding for that GUID is the following (all fields stored
big endian: https://en.wikipedia.org/wiki/Globally_unique_identifier#Binary_encoding)

{ 0x66, 0xF0, 0xD3, 0x79, 0xB4, 0xF3, 0x40,0x74, 0xAC, 0x43, 0x0D,
0x33, 0x18, 0xB7, 0x8C, 0xDB }

The reverse ACPI string translation of a UUID buffer according to
"ACPI 6 - 19.6.136 ToUUID (Convert String to UUID Macro)"

{ dd, cc, bb, aa, ff, ee, hh, gg, ii, jj, kk, ll, mm, nn, oo, pp }

"aabbccdd-eeff-gghh-iijj-kkllmmnnoopp"

"79d3f066-f3b4-7440-ac43-0d3318b78cdb"

Indeed, v2 of this patchset got this wrong.  Thanks to the sharp eyes
of Bob Moore on the ACPICA team, he caught this discrepancy.  It seems
the ACPI spec uses the terms "GUID" and "UUID" interchangeably.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
@ 2015-05-21 15:56       ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-21 15:56 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Neil Brown, Greg KH,
	Rafael J. Wysocki, linux-kernel, Robert Moore, Linux ACPI,
	Lv Zheng, Christoph Hellwig, Ingo Molnar

On Thu, May 21, 2015 at 6:55 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> On Wed, 2015-05-20 at 16:56 -0400, Dan Williams wrote:
>  :
>> +/* NVDIMM - NFIT table */
>> +
>> +#define UUID_VOLATILE_MEMORY            "4f940573-dafd-e344-b16c-3f22d252e5d0"
>> +#define UUID_PERSISTENT_MEMORY          "79d3f066-f3b4-7440-ac43-0d3318b78cdb"
>> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"
>> +#define UUID_DATA_REGION                "3005af91-865d-0e47-a6b0-0a2db9408249"
>> +#define UUID_VOLATILE_VIRTUAL_DISK      "5a53ab77-fc45-4b62-5560-f7b281d1f96e"
>> +#define UUID_VOLATILE_VIRTUAL_CD        "30bd5a3d-7541-ce87-6d64-d2ade523c4bb"
>> +#define UUID_PERSISTENT_VIRTUAL_DISK    "c902ea5c-074d-69d3-269f-4496fbe096f9"
>> +#define UUID_PERSISTENT_VIRTUAL_CD      "88810108-cd42-48bb-100f-5387d53ded3d"
>
> acpi_str_to_uuid() performs little-endian byte-swapping, so the UUID
> strings here need to be actual values.
>
> For instance, UUID_PERSISTENT_MEMORY should be:
> #define UUID_PERSISTENT_MEMORY   "66f0d379-b4f3-4074-ac43-0d3318b78cdb"
>

No, the spec defines the GUID for persistent memory as:

{ 0x66F0D379, 0xB4F3, 0x4074, 0xAC, 0x43, 0x0D, 0x33, 0x18, 0xB7, 0x8C, 0xDB }

The byte encoding for that GUID is the following (all fields stored
big endian: https://en.wikipedia.org/wiki/Globally_unique_identifier#Binary_encoding)

{ 0x66, 0xF0, 0xD3, 0x79, 0xB4, 0xF3, 0x40,0x74, 0xAC, 0x43, 0x0D,
0x33, 0x18, 0xB7, 0x8C, 0xDB }

The reverse ACPI string translation of a UUID buffer according to
"ACPI 6 - 19.6.136 ToUUID (Convert String to UUID Macro)"

{ dd, cc, bb, aa, ff, ee, hh, gg, ii, jj, kk, ll, mm, nn, oo, pp }

"aabbccdd-eeff-gghh-iijj-kkllmmnnoopp"

"79d3f066-f3b4-7440-ac43-0d3318b78cdb"

Indeed, v2 of this patchset got this wrong.  Thanks to the sharp eyes
of Bob Moore on the ACPICA team, he caught this discrepancy.  It seems
the ACPI spec uses the terms "GUID" and "UUID" interchangeably.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
  2015-05-21 15:56       ` Dan Williams
@ 2015-05-21 17:25         ` Toshi Kani
  -1 siblings, 0 replies; 89+ messages in thread
From: Toshi Kani @ 2015-05-21 17:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, linux-nvdimm, Neil Brown, Greg KH, Rafael J. Wysocki,
	linux-kernel, Robert Moore, Linux ACPI, Lv Zheng,
	Christoph Hellwig, Ingo Molnar

On Thu, 2015-05-21 at 08:56 -0700, Dan Williams wrote:
> On Thu, May 21, 2015 at 6:55 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> > On Wed, 2015-05-20 at 16:56 -0400, Dan Williams wrote:
> >  :
> >> +/* NVDIMM - NFIT table */
> >> +
> >> +#define UUID_VOLATILE_MEMORY            "4f940573-dafd-e344-b16c-3f22d252e5d0"
> >> +#define UUID_PERSISTENT_MEMORY          "79d3f066-f3b4-7440-ac43-0d3318b78cdb"
> >> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"
> >> +#define UUID_DATA_REGION                "3005af91-865d-0e47-a6b0-0a2db9408249"
> >> +#define UUID_VOLATILE_VIRTUAL_DISK      "5a53ab77-fc45-4b62-5560-f7b281d1f96e"
> >> +#define UUID_VOLATILE_VIRTUAL_CD        "30bd5a3d-7541-ce87-6d64-d2ade523c4bb"
> >> +#define UUID_PERSISTENT_VIRTUAL_DISK    "c902ea5c-074d-69d3-269f-4496fbe096f9"
> >> +#define UUID_PERSISTENT_VIRTUAL_CD      "88810108-cd42-48bb-100f-5387d53ded3d"
> >
> > acpi_str_to_uuid() performs little-endian byte-swapping, so the UUID
> > strings here need to be actual values.
> >
> > For instance, UUID_PERSISTENT_MEMORY should be:
> > #define UUID_PERSISTENT_MEMORY   "66f0d379-b4f3-4074-ac43-0d3318b78cdb"
> >
> 
> No, the spec defines the GUID for persistent memory as:
> 
> { 0x66F0D379, 0xB4F3, 0x4074, 0xAC, 0x43, 0x0D, 0x33, 0x18, 0xB7, 0x8C, 0xDB }
> 
> The byte encoding for that GUID is the following (all fields stored
> big endian: https://en.wikipedia.org/wiki/Globally_unique_identifier#Binary_encoding)
> 
> { 0x66, 0xF0, 0xD3, 0x79, 0xB4, 0xF3, 0x40,0x74, 0xAC, 0x43, 0x0D,
> 0x33, 0x18, 0xB7, 0x8C, 0xDB }
> 
> The reverse ACPI string translation of a UUID buffer according to
> "ACPI 6 - 19.6.136 ToUUID (Convert String to UUID Macro)"
> 
> { dd, cc, bb, aa, ff, ee, hh, gg, ii, jj, kk, ll, mm, nn, oo, pp }
> 
> "aabbccdd-eeff-gghh-iijj-kkllmmnnoopp"
> 
> "79d3f066-f3b4-7440-ac43-0d3318b78cdb"
> 
> Indeed, v2 of this patchset got this wrong.  Thanks to the sharp eyes
> of Bob Moore on the ACPICA team, he caught this discrepancy.  It seems
> the ACPI spec uses the terms "GUID" and "UUID" interchangeably.

I agree that this thing is confusing...

The Wiki page you pointed states that:
===
Byte encoding
 :
This endianness applies only to the way in which a GUID is stored, and
not to the way in which it is represented in text. GUIDs and RFC 4122
UUIDs should be identical when displayed textually.

Text encoding
 :
For the first three fields, the most significant digit is on the left. 
===

Wiki page of UUID below also states that:
http://en.wikipedia.org/wiki/Universally_unique_identifier
===
Definition
 :
The first 3 sequences are interpreted as complete hexadecimal numbers,
while the final 2 as a plain sequence of bytes. The byte order is "most
significant byte first (known as network byte order)
===

So, the text encoding of GUID represents actual value; no endianness
applies here.  So, the following GUID definition:

{ 0x66F0D379, 0xB4F3, 0x4074, 0xAC, 0x43, 0x0D, 0x33, 0x18, 0xB7, 0x8C,
0xDB }

Should be text encoded as:

"66f0d379-b4f3-4074-ac43-0d3318b78cdb"

Now, byte-encoding is confusing.  While the Wiki page you pointed states
that GUID has big endian per Microsoft definition, EFI spec defines
differently.  Please look at EFI 2.5 "Appendix A GUID and Time Formats".

The EFI spec states that:
===
All EFI GUIDs (Globally Unique Identifiers) have the format described in
RFC 4122 and comply with the referenced algorithms for generating GUIDs.
It should be noted that TimeLow, TimeMid, TimeHighAndVersion fields in
the EFI are encoded as little endian.
===

Table 212 defines how text representation of the GUID is stored in
Buffer, which is little endian format.  This table also states that the
most significant byte is the first in text encoding, which is consistent
with the Wiki pages.

The ACPI spec, ToUUID, is consistent with EFI spec Table 212 as well.

Thanks,
-Toshi


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
@ 2015-05-21 17:25         ` Toshi Kani
  0 siblings, 0 replies; 89+ messages in thread
From: Toshi Kani @ 2015-05-21 17:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Neil Brown, Greg KH,
	Rafael J. Wysocki, linux-kernel, Robert Moore, Linux ACPI,
	Lv Zheng, Christoph Hellwig, Ingo Molnar

On Thu, 2015-05-21 at 08:56 -0700, Dan Williams wrote:
> On Thu, May 21, 2015 at 6:55 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> > On Wed, 2015-05-20 at 16:56 -0400, Dan Williams wrote:
> >  :
> >> +/* NVDIMM - NFIT table */
> >> +
> >> +#define UUID_VOLATILE_MEMORY            "4f940573-dafd-e344-b16c-3f22d252e5d0"
> >> +#define UUID_PERSISTENT_MEMORY          "79d3f066-f3b4-7440-ac43-0d3318b78cdb"
> >> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"
> >> +#define UUID_DATA_REGION                "3005af91-865d-0e47-a6b0-0a2db9408249"
> >> +#define UUID_VOLATILE_VIRTUAL_DISK      "5a53ab77-fc45-4b62-5560-f7b281d1f96e"
> >> +#define UUID_VOLATILE_VIRTUAL_CD        "30bd5a3d-7541-ce87-6d64-d2ade523c4bb"
> >> +#define UUID_PERSISTENT_VIRTUAL_DISK    "c902ea5c-074d-69d3-269f-4496fbe096f9"
> >> +#define UUID_PERSISTENT_VIRTUAL_CD      "88810108-cd42-48bb-100f-5387d53ded3d"
> >
> > acpi_str_to_uuid() performs little-endian byte-swapping, so the UUID
> > strings here need to be actual values.
> >
> > For instance, UUID_PERSISTENT_MEMORY should be:
> > #define UUID_PERSISTENT_MEMORY   "66f0d379-b4f3-4074-ac43-0d3318b78cdb"
> >
> 
> No, the spec defines the GUID for persistent memory as:
> 
> { 0x66F0D379, 0xB4F3, 0x4074, 0xAC, 0x43, 0x0D, 0x33, 0x18, 0xB7, 0x8C, 0xDB }
> 
> The byte encoding for that GUID is the following (all fields stored
> big endian: https://en.wikipedia.org/wiki/Globally_unique_identifier#Binary_encoding)
> 
> { 0x66, 0xF0, 0xD3, 0x79, 0xB4, 0xF3, 0x40,0x74, 0xAC, 0x43, 0x0D,
> 0x33, 0x18, 0xB7, 0x8C, 0xDB }
> 
> The reverse ACPI string translation of a UUID buffer according to
> "ACPI 6 - 19.6.136 ToUUID (Convert String to UUID Macro)"
> 
> { dd, cc, bb, aa, ff, ee, hh, gg, ii, jj, kk, ll, mm, nn, oo, pp }
> 
> "aabbccdd-eeff-gghh-iijj-kkllmmnnoopp"
> 
> "79d3f066-f3b4-7440-ac43-0d3318b78cdb"
> 
> Indeed, v2 of this patchset got this wrong.  Thanks to the sharp eyes
> of Bob Moore on the ACPICA team, he caught this discrepancy.  It seems
> the ACPI spec uses the terms "GUID" and "UUID" interchangeably.

I agree that this thing is confusing...

The Wiki page you pointed states that:
===
Byte encoding
 :
This endianness applies only to the way in which a GUID is stored, and
not to the way in which it is represented in text. GUIDs and RFC 4122
UUIDs should be identical when displayed textually.

Text encoding
 :
For the first three fields, the most significant digit is on the left. 
===

Wiki page of UUID below also states that:
http://en.wikipedia.org/wiki/Universally_unique_identifier
===
Definition
 :
The first 3 sequences are interpreted as complete hexadecimal numbers,
while the final 2 as a plain sequence of bytes. The byte order is "most
significant byte first (known as network byte order)
===

So, the text encoding of GUID represents actual value; no endianness
applies here.  So, the following GUID definition:

{ 0x66F0D379, 0xB4F3, 0x4074, 0xAC, 0x43, 0x0D, 0x33, 0x18, 0xB7, 0x8C,
0xDB }

Should be text encoded as:

"66f0d379-b4f3-4074-ac43-0d3318b78cdb"

Now, byte-encoding is confusing.  While the Wiki page you pointed states
that GUID has big endian per Microsoft definition, EFI spec defines
differently.  Please look at EFI 2.5 "Appendix A GUID and Time Formats".

The EFI spec states that:
===
All EFI GUIDs (Globally Unique Identifiers) have the format described in
RFC 4122 and comply with the referenced algorithms for generating GUIDs.
It should be noted that TimeLow, TimeMid, TimeHighAndVersion fields in
the EFI are encoded as little endian.
===

Table 212 defines how text representation of the GUID is stored in
Buffer, which is little endian format.  This table also states that the
most significant byte is the first in text encoding, which is consistent
with the Wiki pages.

The ACPI spec, ToUUID, is consistent with EFI spec Table 212 as well.

Thanks,
-Toshi


^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
  2015-05-21 17:25         ` Toshi Kani
@ 2015-05-21 17:49           ` Moore, Robert
  -1 siblings, 0 replies; 89+ messages in thread
From: Moore, Robert @ 2015-05-21 17:49 UTC (permalink / raw)
  To: Toshi Kani, Williams, Dan J
  Cc: Jens Axboe, linux-nvdimm, Neil Brown, Greg KH, Wysocki, Rafael J,
	linux-kernel, Linux ACPI, Zheng, Lv, Christoph Hellwig,
	Ingo Molnar

What ACPICA has done here is to define these values consistently with the ToUUID ASL macro:


Byte encoding of UUID/GUID strings into ACPI Buffer objects (use ToUUID from ASL):

   Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
     Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg, ii,jj, kk,ll,mm,nn,oo,pp




> -----Original Message-----
> From: Toshi Kani [mailto:toshi.kani@hp.com]
> Sent: Thursday, May 21, 2015 10:25 AM
> To: Williams, Dan J
> Cc: Jens Axboe; linux-nvdimm@lists.01.org; Neil Brown; Greg KH; Wysocki,
> Rafael J; linux-kernel@vger.kernel.org; Moore, Robert; Linux ACPI; Zheng,
> Lv; Christoph Hellwig; Ingo Molnar
> Subject: Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure
> and NFIT support
> 
> On Thu, 2015-05-21 at 08:56 -0700, Dan Williams wrote:
> > On Thu, May 21, 2015 at 6:55 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> > > On Wed, 2015-05-20 at 16:56 -0400, Dan Williams wrote:
> > >  :
> > >> +/* NVDIMM - NFIT table */
> > >> +
> > >> +#define UUID_VOLATILE_MEMORY            "4f940573-dafd-e344-b16c-
> 3f22d252e5d0"
> > >> +#define UUID_PERSISTENT_MEMORY          "79d3f066-f3b4-7440-ac43-
> 0d3318b78cdb"
> > >> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-
> 299367e8234c"
> > >> +#define UUID_DATA_REGION                "3005af91-865d-0e47-a6b0-
> 0a2db9408249"
> > >> +#define UUID_VOLATILE_VIRTUAL_DISK      "5a53ab77-fc45-4b62-5560-
> f7b281d1f96e"
> > >> +#define UUID_VOLATILE_VIRTUAL_CD        "30bd5a3d-7541-ce87-6d64-
> d2ade523c4bb"
> > >> +#define UUID_PERSISTENT_VIRTUAL_DISK    "c902ea5c-074d-69d3-269f-
> 4496fbe096f9"
> > >> +#define UUID_PERSISTENT_VIRTUAL_CD      "88810108-cd42-48bb-100f-
> 5387d53ded3d"
> > >
> > > acpi_str_to_uuid() performs little-endian byte-swapping, so the UUID
> > > strings here need to be actual values.
> > >
> > > For instance, UUID_PERSISTENT_MEMORY should be:
> > > #define UUID_PERSISTENT_MEMORY   "66f0d379-b4f3-4074-ac43-
> 0d3318b78cdb"
> > >
> >
> > No, the spec defines the GUID for persistent memory as:
> >
> > { 0x66F0D379, 0xB4F3, 0x4074, 0xAC, 0x43, 0x0D, 0x33, 0x18, 0xB7,
> > 0x8C, 0xDB }
> >
> > The byte encoding for that GUID is the following (all fields stored
> > big endian:
> > https://en.wikipedia.org/wiki/Globally_unique_identifier#Binary_encodi
> > ng)
> >
> > { 0x66, 0xF0, 0xD3, 0x79, 0xB4, 0xF3, 0x40,0x74, 0xAC, 0x43, 0x0D,
> > 0x33, 0x18, 0xB7, 0x8C, 0xDB }
> >
> > The reverse ACPI string translation of a UUID buffer according to
> > "ACPI 6 - 19.6.136 ToUUID (Convert String to UUID Macro)"
> >
> > { dd, cc, bb, aa, ff, ee, hh, gg, ii, jj, kk, ll, mm, nn, oo, pp }
> >
> > "aabbccdd-eeff-gghh-iijj-kkllmmnnoopp"
> >
> > "79d3f066-f3b4-7440-ac43-0d3318b78cdb"
> >
> > Indeed, v2 of this patchset got this wrong.  Thanks to the sharp eyes
> > of Bob Moore on the ACPICA team, he caught this discrepancy.  It seems
> > the ACPI spec uses the terms "GUID" and "UUID" interchangeably.
> 
> I agree that this thing is confusing...
> 
> The Wiki page you pointed states that:
> ===
> Byte encoding
>  :
> This endianness applies only to the way in which a GUID is stored, and not
> to the way in which it is represented in text. GUIDs and RFC 4122 UUIDs
> should be identical when displayed textually.
> 
> Text encoding
>  :
> For the first three fields, the most significant digit is on the left.
> ===
> 
> Wiki page of UUID below also states that:
> http://en.wikipedia.org/wiki/Universally_unique_identifier
> ===
> Definition
>  :
> The first 3 sequences are interpreted as complete hexadecimal numbers,
> while the final 2 as a plain sequence of bytes. The byte order is "most
> significant byte first (known as network byte order) ===
> 
> So, the text encoding of GUID represents actual value; no endianness
> applies here.  So, the following GUID definition:
> 
> { 0x66F0D379, 0xB4F3, 0x4074, 0xAC, 0x43, 0x0D, 0x33, 0x18, 0xB7, 0x8C,
> 0xDB }
> 
> Should be text encoded as:
> 
> "66f0d379-b4f3-4074-ac43-0d3318b78cdb"
> 
> Now, byte-encoding is confusing.  While the Wiki page you pointed states
> that GUID has big endian per Microsoft definition, EFI spec defines
> differently.  Please look at EFI 2.5 "Appendix A GUID and Time Formats".
> 
> The EFI spec states that:
> ===
> All EFI GUIDs (Globally Unique Identifiers) have the format described in
> RFC 4122 and comply with the referenced algorithms for generating GUIDs.
> It should be noted that TimeLow, TimeMid, TimeHighAndVersion fields in the
> EFI are encoded as little endian.
> ===
> 
> Table 212 defines how text representation of the GUID is stored in Buffer,
> which is little endian format.  This table also states that the most
> significant byte is the first in text encoding, which is consistent with
> the Wiki pages.
> 
> The ACPI spec, ToUUID, is consistent with EFI spec Table 212 as well.
> 
> Thanks,
> -Toshi


^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
@ 2015-05-21 17:49           ` Moore, Robert
  0 siblings, 0 replies; 89+ messages in thread
From: Moore, Robert @ 2015-05-21 17:49 UTC (permalink / raw)
  To: Toshi Kani, Williams, Dan J
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Neil Brown, Greg KH,
	Wysocki, Rafael J, linux-kernel, Linux ACPI, Zheng, Lv,
	Christoph Hellwig, Ingo Molnar

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 5041 bytes --]

What ACPICA has done here is to define these values consistently with the ToUUID ASL macro:


Byte encoding of UUID/GUID strings into ACPI Buffer objects (use ToUUID from ASL):

   Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
     Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg, ii,jj, kk,ll,mm,nn,oo,pp




> -----Original Message-----
> From: Toshi Kani [mailto:toshi.kani@hp.com]
> Sent: Thursday, May 21, 2015 10:25 AM
> To: Williams, Dan J
> Cc: Jens Axboe; linux-nvdimm@lists.01.org; Neil Brown; Greg KH; Wysocki,
> Rafael J; linux-kernel@vger.kernel.org; Moore, Robert; Linux ACPI; Zheng,
> Lv; Christoph Hellwig; Ingo Molnar
> Subject: Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure
> and NFIT support
> 
> On Thu, 2015-05-21 at 08:56 -0700, Dan Williams wrote:
> > On Thu, May 21, 2015 at 6:55 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> > > On Wed, 2015-05-20 at 16:56 -0400, Dan Williams wrote:
> > >  :
> > >> +/* NVDIMM - NFIT table */
> > >> +
> > >> +#define UUID_VOLATILE_MEMORY            "4f940573-dafd-e344-b16c-
> 3f22d252e5d0"
> > >> +#define UUID_PERSISTENT_MEMORY          "79d3f066-f3b4-7440-ac43-
> 0d3318b78cdb"
> > >> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-
> 299367e8234c"
> > >> +#define UUID_DATA_REGION                "3005af91-865d-0e47-a6b0-
> 0a2db9408249"
> > >> +#define UUID_VOLATILE_VIRTUAL_DISK      "5a53ab77-fc45-4b62-5560-
> f7b281d1f96e"
> > >> +#define UUID_VOLATILE_VIRTUAL_CD        "30bd5a3d-7541-ce87-6d64-
> d2ade523c4bb"
> > >> +#define UUID_PERSISTENT_VIRTUAL_DISK    "c902ea5c-074d-69d3-269f-
> 4496fbe096f9"
> > >> +#define UUID_PERSISTENT_VIRTUAL_CD      "88810108-cd42-48bb-100f-
> 5387d53ded3d"
> > >
> > > acpi_str_to_uuid() performs little-endian byte-swapping, so the UUID
> > > strings here need to be actual values.
> > >
> > > For instance, UUID_PERSISTENT_MEMORY should be:
> > > #define UUID_PERSISTENT_MEMORY   "66f0d379-b4f3-4074-ac43-
> 0d3318b78cdb"
> > >
> >
> > No, the spec defines the GUID for persistent memory as:
> >
> > { 0x66F0D379, 0xB4F3, 0x4074, 0xAC, 0x43, 0x0D, 0x33, 0x18, 0xB7,
> > 0x8C, 0xDB }
> >
> > The byte encoding for that GUID is the following (all fields stored
> > big endian:
> > https://en.wikipedia.org/wiki/Globally_unique_identifier#Binary_encodi
> > ng)
> >
> > { 0x66, 0xF0, 0xD3, 0x79, 0xB4, 0xF3, 0x40,0x74, 0xAC, 0x43, 0x0D,
> > 0x33, 0x18, 0xB7, 0x8C, 0xDB }
> >
> > The reverse ACPI string translation of a UUID buffer according to
> > "ACPI 6 - 19.6.136 ToUUID (Convert String to UUID Macro)"
> >
> > { dd, cc, bb, aa, ff, ee, hh, gg, ii, jj, kk, ll, mm, nn, oo, pp }
> >
> > "aabbccdd-eeff-gghh-iijj-kkllmmnnoopp"
> >
> > "79d3f066-f3b4-7440-ac43-0d3318b78cdb"
> >
> > Indeed, v2 of this patchset got this wrong.  Thanks to the sharp eyes
> > of Bob Moore on the ACPICA team, he caught this discrepancy.  It seems
> > the ACPI spec uses the terms "GUID" and "UUID" interchangeably.
> 
> I agree that this thing is confusing...
> 
> The Wiki page you pointed states that:
> ===
> Byte encoding
>  :
> This endianness applies only to the way in which a GUID is stored, and not
> to the way in which it is represented in text. GUIDs and RFC 4122 UUIDs
> should be identical when displayed textually.
> 
> Text encoding
>  :
> For the first three fields, the most significant digit is on the left.
> ===
> 
> Wiki page of UUID below also states that:
> http://en.wikipedia.org/wiki/Universally_unique_identifier
> ===
> Definition
>  :
> The first 3 sequences are interpreted as complete hexadecimal numbers,
> while the final 2 as a plain sequence of bytes. The byte order is "most
> significant byte first (known as network byte order) ===
> 
> So, the text encoding of GUID represents actual value; no endianness
> applies here.  So, the following GUID definition:
> 
> { 0x66F0D379, 0xB4F3, 0x4074, 0xAC, 0x43, 0x0D, 0x33, 0x18, 0xB7, 0x8C,
> 0xDB }
> 
> Should be text encoded as:
> 
> "66f0d379-b4f3-4074-ac43-0d3318b78cdb"
> 
> Now, byte-encoding is confusing.  While the Wiki page you pointed states
> that GUID has big endian per Microsoft definition, EFI spec defines
> differently.  Please look at EFI 2.5 "Appendix A GUID and Time Formats".
> 
> The EFI spec states that:
> ===
> All EFI GUIDs (Globally Unique Identifiers) have the format described in
> RFC 4122 and comply with the referenced algorithms for generating GUIDs.
> It should be noted that TimeLow, TimeMid, TimeHighAndVersion fields in the
> EFI are encoded as little endian.
> ===
> 
> Table 212 defines how text representation of the GUID is stored in Buffer,
> which is little endian format.  This table also states that the most
> significant byte is the first in text encoding, which is consistent with
> the Wiki pages.
> 
> The ACPI spec, ToUUID, is consistent with EFI spec Table 212 as well.
> 
> Thanks,
> -Toshi

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
  2015-05-21 17:49           ` Moore, Robert
@ 2015-05-21 18:01             ` Toshi Kani
  -1 siblings, 0 replies; 89+ messages in thread
From: Toshi Kani @ 2015-05-21 18:01 UTC (permalink / raw)
  To: Moore, Robert
  Cc: Williams, Dan J, Jens Axboe, linux-nvdimm, Neil Brown, Greg KH,
	Wysocki, Rafael J, linux-kernel, Linux ACPI, Zheng, Lv,
	Christoph Hellwig, Ingo Molnar

On Thu, 2015-05-21 at 17:49 +0000, Moore, Robert wrote:
> What ACPICA has done here is to define these values consistently with the ToUUID ASL macro:
> 
> 
> Byte encoding of UUID/GUID strings into ACPI Buffer objects (use ToUUID from ASL):
> 
>    Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
>      Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg, ii,jj, kk,ll,mm,nn,oo,pp
> 

I do not see any issue in this conversion, which is consistent with
ToUUID defined in ACPI spec.

My point is that the string format of GUID is endian-neutral.  Wiki
pages and EFI spec agree on it.  EFI 2.5 spec, Table 225 (sorry not
Table 212, which is v2.4), is also clear about how String and Buffer are
related with actual values of GUID.

Thanks,
-Toshi


> 
> 
> > -----Original Message-----
> > From: Toshi Kani [mailto:toshi.kani@hp.com]
> > Sent: Thursday, May 21, 2015 10:25 AM
> > To: Williams, Dan J
> > Cc: Jens Axboe; linux-nvdimm@lists.01.org; Neil Brown; Greg KH; Wysocki,
> > Rafael J; linux-kernel@vger.kernel.org; Moore, Robert; Linux ACPI; Zheng,
> > Lv; Christoph Hellwig; Ingo Molnar
> > Subject: Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure
> > and NFIT support
> > 
> > On Thu, 2015-05-21 at 08:56 -0700, Dan Williams wrote:
> > > On Thu, May 21, 2015 at 6:55 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> > > > On Wed, 2015-05-20 at 16:56 -0400, Dan Williams wrote:
> > > >  :
> > > >> +/* NVDIMM - NFIT table */
> > > >> +
> > > >> +#define UUID_VOLATILE_MEMORY            "4f940573-dafd-e344-b16c-
> > 3f22d252e5d0"
> > > >> +#define UUID_PERSISTENT_MEMORY          "79d3f066-f3b4-7440-ac43-
> > 0d3318b78cdb"
> > > >> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-
> > 299367e8234c"
> > > >> +#define UUID_DATA_REGION                "3005af91-865d-0e47-a6b0-
> > 0a2db9408249"
> > > >> +#define UUID_VOLATILE_VIRTUAL_DISK      "5a53ab77-fc45-4b62-5560-
> > f7b281d1f96e"
> > > >> +#define UUID_VOLATILE_VIRTUAL_CD        "30bd5a3d-7541-ce87-6d64-
> > d2ade523c4bb"
> > > >> +#define UUID_PERSISTENT_VIRTUAL_DISK    "c902ea5c-074d-69d3-269f-
> > 4496fbe096f9"
> > > >> +#define UUID_PERSISTENT_VIRTUAL_CD      "88810108-cd42-48bb-100f-
> > 5387d53ded3d"
> > > >
> > > > acpi_str_to_uuid() performs little-endian byte-swapping, so the UUID
> > > > strings here need to be actual values.
> > > >
> > > > For instance, UUID_PERSISTENT_MEMORY should be:
> > > > #define UUID_PERSISTENT_MEMORY   "66f0d379-b4f3-4074-ac43-
> > 0d3318b78cdb"
> > > >
> > >
> > > No, the spec defines the GUID for persistent memory as:
> > >
> > > { 0x66F0D379, 0xB4F3, 0x4074, 0xAC, 0x43, 0x0D, 0x33, 0x18, 0xB7,
> > > 0x8C, 0xDB }
> > >
> > > The byte encoding for that GUID is the following (all fields stored
> > > big endian:
> > > https://en.wikipedia.org/wiki/Globally_unique_identifier#Binary_encodi
> > > ng)
> > >
> > > { 0x66, 0xF0, 0xD3, 0x79, 0xB4, 0xF3, 0x40,0x74, 0xAC, 0x43, 0x0D,
> > > 0x33, 0x18, 0xB7, 0x8C, 0xDB }
> > >
> > > The reverse ACPI string translation of a UUID buffer according to
> > > "ACPI 6 - 19.6.136 ToUUID (Convert String to UUID Macro)"
> > >
> > > { dd, cc, bb, aa, ff, ee, hh, gg, ii, jj, kk, ll, mm, nn, oo, pp }
> > >
> > > "aabbccdd-eeff-gghh-iijj-kkllmmnnoopp"
> > >
> > > "79d3f066-f3b4-7440-ac43-0d3318b78cdb"
> > >
> > > Indeed, v2 of this patchset got this wrong.  Thanks to the sharp eyes
> > > of Bob Moore on the ACPICA team, he caught this discrepancy.  It seems
> > > the ACPI spec uses the terms "GUID" and "UUID" interchangeably.
> > 
> > I agree that this thing is confusing...
> > 
> > The Wiki page you pointed states that:
> > ===
> > Byte encoding
> >  :
> > This endianness applies only to the way in which a GUID is stored, and not
> > to the way in which it is represented in text. GUIDs and RFC 4122 UUIDs
> > should be identical when displayed textually.
> > 
> > Text encoding
> >  :
> > For the first three fields, the most significant digit is on the left.
> > ===
> > 
> > Wiki page of UUID below also states that:
> > http://en.wikipedia.org/wiki/Universally_unique_identifier
> > ===
> > Definition
> >  :
> > The first 3 sequences are interpreted as complete hexadecimal numbers,
> > while the final 2 as a plain sequence of bytes. The byte order is "most
> > significant byte first (known as network byte order) ===
> > 
> > So, the text encoding of GUID represents actual value; no endianness
> > applies here.  So, the following GUID definition:
> > 
> > { 0x66F0D379, 0xB4F3, 0x4074, 0xAC, 0x43, 0x0D, 0x33, 0x18, 0xB7, 0x8C,
> > 0xDB }
> > 
> > Should be text encoded as:
> > 
> > "66f0d379-b4f3-4074-ac43-0d3318b78cdb"
> > 
> > Now, byte-encoding is confusing.  While the Wiki page you pointed states
> > that GUID has big endian per Microsoft definition, EFI spec defines
> > differently.  Please look at EFI 2.5 "Appendix A GUID and Time Formats".
> > 
> > The EFI spec states that:
> > ===
> > All EFI GUIDs (Globally Unique Identifiers) have the format described in
> > RFC 4122 and comply with the referenced algorithms for generating GUIDs.
> > It should be noted that TimeLow, TimeMid, TimeHighAndVersion fields in the
> > EFI are encoded as little endian.
> > ===
> > 
> > Table 212 defines how text representation of the GUID is stored in Buffer,
> > which is little endian format.  This table also states that the most
> > significant byte is the first in text encoding, which is consistent with
> > the Wiki pages.
> > 
> > The ACPI spec, ToUUID, is consistent with EFI spec Table 212 as well.
> > 
> > Thanks,
> > -Toshi
> 



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
@ 2015-05-21 18:01             ` Toshi Kani
  0 siblings, 0 replies; 89+ messages in thread
From: Toshi Kani @ 2015-05-21 18:01 UTC (permalink / raw)
  To: Moore, Robert
  Cc: Williams, Dan J, Jens Axboe, linux-nvdimm@lists.01.org,
	Neil Brown, Greg KH, Wysocki, Rafael J, linux-kernel, Linux ACPI,
	Zheng, Lv, Christoph Hellwig, Ingo Molnar

On Thu, 2015-05-21 at 17:49 +0000, Moore, Robert wrote:
> What ACPICA has done here is to define these values consistently with the ToUUID ASL macro:
> 
> 
> Byte encoding of UUID/GUID strings into ACPI Buffer objects (use ToUUID from ASL):
> 
>    Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
>      Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg, ii,jj, kk,ll,mm,nn,oo,pp
> 

I do not see any issue in this conversion, which is consistent with
ToUUID defined in ACPI spec.

My point is that the string format of GUID is endian-neutral.  Wiki
pages and EFI spec agree on it.  EFI 2.5 spec, Table 225 (sorry not
Table 212, which is v2.4), is also clear about how String and Buffer are
related with actual values of GUID.

Thanks,
-Toshi


> 
> 
> > -----Original Message-----
> > From: Toshi Kani [mailto:toshi.kani@hp.com]
> > Sent: Thursday, May 21, 2015 10:25 AM
> > To: Williams, Dan J
> > Cc: Jens Axboe; linux-nvdimm@lists.01.org; Neil Brown; Greg KH; Wysocki,
> > Rafael J; linux-kernel@vger.kernel.org; Moore, Robert; Linux ACPI; Zheng,
> > Lv; Christoph Hellwig; Ingo Molnar
> > Subject: Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure
> > and NFIT support
> > 
> > On Thu, 2015-05-21 at 08:56 -0700, Dan Williams wrote:
> > > On Thu, May 21, 2015 at 6:55 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> > > > On Wed, 2015-05-20 at 16:56 -0400, Dan Williams wrote:
> > > >  :
> > > >> +/* NVDIMM - NFIT table */
> > > >> +
> > > >> +#define UUID_VOLATILE_MEMORY            "4f940573-dafd-e344-b16c-
> > 3f22d252e5d0"
> > > >> +#define UUID_PERSISTENT_MEMORY          "79d3f066-f3b4-7440-ac43-
> > 0d3318b78cdb"
> > > >> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-
> > 299367e8234c"
> > > >> +#define UUID_DATA_REGION                "3005af91-865d-0e47-a6b0-
> > 0a2db9408249"
> > > >> +#define UUID_VOLATILE_VIRTUAL_DISK      "5a53ab77-fc45-4b62-5560-
> > f7b281d1f96e"
> > > >> +#define UUID_VOLATILE_VIRTUAL_CD        "30bd5a3d-7541-ce87-6d64-
> > d2ade523c4bb"
> > > >> +#define UUID_PERSISTENT_VIRTUAL_DISK    "c902ea5c-074d-69d3-269f-
> > 4496fbe096f9"
> > > >> +#define UUID_PERSISTENT_VIRTUAL_CD      "88810108-cd42-48bb-100f-
> > 5387d53ded3d"
> > > >
> > > > acpi_str_to_uuid() performs little-endian byte-swapping, so the UUID
> > > > strings here need to be actual values.
> > > >
> > > > For instance, UUID_PERSISTENT_MEMORY should be:
> > > > #define UUID_PERSISTENT_MEMORY   "66f0d379-b4f3-4074-ac43-
> > 0d3318b78cdb"
> > > >
> > >
> > > No, the spec defines the GUID for persistent memory as:
> > >
> > > { 0x66F0D379, 0xB4F3, 0x4074, 0xAC, 0x43, 0x0D, 0x33, 0x18, 0xB7,
> > > 0x8C, 0xDB }
> > >
> > > The byte encoding for that GUID is the following (all fields stored
> > > big endian:
> > > https://en.wikipedia.org/wiki/Globally_unique_identifier#Binary_encodi
> > > ng)
> > >
> > > { 0x66, 0xF0, 0xD3, 0x79, 0xB4, 0xF3, 0x40,0x74, 0xAC, 0x43, 0x0D,
> > > 0x33, 0x18, 0xB7, 0x8C, 0xDB }
> > >
> > > The reverse ACPI string translation of a UUID buffer according to
> > > "ACPI 6 - 19.6.136 ToUUID (Convert String to UUID Macro)"
> > >
> > > { dd, cc, bb, aa, ff, ee, hh, gg, ii, jj, kk, ll, mm, nn, oo, pp }
> > >
> > > "aabbccdd-eeff-gghh-iijj-kkllmmnnoopp"
> > >
> > > "79d3f066-f3b4-7440-ac43-0d3318b78cdb"
> > >
> > > Indeed, v2 of this patchset got this wrong.  Thanks to the sharp eyes
> > > of Bob Moore on the ACPICA team, he caught this discrepancy.  It seems
> > > the ACPI spec uses the terms "GUID" and "UUID" interchangeably.
> > 
> > I agree that this thing is confusing...
> > 
> > The Wiki page you pointed states that:
> > ===
> > Byte encoding
> >  :
> > This endianness applies only to the way in which a GUID is stored, and not
> > to the way in which it is represented in text. GUIDs and RFC 4122 UUIDs
> > should be identical when displayed textually.
> > 
> > Text encoding
> >  :
> > For the first three fields, the most significant digit is on the left.
> > ===
> > 
> > Wiki page of UUID below also states that:
> > http://en.wikipedia.org/wiki/Universally_unique_identifier
> > ===
> > Definition
> >  :
> > The first 3 sequences are interpreted as complete hexadecimal numbers,
> > while the final 2 as a plain sequence of bytes. The byte order is "most
> > significant byte first (known as network byte order) ===
> > 
> > So, the text encoding of GUID represents actual value; no endianness
> > applies here.  So, the following GUID definition:
> > 
> > { 0x66F0D379, 0xB4F3, 0x4074, 0xAC, 0x43, 0x0D, 0x33, 0x18, 0xB7, 0x8C,
> > 0xDB }
> > 
> > Should be text encoded as:
> > 
> > "66f0d379-b4f3-4074-ac43-0d3318b78cdb"
> > 
> > Now, byte-encoding is confusing.  While the Wiki page you pointed states
> > that GUID has big endian per Microsoft definition, EFI spec defines
> > differently.  Please look at EFI 2.5 "Appendix A GUID and Time Formats".
> > 
> > The EFI spec states that:
> > ===
> > All EFI GUIDs (Globally Unique Identifiers) have the format described in
> > RFC 4122 and comply with the referenced algorithms for generating GUIDs.
> > It should be noted that TimeLow, TimeMid, TimeHighAndVersion fields in the
> > EFI are encoded as little endian.
> > ===
> > 
> > Table 212 defines how text representation of the GUID is stored in Buffer,
> > which is little endian format.  This table also states that the most
> > significant byte is the first in text encoding, which is consistent with
> > the Wiki pages.
> > 
> > The ACPI spec, ToUUID, is consistent with EFI spec Table 212 as well.
> > 
> > Thanks,
> > -Toshi
> 



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
  2015-05-21 18:01             ` Toshi Kani
@ 2015-05-21 19:06               ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-21 19:06 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Moore, Robert, Jens Axboe, linux-nvdimm, Neil Brown, Greg KH,
	Wysocki, Rafael J, linux-kernel, Linux ACPI, Zheng, Lv,
	Christoph Hellwig, Ingo Molnar

On Thu, May 21, 2015 at 11:01 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> On Thu, 2015-05-21 at 17:49 +0000, Moore, Robert wrote:
>> What ACPICA has done here is to define these values consistently with the ToUUID ASL macro:
>>
>>
>> Byte encoding of UUID/GUID strings into ACPI Buffer objects (use ToUUID from ASL):
>>
>>    Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
>>      Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg, ii,jj, kk,ll,mm,nn,oo,pp
>>
>
> I do not see any issue in this conversion, which is consistent with
> ToUUID defined in ACPI spec.
>
> My point is that the string format of GUID is endian-neutral.  Wiki
> pages and EFI spec agree on it.  EFI 2.5 spec, Table 225 (sorry not
> Table 212, which is v2.4), is also clear about how String and Buffer are
> related with actual values of GUID.

I think the critical point from the UEFI spec is the "It should also
be noted that TimeLow, TimeMid, TimeHighAndVersion fields in the EFI
are encoded as little endian".  That would imply the byte encoding
of...

{ 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23, 0x4C }

...should be:

{ f6,01,f7,92,b4,13,5d,40,91,0b,29,93,67,e8,23,4c }

Which implies the text conversion should be:

"92f701f6-13b4-405d-910b-299367e8234c"

...not

> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"

I think ACPICA has the right order for a standard RFC 4122 id, but it
seems EFI is explicitly clarifying that the encoding is little endian
for the initial fields.  I think the EFI definition applies due to
this note in the NFIT section of the ACPI spec: "The Address Range
Type GUID values used in the ACPI NFIT must match the corresponding
values in the Disk Type GUID of the RAM Disk device path that describe
the same RAM Disk Type. Refer to the UEFI specification for details."

In hindsight it would have been nice if the NFIT spec had used an
unambiguous text encoding to define these values.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
@ 2015-05-21 19:06               ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-21 19:06 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Moore, Robert, Jens Axboe, linux-nvdimm@lists.01.org, Neil Brown,
	Greg KH, Wysocki, Rafael J, linux-kernel, Linux ACPI, Zheng, Lv,
	Christoph Hellwig, Ingo Molnar

On Thu, May 21, 2015 at 11:01 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> On Thu, 2015-05-21 at 17:49 +0000, Moore, Robert wrote:
>> What ACPICA has done here is to define these values consistently with the ToUUID ASL macro:
>>
>>
>> Byte encoding of UUID/GUID strings into ACPI Buffer objects (use ToUUID from ASL):
>>
>>    Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
>>      Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg, ii,jj, kk,ll,mm,nn,oo,pp
>>
>
> I do not see any issue in this conversion, which is consistent with
> ToUUID defined in ACPI spec.
>
> My point is that the string format of GUID is endian-neutral.  Wiki
> pages and EFI spec agree on it.  EFI 2.5 spec, Table 225 (sorry not
> Table 212, which is v2.4), is also clear about how String and Buffer are
> related with actual values of GUID.

I think the critical point from the UEFI spec is the "It should also
be noted that TimeLow, TimeMid, TimeHighAndVersion fields in the EFI
are encoded as little endian".  That would imply the byte encoding
of...

{ 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23, 0x4C }

...should be:

{ f6,01,f7,92,b4,13,5d,40,91,0b,29,93,67,e8,23,4c }

Which implies the text conversion should be:

"92f701f6-13b4-405d-910b-299367e8234c"

...not

> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"

I think ACPICA has the right order for a standard RFC 4122 id, but it
seems EFI is explicitly clarifying that the encoding is little endian
for the initial fields.  I think the EFI definition applies due to
this note in the NFIT section of the ACPI spec: "The Address Range
Type GUID values used in the ACPI NFIT must match the corresponding
values in the Disk Type GUID of the RAM Disk device path that describe
the same RAM Disk Type. Refer to the UEFI specification for details."

In hindsight it would have been nice if the NFIT spec had used an
unambiguous text encoding to define these values.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
  2015-05-21 19:06               ` Dan Williams
  (?)
@ 2015-05-21 19:44                 ` Toshi Kani
  -1 siblings, 0 replies; 89+ messages in thread
From: Toshi Kani @ 2015-05-21 19:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Moore, Robert, Jens Axboe, linux-nvdimm, Neil Brown, Greg KH,
	Wysocki, Rafael J, linux-kernel, Linux ACPI, Zheng, Lv,
	Christoph Hellwig, Ingo Molnar

On Thu, 2015-05-21 at 12:06 -0700, Dan Williams wrote:
> On Thu, May 21, 2015 at 11:01 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> > On Thu, 2015-05-21 at 17:49 +0000, Moore, Robert wrote:
> >> What ACPICA has done here is to define these values consistently with the ToUUID ASL macro:
> >>
> >>
> >> Byte encoding of UUID/GUID strings into ACPI Buffer objects (use ToUUID from ASL):
> >>
> >>    Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
> >>      Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg, ii,jj, kk,ll,mm,nn,oo,pp
> >>
> >
> > I do not see any issue in this conversion, which is consistent with
> > ToUUID defined in ACPI spec.
> >
> > My point is that the string format of GUID is endian-neutral.  Wiki
> > pages and EFI spec agree on it.  EFI 2.5 spec, Table 225 (sorry not
> > Table 212, which is v2.4), is also clear about how String and Buffer are
> > related with actual values of GUID.
> 
> I think the critical point from the UEFI spec is the "It should also
> be noted that TimeLow, TimeMid, TimeHighAndVersion fields in the EFI
> are encoded as little endian".  That would imply the byte encoding
> of...
> 
> { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23, 0x4C }
> 
> ...should be:
> 
> { f6,01,f7,92,b4,13,5d,40,91,0b,29,93,67,e8,23,4c }

The above NFIT GUID as data values means:

EFI_GUID(0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8,
0x23, 0x4C)

> Which implies the text conversion should be:
> 
> "92f701f6-13b4-405d-910b-299367e8234c"

Nope.

EFI 2.5 spec, Appendix A "GUID and Time Formats" defines that:
(NOTE, I simplified the table 225 to fit in this email)
==
This specification also defines a standard text representation of the
GUID. This format is also sometimes called the “registry format”. It
consists of 36 characters, as follows:

aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
 :

Table 225. Text representation relationships
String	Offset In Buffer   EFI_GUID
aa	3                  Data1[24:31]
bb      2                  Data1[16:23]
cc      1                  Data1[8:15]
dd      0                  Data1[0:7]
 :
===

Therefore:

aa = Data1[21:31] = 92
bb = Data1[16:23] = F7
cc = Data1[8:15]  = 01
dd = Data1[0:7]   = F6

> ...not
> 
> > +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"

Hence, the above string is correct.

ToUUD then stores the given string to Buffer according to "Offset In
Buffer" in the above table.

Another example, EFI 2.5 spec defines GPT partition GUID:

===
Table 19. Defined GPT Partition Entry - Partition Type GUIDs
EFI System Partition C12A7328-F81F-11D2-BA4B-00A0C93EC93B
===

The kernel defines it as:
#define PARTITION_SYSTEM_GUID \
    EFI_GUID( 0xC12A7328, 0xF81F, 0x11d2, \
              0xBA, 0x4B, 0x00, 0xA0, 0xC9, 0x3E, 0xC9, 0x3B)

Thanks,
-Toshi


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
@ 2015-05-21 19:44                 ` Toshi Kani
  0 siblings, 0 replies; 89+ messages in thread
From: Toshi Kani @ 2015-05-21 19:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Moore, Robert, Jens Axboe, linux-nvdimm, Neil Brown, Greg KH,
	Wysocki, Rafael J, linux-kernel, Linux ACPI, Zheng, Lv,
	Christoph Hellwig, Ingo Molnar

On Thu, 2015-05-21 at 12:06 -0700, Dan Williams wrote:
> On Thu, May 21, 2015 at 11:01 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> > On Thu, 2015-05-21 at 17:49 +0000, Moore, Robert wrote:
> >> What ACPICA has done here is to define these values consistently with the ToUUID ASL macro:
> >>
> >>
> >> Byte encoding of UUID/GUID strings into ACPI Buffer objects (use ToUUID from ASL):
> >>
> >>    Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
> >>      Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg, ii,jj, kk,ll,mm,nn,oo,pp
> >>
> >
> > I do not see any issue in this conversion, which is consistent with
> > ToUUID defined in ACPI spec.
> >
> > My point is that the string format of GUID is endian-neutral.  Wiki
> > pages and EFI spec agree on it.  EFI 2.5 spec, Table 225 (sorry not
> > Table 212, which is v2.4), is also clear about how String and Buffer are
> > related with actual values of GUID.
> 
> I think the critical point from the UEFI spec is the "It should also
> be noted that TimeLow, TimeMid, TimeHighAndVersion fields in the EFI
> are encoded as little endian".  That would imply the byte encoding
> of...
> 
> { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23, 0x4C }
> 
> ...should be:
> 
> { f6,01,f7,92,b4,13,5d,40,91,0b,29,93,67,e8,23,4c }

The above NFIT GUID as data values means:

EFI_GUID(0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8,
0x23, 0x4C)

> Which implies the text conversion should be:
> 
> "92f701f6-13b4-405d-910b-299367e8234c"

Nope.

EFI 2.5 spec, Appendix A "GUID and Time Formats" defines that:
(NOTE, I simplified the table 225 to fit in this email)
==
This specification also defines a standard text representation of the
GUID. This format is also sometimes called the “registry format”. It
consists of 36 characters, as follows:

aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
 :

Table 225. Text representation relationships
String	Offset In Buffer   EFI_GUID
aa	3                  Data1[24:31]
bb      2                  Data1[16:23]
cc      1                  Data1[8:15]
dd      0                  Data1[0:7]
 :
===

Therefore:

aa = Data1[21:31] = 92
bb = Data1[16:23] = F7
cc = Data1[8:15]  = 01
dd = Data1[0:7]   = F6

> ...not
> 
> > +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"

Hence, the above string is correct.

ToUUD then stores the given string to Buffer according to "Offset In
Buffer" in the above table.

Another example, EFI 2.5 spec defines GPT partition GUID:

===
Table 19. Defined GPT Partition Entry - Partition Type GUIDs
EFI System Partition C12A7328-F81F-11D2-BA4B-00A0C93EC93B
===

The kernel defines it as:
#define PARTITION_SYSTEM_GUID \
    EFI_GUID( 0xC12A7328, 0xF81F, 0x11d2, \
              0xBA, 0x4B, 0x00, 0xA0, 0xC9, 0x3E, 0xC9, 0x3B)

Thanks,
-Toshi

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
@ 2015-05-21 19:44                 ` Toshi Kani
  0 siblings, 0 replies; 89+ messages in thread
From: Toshi Kani @ 2015-05-21 19:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Moore, Robert, Jens Axboe, linux-nvdimm@lists.01.org, Neil Brown,
	Greg KH, Wysocki, Rafael J, linux-kernel, Linux ACPI, Zheng, Lv,
	Christoph Hellwig, Ingo Molnar

On Thu, 2015-05-21 at 12:06 -0700, Dan Williams wrote:
> On Thu, May 21, 2015 at 11:01 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> > On Thu, 2015-05-21 at 17:49 +0000, Moore, Robert wrote:
> >> What ACPICA has done here is to define these values consistently with the ToUUID ASL macro:
> >>
> >>
> >> Byte encoding of UUID/GUID strings into ACPI Buffer objects (use ToUUID from ASL):
> >>
> >>    Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
> >>      Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg, ii,jj, kk,ll,mm,nn,oo,pp
> >>
> >
> > I do not see any issue in this conversion, which is consistent with
> > ToUUID defined in ACPI spec.
> >
> > My point is that the string format of GUID is endian-neutral.  Wiki
> > pages and EFI spec agree on it.  EFI 2.5 spec, Table 225 (sorry not
> > Table 212, which is v2.4), is also clear about how String and Buffer are
> > related with actual values of GUID.
> 
> I think the critical point from the UEFI spec is the "It should also
> be noted that TimeLow, TimeMid, TimeHighAndVersion fields in the EFI
> are encoded as little endian".  That would imply the byte encoding
> of...
> 
> { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23, 0x4C }
> 
> ...should be:
> 
> { f6,01,f7,92,b4,13,5d,40,91,0b,29,93,67,e8,23,4c }

The above NFIT GUID as data values means:

EFI_GUID(0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8,
0x23, 0x4C)

> Which implies the text conversion should be:
> 
> "92f701f6-13b4-405d-910b-299367e8234c"

Nope.

EFI 2.5 spec, Appendix A "GUID and Time Formats" defines that:
(NOTE, I simplified the table 225 to fit in this email)
==
This specification also defines a standard text representation of the
GUID. This format is also sometimes called the “registry format”. It
consists of 36 characters, as follows:

aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
 :

Table 225. Text representation relationships
String	Offset In Buffer   EFI_GUID
aa	3                  Data1[24:31]
bb      2                  Data1[16:23]
cc      1                  Data1[8:15]
dd      0                  Data1[0:7]
 :
===

Therefore:

aa = Data1[21:31] = 92
bb = Data1[16:23] = F7
cc = Data1[8:15]  = 01
dd = Data1[0:7]   = F6

> ...not
> 
> > +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"

Hence, the above string is correct.

ToUUD then stores the given string to Buffer according to "Offset In
Buffer" in the above table.

Another example, EFI 2.5 spec defines GPT partition GUID:

===
Table 19. Defined GPT Partition Entry - Partition Type GUIDs
EFI System Partition C12A7328-F81F-11D2-BA4B-00A0C93EC93B
===

The kernel defines it as:
#define PARTITION_SYSTEM_GUID \
    EFI_GUID( 0xC12A7328, 0xF81F, 0x11d2, \
              0xBA, 0x4B, 0x00, 0xA0, 0xC9, 0x3E, 0xC9, 0x3B)

Thanks,
-Toshi


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
  2015-05-21 19:44                 ` Toshi Kani
  (?)
@ 2015-05-21 19:59                   ` Toshi Kani
  -1 siblings, 0 replies; 89+ messages in thread
From: Toshi Kani @ 2015-05-21 19:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, linux-nvdimm, Neil Brown, Greg KH, Wysocki, Rafael J,
	Moore, Robert, linux-kernel, Linux ACPI, Ingo Molnar, Zheng, Lv,
	Christoph Hellwig

On Thu, 2015-05-21 at 13:44 -0600, Toshi Kani wrote:
> On Thu, 2015-05-21 at 12:06 -0700, Dan Williams wrote:
> > On Thu, May 21, 2015 at 11:01 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> > > On Thu, 2015-05-21 at 17:49 +0000, Moore, Robert wrote:
> > >> What ACPICA has done here is to define these values consistently with the ToUUID ASL macro:
> > >>
> > >>
> > >> Byte encoding of UUID/GUID strings into ACPI Buffer objects (use ToUUID from ASL):
> > >>
> > >>    Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
> > >>      Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg, ii,jj, kk,ll,mm,nn,oo,pp
> > >>
> > >
> > > I do not see any issue in this conversion, which is consistent with
> > > ToUUID defined in ACPI spec.
> > >
> > > My point is that the string format of GUID is endian-neutral.  Wiki
> > > pages and EFI spec agree on it.  EFI 2.5 spec, Table 225 (sorry not
> > > Table 212, which is v2.4), is also clear about how String and Buffer are
> > > related with actual values of GUID.
> > 
> > I think the critical point from the UEFI spec is the "It should also
> > be noted that TimeLow, TimeMid, TimeHighAndVersion fields in the EFI
> > are encoded as little endian".  That would imply the byte encoding
> > of...
> > 
> > { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23, 0x4C }
> > 
> > ...should be:
> > 
> > { f6,01,f7,92,b4,13,5d,40,91,0b,29,93,67,e8,23,4c }
> 
> The above NFIT GUID as data values means:
> 
> EFI_GUID(0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8,
> 0x23, 0x4C)
> 
> > Which implies the text conversion should be:
> > 
> > "92f701f6-13b4-405d-910b-299367e8234c"
> 
> Nope.

Oops! Sorry, I misread your email... The above string is correct,
although I do not think you need such conversion. 

> EFI 2.5 spec, Appendix A "GUID and Time Formats" defines that:
> (NOTE, I simplified the table 225 to fit in this email)
> ==
> This specification also defines a standard text representation of the
> GUID. This format is also sometimes called the “registry format”. It
> consists of 36 characters, as follows:
> 
> aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
>  :
> 
> Table 225. Text representation relationships
> String	Offset In Buffer   EFI_GUID
> aa	3                  Data1[24:31]
> bb      2                  Data1[16:23]
> cc      1                  Data1[8:15]
> dd      0                  Data1[0:7]
>  :
> ===
> 
> Therefore:
> 
> aa = Data1[21:31] = 92
> bb = Data1[16:23] = F7
> cc = Data1[8:15]  = 01
> dd = Data1[0:7]   = F6
> 
> > ...not
> > 
> > > +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"
> 
> Hence, the above string is correct.

Misread again... Right, the above string is NOT correct.

I think we are on the same page that the GUID strings in this patch need
to be changed.

{ 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23,
0x4C }

should be defined as:

"92f701f6-13b4-405d-910b-299367e8234c"

Thanks,
-Toshi


> ToUUD then stores the given string to Buffer according to "Offset In
> Buffer" in the above table.
> 
> Another example, EFI 2.5 spec defines GPT partition GUID:
> 
> ===
> Table 19. Defined GPT Partition Entry - Partition Type GUIDs
> EFI System Partition C12A7328-F81F-11D2-BA4B-00A0C93EC93B
> ===
> 
> The kernel defines it as:
> #define PARTITION_SYSTEM_GUID \
>     EFI_GUID( 0xC12A7328, 0xF81F, 0x11d2, \
>               0xBA, 0x4B, 0x00, 0xA0, 0xC9, 0x3E, 0xC9, 0x3B)
> 
> Thanks,
> -Toshi
> 
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
@ 2015-05-21 19:59                   ` Toshi Kani
  0 siblings, 0 replies; 89+ messages in thread
From: Toshi Kani @ 2015-05-21 19:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, linux-nvdimm, Neil Brown, Greg KH, Wysocki, Rafael J,
	Moore, Robert, linux-kernel, Linux ACPI, Ingo Molnar, Zheng, Lv,
	Christoph Hellwig

On Thu, 2015-05-21 at 13:44 -0600, Toshi Kani wrote:
> On Thu, 2015-05-21 at 12:06 -0700, Dan Williams wrote:
> > On Thu, May 21, 2015 at 11:01 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> > > On Thu, 2015-05-21 at 17:49 +0000, Moore, Robert wrote:
> > >> What ACPICA has done here is to define these values consistently with the ToUUID ASL macro:
> > >>
> > >>
> > >> Byte encoding of UUID/GUID strings into ACPI Buffer objects (use ToUUID from ASL):
> > >>
> > >>    Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
> > >>      Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg, ii,jj, kk,ll,mm,nn,oo,pp
> > >>
> > >
> > > I do not see any issue in this conversion, which is consistent with
> > > ToUUID defined in ACPI spec.
> > >
> > > My point is that the string format of GUID is endian-neutral.  Wiki
> > > pages and EFI spec agree on it.  EFI 2.5 spec, Table 225 (sorry not
> > > Table 212, which is v2.4), is also clear about how String and Buffer are
> > > related with actual values of GUID.
> > 
> > I think the critical point from the UEFI spec is the "It should also
> > be noted that TimeLow, TimeMid, TimeHighAndVersion fields in the EFI
> > are encoded as little endian".  That would imply the byte encoding
> > of...
> > 
> > { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23, 0x4C }
> > 
> > ...should be:
> > 
> > { f6,01,f7,92,b4,13,5d,40,91,0b,29,93,67,e8,23,4c }
> 
> The above NFIT GUID as data values means:
> 
> EFI_GUID(0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8,
> 0x23, 0x4C)
> 
> > Which implies the text conversion should be:
> > 
> > "92f701f6-13b4-405d-910b-299367e8234c"
> 
> Nope.

Oops! Sorry, I misread your email... The above string is correct,
although I do not think you need such conversion. 

> EFI 2.5 spec, Appendix A "GUID and Time Formats" defines that:
> (NOTE, I simplified the table 225 to fit in this email)
> ==
> This specification also defines a standard text representation of the
> GUID. This format is also sometimes called the “registry format”. It
> consists of 36 characters, as follows:
> 
> aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
>  :
> 
> Table 225. Text representation relationships
> String	Offset In Buffer   EFI_GUID
> aa	3                  Data1[24:31]
> bb      2                  Data1[16:23]
> cc      1                  Data1[8:15]
> dd      0                  Data1[0:7]
>  :
> ===
> 
> Therefore:
> 
> aa = Data1[21:31] = 92
> bb = Data1[16:23] = F7
> cc = Data1[8:15]  = 01
> dd = Data1[0:7]   = F6
> 
> > ...not
> > 
> > > +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"
> 
> Hence, the above string is correct.

Misread again... Right, the above string is NOT correct.

I think we are on the same page that the GUID strings in this patch need
to be changed.

{ 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23,
0x4C }

should be defined as:

"92f701f6-13b4-405d-910b-299367e8234c"

Thanks,
-Toshi


> ToUUD then stores the given string to Buffer according to "Offset In
> Buffer" in the above table.
> 
> Another example, EFI 2.5 spec defines GPT partition GUID:
> 
> ===
> Table 19. Defined GPT Partition Entry - Partition Type GUIDs
> EFI System Partition C12A7328-F81F-11D2-BA4B-00A0C93EC93B
> ===
> 
> The kernel defines it as:
> #define PARTITION_SYSTEM_GUID \
>     EFI_GUID( 0xC12A7328, 0xF81F, 0x11d2, \
>               0xBA, 0x4B, 0x00, 0xA0, 0xC9, 0x3E, 0xC9, 0x3B)
> 
> Thanks,
> -Toshi
> 
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm


--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
@ 2015-05-21 19:59                   ` Toshi Kani
  0 siblings, 0 replies; 89+ messages in thread
From: Toshi Kani @ 2015-05-21 19:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Neil Brown, Greg KH,
	Wysocki, Rafael J, Moore, Robert, linux-kernel, Linux ACPI,
	Ingo Molnar, Zheng, Lv, Christoph Hellwig

On Thu, 2015-05-21 at 13:44 -0600, Toshi Kani wrote:
> On Thu, 2015-05-21 at 12:06 -0700, Dan Williams wrote:
> > On Thu, May 21, 2015 at 11:01 AM, Toshi Kani <toshi.kani@hp.com> wrote:
> > > On Thu, 2015-05-21 at 17:49 +0000, Moore, Robert wrote:
> > >> What ACPICA has done here is to define these values consistently with the ToUUID ASL macro:
> > >>
> > >>
> > >> Byte encoding of UUID/GUID strings into ACPI Buffer objects (use ToUUID from ASL):
> > >>
> > >>    Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
> > >>      Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg, ii,jj, kk,ll,mm,nn,oo,pp
> > >>
> > >
> > > I do not see any issue in this conversion, which is consistent with
> > > ToUUID defined in ACPI spec.
> > >
> > > My point is that the string format of GUID is endian-neutral.  Wiki
> > > pages and EFI spec agree on it.  EFI 2.5 spec, Table 225 (sorry not
> > > Table 212, which is v2.4), is also clear about how String and Buffer are
> > > related with actual values of GUID.
> > 
> > I think the critical point from the UEFI spec is the "It should also
> > be noted that TimeLow, TimeMid, TimeHighAndVersion fields in the EFI
> > are encoded as little endian".  That would imply the byte encoding
> > of...
> > 
> > { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23, 0x4C }
> > 
> > ...should be:
> > 
> > { f6,01,f7,92,b4,13,5d,40,91,0b,29,93,67,e8,23,4c }
> 
> The above NFIT GUID as data values means:
> 
> EFI_GUID(0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8,
> 0x23, 0x4C)
> 
> > Which implies the text conversion should be:
> > 
> > "92f701f6-13b4-405d-910b-299367e8234c"
> 
> Nope.

Oops! Sorry, I misread your email... The above string is correct,
although I do not think you need such conversion. 

> EFI 2.5 spec, Appendix A "GUID and Time Formats" defines that:
> (NOTE, I simplified the table 225 to fit in this email)
> ==
> This specification also defines a standard text representation of the
> GUID. This format is also sometimes called the “registry format”. It
> consists of 36 characters, as follows:
> 
> aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
>  :
> 
> Table 225. Text representation relationships
> String	Offset In Buffer   EFI_GUID
> aa	3                  Data1[24:31]
> bb      2                  Data1[16:23]
> cc      1                  Data1[8:15]
> dd      0                  Data1[0:7]
>  :
> ===
> 
> Therefore:
> 
> aa = Data1[21:31] = 92
> bb = Data1[16:23] = F7
> cc = Data1[8:15]  = 01
> dd = Data1[0:7]   = F6
> 
> > ...not
> > 
> > > +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"
> 
> Hence, the above string is correct.

Misread again... Right, the above string is NOT correct.

I think we are on the same page that the GUID strings in this patch need
to be changed.

{ 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23,
0x4C }

should be defined as:

"92f701f6-13b4-405d-910b-299367e8234c"

Thanks,
-Toshi


> ToUUD then stores the given string to Buffer according to "Offset In
> Buffer" in the above table.
> 
> Another example, EFI 2.5 spec defines GPT partition GUID:
> 
> ===
> Table 19. Defined GPT Partition Entry - Partition Type GUIDs
> EFI System Partition C12A7328-F81F-11D2-BA4B-00A0C93EC93B
> ===
> 
> The kernel defines it as:
> #define PARTITION_SYSTEM_GUID \
>     EFI_GUID( 0xC12A7328, 0xF81F, 0x11d2, \
>               0xBA, 0x4B, 0x00, 0xA0, 0xC9, 0x3E, 0xC9, 0x3B)
> 
> Thanks,
> -Toshi
> 
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
  2015-05-21 19:59                   ` Toshi Kani
  (?)
@ 2015-05-21 20:59                     ` Linda Knippers
  -1 siblings, 0 replies; 89+ messages in thread
From: Linda Knippers @ 2015-05-21 20:59 UTC (permalink / raw)
  To: Toshi Kani, Dan Williams
  Cc: Jens Axboe, linux-nvdimm, Neil Brown, Greg KH, Wysocki, Rafael J,
	Moore, Robert, linux-kernel, Linux ACPI, Ingo Molnar, Zheng, Lv,
	Christoph Hellwig

On 05/21/2015 03:59 PM, Toshi Kani wrote:
> On Thu, 2015-05-21 at 13:44 -0600, Toshi Kani wrote:
>> On Thu, 2015-05-21 at 12:06 -0700, Dan Williams wrote:
>>> On Thu, May 21, 2015 at 11:01 AM, Toshi Kani <toshi.kani@hp.com> wrote:
>>>> On Thu, 2015-05-21 at 17:49 +0000, Moore, Robert wrote:
>>>>> What ACPICA has done here is to define these values consistently with the ToUUID ASL macro:
>>>>>
>>>>>
>>>>> Byte encoding of UUID/GUID strings into ACPI Buffer objects (use ToUUID from ASL):
>>>>>
>>>>>    Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
>>>>>      Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg, ii,jj, kk,ll,mm,nn,oo,pp
>>>>>
>>>>
>>>> I do not see any issue in this conversion, which is consistent with
>>>> ToUUID defined in ACPI spec.
>>>>
>>>> My point is that the string format of GUID is endian-neutral.  Wiki
>>>> pages and EFI spec agree on it.  EFI 2.5 spec, Table 225 (sorry not
>>>> Table 212, which is v2.4), is also clear about how String and Buffer are
>>>> related with actual values of GUID.
>>>
>>> I think the critical point from the UEFI spec is the "It should also
>>> be noted that TimeLow, TimeMid, TimeHighAndVersion fields in the EFI
>>> are encoded as little endian".  That would imply the byte encoding
>>> of...
>>>
>>> { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23, 0x4C }
>>>
>>> ...should be:
>>>
>>> { f6,01,f7,92,b4,13,5d,40,91,0b,29,93,67,e8,23,4c }
>>
>> The above NFIT GUID as data values means:
>>
>> EFI_GUID(0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8,
>> 0x23, 0x4C)
>>
>>> Which implies the text conversion should be:
>>>
>>> "92f701f6-13b4-405d-910b-299367e8234c"
>>
>> Nope.
> 
> Oops! Sorry, I misread your email... The above string is correct,
> although I do not think you need such conversion. 
> 
>> EFI 2.5 spec, Appendix A "GUID and Time Formats" defines that:
>> (NOTE, I simplified the table 225 to fit in this email)
>> ==
>> This specification also defines a standard text representation of the
>> GUID. This format is also sometimes called the “registry format”. It
>> consists of 36 characters, as follows:
>>
>> aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
>>  :
>>
>> Table 225. Text representation relationships
>> String	Offset In Buffer   EFI_GUID
>> aa	3                  Data1[24:31]
>> bb      2                  Data1[16:23]
>> cc      1                  Data1[8:15]
>> dd      0                  Data1[0:7]
>>  :
>> ===
>>
>> Therefore:
>>
>> aa = Data1[21:31] = 92
>> bb = Data1[16:23] = F7
>> cc = Data1[8:15]  = 01
>> dd = Data1[0:7]   = F6
>>
>>> ...not
>>>
>>>> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"
>>
>> Hence, the above string is correct.
> 
> Misread again... Right, the above string is NOT correct.
> 
> I think we are on the same page that the GUID strings in this patch need
> to be changed.
> 
> { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23,
> 0x4C }
> 
> should be defined as:
> 
> "92f701f6-13b4-405d-910b-299367e8234c"

I've lost track of the right answer but should we be discussing
it in the context of this patch too?

http://www.spinics.net/lists/linux-acpi/msg57825.html
[PATCH 18/19] ACPICA: ACPI 6.0: Add support for NFIT table.

Dan's version of the file has lots of other UUIDs too, beyond NFIT.

-- ljk

> 
> Thanks,
> -Toshi
> 
> 
>> ToUUD then stores the given string to Buffer according to "Offset In
>> Buffer" in the above table.
>>
>> Another example, EFI 2.5 spec defines GPT partition GUID:
>>
>> ===
>> Table 19. Defined GPT Partition Entry - Partition Type GUIDs
>> EFI System Partition C12A7328-F81F-11D2-BA4B-00A0C93EC93B
>> ===
>>
>> The kernel defines it as:
>> #define PARTITION_SYSTEM_GUID \
>>     EFI_GUID( 0xC12A7328, 0xF81F, 0x11d2, \
>>               0xBA, 0x4B, 0x00, 0xA0, 0xC9, 0x3E, 0xC9, 0x3B)
>>
>> Thanks,
>> -Toshi
>>
>> _______________________________________________
>> Linux-nvdimm mailing list
>> Linux-nvdimm@lists.01.org
>> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
@ 2015-05-21 20:59                     ` Linda Knippers
  0 siblings, 0 replies; 89+ messages in thread
From: Linda Knippers @ 2015-05-21 20:59 UTC (permalink / raw)
  To: Toshi Kani, Dan Williams
  Cc: Jens Axboe, linux-nvdimm, Neil Brown, Greg KH, Wysocki, Rafael J,
	Moore, Robert, linux-kernel, Linux ACPI, Ingo Molnar, Zheng, Lv,
	Christoph Hellwig

On 05/21/2015 03:59 PM, Toshi Kani wrote:
> On Thu, 2015-05-21 at 13:44 -0600, Toshi Kani wrote:
>> On Thu, 2015-05-21 at 12:06 -0700, Dan Williams wrote:
>>> On Thu, May 21, 2015 at 11:01 AM, Toshi Kani <toshi.kani@hp.com> wrote:
>>>> On Thu, 2015-05-21 at 17:49 +0000, Moore, Robert wrote:
>>>>> What ACPICA has done here is to define these values consistently with the ToUUID ASL macro:
>>>>>
>>>>>
>>>>> Byte encoding of UUID/GUID strings into ACPI Buffer objects (use ToUUID from ASL):
>>>>>
>>>>>    Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
>>>>>      Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg, ii,jj, kk,ll,mm,nn,oo,pp
>>>>>
>>>>
>>>> I do not see any issue in this conversion, which is consistent with
>>>> ToUUID defined in ACPI spec.
>>>>
>>>> My point is that the string format of GUID is endian-neutral.  Wiki
>>>> pages and EFI spec agree on it.  EFI 2.5 spec, Table 225 (sorry not
>>>> Table 212, which is v2.4), is also clear about how String and Buffer are
>>>> related with actual values of GUID.
>>>
>>> I think the critical point from the UEFI spec is the "It should also
>>> be noted that TimeLow, TimeMid, TimeHighAndVersion fields in the EFI
>>> are encoded as little endian".  That would imply the byte encoding
>>> of...
>>>
>>> { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23, 0x4C }
>>>
>>> ...should be:
>>>
>>> { f6,01,f7,92,b4,13,5d,40,91,0b,29,93,67,e8,23,4c }
>>
>> The above NFIT GUID as data values means:
>>
>> EFI_GUID(0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8,
>> 0x23, 0x4C)
>>
>>> Which implies the text conversion should be:
>>>
>>> "92f701f6-13b4-405d-910b-299367e8234c"
>>
>> Nope.
> 
> Oops! Sorry, I misread your email... The above string is correct,
> although I do not think you need such conversion. 
> 
>> EFI 2.5 spec, Appendix A "GUID and Time Formats" defines that:
>> (NOTE, I simplified the table 225 to fit in this email)
>> ==
>> This specification also defines a standard text representation of the
>> GUID. This format is also sometimes called the “registry format”. It
>> consists of 36 characters, as follows:
>>
>> aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
>>  :
>>
>> Table 225. Text representation relationships
>> String	Offset In Buffer   EFI_GUID
>> aa	3                  Data1[24:31]
>> bb      2                  Data1[16:23]
>> cc      1                  Data1[8:15]
>> dd      0                  Data1[0:7]
>>  :
>> ===
>>
>> Therefore:
>>
>> aa = Data1[21:31] = 92
>> bb = Data1[16:23] = F7
>> cc = Data1[8:15]  = 01
>> dd = Data1[0:7]   = F6
>>
>>> ...not
>>>
>>>> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"
>>
>> Hence, the above string is correct.
> 
> Misread again... Right, the above string is NOT correct.
> 
> I think we are on the same page that the GUID strings in this patch need
> to be changed.
> 
> { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23,
> 0x4C }
> 
> should be defined as:
> 
> "92f701f6-13b4-405d-910b-299367e8234c"

I've lost track of the right answer but should we be discussing
it in the context of this patch too?

http://www.spinics.net/lists/linux-acpi/msg57825.html
[PATCH 18/19] ACPICA: ACPI 6.0: Add support for NFIT table.

Dan's version of the file has lots of other UUIDs too, beyond NFIT.

-- ljk

> 
> Thanks,
> -Toshi
> 
> 
>> ToUUD then stores the given string to Buffer according to "Offset In
>> Buffer" in the above table.
>>
>> Another example, EFI 2.5 spec defines GPT partition GUID:
>>
>> ===
>> Table 19. Defined GPT Partition Entry - Partition Type GUIDs
>> EFI System Partition C12A7328-F81F-11D2-BA4B-00A0C93EC93B
>> ===
>>
>> The kernel defines it as:
>> #define PARTITION_SYSTEM_GUID \
>>     EFI_GUID( 0xC12A7328, 0xF81F, 0x11d2, \
>>               0xBA, 0x4B, 0x00, 0xA0, 0xC9, 0x3E, 0xC9, 0x3B)
>>
>> Thanks,
>> -Toshi
>>
>> _______________________________________________
>> Linux-nvdimm mailing list
>> Linux-nvdimm@lists.01.org
>> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
@ 2015-05-21 20:59                     ` Linda Knippers
  0 siblings, 0 replies; 89+ messages in thread
From: Linda Knippers @ 2015-05-21 20:59 UTC (permalink / raw)
  To: Toshi Kani, Dan Williams
  Cc: Jens Axboe, linux-nvdimm@lists.01.org, Neil Brown, Greg KH,
	Wysocki, Rafael J, Moore, Robert, linux-kernel, Linux ACPI,
	Ingo Molnar, Zheng, Lv, Christoph Hellwig

On 05/21/2015 03:59 PM, Toshi Kani wrote:
> On Thu, 2015-05-21 at 13:44 -0600, Toshi Kani wrote:
>> On Thu, 2015-05-21 at 12:06 -0700, Dan Williams wrote:
>>> On Thu, May 21, 2015 at 11:01 AM, Toshi Kani <toshi.kani@hp.com> wrote:
>>>> On Thu, 2015-05-21 at 17:49 +0000, Moore, Robert wrote:
>>>>> What ACPICA has done here is to define these values consistently with the ToUUID ASL macro:
>>>>>
>>>>>
>>>>> Byte encoding of UUID/GUID strings into ACPI Buffer objects (use ToUUID from ASL):
>>>>>
>>>>>    Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
>>>>>      Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg, ii,jj, kk,ll,mm,nn,oo,pp
>>>>>
>>>>
>>>> I do not see any issue in this conversion, which is consistent with
>>>> ToUUID defined in ACPI spec.
>>>>
>>>> My point is that the string format of GUID is endian-neutral.  Wiki
>>>> pages and EFI spec agree on it.  EFI 2.5 spec, Table 225 (sorry not
>>>> Table 212, which is v2.4), is also clear about how String and Buffer are
>>>> related with actual values of GUID.
>>>
>>> I think the critical point from the UEFI spec is the "It should also
>>> be noted that TimeLow, TimeMid, TimeHighAndVersion fields in the EFI
>>> are encoded as little endian".  That would imply the byte encoding
>>> of...
>>>
>>> { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23, 0x4C }
>>>
>>> ...should be:
>>>
>>> { f6,01,f7,92,b4,13,5d,40,91,0b,29,93,67,e8,23,4c }
>>
>> The above NFIT GUID as data values means:
>>
>> EFI_GUID(0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8,
>> 0x23, 0x4C)
>>
>>> Which implies the text conversion should be:
>>>
>>> "92f701f6-13b4-405d-910b-299367e8234c"
>>
>> Nope.
> 
> Oops! Sorry, I misread your email... The above string is correct,
> although I do not think you need such conversion. 
> 
>> EFI 2.5 spec, Appendix A "GUID and Time Formats" defines that:
>> (NOTE, I simplified the table 225 to fit in this email)
>> ==
>> This specification also defines a standard text representation of the
>> GUID. This format is also sometimes called the “registry format”. It
>> consists of 36 characters, as follows:
>>
>> aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
>>  :
>>
>> Table 225. Text representation relationships
>> String	Offset In Buffer   EFI_GUID
>> aa	3                  Data1[24:31]
>> bb      2                  Data1[16:23]
>> cc      1                  Data1[8:15]
>> dd      0                  Data1[0:7]
>>  :
>> ===
>>
>> Therefore:
>>
>> aa = Data1[21:31] = 92
>> bb = Data1[16:23] = F7
>> cc = Data1[8:15]  = 01
>> dd = Data1[0:7]   = F6
>>
>>> ...not
>>>
>>>> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"
>>
>> Hence, the above string is correct.
> 
> Misread again... Right, the above string is NOT correct.
> 
> I think we are on the same page that the GUID strings in this patch need
> to be changed.
> 
> { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23,
> 0x4C }
> 
> should be defined as:
> 
> "92f701f6-13b4-405d-910b-299367e8234c"

I've lost track of the right answer but should we be discussing
it in the context of this patch too?

http://www.spinics.net/lists/linux-acpi/msg57825.html
[PATCH 18/19] ACPICA: ACPI 6.0: Add support for NFIT table.

Dan's version of the file has lots of other UUIDs too, beyond NFIT.

-- ljk

> 
> Thanks,
> -Toshi
> 
> 
>> ToUUD then stores the given string to Buffer according to "Offset In
>> Buffer" in the above table.
>>
>> Another example, EFI 2.5 spec defines GPT partition GUID:
>>
>> ===
>> Table 19. Defined GPT Partition Entry - Partition Type GUIDs
>> EFI System Partition C12A7328-F81F-11D2-BA4B-00A0C93EC93B
>> ===
>>
>> The kernel defines it as:
>> #define PARTITION_SYSTEM_GUID \
>>     EFI_GUID( 0xC12A7328, 0xF81F, 0x11d2, \
>>               0xBA, 0x4B, 0x00, 0xA0, 0xC9, 0x3E, 0xC9, 0x3B)
>>
>> Thanks,
>> -Toshi
>>
>> _______________________________________________
>> Linux-nvdimm mailing list
>> Linux-nvdimm@lists.01.org
>> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
  2015-05-21 20:59                     ` Linda Knippers
  (?)
@ 2015-05-21 21:34                       ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-21 21:34 UTC (permalink / raw)
  To: Linda Knippers
  Cc: Toshi Kani, Jens Axboe, linux-nvdimm, Neil Brown, Greg KH,
	Wysocki, Rafael J, Moore, Robert, linux-kernel, Linux ACPI,
	Ingo Molnar, Zheng, Lv, Christoph Hellwig

On Thu, May 21, 2015 at 1:59 PM, Linda Knippers <linda.knippers@hp.com> wrote:
> On 05/21/2015 03:59 PM, Toshi Kani wrote:
>> On Thu, 2015-05-21 at 13:44 -0600, Toshi Kani wrote:
>>> On Thu, 2015-05-21 at 12:06 -0700, Dan Williams wrote:
>>>> On Thu, May 21, 2015 at 11:01 AM, Toshi Kani <toshi.kani@hp.com> wrote:
>>>>> On Thu, 2015-05-21 at 17:49 +0000, Moore, Robert wrote:
>>>>>> What ACPICA has done here is to define these values consistently with the ToUUID ASL macro:
>>>>>>
>>>>>>
>>>>>> Byte encoding of UUID/GUID strings into ACPI Buffer objects (use ToUUID from ASL):
>>>>>>
>>>>>>    Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
>>>>>>      Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg, ii,jj, kk,ll,mm,nn,oo,pp
>>>>>>
>>>>>
>>>>> I do not see any issue in this conversion, which is consistent with
>>>>> ToUUID defined in ACPI spec.
>>>>>
>>>>> My point is that the string format of GUID is endian-neutral.  Wiki
>>>>> pages and EFI spec agree on it.  EFI 2.5 spec, Table 225 (sorry not
>>>>> Table 212, which is v2.4), is also clear about how String and Buffer are
>>>>> related with actual values of GUID.
>>>>
>>>> I think the critical point from the UEFI spec is the "It should also
>>>> be noted that TimeLow, TimeMid, TimeHighAndVersion fields in the EFI
>>>> are encoded as little endian".  That would imply the byte encoding
>>>> of...
>>>>
>>>> { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23, 0x4C }
>>>>
>>>> ...should be:
>>>>
>>>> { f6,01,f7,92,b4,13,5d,40,91,0b,29,93,67,e8,23,4c }
>>>
>>> The above NFIT GUID as data values means:
>>>
>>> EFI_GUID(0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8,
>>> 0x23, 0x4C)
>>>
>>>> Which implies the text conversion should be:
>>>>
>>>> "92f701f6-13b4-405d-910b-299367e8234c"
>>>
>>> Nope.
>>
>> Oops! Sorry, I misread your email... The above string is correct,
>> although I do not think you need such conversion.
>>
>>> EFI 2.5 spec, Appendix A "GUID and Time Formats" defines that:
>>> (NOTE, I simplified the table 225 to fit in this email)
>>> ==
>>> This specification also defines a standard text representation of the
>>> GUID. This format is also sometimes called the “registry format”. It
>>> consists of 36 characters, as follows:
>>>
>>> aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
>>>  :
>>>
>>> Table 225. Text representation relationships
>>> String       Offset In Buffer   EFI_GUID
>>> aa   3                  Data1[24:31]
>>> bb      2                  Data1[16:23]
>>> cc      1                  Data1[8:15]
>>> dd      0                  Data1[0:7]
>>>  :
>>> ===
>>>
>>> Therefore:
>>>
>>> aa = Data1[21:31] = 92
>>> bb = Data1[16:23] = F7
>>> cc = Data1[8:15]  = 01
>>> dd = Data1[0:7]   = F6
>>>
>>>> ...not
>>>>
>>>>> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"
>>>
>>> Hence, the above string is correct.
>>
>> Misread again... Right, the above string is NOT correct.
>>
>> I think we are on the same page that the GUID strings in this patch need
>> to be changed.
>>
>> { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23,
>> 0x4C }
>>
>> should be defined as:
>>
>> "92f701f6-13b4-405d-910b-299367e8234c"
>
> I've lost track of the right answer but should we be discussing
> it in the context of this patch too?
>
> http://www.spinics.net/lists/linux-acpi/msg57825.html
> [PATCH 18/19] ACPICA: ACPI 6.0: Add support for NFIT table.
>
> Dan's version of the file has lots of other UUIDs too, beyond NFIT.

Yeah, it's not clear whether those other GUIDs are actually GUIDs or
these byte-swapped "EFI GUID"s.  At least for NFIT it seems that the
intent was EFI GUID ordering due to the note about needing to match
the "Disk Type GUID" format from the EFI spec.

I circle back with the ACPICA folks.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
@ 2015-05-21 21:34                       ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-21 21:34 UTC (permalink / raw)
  To: Linda Knippers
  Cc: Toshi Kani, Jens Axboe, linux-nvdimm, Neil Brown, Greg KH,
	Wysocki, Rafael J, Moore, Robert, linux-kernel, Linux ACPI,
	Ingo Molnar, Zheng, Lv, Christoph Hellwig

On Thu, May 21, 2015 at 1:59 PM, Linda Knippers <linda.knippers@hp.com> wrote:
> On 05/21/2015 03:59 PM, Toshi Kani wrote:
>> On Thu, 2015-05-21 at 13:44 -0600, Toshi Kani wrote:
>>> On Thu, 2015-05-21 at 12:06 -0700, Dan Williams wrote:
>>>> On Thu, May 21, 2015 at 11:01 AM, Toshi Kani <toshi.kani@hp.com> wrote:
>>>>> On Thu, 2015-05-21 at 17:49 +0000, Moore, Robert wrote:
>>>>>> What ACPICA has done here is to define these values consistently with the ToUUID ASL macro:
>>>>>>
>>>>>>
>>>>>> Byte encoding of UUID/GUID strings into ACPI Buffer objects (use ToUUID from ASL):
>>>>>>
>>>>>>    Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
>>>>>>      Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg, ii,jj, kk,ll,mm,nn,oo,pp
>>>>>>
>>>>>
>>>>> I do not see any issue in this conversion, which is consistent with
>>>>> ToUUID defined in ACPI spec.
>>>>>
>>>>> My point is that the string format of GUID is endian-neutral.  Wiki
>>>>> pages and EFI spec agree on it.  EFI 2.5 spec, Table 225 (sorry not
>>>>> Table 212, which is v2.4), is also clear about how String and Buffer are
>>>>> related with actual values of GUID.
>>>>
>>>> I think the critical point from the UEFI spec is the "It should also
>>>> be noted that TimeLow, TimeMid, TimeHighAndVersion fields in the EFI
>>>> are encoded as little endian".  That would imply the byte encoding
>>>> of...
>>>>
>>>> { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23, 0x4C }
>>>>
>>>> ...should be:
>>>>
>>>> { f6,01,f7,92,b4,13,5d,40,91,0b,29,93,67,e8,23,4c }
>>>
>>> The above NFIT GUID as data values means:
>>>
>>> EFI_GUID(0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8,
>>> 0x23, 0x4C)
>>>
>>>> Which implies the text conversion should be:
>>>>
>>>> "92f701f6-13b4-405d-910b-299367e8234c"
>>>
>>> Nope.
>>
>> Oops! Sorry, I misread your email... The above string is correct,
>> although I do not think you need such conversion.
>>
>>> EFI 2.5 spec, Appendix A "GUID and Time Formats" defines that:
>>> (NOTE, I simplified the table 225 to fit in this email)
>>> ==
>>> This specification also defines a standard text representation of the
>>> GUID. This format is also sometimes called the “registry format”. It
>>> consists of 36 characters, as follows:
>>>
>>> aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
>>>  :
>>>
>>> Table 225. Text representation relationships
>>> String       Offset In Buffer   EFI_GUID
>>> aa   3                  Data1[24:31]
>>> bb      2                  Data1[16:23]
>>> cc      1                  Data1[8:15]
>>> dd      0                  Data1[0:7]
>>>  :
>>> ===
>>>
>>> Therefore:
>>>
>>> aa = Data1[21:31] = 92
>>> bb = Data1[16:23] = F7
>>> cc = Data1[8:15]  = 01
>>> dd = Data1[0:7]   = F6
>>>
>>>> ...not
>>>>
>>>>> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"
>>>
>>> Hence, the above string is correct.
>>
>> Misread again... Right, the above string is NOT correct.
>>
>> I think we are on the same page that the GUID strings in this patch need
>> to be changed.
>>
>> { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23,
>> 0x4C }
>>
>> should be defined as:
>>
>> "92f701f6-13b4-405d-910b-299367e8234c"
>
> I've lost track of the right answer but should we be discussing
> it in the context of this patch too?
>
> http://www.spinics.net/lists/linux-acpi/msg57825.html
> [PATCH 18/19] ACPICA: ACPI 6.0: Add support for NFIT table.
>
> Dan's version of the file has lots of other UUIDs too, beyond NFIT.

Yeah, it's not clear whether those other GUIDs are actually GUIDs or
these byte-swapped "EFI GUID"s.  At least for NFIT it seems that the
intent was EFI GUID ordering due to the note about needing to match
the "Disk Type GUID" format from the EFI spec.

I circle back with the ACPICA folks.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
@ 2015-05-21 21:34                       ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-21 21:34 UTC (permalink / raw)
  To: Linda Knippers
  Cc: Toshi Kani, Jens Axboe, linux-nvdimm@lists.01.org, Neil Brown,
	Greg KH, Wysocki, Rafael J, Moore, Robert, linux-kernel,
	Linux ACPI, Ingo Molnar, Zheng, Lv, Christoph Hellwig

On Thu, May 21, 2015 at 1:59 PM, Linda Knippers <linda.knippers@hp.com> wrote:
> On 05/21/2015 03:59 PM, Toshi Kani wrote:
>> On Thu, 2015-05-21 at 13:44 -0600, Toshi Kani wrote:
>>> On Thu, 2015-05-21 at 12:06 -0700, Dan Williams wrote:
>>>> On Thu, May 21, 2015 at 11:01 AM, Toshi Kani <toshi.kani@hp.com> wrote:
>>>>> On Thu, 2015-05-21 at 17:49 +0000, Moore, Robert wrote:
>>>>>> What ACPICA has done here is to define these values consistently with the ToUUID ASL macro:
>>>>>>
>>>>>>
>>>>>> Byte encoding of UUID/GUID strings into ACPI Buffer objects (use ToUUID from ASL):
>>>>>>
>>>>>>    Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
>>>>>>      Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg, ii,jj, kk,ll,mm,nn,oo,pp
>>>>>>
>>>>>
>>>>> I do not see any issue in this conversion, which is consistent with
>>>>> ToUUID defined in ACPI spec.
>>>>>
>>>>> My point is that the string format of GUID is endian-neutral.  Wiki
>>>>> pages and EFI spec agree on it.  EFI 2.5 spec, Table 225 (sorry not
>>>>> Table 212, which is v2.4), is also clear about how String and Buffer are
>>>>> related with actual values of GUID.
>>>>
>>>> I think the critical point from the UEFI spec is the "It should also
>>>> be noted that TimeLow, TimeMid, TimeHighAndVersion fields in the EFI
>>>> are encoded as little endian".  That would imply the byte encoding
>>>> of...
>>>>
>>>> { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23, 0x4C }
>>>>
>>>> ...should be:
>>>>
>>>> { f6,01,f7,92,b4,13,5d,40,91,0b,29,93,67,e8,23,4c }
>>>
>>> The above NFIT GUID as data values means:
>>>
>>> EFI_GUID(0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8,
>>> 0x23, 0x4C)
>>>
>>>> Which implies the text conversion should be:
>>>>
>>>> "92f701f6-13b4-405d-910b-299367e8234c"
>>>
>>> Nope.
>>
>> Oops! Sorry, I misread your email... The above string is correct,
>> although I do not think you need such conversion.
>>
>>> EFI 2.5 spec, Appendix A "GUID and Time Formats" defines that:
>>> (NOTE, I simplified the table 225 to fit in this email)
>>> ==
>>> This specification also defines a standard text representation of the
>>> GUID. This format is also sometimes called the “registry format”. It
>>> consists of 36 characters, as follows:
>>>
>>> aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
>>>  :
>>>
>>> Table 225. Text representation relationships
>>> String       Offset In Buffer   EFI_GUID
>>> aa   3                  Data1[24:31]
>>> bb      2                  Data1[16:23]
>>> cc      1                  Data1[8:15]
>>> dd      0                  Data1[0:7]
>>>  :
>>> ===
>>>
>>> Therefore:
>>>
>>> aa = Data1[21:31] = 92
>>> bb = Data1[16:23] = F7
>>> cc = Data1[8:15]  = 01
>>> dd = Data1[0:7]   = F6
>>>
>>>> ...not
>>>>
>>>>> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-299367e8234c"
>>>
>>> Hence, the above string is correct.
>>
>> Misread again... Right, the above string is NOT correct.
>>
>> I think we are on the same page that the GUID strings in this patch need
>> to be changed.
>>
>> { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8, 0x23,
>> 0x4C }
>>
>> should be defined as:
>>
>> "92f701f6-13b4-405d-910b-299367e8234c"
>
> I've lost track of the right answer but should we be discussing
> it in the context of this patch too?
>
> http://www.spinics.net/lists/linux-acpi/msg57825.html
> [PATCH 18/19] ACPICA: ACPI 6.0: Add support for NFIT table.
>
> Dan's version of the file has lots of other UUIDs too, beyond NFIT.

Yeah, it's not clear whether those other GUIDs are actually GUIDs or
these byte-swapped "EFI GUID"s.  At least for NFIT it seems that the
intent was EFI GUID ordering due to the note about needing to match
the "Disk Type GUID" format from the EFI spec.

I circle back with the ACPICA folks.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
  2015-05-21 21:34                       ` Dan Williams
@ 2015-05-21 22:11                         ` Toshi Kani
  -1 siblings, 0 replies; 89+ messages in thread
From: Toshi Kani @ 2015-05-21 22:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linda Knippers, Jens Axboe, linux-nvdimm, Neil Brown, Greg KH,
	Wysocki, Rafael J, Moore, Robert, linux-kernel, Linux ACPI,
	Ingo Molnar, Zheng, Lv, Christoph Hellwig

On Thu, 2015-05-21 at 14:34 -0700, Dan Williams wrote:
> On Thu, May 21, 2015 at 1:59 PM, Linda Knippers <linda.knippers@hp.com> wrote:
> > On 05/21/2015 03:59 PM, Toshi Kani wrote:
:
> >
> > I've lost track of the right answer but should we be discussing
> > it in the context of this patch too?
> >
> > http://www.spinics.net/lists/linux-acpi/msg57825.html
> > [PATCH 18/19] ACPICA: ACPI 6.0: Add support for NFIT table.
> >
> > Dan's version of the file has lots of other UUIDs too, beyond NFIT.
> 
> Yeah, it's not clear whether those other GUIDs are actually GUIDs or
> these byte-swapped "EFI GUID"s.  At least for NFIT it seems that the
> intent was EFI GUID ordering due to the note about needing to match
> the "Disk Type GUID" format from the EFI spec.
> 
> I circle back with the ACPICA folks.

Endianness only matters when you store GUID data into memory (or read it
from memory).  The data values themselves are independent from the
endianness.  GUIDs, EFI GUIDs, and their text strings all represent
actual data values, and therefore no swapping is necessary.

When storing EFI GUID or text string into memory/Buffer, EFI spec
defines to store it in little-endian format.  This is handled by
EFI_GUID() macro for EFI GUID values, and ToUUID / acpi_str_to_uuid()
for a string.

Thanks,
-Toshi


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
@ 2015-05-21 22:11                         ` Toshi Kani
  0 siblings, 0 replies; 89+ messages in thread
From: Toshi Kani @ 2015-05-21 22:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linda Knippers, Jens Axboe, linux-nvdimm@lists.01.org,
	Neil Brown, Greg KH, Wysocki, Rafael J, Moore, Robert,
	linux-kernel, Linux ACPI, Ingo Molnar, Zheng, Lv,
	Christoph Hellwig

On Thu, 2015-05-21 at 14:34 -0700, Dan Williams wrote:
> On Thu, May 21, 2015 at 1:59 PM, Linda Knippers <linda.knippers@hp.com> wrote:
> > On 05/21/2015 03:59 PM, Toshi Kani wrote:
:
> >
> > I've lost track of the right answer but should we be discussing
> > it in the context of this patch too?
> >
> > http://www.spinics.net/lists/linux-acpi/msg57825.html
> > [PATCH 18/19] ACPICA: ACPI 6.0: Add support for NFIT table.
> >
> > Dan's version of the file has lots of other UUIDs too, beyond NFIT.
> 
> Yeah, it's not clear whether those other GUIDs are actually GUIDs or
> these byte-swapped "EFI GUID"s.  At least for NFIT it seems that the
> intent was EFI GUID ordering due to the note about needing to match
> the "Disk Type GUID" format from the EFI spec.
> 
> I circle back with the ACPICA folks.

Endianness only matters when you store GUID data into memory (or read it
from memory).  The data values themselves are independent from the
endianness.  GUIDs, EFI GUIDs, and their text strings all represent
actual data values, and therefore no swapping is necessary.

When storing EFI GUID or text string into memory/Buffer, EFI spec
defines to store it in little-endian format.  This is handled by
EFI_GUID() macro for EFI GUID values, and ToUUID / acpi_str_to_uuid()
for a string.

Thanks,
-Toshi


^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
  2015-05-21 21:34                       ` Dan Williams
@ 2015-05-22 14:58                         ` Moore, Robert
  -1 siblings, 0 replies; 89+ messages in thread
From: Moore, Robert @ 2015-05-22 14:58 UTC (permalink / raw)
  To: Williams, Dan J, Linda Knippers
  Cc: Toshi Kani, Jens Axboe, linux-nvdimm, Neil Brown, Greg KH,
	Wysocki, Rafael J, linux-kernel, Linux ACPI, Ingo Molnar, Zheng,
	Lv, Christoph Hellwig

It looks to me that you are correct and I screwed up when I made those strings. The odd thing is that we had discussed this whole issue internally for a few days -- then I went ahead and messed up the strings. I think my brain was going around in circles.

Anyway, here is the latest info, please have a look:


Below is the GUID for volatile memory region directly from the ACPI spec:


{ 0x7305944F, 0xFDDA, 0x44E3, 0xB1, 0x6C, 0x3F, 0x22, 0xD2, 0x52, 0xE5, 0xD0 }

Here is an example of ToUUID using a corrected version of the GUID string. Note that the ordering of the string is identical to the version in the ACPI spec:

      11:      Name (UUID, ToUUID ("7305944F-FDDA-44E3-B16C-3F22D252E5D0"))


Here is the AML output of the ToUUID macro. Note that the first three fields are reversed, the rest of the string is left as-is (as per the ToUUID definition):

00000024:  08 55 55 49 44 .........    ".UUID"
00000029:  11 13 0A 10 4F 94 05 73     "....O..s"
00000031:  DA FD E3 44 B1 6C 3F 22     "...D.l?""
00000039:  D2 52 E5 D0 ............    ".R.."

This is the important part:

           4F 94 05 73     "....O..s"
00000031:  DA FD E3 44 B1 6C 3F 22     "...D.l?""
00000039:  D2 52 E5 D0


I believe that this is correct.

Thanks for your help,
Bob




> -----Original Message-----
> From: Dan Williams [mailto:dan.j.williams@intel.com]
> Sent: Thursday, May 21, 2015 2:35 PM
> To: Linda Knippers
> Cc: Toshi Kani; Jens Axboe; linux-nvdimm@lists.01.org; Neil Brown; Greg
> KH; Wysocki, Rafael J; Moore, Robert; linux-kernel@vger.kernel.org; Linux
> ACPI; Ingo Molnar; Zheng, Lv; Christoph Hellwig
> Subject: Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure
> and NFIT support
> 
> On Thu, May 21, 2015 at 1:59 PM, Linda Knippers <linda.knippers@hp.com>
> wrote:
> > On 05/21/2015 03:59 PM, Toshi Kani wrote:
> >> On Thu, 2015-05-21 at 13:44 -0600, Toshi Kani wrote:
> >>> On Thu, 2015-05-21 at 12:06 -0700, Dan Williams wrote:
> >>>> On Thu, May 21, 2015 at 11:01 AM, Toshi Kani <toshi.kani@hp.com>
> wrote:
> >>>>> On Thu, 2015-05-21 at 17:49 +0000, Moore, Robert wrote:
> >>>>>> What ACPICA has done here is to define these values consistently
> with the ToUUID ASL macro:
> >>>>>>
> >>>>>>
> >>>>>> Byte encoding of UUID/GUID strings into ACPI Buffer objects (use
> ToUUID from ASL):
> >>>>>>
> >>>>>>    Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-
> kkllmmnnoopp
> >>>>>>      Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg,
> >>>>>> ii,jj, kk,ll,mm,nn,oo,pp
> >>>>>>
> >>>>>
> >>>>> I do not see any issue in this conversion, which is consistent
> >>>>> with ToUUID defined in ACPI spec.
> >>>>>
> >>>>> My point is that the string format of GUID is endian-neutral.
> >>>>> Wiki pages and EFI spec agree on it.  EFI 2.5 spec, Table 225
> >>>>> (sorry not Table 212, which is v2.4), is also clear about how
> >>>>> String and Buffer are related with actual values of GUID.
> >>>>
> >>>> I think the critical point from the UEFI spec is the "It should
> >>>> also be noted that TimeLow, TimeMid, TimeHighAndVersion fields in
> >>>> the EFI are encoded as little endian".  That would imply the byte
> >>>> encoding of...
> >>>>
> >>>> { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8,
> >>>> 0x23, 0x4C }
> >>>>
> >>>> ...should be:
> >>>>
> >>>> { f6,01,f7,92,b4,13,5d,40,91,0b,29,93,67,e8,23,4c }
> >>>
> >>> The above NFIT GUID as data values means:
> >>>
> >>> EFI_GUID(0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67,
> >>> 0xE8, 0x23, 0x4C)
> >>>
> >>>> Which implies the text conversion should be:
> >>>>
> >>>> "92f701f6-13b4-405d-910b-299367e8234c"
> >>>
> >>> Nope.
> >>
> >> Oops! Sorry, I misread your email... The above string is correct,
> >> although I do not think you need such conversion.
> >>
> >>> EFI 2.5 spec, Appendix A "GUID and Time Formats" defines that:
> >>> (NOTE, I simplified the table 225 to fit in this email) == This
> >>> specification also defines a standard text representation of the
> >>> GUID. This format is also sometimes called the “registry format”. It
> >>> consists of 36 characters, as follows:
> >>>
> >>> aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
> >>>  :
> >>>
> >>> Table 225. Text representation relationships
> >>> String       Offset In Buffer   EFI_GUID
> >>> aa   3                  Data1[24:31]
> >>> bb      2                  Data1[16:23]
> >>> cc      1                  Data1[8:15]
> >>> dd      0                  Data1[0:7]
> >>>  :
> >>> ===
> >>>
> >>> Therefore:
> >>>
> >>> aa = Data1[21:31] = 92
> >>> bb = Data1[16:23] = F7
> >>> cc = Data1[8:15]  = 01
> >>> dd = Data1[0:7]   = F6
> >>>
> >>>> ...not
> >>>>
> >>>>> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-
> 299367e8234c"
> >>>
> >>> Hence, the above string is correct.
> >>
> >> Misread again... Right, the above string is NOT correct.
> >>
> >> I think we are on the same page that the GUID strings in this patch
> >> need to be changed.
> >>
> >> { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8,
> >> 0x23, 0x4C }
> >>
> >> should be defined as:
> >>
> >> "92f701f6-13b4-405d-910b-299367e8234c"
> >
> > I've lost track of the right answer but should we be discussing it in
> > the context of this patch too?
> >
> > http://www.spinics.net/lists/linux-acpi/msg57825.html
> > [PATCH 18/19] ACPICA: ACPI 6.0: Add support for NFIT table.
> >
> > Dan's version of the file has lots of other UUIDs too, beyond NFIT.
> 
> Yeah, it's not clear whether those other GUIDs are actually GUIDs or these
> byte-swapped "EFI GUID"s.  At least for NFIT it seems that the intent was
> EFI GUID ordering due to the note about needing to match the "Disk Type
> GUID" format from the EFI spec.
> 
> I circle back with the ACPICA folks.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
@ 2015-05-22 14:58                         ` Moore, Robert
  0 siblings, 0 replies; 89+ messages in thread
From: Moore, Robert @ 2015-05-22 14:58 UTC (permalink / raw)
  To: Williams, Dan J, Linda Knippers
  Cc: Toshi Kani, Jens Axboe, linux-nvdimm@lists.01.org, Neil Brown,
	Greg KH, Wysocki, Rafael J, linux-kernel, Linux ACPI,
	Ingo Molnar, Zheng, Lv, Christoph Hellwig

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 5917 bytes --]

It looks to me that you are correct and I screwed up when I made those strings. The odd thing is that we had discussed this whole issue internally for a few days -- then I went ahead and messed up the strings. I think my brain was going around in circles.

Anyway, here is the latest info, please have a look:


Below is the GUID for volatile memory region directly from the ACPI spec:


{ 0x7305944F, 0xFDDA, 0x44E3, 0xB1, 0x6C, 0x3F, 0x22, 0xD2, 0x52, 0xE5, 0xD0 }

Here is an example of ToUUID using a corrected version of the GUID string. Note that the ordering of the string is identical to the version in the ACPI spec:

      11:      Name (UUID, ToUUID ("7305944F-FDDA-44E3-B16C-3F22D252E5D0"))


Here is the AML output of the ToUUID macro. Note that the first three fields are reversed, the rest of the string is left as-is (as per the ToUUID definition):

00000024:  08 55 55 49 44 .........    ".UUID"
00000029:  11 13 0A 10 4F 94 05 73     "....O..s"
00000031:  DA FD E3 44 B1 6C 3F 22     "...D.l?""
00000039:  D2 52 E5 D0 ............    ".R.."

This is the important part:

           4F 94 05 73     "....O..s"
00000031:  DA FD E3 44 B1 6C 3F 22     "...D.l?""
00000039:  D2 52 E5 D0


I believe that this is correct.

Thanks for your help,
Bob




> -----Original Message-----
> From: Dan Williams [mailto:dan.j.williams@intel.com]
> Sent: Thursday, May 21, 2015 2:35 PM
> To: Linda Knippers
> Cc: Toshi Kani; Jens Axboe; linux-nvdimm@lists.01.org; Neil Brown; Greg
> KH; Wysocki, Rafael J; Moore, Robert; linux-kernel@vger.kernel.org; Linux
> ACPI; Ingo Molnar; Zheng, Lv; Christoph Hellwig
> Subject: Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure
> and NFIT support
> 
> On Thu, May 21, 2015 at 1:59 PM, Linda Knippers <linda.knippers@hp.com>
> wrote:
> > On 05/21/2015 03:59 PM, Toshi Kani wrote:
> >> On Thu, 2015-05-21 at 13:44 -0600, Toshi Kani wrote:
> >>> On Thu, 2015-05-21 at 12:06 -0700, Dan Williams wrote:
> >>>> On Thu, May 21, 2015 at 11:01 AM, Toshi Kani <toshi.kani@hp.com>
> wrote:
> >>>>> On Thu, 2015-05-21 at 17:49 +0000, Moore, Robert wrote:
> >>>>>> What ACPICA has done here is to define these values consistently
> with the ToUUID ASL macro:
> >>>>>>
> >>>>>>
> >>>>>> Byte encoding of UUID/GUID strings into ACPI Buffer objects (use
> ToUUID from ASL):
> >>>>>>
> >>>>>>    Input UUID/GUID String format : aabbccdd-eeff-gghh-iijj-
> kkllmmnnoopp
> >>>>>>      Expected output ACPI buffer : dd,cc,bb,aa, ff,ee, hh,gg,
> >>>>>> ii,jj, kk,ll,mm,nn,oo,pp
> >>>>>>
> >>>>>
> >>>>> I do not see any issue in this conversion, which is consistent
> >>>>> with ToUUID defined in ACPI spec.
> >>>>>
> >>>>> My point is that the string format of GUID is endian-neutral.
> >>>>> Wiki pages and EFI spec agree on it.  EFI 2.5 spec, Table 225
> >>>>> (sorry not Table 212, which is v2.4), is also clear about how
> >>>>> String and Buffer are related with actual values of GUID.
> >>>>
> >>>> I think the critical point from the UEFI spec is the "It should
> >>>> also be noted that TimeLow, TimeMid, TimeHighAndVersion fields in
> >>>> the EFI are encoded as little endian".  That would imply the byte
> >>>> encoding of...
> >>>>
> >>>> { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8,
> >>>> 0x23, 0x4C }
> >>>>
> >>>> ...should be:
> >>>>
> >>>> { f6,01,f7,92,b4,13,5d,40,91,0b,29,93,67,e8,23,4c }
> >>>
> >>> The above NFIT GUID as data values means:
> >>>
> >>> EFI_GUID(0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67,
> >>> 0xE8, 0x23, 0x4C)
> >>>
> >>>> Which implies the text conversion should be:
> >>>>
> >>>> "92f701f6-13b4-405d-910b-299367e8234c"
> >>>
> >>> Nope.
> >>
> >> Oops! Sorry, I misread your email... The above string is correct,
> >> although I do not think you need such conversion.
> >>
> >>> EFI 2.5 spec, Appendix A "GUID and Time Formats" defines that:
> >>> (NOTE, I simplified the table 225 to fit in this email) == This
> >>> specification also defines a standard text representation of the
> >>> GUID. This format is also sometimes called the “registry format”. It
> >>> consists of 36 characters, as follows:
> >>>
> >>> aabbccdd-eeff-gghh-iijj-kkllmmnnoopp
> >>>  :
> >>>
> >>> Table 225. Text representation relationships
> >>> String       Offset In Buffer   EFI_GUID
> >>> aa   3                  Data1[24:31]
> >>> bb      2                  Data1[16:23]
> >>> cc      1                  Data1[8:15]
> >>> dd      0                  Data1[0:7]
> >>>  :
> >>> ===
> >>>
> >>> Therefore:
> >>>
> >>> aa = Data1[21:31] = 92
> >>> bb = Data1[16:23] = F7
> >>> cc = Data1[8:15]  = 01
> >>> dd = Data1[0:7]   = F6
> >>>
> >>>> ...not
> >>>>
> >>>>> +#define UUID_CONTROL_REGION             "f601f792-b413-5d40-910b-
> 299367e8234c"
> >>>
> >>> Hence, the above string is correct.
> >>
> >> Misread again... Right, the above string is NOT correct.
> >>
> >> I think we are on the same page that the GUID strings in this patch
> >> need to be changed.
> >>
> >> { 0x92F701F6, 0x13B4, 0x405D, 0x91, 0x0B, 0x29, 0x93, 0x67, 0xE8,
> >> 0x23, 0x4C }
> >>
> >> should be defined as:
> >>
> >> "92f701f6-13b4-405d-910b-299367e8234c"
> >
> > I've lost track of the right answer but should we be discussing it in
> > the context of this patch too?
> >
> > http://www.spinics.net/lists/linux-acpi/msg57825.html
> > [PATCH 18/19] ACPICA: ACPI 6.0: Add support for NFIT table.
> >
> > Dan's version of the file has lots of other UUIDs too, beyond NFIT.
> 
> Yeah, it's not clear whether those other GUIDs are actually GUIDs or these
> byte-swapped "EFI GUID"s.  At least for NFIT it seems that the intent was
> EFI GUID ordering due to the note about needing to match the "Disk Type
> GUID" format from the EFI spec.
> 
> I circle back with the ACPICA folks.
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
  2015-05-22 14:58                         ` Moore, Robert
@ 2015-05-22 15:21                           ` Toshi Kani
  -1 siblings, 0 replies; 89+ messages in thread
From: Toshi Kani @ 2015-05-22 15:21 UTC (permalink / raw)
  To: Moore, Robert
  Cc: Williams, Dan J, Linda Knippers, Jens Axboe, linux-nvdimm,
	Neil Brown, Greg KH, Wysocki, Rafael J, linux-kernel, Linux ACPI,
	Ingo Molnar, Zheng, Lv, Christoph Hellwig

On Fri, 2015-05-22 at 14:58 +0000, Moore, Robert wrote:
> It looks to me that you are correct and I screwed up when I made those strings.
> The odd thing is that we had discussed this whole issue internally for a few days
>  -- then I went ahead and messed up the strings. I think my brain was going
> around in circles.

Yes, endianness is always fun... :-)

> Anyway, here is the latest info, please have a look:
> 
> 
> Below is the GUID for volatile memory region directly from the ACPI spec:
> 
> 
> { 0x7305944F, 0xFDDA, 0x44E3, 0xB1, 0x6C, 0x3F, 0x22, 0xD2, 0x52, 0xE5, 0xD0 }
> 
> Here is an example of ToUUID using a corrected version of the GUID string. Note that the ordering of the string is identical to the version in the ACPI spec:
> 
>       11:      Name (UUID, ToUUID ("7305944F-FDDA-44E3-B16C-3F22D252E5D0"))
> 
> 
> Here is the AML output of the ToUUID macro. Note that the first three fields are reversed, the rest of the string is left as-is (as per the ToUUID definition):
> 
> 00000024:  08 55 55 49 44 .........    ".UUID"
> 00000029:  11 13 0A 10 4F 94 05 73     "....O..s"
> 00000031:  DA FD E3 44 B1 6C 3F 22     "...D.l?""
> 00000039:  D2 52 E5 D0 ............    ".R.."
> 
> This is the important part:
> 
>            4F 94 05 73     "....O..s"
> 00000031:  DA FD E3 44 B1 6C 3F 22     "...D.l?""
> 00000039:  D2 52 E5 D0
> 
> 
> I believe that this is correct.

Looks good!

Thanks,
-Toshi



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
@ 2015-05-22 15:21                           ` Toshi Kani
  0 siblings, 0 replies; 89+ messages in thread
From: Toshi Kani @ 2015-05-22 15:21 UTC (permalink / raw)
  To: Moore, Robert
  Cc: Williams, Dan J, Linda Knippers, Jens Axboe,
	linux-nvdimm@lists.01.org, Neil Brown, Greg KH, Wysocki,
	Rafael J, linux-kernel, Linux ACPI, Ingo Molnar, Zheng, Lv,
	Christoph Hellwig

On Fri, 2015-05-22 at 14:58 +0000, Moore, Robert wrote:
> It looks to me that you are correct and I screwed up when I made those strings.
> The odd thing is that we had discussed this whole issue internally for a few days
>  -- then I went ahead and messed up the strings. I think my brain was going
> around in circles.

Yes, endianness is always fun... :-)

> Anyway, here is the latest info, please have a look:
> 
> 
> Below is the GUID for volatile memory region directly from the ACPI spec:
> 
> 
> { 0x7305944F, 0xFDDA, 0x44E3, 0xB1, 0x6C, 0x3F, 0x22, 0xD2, 0x52, 0xE5, 0xD0 }
> 
> Here is an example of ToUUID using a corrected version of the GUID string. Note that the ordering of the string is identical to the version in the ACPI spec:
> 
>       11:      Name (UUID, ToUUID ("7305944F-FDDA-44E3-B16C-3F22D252E5D0"))
> 
> 
> Here is the AML output of the ToUUID macro. Note that the first three fields are reversed, the rest of the string is left as-is (as per the ToUUID definition):
> 
> 00000024:  08 55 55 49 44 .........    ".UUID"
> 00000029:  11 13 0A 10 4F 94 05 73     "....O..s"
> 00000031:  DA FD E3 44 B1 6C 3F 22     "...D.l?""
> 00000039:  D2 52 E5 D0 ............    ".R.."
> 
> This is the important part:
> 
>            4F 94 05 73     "....O..s"
> 00000031:  DA FD E3 44 B1 6C 3F 22     "...D.l?""
> 00000039:  D2 52 E5 D0
> 
> 
> I believe that this is correct.

Looks good!

Thanks,
-Toshi



^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
  2015-05-22 15:21                           ` Toshi Kani
@ 2015-05-22 16:12                             ` Moore, Robert
  -1 siblings, 0 replies; 89+ messages in thread
From: Moore, Robert @ 2015-05-22 16:12 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Williams, Dan J, Linda Knippers, Jens Axboe, linux-nvdimm,
	Neil Brown, Greg KH, Wysocki, Rafael J, linux-kernel, Linux ACPI,
	Ingo Molnar, Zheng, Lv, Christoph Hellwig

Here are the corrected strings:

/* NVDIMM - NFIT table */

#define UUID_VOLATILE_MEMORY            "7305944f-fdda-44e3-b16c-3f22d252e5d0"
#define UUID_PERSISTENT_MEMORY          "66f0d379-b4f3-4074-ac43-0d3318b78cdb"
#define UUID_CONTROL_REGION             "92f701f6-13b4-405d-910b-299367e8234c"
#define UUID_DATA_REGION                "91af0530-5d86-470e-a6b0-0a2db9408249"
#define UUID_VOLATILE_VIRTUAL_DISK      "77ab535a-45fc-624b-5560-f7b281d1f96e"
#define UUID_VOLATILE_VIRTUAL_CD        "3d5abd30-4175-87ce-6d64-d2ade523c4bb"
#define UUID_PERSISTENT_VIRTUAL_DISK    "5cea02c9-4d07-69d3-269f-4496fbe096f9"
#define UUID_PERSISTENT_VIRTUAL_CD      "08018188-42cd-bb48-100f-5387d53ded3d"



> -----Original Message-----
> From: Toshi Kani [mailto:toshi.kani@hp.com]
> Sent: Friday, May 22, 2015 8:21 AM
> To: Moore, Robert
> Cc: Williams, Dan J; Linda Knippers; Jens Axboe; linux-
> nvdimm@lists.01.org; Neil Brown; Greg KH; Wysocki, Rafael J; linux-
> kernel@vger.kernel.org; Linux ACPI; Ingo Molnar; Zheng, Lv; Christoph
> Hellwig
> Subject: Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure
> and NFIT support
> 
> On Fri, 2015-05-22 at 14:58 +0000, Moore, Robert wrote:
> > It looks to me that you are correct and I screwed up when I made those
> strings.
> > The odd thing is that we had discussed this whole issue internally for
> > a few days
> >  -- then I went ahead and messed up the strings. I think my brain was
> > going around in circles.
> 
> Yes, endianness is always fun... :-)
> 
> > Anyway, here is the latest info, please have a look:
> >
> >
> > Below is the GUID for volatile memory region directly from the ACPI
> spec:
> >
> >
> > { 0x7305944F, 0xFDDA, 0x44E3, 0xB1, 0x6C, 0x3F, 0x22, 0xD2, 0x52,
> > 0xE5, 0xD0 }
> >
> > Here is an example of ToUUID using a corrected version of the GUID
> string. Note that the ordering of the string is identical to the version
> in the ACPI spec:
> >
> >       11:      Name (UUID, ToUUID ("7305944F-FDDA-44E3-B16C-
> 3F22D252E5D0"))
> >
> >
> > Here is the AML output of the ToUUID macro. Note that the first three
> fields are reversed, the rest of the string is left as-is (as per the
> ToUUID definition):
> >
> > 00000024:  08 55 55 49 44 .........    ".UUID"
> > 00000029:  11 13 0A 10 4F 94 05 73     "....O..s"
> > 00000031:  DA FD E3 44 B1 6C 3F 22     "...D.l?""
> > 00000039:  D2 52 E5 D0 ............    ".R.."
> >
> > This is the important part:
> >
> >            4F 94 05 73     "....O..s"
> > 00000031:  DA FD E3 44 B1 6C 3F 22     "...D.l?""
> > 00000039:  D2 52 E5 D0
> >
> >
> > I believe that this is correct.
> 
> Looks good!
> 
> Thanks,
> -Toshi
> 


^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support
@ 2015-05-22 16:12                             ` Moore, Robert
  0 siblings, 0 replies; 89+ messages in thread
From: Moore, Robert @ 2015-05-22 16:12 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Williams, Dan J, Linda Knippers, Jens Axboe,
	linux-nvdimm@lists.01.org, Neil Brown, Greg KH, Wysocki,
	Rafael J, linux-kernel, Linux ACPI, Ingo Molnar, Zheng, Lv,
	Christoph Hellwig

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2811 bytes --]

Here are the corrected strings:

/* NVDIMM - NFIT table */

#define UUID_VOLATILE_MEMORY            "7305944f-fdda-44e3-b16c-3f22d252e5d0"
#define UUID_PERSISTENT_MEMORY          "66f0d379-b4f3-4074-ac43-0d3318b78cdb"
#define UUID_CONTROL_REGION             "92f701f6-13b4-405d-910b-299367e8234c"
#define UUID_DATA_REGION                "91af0530-5d86-470e-a6b0-0a2db9408249"
#define UUID_VOLATILE_VIRTUAL_DISK      "77ab535a-45fc-624b-5560-f7b281d1f96e"
#define UUID_VOLATILE_VIRTUAL_CD        "3d5abd30-4175-87ce-6d64-d2ade523c4bb"
#define UUID_PERSISTENT_VIRTUAL_DISK    "5cea02c9-4d07-69d3-269f-4496fbe096f9"
#define UUID_PERSISTENT_VIRTUAL_CD      "08018188-42cd-bb48-100f-5387d53ded3d"



> -----Original Message-----
> From: Toshi Kani [mailto:toshi.kani@hp.com]
> Sent: Friday, May 22, 2015 8:21 AM
> To: Moore, Robert
> Cc: Williams, Dan J; Linda Knippers; Jens Axboe; linux-
> nvdimm@lists.01.org; Neil Brown; Greg KH; Wysocki, Rafael J; linux-
> kernel@vger.kernel.org; Linux ACPI; Ingo Molnar; Zheng, Lv; Christoph
> Hellwig
> Subject: Re: [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure
> and NFIT support
> 
> On Fri, 2015-05-22 at 14:58 +0000, Moore, Robert wrote:
> > It looks to me that you are correct and I screwed up when I made those
> strings.
> > The odd thing is that we had discussed this whole issue internally for
> > a few days
> >  -- then I went ahead and messed up the strings. I think my brain was
> > going around in circles.
> 
> Yes, endianness is always fun... :-)
> 
> > Anyway, here is the latest info, please have a look:
> >
> >
> > Below is the GUID for volatile memory region directly from the ACPI
> spec:
> >
> >
> > { 0x7305944F, 0xFDDA, 0x44E3, 0xB1, 0x6C, 0x3F, 0x22, 0xD2, 0x52,
> > 0xE5, 0xD0 }
> >
> > Here is an example of ToUUID using a corrected version of the GUID
> string. Note that the ordering of the string is identical to the version
> in the ACPI spec:
> >
> >       11:      Name (UUID, ToUUID ("7305944F-FDDA-44E3-B16C-
> 3F22D252E5D0"))
> >
> >
> > Here is the AML output of the ToUUID macro. Note that the first three
> fields are reversed, the rest of the string is left as-is (as per the
> ToUUID definition):
> >
> > 00000024:  08 55 55 49 44 .........    ".UUID"
> > 00000029:  11 13 0A 10 4F 94 05 73     "....O..s"
> > 00000031:  DA FD E3 44 B1 6C 3F 22     "...D.l?""
> > 00000039:  D2 52 E5 D0 ............    ".R.."
> >
> > This is the important part:
> >
> >            4F 94 05 73     "....O..s"
> > 00000031:  DA FD E3 44 B1 6C 3F 22     "...D.l?""
> > 00000039:  D2 52 E5 D0
> >
> >
> > I believe that this is correct.
> 
> Looks good!
> 
> Thanks,
> -Toshi
> 

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH v3 14/21] libnd: blk labels and namespace instantiation
  2015-05-20 20:57   ` Dan Williams
  (?)
@ 2015-05-22 18:37     ` Elliott, Robert (Server Storage)
  -1 siblings, 0 replies; 89+ messages in thread
From: Elliott, Robert (Server Storage) @ 2015-05-22 18:37 UTC (permalink / raw)
  To: Dan Williams, axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, hch, linux-acpi, mingo


> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of
> Dan Williams
> Sent: Wednesday, May 20, 2015 3:57 PM
> To: axboe@kernel.dk
> Cc: linux-nvdimm@lists.01.org; neilb@suse.de; gregkh@linuxfoundation.org;
> linux-kernel@vger.kernel.org; hch@lst.de; linux-acpi@vger.kernel.org;
> mingo@kernel.org
> Subject: [PATCH v3 14/21] libnd: blk labels and namespace instantiation
> 
...
> @@ -1029,6 +1244,173 @@ static struct device **create_namespace_pmem(struct
> nd_region *nd_region)
>  	return NULL;
>  }
> 
> +struct resource *nsblk_add_resource(struct nd_region *nd_region,
> +		struct nd_dimm_drvdata *ndd, struct nd_namespace_blk *nsblk,
> +		resource_size_t start)
> +{
> +	struct nd_label_id label_id;
> +	struct resource *res;
> +
> +	nd_label_gen_id(&label_id, nsblk->uuid, NSLABEL_FLAG_LOCAL);
> +	nsblk->res = krealloc(nsblk->res,
> +			sizeof(void *) * (nsblk->num_resources + 1),
> +			GFP_KERNEL);
> +	if (!nsblk->res)
> +		return NULL;

scripts/checkpatch.pl doesn't like that:
WARNING: Reusing the krealloc arg is almost always a bug
#1411: FILE: drivers/block/nd/namespace_devs.c:1411:
+       nsblk->res = krealloc(nsblk->res,

The reasoning (https://lkml.org/lkml/2013/3/14/558) is:

"If krealloc() returns NULL, it *doesn't* free the original. So any 
code of the form 'foo = krealloc(foo, �);' is almost certainly a bug."


---
Robert Elliott, HP Server Storage

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH v3 14/21] libnd: blk labels and namespace instantiation
@ 2015-05-22 18:37     ` Elliott, Robert (Server Storage)
  0 siblings, 0 replies; 89+ messages in thread
From: Elliott, Robert (Server Storage) @ 2015-05-22 18:37 UTC (permalink / raw)
  To: Dan Williams, axboe
  Cc: linux-nvdimm, neilb, gregkh, linux-kernel, hch, linux-acpi, mingo


> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of
> Dan Williams
> Sent: Wednesday, May 20, 2015 3:57 PM
> To: axboe@kernel.dk
> Cc: linux-nvdimm@lists.01.org; neilb@suse.de; gregkh@linuxfoundation.org;
> linux-kernel@vger.kernel.org; hch@lst.de; linux-acpi@vger.kernel.org;
> mingo@kernel.org
> Subject: [PATCH v3 14/21] libnd: blk labels and namespace instantiation
> 
...
> @@ -1029,6 +1244,173 @@ static struct device **create_namespace_pmem(struct
> nd_region *nd_region)
>  	return NULL;
>  }
> 
> +struct resource *nsblk_add_resource(struct nd_region *nd_region,
> +		struct nd_dimm_drvdata *ndd, struct nd_namespace_blk *nsblk,
> +		resource_size_t start)
> +{
> +	struct nd_label_id label_id;
> +	struct resource *res;
> +
> +	nd_label_gen_id(&label_id, nsblk->uuid, NSLABEL_FLAG_LOCAL);
> +	nsblk->res = krealloc(nsblk->res,
> +			sizeof(void *) * (nsblk->num_resources + 1),
> +			GFP_KERNEL);
> +	if (!nsblk->res)
> +		return NULL;

scripts/checkpatch.pl doesn't like that:
WARNING: Reusing the krealloc arg is almost always a bug
#1411: FILE: drivers/block/nd/namespace_devs.c:1411:
+       nsblk->res = krealloc(nsblk->res,

The reasoning (https://lkml.org/lkml/2013/3/14/558) is:

"If krealloc() returns NULL, it *doesn't* free the original. So any 
code of the form 'foo = krealloc(foo, …);' is almost certainly a bug."


---
Robert Elliott, HP Server Storage
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH v3 14/21] libnd: blk labels and namespace instantiation
@ 2015-05-22 18:37     ` Elliott, Robert (Server Storage)
  0 siblings, 0 replies; 89+ messages in thread
From: Elliott, Robert (Server Storage) @ 2015-05-22 18:37 UTC (permalink / raw)
  To: Dan Williams, axboe
  Cc: linux-nvdimm@lists.01.org, neilb, gregkh, linux-kernel, hch,
	linux-acpi, mingo


> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of
> Dan Williams
> Sent: Wednesday, May 20, 2015 3:57 PM
> To: axboe@kernel.dk
> Cc: linux-nvdimm@lists.01.org; neilb@suse.de; gregkh@linuxfoundation.org;
> linux-kernel@vger.kernel.org; hch@lst.de; linux-acpi@vger.kernel.org;
> mingo@kernel.org
> Subject: [PATCH v3 14/21] libnd: blk labels and namespace instantiation
> 
...
> @@ -1029,6 +1244,173 @@ static struct device **create_namespace_pmem(struct
> nd_region *nd_region)
>  	return NULL;
>  }
> 
> +struct resource *nsblk_add_resource(struct nd_region *nd_region,
> +		struct nd_dimm_drvdata *ndd, struct nd_namespace_blk *nsblk,
> +		resource_size_t start)
> +{
> +	struct nd_label_id label_id;
> +	struct resource *res;
> +
> +	nd_label_gen_id(&label_id, nsblk->uuid, NSLABEL_FLAG_LOCAL);
> +	nsblk->res = krealloc(nsblk->res,
> +			sizeof(void *) * (nsblk->num_resources + 1),
> +			GFP_KERNEL);
> +	if (!nsblk->res)
> +		return NULL;

scripts/checkpatch.pl doesn't like that:
WARNING: Reusing the krealloc arg is almost always a bug
#1411: FILE: drivers/block/nd/namespace_devs.c:1411:
+       nsblk->res = krealloc(nsblk->res,

The reasoning (https://lkml.org/lkml/2013/3/14/558) is:

"If krealloc() returns NULL, it *doesn't* free the original. So any 
code of the form 'foo = krealloc(foo, …);' is almost certainly a bug."


---
Robert Elliott, HP Server Storage

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 14/21] libnd: blk labels and namespace instantiation
  2015-05-22 18:37     ` Elliott, Robert (Server Storage)
@ 2015-05-22 18:51       ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-22 18:51 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: axboe, linux-nvdimm, neilb, gregkh, linux-kernel, hch, linux-acpi, mingo

On Fri, May 22, 2015 at 11:37 AM, Elliott, Robert (Server Storage)
<Elliott@hp.com> wrote:
>
>> -----Original Message-----
>> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of
>> Dan Williams
>> Sent: Wednesday, May 20, 2015 3:57 PM
>> To: axboe@kernel.dk
>> Cc: linux-nvdimm@lists.01.org; neilb@suse.de; gregkh@linuxfoundation.org;
>> linux-kernel@vger.kernel.org; hch@lst.de; linux-acpi@vger.kernel.org;
>> mingo@kernel.org
>> Subject: [PATCH v3 14/21] libnd: blk labels and namespace instantiation
>>
> ...
>> @@ -1029,6 +1244,173 @@ static struct device **create_namespace_pmem(struct
>> nd_region *nd_region)
>>       return NULL;
>>  }
>>
>> +struct resource *nsblk_add_resource(struct nd_region *nd_region,
>> +             struct nd_dimm_drvdata *ndd, struct nd_namespace_blk *nsblk,
>> +             resource_size_t start)
>> +{
>> +     struct nd_label_id label_id;
>> +     struct resource *res;
>> +
>> +     nd_label_gen_id(&label_id, nsblk->uuid, NSLABEL_FLAG_LOCAL);
>> +     nsblk->res = krealloc(nsblk->res,
>> +                     sizeof(void *) * (nsblk->num_resources + 1),
>> +                     GFP_KERNEL);
>> +     if (!nsblk->res)
>> +             return NULL;
>
> scripts/checkpatch.pl doesn't like that:
> WARNING: Reusing the krealloc arg is almost always a bug
> #1411: FILE: drivers/block/nd/namespace_devs.c:1411:
> +       nsblk->res = krealloc(nsblk->res,
>
> The reasoning (https://lkml.org/lkml/2013/3/14/558) is:
>
> "If krealloc() returns NULL, it *doesn't* free the original. So any
> code of the form 'foo = krealloc(foo, …);' is almost certainly a bug."
>

Ok, will fix that up.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 14/21] libnd: blk labels and namespace instantiation
@ 2015-05-22 18:51       ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-22 18:51 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: axboe, linux-nvdimm@lists.01.org, neilb, gregkh, linux-kernel,
	hch, linux-acpi, mingo

On Fri, May 22, 2015 at 11:37 AM, Elliott, Robert (Server Storage)
<Elliott@hp.com> wrote:
>
>> -----Original Message-----
>> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of
>> Dan Williams
>> Sent: Wednesday, May 20, 2015 3:57 PM
>> To: axboe@kernel.dk
>> Cc: linux-nvdimm@lists.01.org; neilb@suse.de; gregkh@linuxfoundation.org;
>> linux-kernel@vger.kernel.org; hch@lst.de; linux-acpi@vger.kernel.org;
>> mingo@kernel.org
>> Subject: [PATCH v3 14/21] libnd: blk labels and namespace instantiation
>>
> ...
>> @@ -1029,6 +1244,173 @@ static struct device **create_namespace_pmem(struct
>> nd_region *nd_region)
>>       return NULL;
>>  }
>>
>> +struct resource *nsblk_add_resource(struct nd_region *nd_region,
>> +             struct nd_dimm_drvdata *ndd, struct nd_namespace_blk *nsblk,
>> +             resource_size_t start)
>> +{
>> +     struct nd_label_id label_id;
>> +     struct resource *res;
>> +
>> +     nd_label_gen_id(&label_id, nsblk->uuid, NSLABEL_FLAG_LOCAL);
>> +     nsblk->res = krealloc(nsblk->res,
>> +                     sizeof(void *) * (nsblk->num_resources + 1),
>> +                     GFP_KERNEL);
>> +     if (!nsblk->res)
>> +             return NULL;
>
> scripts/checkpatch.pl doesn't like that:
> WARNING: Reusing the krealloc arg is almost always a bug
> #1411: FILE: drivers/block/nd/namespace_devs.c:1411:
> +       nsblk->res = krealloc(nsblk->res,
>
> The reasoning (https://lkml.org/lkml/2013/3/14/558) is:
>
> "If krealloc() returns NULL, it *doesn't* free the original. So any
> code of the form 'foo = krealloc(foo, …);' is almost certainly a bug."
>

Ok, will fix that up.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH v3 18/21] nd_btt: atomic sector updates
  2015-05-20 20:57   ` Dan Williams
@ 2015-05-22 21:16     ` Elliott, Robert (Server Storage)
  -1 siblings, 0 replies; 89+ messages in thread
From: Elliott, Robert (Server Storage) @ 2015-05-22 21:16 UTC (permalink / raw)
  To: Dan Williams, axboe
  Cc: mingo, linux-nvdimm, neilb, gregkh, Dave Chinner, linux-kernel,
	Andy Lutomirski, Jens Axboe, linux-acpi, H. Peter Anvin, hch



> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of
> Dan Williams
> Sent: Wednesday, May 20, 2015 3:58 PM
> To: axboe@kernel.dk
> Cc: mingo@kernel.org; linux-nvdimm@lists.01.org; neilb@suse.de;
> gregkh@linuxfoundation.org; Dave Chinner; linux-kernel@vger.kernel.org; Andy
> Lutomirski; Jens Axboe; linux-acpi@vger.kernel.org; H. Peter Anvin;
> hch@lst.de
> Subject: [PATCH v3 18/21] nd_btt: atomic sector updates
> 
> From: Vishal Verma <vishal.l.verma@linux.intel.com>
> 
...
> diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
> index 00d9afe9475e..2b169806eac5 100644
> --- a/drivers/block/nd/Kconfig
> +++ b/drivers/block/nd/Kconfig
> @@ -32,9 +32,25 @@ config BLK_DEV_PMEM
>  	  capable of DAX (direct-access) file system mappings.  See
>  	  Documentation/blockdev/nd.txt for more details.
> 
> -	  Say Y if you want to use a NVDIMM described by NFIT
> +	  Say Y if you want to use a NVDIMM described by ACPI, E820, etc...
> 
>  config ND_BTT_DEVS
> -	def_bool y
> +	bool
> +
> +config ND_BTT
> +	tristate "BTT: Block Translation Table (atomic sector updates)"
> +	depends on LIBND
> +	default LIBND
> +	select ND_BTT_DEVS

The ND_BTT option, which is presented during a kernel build,
is missing help text. So is E820_PMEM in patch 3/21.

---
Robert Elliott, HP Server Storage

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH v3 18/21] nd_btt: atomic sector updates
@ 2015-05-22 21:16     ` Elliott, Robert (Server Storage)
  0 siblings, 0 replies; 89+ messages in thread
From: Elliott, Robert (Server Storage) @ 2015-05-22 21:16 UTC (permalink / raw)
  To: Dan Williams, axboe
  Cc: mingo, linux-nvdimm@lists.01.org, neilb, gregkh, Dave Chinner,
	linux-kernel, Andy Lutomirski, Jens Axboe, linux-acpi,
	H. Peter Anvin, hch



> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of
> Dan Williams
> Sent: Wednesday, May 20, 2015 3:58 PM
> To: axboe@kernel.dk
> Cc: mingo@kernel.org; linux-nvdimm@lists.01.org; neilb@suse.de;
> gregkh@linuxfoundation.org; Dave Chinner; linux-kernel@vger.kernel.org; Andy
> Lutomirski; Jens Axboe; linux-acpi@vger.kernel.org; H. Peter Anvin;
> hch@lst.de
> Subject: [PATCH v3 18/21] nd_btt: atomic sector updates
> 
> From: Vishal Verma <vishal.l.verma@linux.intel.com>
> 
...
> diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
> index 00d9afe9475e..2b169806eac5 100644
> --- a/drivers/block/nd/Kconfig
> +++ b/drivers/block/nd/Kconfig
> @@ -32,9 +32,25 @@ config BLK_DEV_PMEM
>  	  capable of DAX (direct-access) file system mappings.  See
>  	  Documentation/blockdev/nd.txt for more details.
> 
> -	  Say Y if you want to use a NVDIMM described by NFIT
> +	  Say Y if you want to use a NVDIMM described by ACPI, E820, etc...
> 
>  config ND_BTT_DEVS
> -	def_bool y
> +	bool
> +
> +config ND_BTT
> +	tristate "BTT: Block Translation Table (atomic sector updates)"
> +	depends on LIBND
> +	default LIBND
> +	select ND_BTT_DEVS

The ND_BTT option, which is presented during a kernel build,
is missing help text. So is E820_PMEM in patch 3/21.

---
Robert Elliott, HP Server Storage

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 18/21] nd_btt: atomic sector updates
  2015-05-22 21:16     ` Elliott, Robert (Server Storage)
@ 2015-05-22 21:39       ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-22 21:39 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: axboe, mingo, linux-nvdimm, neilb, gregkh, Dave Chinner,
	linux-kernel, Andy Lutomirski, Jens Axboe, linux-acpi,
	H. Peter Anvin, hch

On Fri, May 22, 2015 at 2:16 PM, Elliott, Robert (Server Storage)
<Elliott@hp.com> wrote:
>
>
>> -----Original Message-----
>> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of
>> Dan Williams
>> Sent: Wednesday, May 20, 2015 3:58 PM
>> To: axboe@kernel.dk
>> Cc: mingo@kernel.org; linux-nvdimm@lists.01.org; neilb@suse.de;
>> gregkh@linuxfoundation.org; Dave Chinner; linux-kernel@vger.kernel.org; Andy
>> Lutomirski; Jens Axboe; linux-acpi@vger.kernel.org; H. Peter Anvin;
>> hch@lst.de
>> Subject: [PATCH v3 18/21] nd_btt: atomic sector updates
>>
>> From: Vishal Verma <vishal.l.verma@linux.intel.com>
>>
> ...
>> diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
>> index 00d9afe9475e..2b169806eac5 100644
>> --- a/drivers/block/nd/Kconfig
>> +++ b/drivers/block/nd/Kconfig
>> @@ -32,9 +32,25 @@ config BLK_DEV_PMEM
>>         capable of DAX (direct-access) file system mappings.  See
>>         Documentation/blockdev/nd.txt for more details.
>>
>> -       Say Y if you want to use a NVDIMM described by NFIT
>> +       Say Y if you want to use a NVDIMM described by ACPI, E820, etc...
>>
>>  config ND_BTT_DEVS
>> -     def_bool y
>> +     bool
>> +
>> +config ND_BTT
>> +     tristate "BTT: Block Translation Table (atomic sector updates)"
>> +     depends on LIBND
>> +     default LIBND
>> +     select ND_BTT_DEVS
>
> The ND_BTT option, which is presented during a kernel build,
> is missing help text. So is E820_PMEM in patch 3/21.
>

Right, but another alternative is hiding the ability to configure it
altogether.  Perhaps just build them in always.  I made them
configurable for those kernel size folks that like the ability to
throw away things they don't need, but this may be a degree of freedom
too far.  E820_PMEM is a bit more straightforward, and could use a
help comment like "you've already gone through the trouble to turn on
X86_PMEM_LEGACY, you had better turn on the driver too, otherwise
what's the point" :).

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 18/21] nd_btt: atomic sector updates
@ 2015-05-22 21:39       ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-22 21:39 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: axboe, mingo, linux-nvdimm@lists.01.org, neilb, gregkh,
	Dave Chinner, linux-kernel, Andy Lutomirski, Jens Axboe,
	linux-acpi, H. Peter Anvin, hch

On Fri, May 22, 2015 at 2:16 PM, Elliott, Robert (Server Storage)
<Elliott@hp.com> wrote:
>
>
>> -----Original Message-----
>> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of
>> Dan Williams
>> Sent: Wednesday, May 20, 2015 3:58 PM
>> To: axboe@kernel.dk
>> Cc: mingo@kernel.org; linux-nvdimm@lists.01.org; neilb@suse.de;
>> gregkh@linuxfoundation.org; Dave Chinner; linux-kernel@vger.kernel.org; Andy
>> Lutomirski; Jens Axboe; linux-acpi@vger.kernel.org; H. Peter Anvin;
>> hch@lst.de
>> Subject: [PATCH v3 18/21] nd_btt: atomic sector updates
>>
>> From: Vishal Verma <vishal.l.verma@linux.intel.com>
>>
> ...
>> diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
>> index 00d9afe9475e..2b169806eac5 100644
>> --- a/drivers/block/nd/Kconfig
>> +++ b/drivers/block/nd/Kconfig
>> @@ -32,9 +32,25 @@ config BLK_DEV_PMEM
>>         capable of DAX (direct-access) file system mappings.  See
>>         Documentation/blockdev/nd.txt for more details.
>>
>> -       Say Y if you want to use a NVDIMM described by NFIT
>> +       Say Y if you want to use a NVDIMM described by ACPI, E820, etc...
>>
>>  config ND_BTT_DEVS
>> -     def_bool y
>> +     bool
>> +
>> +config ND_BTT
>> +     tristate "BTT: Block Translation Table (atomic sector updates)"
>> +     depends on LIBND
>> +     default LIBND
>> +     select ND_BTT_DEVS
>
> The ND_BTT option, which is presented during a kernel build,
> is missing help text. So is E820_PMEM in patch 3/21.
>

Right, but another alternative is hiding the ability to configure it
altogether.  Perhaps just build them in always.  I made them
configurable for those kernel size folks that like the ability to
throw away things they don't need, but this may be a degree of freedom
too far.  E820_PMEM is a bit more straightforward, and could use a
help comment like "you've already gone through the trouble to turn on
X86_PMEM_LEGACY, you had better turn on the driver too, otherwise
what's the point" :).

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 09/21] libnd, nd_pmem: add libnd support to the pmem driver
  2015-05-20 20:57   ` Dan Williams
  (?)
@ 2015-05-23 14:39   ` Christoph Hellwig
  2015-05-23 16:59     ` Dan Williams
  -1 siblings, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2015-05-23 14:39 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, linux-nvdimm, neilb, gregkh, linux-kernel,
	Andy Lutomirski, Jens Axboe, linux-acpi, H. Peter Anvin, hch,
	mingo

On Wed, May 20, 2015 at 04:57:00PM -0400, Dan Williams wrote:
> nd_pmem attaches to persistent memory regions and namespaces emitted by
> the libnd subsystem, and, same as the original pmem driver, presents the
> system-physical-address range as a block device.
> 
> The existing e820-type-12 to pmem setup is converted to a full libnd bus
> that emits an nd_namespace_io device.

This looks completely bonkers.  If you want to pretend the legacy
e820 NVDIMMs fit into your new world do that directly in
arch/x86/kernel/pmem.c instead of splitting it over two files.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 09/21] libnd, nd_pmem: add libnd support to the pmem driver
  2015-05-23 14:39   ` Christoph Hellwig
@ 2015-05-23 16:59     ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2015-05-23 16:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Neil Brown, linux-nvdimm, H. Peter Anvin,
	Linux Kernel Mailing List, Andy Lutomirski, Jens Axboe,
	linux-acpi, Ingo Molnar, Greg Kroah-Hartman, Christoph Hellwig

On Sat, May 23, 2015 at 7:39 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, May 20, 2015 at 04:57:00PM -0400, Dan Williams wrote:
>> nd_pmem attaches to persistent memory regions and namespaces emitted by
>> the libnd subsystem, and, same as the original pmem driver, presents the
>> system-physical-address range as a block device.
>>
>> The existing e820-type-12 to pmem setup is converted to a full libnd bus
>> that emits an nd_namespace_io device.
>
> This looks completely bonkers.  If you want to pretend the legacy
> e820 NVDIMMs fit into your new world do that directly in
> arch/x86/kernel/pmem.c instead of splitting it over two files.

I was looking to preserve the ability to keep libnd as a module, but
it doesn't really matter given the small number of systems that will
end up caring about X86_PMEM_LEGACY in the near term.  I'll skip the
platform device infrastructure and just register the pmem regions
directly from arch/x86/kernel/pmem.c.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [PATCH v3 20/21] nfit-test: manufactured NFITs for interface development
  2015-05-20 20:58   ` Dan Williams
@ 2015-05-25  7:02     ` Elliott, Robert (Server Storage)
  -1 siblings, 0 replies; 89+ messages in thread
From: Elliott, Robert (Server Storage) @ 2015-05-25  7:02 UTC (permalink / raw)
  To: Dan Williams, axboe
  Cc: linux-nvdimm, neilb, gregkh, Rafael J. Wysocki, linux-kernel,
	Robert Moore, linux-acpi, Lv Zheng, hch, mingo, Kani, Toshimitsu

[-- Attachment #1: Type: text/plain, Size: 667 bytes --]

> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf
> Of Dan Williams
> Sent: Wednesday, May 20, 2015 3:58 PM
> To: axboe@kernel.dk
> Subject: [PATCH v3 20/21] nfit-test: manufactured NFITs for interface
> development
...

Attached is some experimental code to try pmem with different 
cache types (UC, WB, WC, and WT) and memcpy functions using x86 
AVX non-temporal load and store instructions.

It depends on Toshi's WT patch series:
	https://lkml.org/lkml/2015/5/13/866

If you don't have that, you can just comment out the lines related
to ioremap_wt.

---
Rob Elliott, HP Server Storage


[-- Attachment #2: 0001-pmem-cache-type --]
[-- Type: application/octet-stream, Size: 19027 bytes --]

From 18e75a7134e0130b925fffab13f41c1ffc4d9f05 Mon Sep 17 00:00:00 2001
From: Robert Elliott <elliott@hp.com>
Date: Fri, 22 May 2015 16:46:21 -0500
Subject: [PATCH] pmem cache type patch

Author: Robert Elliott <elliott@hp.com>
Date:   Tue Apr 28 19:14:53 2015 -0500

    pmem: cache_type, non-temporal memcpy experiments

    WARNING: Not for inclusion in the kernel - just for experimentation.

    Add modparams to select cache_type and various kinds of
    memcpy with non-temporal loads and stores.  Parameters
    are printed to the kernel serial log at module load time.

    Example usage:
    modprobe pmem pmem_cachetype=2 pmem_readscan=2 pmem_ntw=1 pmem_ntr=1

    x86 offers several non-temporal instructions:
    *  8 byte: movnti (store) from normal registers
    * 16 byte: movntdq (store) and movntdqa (load) using xmm registers (SSE)
    * 32 byte: vmovntdq and vmovntdqa using ymm registers (AVX)
    * 64 byte: vmovntdq and vmovntdqa using zmm registers (AVX512)

    The 32-byte AVX instructions are used by this patch.

    Normal memcpy is used for unaligned pmem_rw_bytes accesses,
    so is unsafe for WB mode.

    Module parameters
    =================
    pmem_cachetype=n	(default 3)
    	Select the cache type (which ioremap function to use to
    	map the NVDIMM memory)
    	0 = UC (uncacheable) - slow writes, slow reads
    	1 = WB (writeback) - fast unsafe writes, fast reads
    	2 = WC (write combining) - fast writes, slow reads
    	3 = WT (writethrough) - slow writes, fast reads

    	WB writes are safe if:
    	* non-temporal stores are exclusively used
    	* clflush instructions are added

    pmem_readscan=n		(default 0)
    	0 = no read scan
    	1 = read the entire memory range, looking to trigger
    	UC memory errors

    	The rate is also printed, serving as a quick performance
    	check (uses a 64 byte loop with NT loads).

    pmem_clean=n		(default 0)
    	0 = no clean
    	1 = overwrite the entire memory range, possibly
    	clearing UC memory errors (dangerous, destroys
    	all data)

    	The rate is also printed, serving as a quick performance
    	check (uses a 64 byte loop with NT stores).

    pmem_ntw=n		(default 3)
    	Use non-temporal stores when writing persistent memory

    	0 = memcpy (unsafe for WB)
    	1 = 64 byte loop with NT stores
    	2 = 128 byte loop with NT stores
    	3 = 64 byte loop with NT stores, plus use NT loads from
    	  normal memory (may be better cache usage)
    	4 = 128 byte loop with NT stores, plus use NT loads from
    	  normal memory
    	5 = __copy_from_user (existing kernel function with
    	  8 byte NT instructions)
    	6 = no write at all (nop)(dangerous)
    	7 = 64-byte loop, store only (write garbage)(dangerous)

    pmem_ntr=n		(default 3)
    	Use non-temporal loads when reading persistent memory

    	0 = memcpy
    	1 = 64 byte loop with NT loads
    	2 = 128 byte loop with NT loads
    	3 = 64 byte loop with NT loads, plus use NT stores to
    	  normal memory
    	4 = 128 byte loop with NT loads, plus use NT stores to
    	  normal memory
    	5 = memcpy
    	6 = no load at all (nop)(dangerous)
    	7 = 64-byte loop, load only (return garbage)(dangerous)

    pmm_ntw=6 pmem_ntr=6 exhibits the block layer IOPS limits.

    Signed-off-by: Robert Elliott <elliott@hp.com>
---
 drivers/block/nd/pmem.c | 550 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 539 insertions(+), 11 deletions(-)

diff --git a/drivers/block/nd/pmem.c b/drivers/block/nd/pmem.c
index 7b5cedf1f2a4..f378ef81733f 100644
--- a/drivers/block/nd/pmem.c
+++ b/drivers/block/nd/pmem.c
@@ -26,6 +26,382 @@
 #include <linux/nd.h>
 #include "nd.h"
 
+static int pmem_cachetype;	/* default UC */
+module_param(pmem_cachetype, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_cachetype,
+	"Select cache attribute for pmem driver (0=UC, 1=WB 2=WC 3=WT)");
+
+static int pmem_readscan;
+module_param(pmem_readscan, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_readscan,
+	"Read scan pmem device upon init (trigger ECC errors)");
+
+static int pmem_clean;
+module_param(pmem_clean, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_clean,
+	"Clean pmem device upon init (write garbage, but cleans the ECC)");
+
+static int pmem_ntw = 3;
+module_param(pmem_ntw, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_ntw,
+	"Use non-temporal stores for block writes in pmem (0=memcpy, 1=64 byte NT, 2=128 byte NT, 3=64 dual NT, 4=128 dual NT, 5=copy_from_user, 6=nop, 7=64-byte NT-store only)");
+
+static int pmem_ntr = 3;
+module_param(pmem_ntr, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_ntr,
+	"Use non-temporal loads for block reads in pmem (0=memcpy, 1=64 byte NT, 2=128 byte NT, 3=64 dual NT, 4=128 dual NT, 5=memcpy, 6=nop, 7=64-byte NT-load only)");
+
+/* load: normal, store: non-temporal, loop: 64 bytes */
+static void memcpy_lt_snt_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		/* 16-byte SSE instructions */
+		"movdqa (%0), %%xmm0\n"
+		"movdqa 16(%0), %%xmm1\n"
+		"movdqa 32(%0), %%xmm2\n"
+		"movdqa 48(%0), %%xmm3\n"
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+#endif
+		/* 32-byte AVX instructions */
+		"vmovdqa (%0), %%ymm0\n"
+		"vmovdqa 32(%0), %%ymm1\n"
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: skip, store: non-temporal, loop: 64 bytes */
+static void memcpy_lskip_snt_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+#endif
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: non-temporal, loop: 64 bytes */
+static void memcpy_lnt_snt_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: normal, store: non-temporal, loop: 128 bytes */
+static void memcpy_lt_snt_128(void *to, const void *from, size_t size)
+{
+	u64 bs = 128;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		/* hard to use prefetch effectively */
+		"prefetchnta 128(%0)\n"
+		"prefetchnta 192(%0)\n"
+#endif
+#if 0
+		"movdqa (%0), %%xmm0\n"
+		"movdqa 16(%0), %%xmm1\n"
+		"movdqa 32(%0), %%xmm2\n"
+		"movdqa 48(%0), %%xmm3\n"
+		"movdqa 64(%0), %%xmm4\n"
+		"movdqa 80(%0), %%xmm5\n"
+		"movdqa 96(%0), %%xmm6\n"
+		"movdqa 112(%0), %%xmm7\n"
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+		"movntdq %%xmm4, 64(%1)\n"
+		"movntdq %%xmm5, 80(%1)\n"
+		"movntdq %%xmm6, 96(%1)\n"
+		"movntdq %%xmm7, 112(%1)\n"
+#endif
+		"vmovdqa (%0), %%ymm0\n"
+		"vmovdqa 32(%0), %%ymm1\n"
+		"vmovdqa 64(%0), %%ymm2\n"
+		"vmovdqa 96(%0), %%ymm3\n"
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		"vmovntdq %%ymm2, 64(%1)\n"
+		"vmovntdq %%ymm3, 96(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: non-temporal, loop: 128 bytes */
+static void memcpy_lnt_snt_128(void *to, const void *from, size_t size)
+{
+	u64 bs = 128;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"prefetchnta 128(%0)\n"
+		"prefetchnta 192(%0)\n"
+#endif
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+		"movntdqa 64(%0), %%xmm4\n"
+		"movntdqa 80(%0), %%xmm5\n"
+		"movntdqa 96(%0), %%xmm6\n"
+		"movntdqa 112(%0), %%xmm7\n"
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+		"movntdq %%xmm4, 64(%1)\n"
+		"movntdq %%xmm5, 80(%1)\n"
+		"movntdq %%xmm6, 96(%1)\n"
+		"movntdq %%xmm7, 112(%1)\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		"vmovntdqa 64(%0), %%ymm2\n"
+		"vmovntdqa 96(%0), %%ymm3\n"
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		"vmovntdq %%ymm2, 64(%1)\n"
+		"vmovntdq %%ymm3, 96(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: normal, loop: 64 bytes */
+static void memcpy_lnt_st_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+		"movdqa %%xmm0, (%1)\n"
+		"movdqa %%xmm1, 16(%1)\n"
+		"movdqa %%xmm2, 32(%1)\n"
+		"movdqa %%xmm3, 48(%1)\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		"vmovdqa %%ymm0, (%1)\n"
+		"vmovdqa %%ymm1, 32(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: skip, loop: 64 bytes */
+static void memcpy_lnt_sskip_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: normal, loop: 128 bytes */
+static void memcpy_lnt_st_128(void *to, const void *from, size_t size)
+{
+	u64 bs = 128;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"prefetchnta 128(%0)\n"
+		"prefetchnta 192(%0)\n"
+#endif
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+		"movntdqa 64(%0), %%xmm4\n"
+		"movntdqa 80(%0), %%xmm5\n"
+		"movntdqa 96(%0), %%xmm6\n"
+		"movntdqa 112(%0), %%xmm7\n"
+		"movdqa %%xmm0, (%1)\n"
+		"movdqa %%xmm1, 16(%1)\n"
+		"movdqa %%xmm2, 32(%1)\n"
+		"movdqa %%xmm3, 48(%1)\n"
+		"movdqa %%xmm4, 64(%1)\n"
+		"movdqa %%xmm5, 80(%1)\n"
+		"movdqa %%xmm6, 96(%1)\n"
+		"movdqa %%xmm7, 112(%1)\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		"vmovntdqa 64(%0), %%ymm2\n"
+		"vmovntdqa 96(%0), %%ymm3\n"
+		"vmovdqa %%ymm0, (%1)\n"
+		"vmovdqa %%ymm1, 32(%1)\n"
+		"vmovdqa %%ymm2, 64(%1)\n"
+		"vmovdqa %%ymm3, 96(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
 struct pmem_device {
 	struct request_queue	*pmem_queue;
 	struct gendisk		*pmem_disk;
@@ -37,6 +413,81 @@ struct pmem_device {
 	size_t			size;
 };
 
+/* pick the type of memcpy for a read from NVDIMMs */
+static void memcpy_ntr(void *to, const void *from, size_t size)
+{
+	switch (pmem_ntr) {
+	case 1:
+		memcpy_lnt_st_64(to, from, size);
+		break;
+	case 2:
+		memcpy_lnt_st_128(to, from, size);
+		break;
+	case 3:
+		memcpy_lnt_snt_64(to, from, size);
+		break;
+	case 4:
+		memcpy_lnt_snt_128(to, from, size);
+		break;
+	case 6:
+		/* nop */
+		break;
+	case 7:
+		memcpy_lnt_sskip_64(to, from, size);
+		break;
+	default:
+		memcpy(to, from, size);
+		break;
+	}
+}
+
+/* pick the type of memcpy for a write to NVDIMMs */
+static void memcpy_ntw(void *to, const void *from, size_t size)
+{
+	int ret;
+	switch (pmem_ntw) {
+	case 1:
+		memcpy_lt_snt_64(to, from, size);
+		ret = 0;
+		break;
+	case 2:
+		memcpy_lt_snt_128(to, from, size);
+		ret = 0;
+		break;
+	case 3:
+		memcpy_lnt_snt_64(to, from, size);
+		ret = 0;
+		break;
+	case 4:
+		memcpy_lnt_snt_128(to, from, size);
+		ret = 0;
+		break;
+	case 5:
+		ret = __copy_from_user(to, from, size);
+		if (ret)
+			goto exit;
+	case 6:
+		/* nop */
+		ret = 0;
+		break;
+	case 7:
+		memcpy_lskip_snt_64(to, from, size);
+		ret = 0;
+		break;
+	default:
+		memcpy(to, from, size);
+		ret = 0;
+		break;
+	}
+exit:
+	/* if __copy_from_user or other memcpy functions with return
+	 * values are used, the return value should really be
+	 * propagated upstream. Since most memcpys assume success,
+	 * forgo this for now
+	 */
+	return;
+}
+
 static int pmem_major;
 
 static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
@@ -47,11 +498,11 @@ static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 	size_t pmem_off = sector << 9;
 
 	if (rw == READ) {
-		memcpy(mem + off, pmem->virt_addr + pmem_off, len);
+		memcpy_ntr(mem + off, pmem->virt_addr + pmem_off, len);
 		flush_dcache_page(page);
 	} else {
 		flush_dcache_page(page);
-		memcpy(pmem->virt_addr + pmem_off, mem + off, len);
+		memcpy_ntw(pmem->virt_addr + pmem_off, mem + off, len);
 	}
 
 	kunmap_atomic(mem);
@@ -109,10 +560,26 @@ static int pmem_rw_bytes(struct nd_io *ndio, void *buf, size_t offset,
 		return -EFAULT;
 	}
 
-	if (rw == READ)
-		memcpy(buf, pmem->virt_addr + offset, n);
-	else
-		memcpy(pmem->virt_addr + offset, buf, n);
+	/* NOTE: Plain memcpy is used for unaligned accesses, meaning
+	 * this is not safe for WB mode.
+	 *
+	 * All btt accesses come through here; many are not aligned.
+	 */
+	if (rw == READ) {
+		if (IS_ALIGNED((u64) buf, 64) &&
+		    IS_ALIGNED((u64) pmem->virt_addr + offset, 64) &&
+		    IS_ALIGNED(n, 64))
+			memcpy_ntr(buf, pmem->virt_addr + offset, n);
+		else
+			memcpy(buf, pmem->virt_addr + offset, n);
+	} else {
+		if (IS_ALIGNED((u64) buf, 64) &&
+		    IS_ALIGNED((u64) pmem->virt_addr + offset, 64) &&
+		    IS_ALIGNED(n, 64))
+			memcpy_ntw(pmem->virt_addr + offset, buf, n);
+		else
+			memcpy(pmem->virt_addr + offset, buf, n);
+	}
 
 	return 0;
 }
@@ -143,6 +610,7 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res,
 	struct pmem_device *pmem;
 	struct gendisk *disk;
 	int err;
+	u64 ts, te;
 
 	err = -ENOMEM;
 	pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
@@ -152,21 +620,78 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res,
 	pmem->phys_addr = res->start;
 	pmem->size = resource_size(res);
 
+	dev_info(dev,
+		"mapping phys=0x%llx (%lld GiB) size=0x%zx (%ld GiB)\n",
+		pmem->phys_addr, pmem->phys_addr / (1024*1024*1024),
+		pmem->size, pmem->size / (1024*1024*1024));
+
 	err = -EINVAL;
 	if (!request_mem_region(pmem->phys_addr, pmem->size, "pmem")) {
 		dev_warn(dev, "could not reserve region [0x%pa:0x%zx]\n", &pmem->phys_addr, pmem->size);
 		goto out_free_dev;
 	}
 
-	/*
-	 * Map the memory as non-cachable, as we can't write back the contents
-	 * of the CPU caches in case of a crash.
-	 */
 	err = -ENOMEM;
-	pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
+	switch (pmem_cachetype) {
+	case 0: /* UC */
+		pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
+		break;
+	case 1: /* WB */
+		/* WB is unsafe unless system flushes caches on power loss */
+		pmem->virt_addr = ioremap_cache(pmem->phys_addr, pmem->size);
+		break;
+	case 2: /* WC */
+		/* WC is unsafe unless system flushes buffers on power loss */
+		pmem->virt_addr = ioremap_wc(pmem->phys_addr, pmem->size);
+		break;
+	case 3: /* WT */
+	default:
+		pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size);
+		break;
+	}
+
+	dev_info(dev,
+		"mapped: cache_type=%d virt=0x%p phys=0x%llx (%lld GiB) size=0x%zx (%ld GiB)\n",
+		pmem_cachetype,
+		pmem->virt_addr,
+		pmem->phys_addr, pmem->phys_addr / (1024*1024*1024),
+		pmem->size, pmem->size / (1024*1024*1024));
+
 	if (!pmem->virt_addr)
 		goto out_release_region;
 
+	if (pmem_clean) {
+		/* write all of NVDIMM memory to clear any ECC errors */
+		dev_info(dev,
+			"write clean starting: virt=0x%p phys=0x%llx (%lld GiB) size=0x%zx (%ld GiB)\n",
+			pmem->virt_addr,
+			pmem->phys_addr, pmem->phys_addr / (1024*1024*1024),
+			pmem->size, pmem->size / (1024*1024*1024));
+		ts = local_clock();
+		memcpy_lskip_snt_64(pmem->virt_addr, NULL, pmem->size);
+		te = local_clock();
+		dev_info(dev,
+			"write clean complete: ct=%d in %lld GB/s\n",
+			pmem_cachetype,
+			pmem->size / (te - ts));	/* B/ns equals GB/s */
+	}
+
+	/* read all of NVDIMM memory to trigger any ECC errors now */
+	if (pmem_readscan) {
+		dev_info(dev,
+			"read scan starting: virt=0x%p phys=0x%llx (%lld GiB) size=0x%zx (%ld GiB)\n",
+			pmem->virt_addr,
+			pmem->phys_addr, pmem->phys_addr / (1024*1024*1024),
+			pmem->size, pmem->size / (1024*1024*1024));
+		ts = local_clock();
+		memcpy_lnt_sskip_64(0, pmem->virt_addr, pmem->size);
+		te = local_clock();
+		dev_info(dev,
+			"read scan complete: ct=%d in %lld GB/s\n",
+			pmem_cachetype,
+			pmem->size / (te - ts));	/* B/ns equals GB/s */
+	}
+
 	pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
 	if (!pmem->pmem_queue)
 		goto out_unmap;
@@ -276,6 +801,9 @@ static int __init pmem_init(void)
 {
 	int error;
 
+	pr_info("pmem loading with pmem_readscan=%d pmem_clean=%d pmem_cachetype=%d pmem_ntw=%d pmem_ntr=%d\n",
+		pmem_readscan, pmem_clean, pmem_cachetype, pmem_ntw, pmem_ntr);
+
 	pmem_major = register_blkdev(0, "pmem");
 	if (pmem_major < 0)
 		return pmem_major;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* RE: [PATCH v3 20/21] nfit-test: manufactured NFITs for interface development
@ 2015-05-25  7:02     ` Elliott, Robert (Server Storage)
  0 siblings, 0 replies; 89+ messages in thread
From: Elliott, Robert (Server Storage) @ 2015-05-25  7:02 UTC (permalink / raw)
  To: Dan Williams, axboe
  Cc: linux-nvdimm@lists.01.org, neilb, gregkh, Rafael J. Wysocki,
	linux-kernel, Robert Moore, linux-acpi, Lv Zheng, hch, mingo,
	Kani, Toshimitsu, Christoph Hellwig,
	Boaz Harrosh (boaz@plexistor.com)

[-- Attachment #1: Type: text/plain, Size: 667 bytes --]

> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf
> Of Dan Williams
> Sent: Wednesday, May 20, 2015 3:58 PM
> To: axboe@kernel.dk
> Subject: [PATCH v3 20/21] nfit-test: manufactured NFITs for interface
> development
...

Attached is some experimental code to try pmem with different 
cache types (UC, WB, WC, and WT) and memcpy functions using x86 
AVX non-temporal load and store instructions.

It depends on Toshi's WT patch series:
	https://lkml.org/lkml/2015/5/13/866

If you don't have that, you can just comment out the lines related
to ioremap_wt.

---
Rob Elliott, HP Server Storage


[-- Attachment #2: 0001-pmem-cache-type --]
[-- Type: application/octet-stream, Size: 19027 bytes --]

From 18e75a7134e0130b925fffab13f41c1ffc4d9f05 Mon Sep 17 00:00:00 2001
From: Robert Elliott <elliott@hp.com>
Date: Fri, 22 May 2015 16:46:21 -0500
Subject: [PATCH] pmem cache type patch

Author: Robert Elliott <elliott@hp.com>
Date:   Tue Apr 28 19:14:53 2015 -0500

    pmem: cache_type, non-temporal memcpy experiments

    WARNING: Not for inclusion in the kernel - just for experimentation.

    Add modparams to select cache_type and various kinds of
    memcpy with non-temporal loads and stores.  Parameters
    are printed to the kernel serial log at module load time.

    Example usage:
    modprobe pmem pmem_cachetype=2 pmem_readscan=2 pmem_ntw=1 pmem_ntr=1

    x86 offers several non-temporal instructions:
    *  8 byte: movnti (store) from normal registers
    * 16 byte: movntdq (store) and movntdqa (load) using xmm registers (SSE)
    * 32 byte: vmovntdq and vmovntdqa using ymm registers (AVX)
    * 64 byte: vmovntdq and vmovntdqa using zmm registers (AVX512)

    The 32-byte AVX instructions are used by this patch.

    Normal memcpy is used for unaligned pmem_rw_bytes accesses,
    so is unsafe for WB mode.

    Module parameters
    =================
    pmem_cachetype=n	(default 3)
    	Select the cache type (which ioremap function to use to
    	map the NVDIMM memory)
    	0 = UC (uncacheable) - slow writes, slow reads
    	1 = WB (writeback) - fast unsafe writes, fast reads
    	2 = WC (write combining) - fast writes, slow reads
    	3 = WT (writethrough) - slow writes, fast reads

    	WB writes are safe if:
    	* non-temporal stores are exclusively used
    	* clflush instructions are added

    pmem_readscan=n		(default 0)
    	0 = no read scan
    	1 = read the entire memory range, looking to trigger
    	UC memory errors

    	The rate is also printed, serving as a quick performance
    	check (uses a 64 byte loop with NT loads).

    pmem_clean=n		(default 0)
    	0 = no clean
    	1 = overwrite the entire memory range, possibly
    	clearing UC memory errors (dangerous, destroys
    	all data)

    	The rate is also printed, serving as a quick performance
    	check (uses a 64 byte loop with NT stores).

    pmem_ntw=n		(default 3)
    	Use non-temporal stores when writing persistent memory

    	0 = memcpy (unsafe for WB)
    	1 = 64 byte loop with NT stores
    	2 = 128 byte loop with NT stores
    	3 = 64 byte loop with NT stores, plus use NT loads from
    	  normal memory (may be better cache usage)
    	4 = 128 byte loop with NT stores, plus use NT loads from
    	  normal memory
    	5 = __copy_from_user (existing kernel function with
    	  8 byte NT instructions)
    	6 = no write at all (nop)(dangerous)
    	7 = 64-byte loop, store only (write garbage)(dangerous)

    pmem_ntr=n		(default 3)
    	Use non-temporal loads when reading persistent memory

    	0 = memcpy
    	1 = 64 byte loop with NT loads
    	2 = 128 byte loop with NT loads
    	3 = 64 byte loop with NT loads, plus use NT stores to
    	  normal memory
    	4 = 128 byte loop with NT loads, plus use NT stores to
    	  normal memory
    	5 = memcpy
    	6 = no load at all (nop)(dangerous)
    	7 = 64-byte loop, load only (return garbage)(dangerous)

    pmm_ntw=6 pmem_ntr=6 exhibits the block layer IOPS limits.

    Signed-off-by: Robert Elliott <elliott@hp.com>
---
 drivers/block/nd/pmem.c | 550 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 539 insertions(+), 11 deletions(-)

diff --git a/drivers/block/nd/pmem.c b/drivers/block/nd/pmem.c
index 7b5cedf1f2a4..f378ef81733f 100644
--- a/drivers/block/nd/pmem.c
+++ b/drivers/block/nd/pmem.c
@@ -26,6 +26,382 @@
 #include <linux/nd.h>
 #include "nd.h"
 
+static int pmem_cachetype;	/* default UC */
+module_param(pmem_cachetype, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_cachetype,
+	"Select cache attribute for pmem driver (0=UC, 1=WB 2=WC 3=WT)");
+
+static int pmem_readscan;
+module_param(pmem_readscan, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_readscan,
+	"Read scan pmem device upon init (trigger ECC errors)");
+
+static int pmem_clean;
+module_param(pmem_clean, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_clean,
+	"Clean pmem device upon init (write garbage, but cleans the ECC)");
+
+static int pmem_ntw = 3;
+module_param(pmem_ntw, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_ntw,
+	"Use non-temporal stores for block writes in pmem (0=memcpy, 1=64 byte NT, 2=128 byte NT, 3=64 dual NT, 4=128 dual NT, 5=copy_from_user, 6=nop, 7=64-byte NT-store only)");
+
+static int pmem_ntr = 3;
+module_param(pmem_ntr, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(pmem_ntr,
+	"Use non-temporal loads for block reads in pmem (0=memcpy, 1=64 byte NT, 2=128 byte NT, 3=64 dual NT, 4=128 dual NT, 5=memcpy, 6=nop, 7=64-byte NT-load only)");
+
+/* load: normal, store: non-temporal, loop: 64 bytes */
+static void memcpy_lt_snt_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		/* 16-byte SSE instructions */
+		"movdqa (%0), %%xmm0\n"
+		"movdqa 16(%0), %%xmm1\n"
+		"movdqa 32(%0), %%xmm2\n"
+		"movdqa 48(%0), %%xmm3\n"
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+#endif
+		/* 32-byte AVX instructions */
+		"vmovdqa (%0), %%ymm0\n"
+		"vmovdqa 32(%0), %%ymm1\n"
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: skip, store: non-temporal, loop: 64 bytes */
+static void memcpy_lskip_snt_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+#endif
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: non-temporal, loop: 64 bytes */
+static void memcpy_lnt_snt_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: normal, store: non-temporal, loop: 128 bytes */
+static void memcpy_lt_snt_128(void *to, const void *from, size_t size)
+{
+	u64 bs = 128;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		/* hard to use prefetch effectively */
+		"prefetchnta 128(%0)\n"
+		"prefetchnta 192(%0)\n"
+#endif
+#if 0
+		"movdqa (%0), %%xmm0\n"
+		"movdqa 16(%0), %%xmm1\n"
+		"movdqa 32(%0), %%xmm2\n"
+		"movdqa 48(%0), %%xmm3\n"
+		"movdqa 64(%0), %%xmm4\n"
+		"movdqa 80(%0), %%xmm5\n"
+		"movdqa 96(%0), %%xmm6\n"
+		"movdqa 112(%0), %%xmm7\n"
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+		"movntdq %%xmm4, 64(%1)\n"
+		"movntdq %%xmm5, 80(%1)\n"
+		"movntdq %%xmm6, 96(%1)\n"
+		"movntdq %%xmm7, 112(%1)\n"
+#endif
+		"vmovdqa (%0), %%ymm0\n"
+		"vmovdqa 32(%0), %%ymm1\n"
+		"vmovdqa 64(%0), %%ymm2\n"
+		"vmovdqa 96(%0), %%ymm3\n"
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		"vmovntdq %%ymm2, 64(%1)\n"
+		"vmovntdq %%ymm3, 96(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: non-temporal, loop: 128 bytes */
+static void memcpy_lnt_snt_128(void *to, const void *from, size_t size)
+{
+	u64 bs = 128;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"prefetchnta 128(%0)\n"
+		"prefetchnta 192(%0)\n"
+#endif
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+		"movntdqa 64(%0), %%xmm4\n"
+		"movntdqa 80(%0), %%xmm5\n"
+		"movntdqa 96(%0), %%xmm6\n"
+		"movntdqa 112(%0), %%xmm7\n"
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+		"movntdq %%xmm4, 64(%1)\n"
+		"movntdq %%xmm5, 80(%1)\n"
+		"movntdq %%xmm6, 96(%1)\n"
+		"movntdq %%xmm7, 112(%1)\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		"vmovntdqa 64(%0), %%ymm2\n"
+		"vmovntdqa 96(%0), %%ymm3\n"
+		"vmovntdq %%ymm0, (%1)\n"
+		"vmovntdq %%ymm1, 32(%1)\n"
+		"vmovntdq %%ymm2, 64(%1)\n"
+		"vmovntdq %%ymm3, 96(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: normal, loop: 64 bytes */
+static void memcpy_lnt_st_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+		"movdqa %%xmm0, (%1)\n"
+		"movdqa %%xmm1, 16(%1)\n"
+		"movdqa %%xmm2, 32(%1)\n"
+		"movdqa %%xmm3, 48(%1)\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		"vmovdqa %%ymm0, (%1)\n"
+		"vmovdqa %%ymm1, 32(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: skip, loop: 64 bytes */
+static void memcpy_lnt_sskip_64(void *to, const void *from, size_t size)
+{
+	u64 bs = 64;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
+/* load: non-temporal, store: normal, loop: 128 bytes */
+static void memcpy_lnt_st_128(void *to, const void *from, size_t size)
+{
+	u64 bs = 128;
+	int i;
+
+	BUG_ON(!IS_ALIGNED(size, bs));
+	BUG_ON(!IS_ALIGNED((u64)to, bs));
+	BUG_ON(!IS_ALIGNED((u64)from, bs));
+
+	for (i = 0; i < size; i += bs) {
+		__asm__ __volatile__ (
+#if 0
+		"prefetchnta 128(%0)\n"
+		"prefetchnta 192(%0)\n"
+#endif
+#if 0
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+		"movntdqa 64(%0), %%xmm4\n"
+		"movntdqa 80(%0), %%xmm5\n"
+		"movntdqa 96(%0), %%xmm6\n"
+		"movntdqa 112(%0), %%xmm7\n"
+		"movdqa %%xmm0, (%1)\n"
+		"movdqa %%xmm1, 16(%1)\n"
+		"movdqa %%xmm2, 32(%1)\n"
+		"movdqa %%xmm3, 48(%1)\n"
+		"movdqa %%xmm4, 64(%1)\n"
+		"movdqa %%xmm5, 80(%1)\n"
+		"movdqa %%xmm6, 96(%1)\n"
+		"movdqa %%xmm7, 112(%1)\n"
+#endif
+		"vmovntdqa (%0), %%ymm0\n"
+		"vmovntdqa 32(%0), %%ymm1\n"
+		"vmovntdqa 64(%0), %%ymm2\n"
+		"vmovntdqa 96(%0), %%ymm3\n"
+		"vmovdqa %%ymm0, (%1)\n"
+		"vmovdqa %%ymm1, 32(%1)\n"
+		"vmovdqa %%ymm2, 64(%1)\n"
+		"vmovdqa %%ymm3, 96(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+}
+
 struct pmem_device {
 	struct request_queue	*pmem_queue;
 	struct gendisk		*pmem_disk;
@@ -37,6 +413,81 @@ struct pmem_device {
 	size_t			size;
 };
 
+/* pick the type of memcpy for a read from NVDIMMs */
+static void memcpy_ntr(void *to, const void *from, size_t size)
+{
+	switch (pmem_ntr) {
+	case 1:
+		memcpy_lnt_st_64(to, from, size);
+		break;
+	case 2:
+		memcpy_lnt_st_128(to, from, size);
+		break;
+	case 3:
+		memcpy_lnt_snt_64(to, from, size);
+		break;
+	case 4:
+		memcpy_lnt_snt_128(to, from, size);
+		break;
+	case 6:
+		/* nop */
+		break;
+	case 7:
+		memcpy_lnt_sskip_64(to, from, size);
+		break;
+	default:
+		memcpy(to, from, size);
+		break;
+	}
+}
+
+/* pick the type of memcpy for a write to NVDIMMs */
+static void memcpy_ntw(void *to, const void *from, size_t size)
+{
+	int ret;
+	switch (pmem_ntw) {
+	case 1:
+		memcpy_lt_snt_64(to, from, size);
+		ret = 0;
+		break;
+	case 2:
+		memcpy_lt_snt_128(to, from, size);
+		ret = 0;
+		break;
+	case 3:
+		memcpy_lnt_snt_64(to, from, size);
+		ret = 0;
+		break;
+	case 4:
+		memcpy_lnt_snt_128(to, from, size);
+		ret = 0;
+		break;
+	case 5:
+		ret = __copy_from_user(to, from, size);
+		if (ret)
+			goto exit;
+	case 6:
+		/* nop */
+		ret = 0;
+		break;
+	case 7:
+		memcpy_lskip_snt_64(to, from, size);
+		ret = 0;
+		break;
+	default:
+		memcpy(to, from, size);
+		ret = 0;
+		break;
+	}
+exit:
+	/* if __copy_from_user or other memcpy functions with return
+	 * values are used, the return value should really be
+	 * propagated upstream. Since most memcpys assume success,
+	 * forgo this for now
+	 */
+	return;
+}
+
 static int pmem_major;
 
 static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
@@ -47,11 +498,11 @@ static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 	size_t pmem_off = sector << 9;
 
 	if (rw == READ) {
-		memcpy(mem + off, pmem->virt_addr + pmem_off, len);
+		memcpy_ntr(mem + off, pmem->virt_addr + pmem_off, len);
 		flush_dcache_page(page);
 	} else {
 		flush_dcache_page(page);
-		memcpy(pmem->virt_addr + pmem_off, mem + off, len);
+		memcpy_ntw(pmem->virt_addr + pmem_off, mem + off, len);
 	}
 
 	kunmap_atomic(mem);
@@ -109,10 +560,26 @@ static int pmem_rw_bytes(struct nd_io *ndio, void *buf, size_t offset,
 		return -EFAULT;
 	}
 
-	if (rw == READ)
-		memcpy(buf, pmem->virt_addr + offset, n);
-	else
-		memcpy(pmem->virt_addr + offset, buf, n);
+	/* NOTE: Plain memcpy is used for unaligned accesses, meaning
+	 * this is not safe for WB mode.
+	 *
+	 * All btt accesses come through here; many are not aligned.
+	 */
+	if (rw == READ) {
+		if (IS_ALIGNED((u64) buf, 64) &&
+		    IS_ALIGNED((u64) pmem->virt_addr + offset, 64) &&
+		    IS_ALIGNED(n, 64))
+			memcpy_ntr(buf, pmem->virt_addr + offset, n);
+		else
+			memcpy(buf, pmem->virt_addr + offset, n);
+	} else {
+		if (IS_ALIGNED((u64) buf, 64) &&
+		    IS_ALIGNED((u64) pmem->virt_addr + offset, 64) &&
+		    IS_ALIGNED(n, 64))
+			memcpy_ntw(pmem->virt_addr + offset, buf, n);
+		else
+			memcpy(pmem->virt_addr + offset, buf, n);
+	}
 
 	return 0;
 }
@@ -143,6 +610,7 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res,
 	struct pmem_device *pmem;
 	struct gendisk *disk;
 	int err;
+	u64 ts, te;
 
 	err = -ENOMEM;
 	pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
@@ -152,21 +620,78 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res,
 	pmem->phys_addr = res->start;
 	pmem->size = resource_size(res);
 
+	dev_info(dev,
+		"mapping phys=0x%llx (%lld GiB) size=0x%zx (%ld GiB)\n",
+		pmem->phys_addr, pmem->phys_addr / (1024*1024*1024),
+		pmem->size, pmem->size / (1024*1024*1024));
+
 	err = -EINVAL;
 	if (!request_mem_region(pmem->phys_addr, pmem->size, "pmem")) {
 		dev_warn(dev, "could not reserve region [0x%pa:0x%zx]\n", &pmem->phys_addr, pmem->size);
 		goto out_free_dev;
 	}
 
-	/*
-	 * Map the memory as non-cachable, as we can't write back the contents
-	 * of the CPU caches in case of a crash.
-	 */
 	err = -ENOMEM;
-	pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
+	switch (pmem_cachetype) {
+	case 0: /* UC */
+		pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
+		break;
+	case 1: /* WB */
+		/* WB is unsafe unless system flushes caches on power loss */
+		pmem->virt_addr = ioremap_cache(pmem->phys_addr, pmem->size);
+		break;
+	case 2: /* WC */
+		/* WC is unsafe unless system flushes buffers on power loss */
+		pmem->virt_addr = ioremap_wc(pmem->phys_addr, pmem->size);
+		break;
+	case 3: /* WT */
+	default:
+		pmem->virt_addr = ioremap_wt(pmem->phys_addr, pmem->size);
+		break;
+	}
+
+	dev_info(dev,
+		"mapped: cache_type=%d virt=0x%p phys=0x%llx (%lld GiB) size=0x%zx (%ld GiB)\n",
+		pmem_cachetype,
+		pmem->virt_addr,
+		pmem->phys_addr, pmem->phys_addr / (1024*1024*1024),
+		pmem->size, pmem->size / (1024*1024*1024));
+
 	if (!pmem->virt_addr)
 		goto out_release_region;
 
+	if (pmem_clean) {
+		/* write all of NVDIMM memory to clear any ECC errors */
+		dev_info(dev,
+			"write clean starting: virt=0x%p phys=0x%llx (%lld GiB) size=0x%zx (%ld GiB)\n",
+			pmem->virt_addr,
+			pmem->phys_addr, pmem->phys_addr / (1024*1024*1024),
+			pmem->size, pmem->size / (1024*1024*1024));
+		ts = local_clock();
+		memcpy_lskip_snt_64(pmem->virt_addr, NULL, pmem->size);
+		te = local_clock();
+		dev_info(dev,
+			"write clean complete: ct=%d in %lld GB/s\n",
+			pmem_cachetype,
+			pmem->size / (te - ts));	/* B/ns equals GB/s */
+	}
+
+	/* read all of NVDIMM memory to trigger any ECC errors now */
+	if (pmem_readscan) {
+		dev_info(dev,
+			"read scan starting: virt=0x%p phys=0x%llx (%lld GiB) size=0x%zx (%ld GiB)\n",
+			pmem->virt_addr,
+			pmem->phys_addr, pmem->phys_addr / (1024*1024*1024),
+			pmem->size, pmem->size / (1024*1024*1024));
+		ts = local_clock();
+		memcpy_lnt_sskip_64(0, pmem->virt_addr, pmem->size);
+		te = local_clock();
+		dev_info(dev,
+			"read scan complete: ct=%d in %lld GB/s\n",
+			pmem_cachetype,
+			pmem->size / (te - ts));	/* B/ns equals GB/s */
+	}
+
 	pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
 	if (!pmem->pmem_queue)
 		goto out_unmap;
@@ -276,6 +801,9 @@ static int __init pmem_init(void)
 {
 	int error;
 
+	pr_info("pmem loading with pmem_readscan=%d pmem_clean=%d pmem_cachetype=%d pmem_ntw=%d pmem_ntr=%d\n",
+		pmem_readscan, pmem_clean, pmem_cachetype, pmem_ntw, pmem_ntr);
+
 	pmem_major = register_blkdev(0, "pmem");
 	if (pmem_major < 0)
 		return pmem_major;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

end of thread, other threads:[~2015-05-25  7:03 UTC | newest]

Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-20 20:56 [PATCH v3 00/21] libnd: non-volatile memory device support Dan Williams
2015-05-20 20:56 ` Dan Williams
2015-05-20 20:56 ` [PATCH v3 01/21] e820, efi: add ACPI 6.0 persistent memory types Dan Williams
2015-05-20 20:56   ` Dan Williams
2015-05-20 20:56 ` [PATCH v3 02/21] libnd, nfit: initial libnd infrastructure and NFIT support Dan Williams
2015-05-20 20:56   ` Dan Williams
2015-05-21 13:55   ` Toshi Kani
2015-05-21 13:55     ` Toshi Kani
2015-05-21 15:56     ` Dan Williams
2015-05-21 15:56       ` Dan Williams
2015-05-21 17:25       ` Toshi Kani
2015-05-21 17:25         ` Toshi Kani
2015-05-21 17:49         ` Moore, Robert
2015-05-21 17:49           ` Moore, Robert
2015-05-21 18:01           ` Toshi Kani
2015-05-21 18:01             ` Toshi Kani
2015-05-21 19:06             ` Dan Williams
2015-05-21 19:06               ` Dan Williams
2015-05-21 19:44               ` Toshi Kani
2015-05-21 19:44                 ` Toshi Kani
2015-05-21 19:44                 ` Toshi Kani
2015-05-21 19:59                 ` Toshi Kani
2015-05-21 19:59                   ` Toshi Kani
2015-05-21 19:59                   ` Toshi Kani
2015-05-21 20:59                   ` Linda Knippers
2015-05-21 20:59                     ` Linda Knippers
2015-05-21 20:59                     ` Linda Knippers
2015-05-21 21:34                     ` Dan Williams
2015-05-21 21:34                       ` Dan Williams
2015-05-21 21:34                       ` Dan Williams
2015-05-21 22:11                       ` Toshi Kani
2015-05-21 22:11                         ` Toshi Kani
2015-05-22 14:58                       ` Moore, Robert
2015-05-22 14:58                         ` Moore, Robert
2015-05-22 15:21                         ` Toshi Kani
2015-05-22 15:21                           ` Toshi Kani
2015-05-22 16:12                           ` Moore, Robert
2015-05-22 16:12                             ` Moore, Robert
2015-05-20 20:56 ` [PATCH v3 03/21] libnd: control character device and libnd bus sysfs attributes Dan Williams
2015-05-20 20:56   ` Dan Williams
2015-05-20 20:56 ` [PATCH v3 04/21] libnd, nfit: dimm/memory-devices Dan Williams
2015-05-20 20:56   ` Dan Williams
2015-05-20 20:56 ` [PATCH v3 05/21] libnd: control (ioctl) messages for libnd bus and dimm devices Dan Williams
2015-05-20 20:56   ` Dan Williams
2015-05-20 20:56 ` [PATCH v3 06/21] libnd, nd_dimm: dimm driver and base libnd device-driver infrastructure Dan Williams
2015-05-20 20:56   ` Dan Williams
2015-05-20 20:56 ` [PATCH v3 07/21] libnd, nfit: regions (block-data-window, persistent memory, volatile memory) Dan Williams
2015-05-20 20:56   ` Dan Williams
2015-05-20 20:56 ` [PATCH v3 08/21] libnd: support for legacy (non-aliasing) nvdimms Dan Williams
2015-05-20 20:56   ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 09/21] libnd, nd_pmem: add libnd support to the pmem driver Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-23 14:39   ` Christoph Hellwig
2015-05-23 16:59     ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 10/21] pmem: Dynamically allocate partition numbers Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 11/21] libnd, nfit: add interleave-set state-tracking infrastructure Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 12/21] libnd: namespace indices: read and validate Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 13/21] libnd: pmem label sets and namespace instantiation Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 14/21] libnd: blk labels " Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-22 18:37   ` Elliott, Robert (Server Storage)
2015-05-22 18:37     ` Elliott, Robert (Server Storage)
2015-05-22 18:37     ` Elliott, Robert (Server Storage)
2015-05-22 18:51     ` Dan Williams
2015-05-22 18:51       ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 15/21] libnd: write pmem label set Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 16/21] libnd: write blk " Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 17/21] libnd: infrastructure for btt devices Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 18/21] nd_btt: atomic sector updates Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-22 21:16   ` Elliott, Robert (Server Storage)
2015-05-22 21:16     ` Elliott, Robert (Server Storage)
2015-05-22 21:39     ` Dan Williams
2015-05-22 21:39       ` Dan Williams
2015-05-20 20:57 ` [PATCH v3 19/21] libnd, nfit, nd_blk: driver for BLK-mode access persistent memory Dan Williams
2015-05-20 20:57   ` Dan Williams
2015-05-20 20:58 ` [PATCH v3 20/21] nfit-test: manufactured NFITs for interface development Dan Williams
2015-05-20 20:58   ` Dan Williams
2015-05-25  7:02   ` Elliott, Robert (Server Storage)
2015-05-25  7:02     ` Elliott, Robert (Server Storage)
2015-05-20 20:58 ` [PATCH v3 21/21] libnd: Non-Volatile Devices Dan Williams
2015-05-20 20:58   ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.