All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/15] block, dax updates for 4.4
@ 2015-11-02  4:29 ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:29 UTC (permalink / raw)
  To: axboe
  Cc: Jens Axboe, Dave Hansen, jack, linux-nvdimm, Richard Weinberger,
	Jeff Dike, Dave Hansen, david, linux-kernel, stable, hch,
	Jeff Moyer, Al Viro, Jan Kara, Paolo Bonzini, Andrew Morton,
	kbuild test robot, ross.zwisler, Matthew Wilcox,
	Christoffer Dall

Changes since v2: [1]

1/ Include a fix to revoke dax-mappings at device teardown time

2/ Include a blkdev_issue_flush implementation for pmem

3/ New block device ioctls to set/query dax mode, make dax opt-in

4/ Collect reviewed-by's

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-October/002538.html

---

There were several topics developed during this cycle, some are ready
and some are not.  The items needing more review and testing are the
pmem+dax updates with mm, ext4, and xfs dependencies.  Topics like the
dax locking reworks, dax get_user_pages, and write-protecting dax ptes
after fsync/msync.

Instead, this more conservative set represents the fs and mm independent
updates to dax and pmem:

1/ Enable dax mappings for raw block devices

2/ Use blk_queue_{enter|exit} and the availability of 'struct page' to
   fix lifetime issues and races with unloading the pmem driver.

3/ Enable blkdev_issue_flush to fix legacy apps that may dirty pmem via
   dax mmap I/O.  Although legacy environments should consider disabling
   dax for apps that don't expect it.

These depend on the latest 'libnvdimm-for-next' branch from nvdimm.git
and the 'for-4.4/integrity' branch from Jens' tree.  All but the last
three have been out for review previously, and I am looking to submit
them towards the back half of the merge window.

---

Dan Williams (15):
      pmem, dax: clean up clear_pmem()
      dax: increase granularity of dax_clear_blocks() operations
      block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
      libnvdimm, pmem: move request_queue allocation earlier in probe
      libnvdimm, pmem: fix size trim in pmem_direct_access()
      um: kill pfn_t
      kvm: rename pfn_t to kvm_pfn_t
      mm, dax, pmem: introduce pfn_t
      block: notify queue death confirmation
      dax, pmem: introduce zone_device_revoke() and devm_memunmap_pages()
      block: introduce bdev_file_inode()
      block: enable dax for raw block devices
      block, dax: make dax mappings opt-in by default
      dax: dirty extent notification
      pmem: blkdev_issue_flush support


 arch/arm/include/asm/kvm_mmu.h        |    5 -
 arch/arm/kvm/mmu.c                    |   10 +
 arch/arm64/include/asm/kvm_mmu.h      |    3 
 arch/mips/include/asm/kvm_host.h      |    6 -
 arch/mips/kvm/emulate.c               |    2 
 arch/mips/kvm/tlb.c                   |   14 +-
 arch/powerpc/include/asm/kvm_book3s.h |    4 -
 arch/powerpc/include/asm/kvm_ppc.h    |    2 
 arch/powerpc/kvm/book3s.c             |    6 -
 arch/powerpc/kvm/book3s_32_mmu_host.c |    2 
 arch/powerpc/kvm/book3s_64_mmu_host.c |    2 
 arch/powerpc/kvm/e500.h               |    2 
 arch/powerpc/kvm/e500_mmu_host.c      |    8 +
 arch/powerpc/kvm/trace_pr.h           |    2 
 arch/powerpc/sysdev/axonram.c         |   11 +
 arch/um/include/asm/page.h            |    6 -
 arch/um/include/asm/pgtable-3level.h  |    4 -
 arch/um/include/asm/pgtable.h         |    2 
 arch/x86/include/asm/cacheflush.h     |    4 +
 arch/x86/include/asm/pmem.h           |    7 -
 arch/x86/kvm/iommu.c                  |   11 +
 arch/x86/kvm/mmu.c                    |   37 +++--
 arch/x86/kvm/mmu_audit.c              |    2 
 arch/x86/kvm/paging_tmpl.h            |    6 -
 arch/x86/kvm/vmx.c                    |    2 
 arch/x86/kvm/x86.c                    |    2 
 block/blk-core.c                      |   13 +-
 block/blk-mq.c                        |   19 ++-
 block/blk.h                           |   13 --
 block/ioctl.c                         |   42 ++++++
 drivers/block/brd.c                   |   10 +
 drivers/nvdimm/pmem.c                 |  247 ++++++++++++++++++++++++++++-----
 drivers/s390/block/dcssblk.c          |   13 +-
 fs/block_dev.c                        |  159 ++++++++++++++++++---
 fs/dax.c                              |  242 ++++++++++++++++++++++----------
 include/linux/blkdev.h                |   41 +++++
 include/linux/fs.h                    |   11 +
 include/linux/io.h                    |   17 --
 include/linux/kvm_host.h              |   37 +++--
 include/linux/kvm_types.h             |    2 
 include/linux/mm.h                    |   90 ++++++++++++
 include/linux/pfn.h                   |    9 +
 include/uapi/linux/fs.h               |    2 
 kernel/memremap.c                     |   98 +++++++++++++
 virt/kvm/kvm_main.c                   |   47 +++---
 45 files changed, 949 insertions(+), 325 deletions(-)

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v3 00/15] block, dax updates for 4.4
@ 2015-11-02  4:29 ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:29 UTC (permalink / raw)
  To: axboe
  Cc: Jens Axboe, Dave Hansen, jack, linux-nvdimm, Richard Weinberger,
	Jeff Dike, Dave Hansen, david, linux-kernel, stable, hch,
	Jeff Moyer, Al Viro, Jan Kara, Paolo Bonzini, Andrew Morton,
	kbuild test robot, ross.zwisler, Matthew Wilcox,
	Christoffer Dall

Changes since v2: [1]

1/ Include a fix to revoke dax-mappings at device teardown time

2/ Include a blkdev_issue_flush implementation for pmem

3/ New block device ioctls to set/query dax mode, make dax opt-in

4/ Collect reviewed-by's

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-October/002538.html

---

There were several topics developed during this cycle, some are ready
and some are not.  The items needing more review and testing are the
pmem+dax updates with mm, ext4, and xfs dependencies.  Topics like the
dax locking reworks, dax get_user_pages, and write-protecting dax ptes
after fsync/msync.

Instead, this more conservative set represents the fs and mm independent
updates to dax and pmem:

1/ Enable dax mappings for raw block devices

2/ Use blk_queue_{enter|exit} and the availability of 'struct page' to
   fix lifetime issues and races with unloading the pmem driver.

3/ Enable blkdev_issue_flush to fix legacy apps that may dirty pmem via
   dax mmap I/O.  Although legacy environments should consider disabling
   dax for apps that don't expect it.

These depend on the latest 'libnvdimm-for-next' branch from nvdimm.git
and the 'for-4.4/integrity' branch from Jens' tree.  All but the last
three have been out for review previously, and I am looking to submit
them towards the back half of the merge window.

---

Dan Williams (15):
      pmem, dax: clean up clear_pmem()
      dax: increase granularity of dax_clear_blocks() operations
      block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
      libnvdimm, pmem: move request_queue allocation earlier in probe
      libnvdimm, pmem: fix size trim in pmem_direct_access()
      um: kill pfn_t
      kvm: rename pfn_t to kvm_pfn_t
      mm, dax, pmem: introduce pfn_t
      block: notify queue death confirmation
      dax, pmem: introduce zone_device_revoke() and devm_memunmap_pages()
      block: introduce bdev_file_inode()
      block: enable dax for raw block devices
      block, dax: make dax mappings opt-in by default
      dax: dirty extent notification
      pmem: blkdev_issue_flush support


 arch/arm/include/asm/kvm_mmu.h        |    5 -
 arch/arm/kvm/mmu.c                    |   10 +
 arch/arm64/include/asm/kvm_mmu.h      |    3 
 arch/mips/include/asm/kvm_host.h      |    6 -
 arch/mips/kvm/emulate.c               |    2 
 arch/mips/kvm/tlb.c                   |   14 +-
 arch/powerpc/include/asm/kvm_book3s.h |    4 -
 arch/powerpc/include/asm/kvm_ppc.h    |    2 
 arch/powerpc/kvm/book3s.c             |    6 -
 arch/powerpc/kvm/book3s_32_mmu_host.c |    2 
 arch/powerpc/kvm/book3s_64_mmu_host.c |    2 
 arch/powerpc/kvm/e500.h               |    2 
 arch/powerpc/kvm/e500_mmu_host.c      |    8 +
 arch/powerpc/kvm/trace_pr.h           |    2 
 arch/powerpc/sysdev/axonram.c         |   11 +
 arch/um/include/asm/page.h            |    6 -
 arch/um/include/asm/pgtable-3level.h  |    4 -
 arch/um/include/asm/pgtable.h         |    2 
 arch/x86/include/asm/cacheflush.h     |    4 +
 arch/x86/include/asm/pmem.h           |    7 -
 arch/x86/kvm/iommu.c                  |   11 +
 arch/x86/kvm/mmu.c                    |   37 +++--
 arch/x86/kvm/mmu_audit.c              |    2 
 arch/x86/kvm/paging_tmpl.h            |    6 -
 arch/x86/kvm/vmx.c                    |    2 
 arch/x86/kvm/x86.c                    |    2 
 block/blk-core.c                      |   13 +-
 block/blk-mq.c                        |   19 ++-
 block/blk.h                           |   13 --
 block/ioctl.c                         |   42 ++++++
 drivers/block/brd.c                   |   10 +
 drivers/nvdimm/pmem.c                 |  247 ++++++++++++++++++++++++++++-----
 drivers/s390/block/dcssblk.c          |   13 +-
 fs/block_dev.c                        |  159 ++++++++++++++++++---
 fs/dax.c                              |  242 ++++++++++++++++++++++----------
 include/linux/blkdev.h                |   41 +++++
 include/linux/fs.h                    |   11 +
 include/linux/io.h                    |   17 --
 include/linux/kvm_host.h              |   37 +++--
 include/linux/kvm_types.h             |    2 
 include/linux/mm.h                    |   90 ++++++++++++
 include/linux/pfn.h                   |    9 +
 include/uapi/linux/fs.h               |    2 
 kernel/memremap.c                     |   98 +++++++++++++
 virt/kvm/kvm_main.c                   |   47 +++---
 45 files changed, 949 insertions(+), 325 deletions(-)

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v3 01/15] pmem, dax: clean up clear_pmem()
  2015-11-02  4:29 ` Dan Williams
@ 2015-11-02  4:29   ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:29 UTC (permalink / raw)
  To: axboe
  Cc: jack, linux-nvdimm, Dave Hansen, david, linux-kernel, Jeff Moyer,
	ross.zwisler, hch

Both, __dax_pmd_fault, and clear_pmem() were taking special steps to
clear memory a page at a time to take advantage of non-temporal
clear_page() implementations.  However, x86_64 does not use
non-temporal instructions for clear_page(), and arch_clear_pmem() was
always incurring the cost of __arch_wb_cache_pmem().

Clean up the assumption that doing clear_pmem() a page at a time is more
performant.

Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/include/asm/pmem.h |    7 +------
 fs/dax.c                    |    4 +---
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index d8ce3ec816ab..1544fabcd7f9 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -132,12 +132,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 {
 	void *vaddr = (void __force *)addr;
 
-	/* TODO: implement the zeroing via non-temporal writes */
-	if (size == PAGE_SIZE && ((unsigned long)vaddr & ~PAGE_MASK) == 0)
-		clear_page(vaddr);
-	else
-		memset(vaddr, 0, size);
-
+	memset(vaddr, 0, size);
 	__arch_wb_cache_pmem(vaddr, size);
 }
 
diff --git a/fs/dax.c b/fs/dax.c
index a86d3cc2b389..5dc33d788d50 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -623,9 +623,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			goto fallback;
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
-			int i;
-			for (i = 0; i < PTRS_PER_PMD; i++)
-				clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE);
+			clear_pmem(kaddr, HPAGE_SIZE);
 			wmb_pmem();
 			count_vm_event(PGMAJFAULT);
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 01/15] pmem, dax: clean up clear_pmem()
@ 2015-11-02  4:29   ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:29 UTC (permalink / raw)
  To: axboe
  Cc: jack, linux-nvdimm, Dave Hansen, david, linux-kernel, Jeff Moyer,
	ross.zwisler, hch

Both, __dax_pmd_fault, and clear_pmem() were taking special steps to
clear memory a page at a time to take advantage of non-temporal
clear_page() implementations.  However, x86_64 does not use
non-temporal instructions for clear_page(), and arch_clear_pmem() was
always incurring the cost of __arch_wb_cache_pmem().

Clean up the assumption that doing clear_pmem() a page at a time is more
performant.

Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/include/asm/pmem.h |    7 +------
 fs/dax.c                    |    4 +---
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index d8ce3ec816ab..1544fabcd7f9 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -132,12 +132,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 {
 	void *vaddr = (void __force *)addr;
 
-	/* TODO: implement the zeroing via non-temporal writes */
-	if (size == PAGE_SIZE && ((unsigned long)vaddr & ~PAGE_MASK) == 0)
-		clear_page(vaddr);
-	else
-		memset(vaddr, 0, size);
-
+	memset(vaddr, 0, size);
 	__arch_wb_cache_pmem(vaddr, size);
 }
 
diff --git a/fs/dax.c b/fs/dax.c
index a86d3cc2b389..5dc33d788d50 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -623,9 +623,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			goto fallback;
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
-			int i;
-			for (i = 0; i < PTRS_PER_PMD; i++)
-				clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE);
+			clear_pmem(kaddr, HPAGE_SIZE);
 			wmb_pmem();
 			count_vm_event(PGMAJFAULT);
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
  2015-11-02  4:29 ` Dan Williams
@ 2015-11-02  4:29   ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:29 UTC (permalink / raw)
  To: axboe
  Cc: jack, linux-nvdimm, david, linux-kernel, Jeff Moyer, Jan Kara,
	ross.zwisler, hch

dax_clear_blocks is currently performing a cond_resched() after every
PAGE_SIZE memset.  We need not check so frequently, for example md-raid
only calls cond_resched() at stripe granularity.  Also, in preparation
for introducing a dax_map_atomic() operation that temporarily pins a dax
mapping move the call to cond_resched() to the outer loop.

Reviewed-by: Jan Kara <jack@suse.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5dc33d788d50..f8e543839e5c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -28,6 +28,7 @@
 #include <linux/sched.h>
 #include <linux/uio.h>
 #include <linux/vmstat.h>
+#include <linux/sizes.h>
 
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
@@ -38,24 +39,20 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 	do {
 		void __pmem *addr;
 		unsigned long pfn;
-		long count;
+		long count, sz;
 
-		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
+		sz = min_t(long, size, SZ_1M);
+		count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
 		if (count < 0)
 			return count;
-		BUG_ON(size < count);
-		while (count > 0) {
-			unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
-			if (pgsz > count)
-				pgsz = count;
-			clear_pmem(addr, pgsz);
-			addr += pgsz;
-			size -= pgsz;
-			count -= pgsz;
-			BUG_ON(pgsz & 511);
-			sector += pgsz / 512;
-			cond_resched();
-		}
+		if (count < sz)
+			sz = count;
+		clear_pmem(addr, sz);
+		addr += sz;
+		size -= sz;
+		BUG_ON(sz & 511);
+		sector += sz / 512;
+		cond_resched();
 	} while (size);
 
 	wmb_pmem();


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
@ 2015-11-02  4:29   ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:29 UTC (permalink / raw)
  To: axboe
  Cc: jack, linux-nvdimm, david, linux-kernel, Jeff Moyer, Jan Kara,
	ross.zwisler, hch

dax_clear_blocks is currently performing a cond_resched() after every
PAGE_SIZE memset.  We need not check so frequently, for example md-raid
only calls cond_resched() at stripe granularity.  Also, in preparation
for introducing a dax_map_atomic() operation that temporarily pins a dax
mapping move the call to cond_resched() to the outer loop.

Reviewed-by: Jan Kara <jack@suse.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5dc33d788d50..f8e543839e5c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -28,6 +28,7 @@
 #include <linux/sched.h>
 #include <linux/uio.h>
 #include <linux/vmstat.h>
+#include <linux/sizes.h>
 
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
@@ -38,24 +39,20 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 	do {
 		void __pmem *addr;
 		unsigned long pfn;
-		long count;
+		long count, sz;
 
-		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
+		sz = min_t(long, size, SZ_1M);
+		count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
 		if (count < 0)
 			return count;
-		BUG_ON(size < count);
-		while (count > 0) {
-			unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
-			if (pgsz > count)
-				pgsz = count;
-			clear_pmem(addr, pgsz);
-			addr += pgsz;
-			size -= pgsz;
-			count -= pgsz;
-			BUG_ON(pgsz & 511);
-			sector += pgsz / 512;
-			cond_resched();
-		}
+		if (count < sz)
+			sz = count;
+		clear_pmem(addr, sz);
+		addr += sz;
+		size -= sz;
+		BUG_ON(sz & 511);
+		sector += sz / 512;
+		cond_resched();
 	} while (size);
 
 	wmb_pmem();


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 03/15] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
  2015-11-02  4:29 ` Dan Williams
@ 2015-11-02  4:29   ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:29 UTC (permalink / raw)
  To: axboe
  Cc: Jens Axboe, jack, linux-nvdimm, david, linux-kernel, Jeff Moyer,
	Jan Kara, ross.zwisler, hch

The DAX implementation needs to protect new calls to ->direct_access()
and usage of its return value against unbind of the underlying block
device.  Use blk_queue_enter()/blk_queue_exit() to either prevent
blk_cleanup_queue() from proceeding, or fail the dax_map_atomic() if the
request_queue is being torn down.

Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/blk.h            |    2 -
 fs/dax.c               |  165 ++++++++++++++++++++++++++++++++----------------
 include/linux/blkdev.h |    2 +
 3 files changed, 112 insertions(+), 57 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index 157c93d54dc9..dc7d9411fa45 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -72,8 +72,6 @@ void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 bool __blk_end_bidi_request(struct request *rq, int error,
 			    unsigned int nr_bytes, unsigned int bidi_bytes);
-int blk_queue_enter(struct request_queue *q, gfp_t gfp);
-void blk_queue_exit(struct request_queue *q);
 void blk_freeze_queue(struct request_queue *q);
 
 static inline void blk_queue_enter_live(struct request_queue *q)
diff --git a/fs/dax.c b/fs/dax.c
index f8e543839e5c..a480729c00ec 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -30,6 +30,40 @@
 #include <linux/vmstat.h>
 #include <linux/sizes.h>
 
+static void __pmem *__dax_map_atomic(struct block_device *bdev, sector_t sector,
+		long size, unsigned long *pfn, long *len)
+{
+	long rc;
+	void __pmem *addr;
+	struct request_queue *q = bdev->bd_queue;
+
+	if (blk_queue_enter(q, GFP_NOWAIT) != 0)
+		return (void __pmem *) ERR_PTR(-EIO);
+	rc = bdev_direct_access(bdev, sector, &addr, pfn, size);
+	if (len)
+		*len = rc;
+	if (rc < 0) {
+		blk_queue_exit(q);
+		return (void __pmem *) ERR_PTR(rc);
+	}
+	return addr;
+}
+
+static void __pmem *dax_map_atomic(struct block_device *bdev, sector_t sector,
+		long size)
+{
+	unsigned long pfn;
+
+	return __dax_map_atomic(bdev, sector, size, &pfn, NULL);
+}
+
+static void dax_unmap_atomic(struct block_device *bdev, void __pmem *addr)
+{
+	if (IS_ERR(addr))
+		return;
+	blk_queue_exit(bdev->bd_queue);
+}
+
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
 	struct block_device *bdev = inode->i_sb->s_bdev;
@@ -42,9 +76,9 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 		long count, sz;
 
 		sz = min_t(long, size, SZ_1M);
-		count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
-		if (count < 0)
-			return count;
+		addr = __dax_map_atomic(bdev, sector, size, &pfn, &count);
+		if (IS_ERR(addr))
+			return PTR_ERR(addr);
 		if (count < sz)
 			sz = count;
 		clear_pmem(addr, sz);
@@ -52,6 +86,7 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 		size -= sz;
 		BUG_ON(sz & 511);
 		sector += sz / 512;
+		dax_unmap_atomic(bdev, addr);
 		cond_resched();
 	} while (size);
 
@@ -60,14 +95,6 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 }
 EXPORT_SYMBOL_GPL(dax_clear_blocks);
 
-static long dax_get_addr(struct buffer_head *bh, void __pmem **addr,
-		unsigned blkbits)
-{
-	unsigned long pfn;
-	sector_t sector = bh->b_blocknr << (blkbits - 9);
-	return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
-}
-
 /* the clear_pmem() calls are ordered by a wmb_pmem() in the caller */
 static void dax_new_buf(void __pmem *addr, unsigned size, unsigned first,
 		loff_t pos, loff_t end)
@@ -97,19 +124,30 @@ static bool buffer_size_valid(struct buffer_head *bh)
 	return bh->b_state != 0;
 }
 
+
+static sector_t to_sector(const struct buffer_head *bh,
+		const struct inode *inode)
+{
+	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
+
+	return sector;
+}
+
 static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 		      loff_t start, loff_t end, get_block_t get_block,
 		      struct buffer_head *bh)
 {
-	ssize_t retval = 0;
-	loff_t pos = start;
-	loff_t max = start;
-	loff_t bh_max = start;
-	void __pmem *addr;
+	loff_t pos = start, max = start, bh_max = start;
+	struct block_device *bdev = NULL;
+	int rw = iov_iter_rw(iter), rc;
+	long map_len = 0;
+	unsigned long pfn;
+	void __pmem *addr = NULL;
+	void __pmem *kmap = (void __pmem *) ERR_PTR(-EIO);
 	bool hole = false;
 	bool need_wmb = false;
 
-	if (iov_iter_rw(iter) != WRITE)
+	if (rw == READ)
 		end = min(end, i_size_read(inode));
 
 	while (pos < end) {
@@ -124,13 +162,13 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 			if (pos == bh_max) {
 				bh->b_size = PAGE_ALIGN(end - pos);
 				bh->b_state = 0;
-				retval = get_block(inode, block, bh,
-						   iov_iter_rw(iter) == WRITE);
-				if (retval)
+				rc = get_block(inode, block, bh, rw == WRITE);
+				if (rc)
 					break;
 				if (!buffer_size_valid(bh))
 					bh->b_size = 1 << blkbits;
 				bh_max = pos - first + bh->b_size;
+				bdev = bh->b_bdev;
 			} else {
 				unsigned done = bh->b_size -
 						(bh_max - (pos - first));
@@ -138,21 +176,27 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				bh->b_size -= done;
 			}
 
-			hole = iov_iter_rw(iter) != WRITE && !buffer_written(bh);
+			hole = rw == READ && !buffer_written(bh);
 			if (hole) {
 				addr = NULL;
 				size = bh->b_size - first;
 			} else {
-				retval = dax_get_addr(bh, &addr, blkbits);
-				if (retval < 0)
+				dax_unmap_atomic(bdev, kmap);
+				kmap = __dax_map_atomic(bdev,
+						to_sector(bh, inode),
+						bh->b_size, &pfn, &map_len);
+				if (IS_ERR(kmap)) {
+					rc = PTR_ERR(kmap);
 					break;
+				}
+				addr = kmap;
 				if (buffer_unwritten(bh) || buffer_new(bh)) {
-					dax_new_buf(addr, retval, first, pos,
-									end);
+					dax_new_buf(addr, map_len, first, pos,
+							end);
 					need_wmb = true;
 				}
 				addr += first;
-				size = retval - first;
+				size = map_len - first;
 			}
 			max = min(pos + size, end);
 		}
@@ -175,8 +219,9 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 
 	if (need_wmb)
 		wmb_pmem();
+	dax_unmap_atomic(bdev, kmap);
 
-	return (pos == start) ? retval : pos - start;
+	return (pos == start) ? rc : pos - start;
 }
 
 /**
@@ -265,28 +310,31 @@ static int dax_load_hole(struct address_space *mapping, struct page *page,
 	return VM_FAULT_LOCKED;
 }
 
-static int copy_user_bh(struct page *to, struct buffer_head *bh,
-			unsigned blkbits, unsigned long vaddr)
+static int copy_user_bh(struct page *to, struct inode *inode,
+		struct buffer_head *bh, unsigned long vaddr)
 {
+	struct block_device *bdev = bh->b_bdev;
 	void __pmem *vfrom;
 	void *vto;
 
-	if (dax_get_addr(bh, &vfrom, blkbits) < 0)
-		return -EIO;
+	vfrom = dax_map_atomic(bdev, to_sector(bh, inode), bh->b_size);
+	if (IS_ERR(vfrom))
+		return PTR_ERR(vfrom);
 	vto = kmap_atomic(to);
 	copy_user_page(vto, (void __force *)vfrom, vaddr, to);
 	kunmap_atomic(vto);
+	dax_unmap_atomic(bdev, vfrom);
 	return 0;
 }
 
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct address_space *mapping = inode->i_mapping;
-	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
-	void __pmem *addr;
+	struct address_space *mapping = inode->i_mapping;
+	struct block_device *bdev = bh->b_bdev;
 	unsigned long pfn;
+	void __pmem *addr;
 	pgoff_t size;
 	int error;
 
@@ -305,11 +353,10 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		goto out;
 	}
 
-	error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
-	if (error < 0)
-		goto out;
-	if (error < PAGE_SIZE) {
-		error = -EIO;
+	addr = __dax_map_atomic(bdev, to_sector(bh, inode), bh->b_size,
+			&pfn, NULL);
+	if (IS_ERR(addr)) {
+		error = PTR_ERR(addr);
 		goto out;
 	}
 
@@ -317,6 +364,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		clear_pmem(addr, PAGE_SIZE);
 		wmb_pmem();
 	}
+	dax_unmap_atomic(bdev, addr);
 
 	error = vm_insert_mixed(vma, vaddr, pfn);
 
@@ -412,7 +460,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (vmf->cow_page) {
 		struct page *new_page = vmf->cow_page;
 		if (buffer_written(&bh))
-			error = copy_user_bh(new_page, &bh, blkbits, vaddr);
+			error = copy_user_bh(new_page, inode, &bh, vaddr);
 		else
 			clear_user_highpage(new_page, vaddr);
 		if (error)
@@ -524,11 +572,9 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	unsigned blkbits = inode->i_blkbits;
 	unsigned long pmd_addr = address & PMD_MASK;
 	bool write = flags & FAULT_FLAG_WRITE;
-	long length;
-	void __pmem *kaddr;
+	struct block_device *bdev;
 	pgoff_t size, pgoff;
-	sector_t block, sector;
-	unsigned long pfn;
+	sector_t block;
 	int result = 0;
 
 	/* Fall back to PTEs if we're going to COW */
@@ -552,9 +598,9 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
 
 	bh.b_size = PMD_SIZE;
-	length = get_block(inode, block, &bh, write);
-	if (length)
+	if (get_block(inode, block, &bh, write) != 0)
 		return VM_FAULT_SIGBUS;
+	bdev = bh.b_bdev;
 	i_mmap_lock_read(mapping);
 
 	/*
@@ -609,15 +655,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
-		sector = bh.b_blocknr << (blkbits - 9);
-		length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
-						bh.b_size);
-		if (length < 0) {
+		long length;
+		unsigned long pfn;
+		void __pmem *kaddr = __dax_map_atomic(bdev,
+				to_sector(&bh, inode), HPAGE_SIZE, &pfn,
+				&length);
+
+		if (IS_ERR(kaddr)) {
 			result = VM_FAULT_SIGBUS;
 			goto out;
 		}
-		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
+		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) {
+			dax_unmap_atomic(bdev, kaddr);
 			goto fallback;
+		}
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
 			clear_pmem(kaddr, HPAGE_SIZE);
@@ -626,6 +677,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 			result |= VM_FAULT_MAJOR;
 		}
+		dax_unmap_atomic(bdev, kaddr);
 
 		result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
 	}
@@ -729,12 +781,15 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 	if (err < 0)
 		return err;
 	if (buffer_written(&bh)) {
-		void __pmem *addr;
-		err = dax_get_addr(&bh, &addr, inode->i_blkbits);
-		if (err < 0)
-			return err;
+		struct block_device *bdev = bh.b_bdev;
+		void __pmem *addr = dax_map_atomic(bdev, to_sector(&bh, inode),
+				PAGE_CACHE_SIZE);
+
+		if (IS_ERR(addr))
+			return PTR_ERR(addr);
 		clear_pmem(addr + offset, length);
 		wmb_pmem();
+		dax_unmap_atomic(bdev, addr);
 	}
 
 	return 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cf57884db4b7..59a770dad804 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -792,6 +792,8 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
+extern int blk_queue_enter(struct request_queue *q, gfp_t gfp);
+extern void blk_queue_exit(struct request_queue *q);
 extern void blk_start_queue(struct request_queue *q);
 extern void blk_stop_queue(struct request_queue *q);
 extern void blk_sync_queue(struct request_queue *q);


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 03/15] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
@ 2015-11-02  4:29   ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:29 UTC (permalink / raw)
  To: axboe
  Cc: Jens Axboe, jack, linux-nvdimm, david, linux-kernel, Jeff Moyer,
	Jan Kara, ross.zwisler, hch

The DAX implementation needs to protect new calls to ->direct_access()
and usage of its return value against unbind of the underlying block
device.  Use blk_queue_enter()/blk_queue_exit() to either prevent
blk_cleanup_queue() from proceeding, or fail the dax_map_atomic() if the
request_queue is being torn down.

Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/blk.h            |    2 -
 fs/dax.c               |  165 ++++++++++++++++++++++++++++++++----------------
 include/linux/blkdev.h |    2 +
 3 files changed, 112 insertions(+), 57 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index 157c93d54dc9..dc7d9411fa45 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -72,8 +72,6 @@ void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 bool __blk_end_bidi_request(struct request *rq, int error,
 			    unsigned int nr_bytes, unsigned int bidi_bytes);
-int blk_queue_enter(struct request_queue *q, gfp_t gfp);
-void blk_queue_exit(struct request_queue *q);
 void blk_freeze_queue(struct request_queue *q);
 
 static inline void blk_queue_enter_live(struct request_queue *q)
diff --git a/fs/dax.c b/fs/dax.c
index f8e543839e5c..a480729c00ec 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -30,6 +30,40 @@
 #include <linux/vmstat.h>
 #include <linux/sizes.h>
 
+static void __pmem *__dax_map_atomic(struct block_device *bdev, sector_t sector,
+		long size, unsigned long *pfn, long *len)
+{
+	long rc;
+	void __pmem *addr;
+	struct request_queue *q = bdev->bd_queue;
+
+	if (blk_queue_enter(q, GFP_NOWAIT) != 0)
+		return (void __pmem *) ERR_PTR(-EIO);
+	rc = bdev_direct_access(bdev, sector, &addr, pfn, size);
+	if (len)
+		*len = rc;
+	if (rc < 0) {
+		blk_queue_exit(q);
+		return (void __pmem *) ERR_PTR(rc);
+	}
+	return addr;
+}
+
+static void __pmem *dax_map_atomic(struct block_device *bdev, sector_t sector,
+		long size)
+{
+	unsigned long pfn;
+
+	return __dax_map_atomic(bdev, sector, size, &pfn, NULL);
+}
+
+static void dax_unmap_atomic(struct block_device *bdev, void __pmem *addr)
+{
+	if (IS_ERR(addr))
+		return;
+	blk_queue_exit(bdev->bd_queue);
+}
+
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
 	struct block_device *bdev = inode->i_sb->s_bdev;
@@ -42,9 +76,9 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 		long count, sz;
 
 		sz = min_t(long, size, SZ_1M);
-		count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
-		if (count < 0)
-			return count;
+		addr = __dax_map_atomic(bdev, sector, size, &pfn, &count);
+		if (IS_ERR(addr))
+			return PTR_ERR(addr);
 		if (count < sz)
 			sz = count;
 		clear_pmem(addr, sz);
@@ -52,6 +86,7 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 		size -= sz;
 		BUG_ON(sz & 511);
 		sector += sz / 512;
+		dax_unmap_atomic(bdev, addr);
 		cond_resched();
 	} while (size);
 
@@ -60,14 +95,6 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 }
 EXPORT_SYMBOL_GPL(dax_clear_blocks);
 
-static long dax_get_addr(struct buffer_head *bh, void __pmem **addr,
-		unsigned blkbits)
-{
-	unsigned long pfn;
-	sector_t sector = bh->b_blocknr << (blkbits - 9);
-	return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
-}
-
 /* the clear_pmem() calls are ordered by a wmb_pmem() in the caller */
 static void dax_new_buf(void __pmem *addr, unsigned size, unsigned first,
 		loff_t pos, loff_t end)
@@ -97,19 +124,30 @@ static bool buffer_size_valid(struct buffer_head *bh)
 	return bh->b_state != 0;
 }
 
+
+static sector_t to_sector(const struct buffer_head *bh,
+		const struct inode *inode)
+{
+	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
+
+	return sector;
+}
+
 static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 		      loff_t start, loff_t end, get_block_t get_block,
 		      struct buffer_head *bh)
 {
-	ssize_t retval = 0;
-	loff_t pos = start;
-	loff_t max = start;
-	loff_t bh_max = start;
-	void __pmem *addr;
+	loff_t pos = start, max = start, bh_max = start;
+	struct block_device *bdev = NULL;
+	int rw = iov_iter_rw(iter), rc;
+	long map_len = 0;
+	unsigned long pfn;
+	void __pmem *addr = NULL;
+	void __pmem *kmap = (void __pmem *) ERR_PTR(-EIO);
 	bool hole = false;
 	bool need_wmb = false;
 
-	if (iov_iter_rw(iter) != WRITE)
+	if (rw == READ)
 		end = min(end, i_size_read(inode));
 
 	while (pos < end) {
@@ -124,13 +162,13 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 			if (pos == bh_max) {
 				bh->b_size = PAGE_ALIGN(end - pos);
 				bh->b_state = 0;
-				retval = get_block(inode, block, bh,
-						   iov_iter_rw(iter) == WRITE);
-				if (retval)
+				rc = get_block(inode, block, bh, rw == WRITE);
+				if (rc)
 					break;
 				if (!buffer_size_valid(bh))
 					bh->b_size = 1 << blkbits;
 				bh_max = pos - first + bh->b_size;
+				bdev = bh->b_bdev;
 			} else {
 				unsigned done = bh->b_size -
 						(bh_max - (pos - first));
@@ -138,21 +176,27 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				bh->b_size -= done;
 			}
 
-			hole = iov_iter_rw(iter) != WRITE && !buffer_written(bh);
+			hole = rw == READ && !buffer_written(bh);
 			if (hole) {
 				addr = NULL;
 				size = bh->b_size - first;
 			} else {
-				retval = dax_get_addr(bh, &addr, blkbits);
-				if (retval < 0)
+				dax_unmap_atomic(bdev, kmap);
+				kmap = __dax_map_atomic(bdev,
+						to_sector(bh, inode),
+						bh->b_size, &pfn, &map_len);
+				if (IS_ERR(kmap)) {
+					rc = PTR_ERR(kmap);
 					break;
+				}
+				addr = kmap;
 				if (buffer_unwritten(bh) || buffer_new(bh)) {
-					dax_new_buf(addr, retval, first, pos,
-									end);
+					dax_new_buf(addr, map_len, first, pos,
+							end);
 					need_wmb = true;
 				}
 				addr += first;
-				size = retval - first;
+				size = map_len - first;
 			}
 			max = min(pos + size, end);
 		}
@@ -175,8 +219,9 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 
 	if (need_wmb)
 		wmb_pmem();
+	dax_unmap_atomic(bdev, kmap);
 
-	return (pos == start) ? retval : pos - start;
+	return (pos == start) ? rc : pos - start;
 }
 
 /**
@@ -265,28 +310,31 @@ static int dax_load_hole(struct address_space *mapping, struct page *page,
 	return VM_FAULT_LOCKED;
 }
 
-static int copy_user_bh(struct page *to, struct buffer_head *bh,
-			unsigned blkbits, unsigned long vaddr)
+static int copy_user_bh(struct page *to, struct inode *inode,
+		struct buffer_head *bh, unsigned long vaddr)
 {
+	struct block_device *bdev = bh->b_bdev;
 	void __pmem *vfrom;
 	void *vto;
 
-	if (dax_get_addr(bh, &vfrom, blkbits) < 0)
-		return -EIO;
+	vfrom = dax_map_atomic(bdev, to_sector(bh, inode), bh->b_size);
+	if (IS_ERR(vfrom))
+		return PTR_ERR(vfrom);
 	vto = kmap_atomic(to);
 	copy_user_page(vto, (void __force *)vfrom, vaddr, to);
 	kunmap_atomic(vto);
+	dax_unmap_atomic(bdev, vfrom);
 	return 0;
 }
 
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct address_space *mapping = inode->i_mapping;
-	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
-	void __pmem *addr;
+	struct address_space *mapping = inode->i_mapping;
+	struct block_device *bdev = bh->b_bdev;
 	unsigned long pfn;
+	void __pmem *addr;
 	pgoff_t size;
 	int error;
 
@@ -305,11 +353,10 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		goto out;
 	}
 
-	error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
-	if (error < 0)
-		goto out;
-	if (error < PAGE_SIZE) {
-		error = -EIO;
+	addr = __dax_map_atomic(bdev, to_sector(bh, inode), bh->b_size,
+			&pfn, NULL);
+	if (IS_ERR(addr)) {
+		error = PTR_ERR(addr);
 		goto out;
 	}
 
@@ -317,6 +364,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		clear_pmem(addr, PAGE_SIZE);
 		wmb_pmem();
 	}
+	dax_unmap_atomic(bdev, addr);
 
 	error = vm_insert_mixed(vma, vaddr, pfn);
 
@@ -412,7 +460,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (vmf->cow_page) {
 		struct page *new_page = vmf->cow_page;
 		if (buffer_written(&bh))
-			error = copy_user_bh(new_page, &bh, blkbits, vaddr);
+			error = copy_user_bh(new_page, inode, &bh, vaddr);
 		else
 			clear_user_highpage(new_page, vaddr);
 		if (error)
@@ -524,11 +572,9 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	unsigned blkbits = inode->i_blkbits;
 	unsigned long pmd_addr = address & PMD_MASK;
 	bool write = flags & FAULT_FLAG_WRITE;
-	long length;
-	void __pmem *kaddr;
+	struct block_device *bdev;
 	pgoff_t size, pgoff;
-	sector_t block, sector;
-	unsigned long pfn;
+	sector_t block;
 	int result = 0;
 
 	/* Fall back to PTEs if we're going to COW */
@@ -552,9 +598,9 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
 
 	bh.b_size = PMD_SIZE;
-	length = get_block(inode, block, &bh, write);
-	if (length)
+	if (get_block(inode, block, &bh, write) != 0)
 		return VM_FAULT_SIGBUS;
+	bdev = bh.b_bdev;
 	i_mmap_lock_read(mapping);
 
 	/*
@@ -609,15 +655,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
-		sector = bh.b_blocknr << (blkbits - 9);
-		length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
-						bh.b_size);
-		if (length < 0) {
+		long length;
+		unsigned long pfn;
+		void __pmem *kaddr = __dax_map_atomic(bdev,
+				to_sector(&bh, inode), HPAGE_SIZE, &pfn,
+				&length);
+
+		if (IS_ERR(kaddr)) {
 			result = VM_FAULT_SIGBUS;
 			goto out;
 		}
-		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
+		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) {
+			dax_unmap_atomic(bdev, kaddr);
 			goto fallback;
+		}
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
 			clear_pmem(kaddr, HPAGE_SIZE);
@@ -626,6 +677,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 			result |= VM_FAULT_MAJOR;
 		}
+		dax_unmap_atomic(bdev, kaddr);
 
 		result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
 	}
@@ -729,12 +781,15 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 	if (err < 0)
 		return err;
 	if (buffer_written(&bh)) {
-		void __pmem *addr;
-		err = dax_get_addr(&bh, &addr, inode->i_blkbits);
-		if (err < 0)
-			return err;
+		struct block_device *bdev = bh.b_bdev;
+		void __pmem *addr = dax_map_atomic(bdev, to_sector(&bh, inode),
+				PAGE_CACHE_SIZE);
+
+		if (IS_ERR(addr))
+			return PTR_ERR(addr);
 		clear_pmem(addr + offset, length);
 		wmb_pmem();
+		dax_unmap_atomic(bdev, addr);
 	}
 
 	return 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cf57884db4b7..59a770dad804 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -792,6 +792,8 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
+extern int blk_queue_enter(struct request_queue *q, gfp_t gfp);
+extern void blk_queue_exit(struct request_queue *q);
 extern void blk_start_queue(struct request_queue *q);
 extern void blk_stop_queue(struct request_queue *q);
 extern void blk_sync_queue(struct request_queue *q);


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 04/15] libnvdimm, pmem: move request_queue allocation earlier in probe
  2015-11-02  4:29 ` Dan Williams
@ 2015-11-02  4:30   ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe; +Cc: jack, linux-nvdimm, david, linux-kernel, ross.zwisler, hch

Before the dynamically allocated struct pages from devm_memremap_pages()
can be put to use outside the driver, we need a mechanism to track
whether they are still in use at teardown.  Towards that goal reorder
the initialization sequence to allow the 'q_usage_counter' from the
request_queue to be used by the devm_memremap_pages() implementation (in
subsequent patches).

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c |   37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 349f03e7ed06..e46988fbdee5 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -133,6 +133,7 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 		struct resource *res, int id)
 {
 	struct pmem_device *pmem;
+	struct request_queue *q;
 
 	pmem = devm_kzalloc(dev, sizeof(*pmem), GFP_KERNEL);
 	if (!pmem)
@@ -150,16 +151,23 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 		return ERR_PTR(-EBUSY);
 	}
 
-	if (pmem_should_map_pages(dev))
+	q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(dev));
+	if (!q)
+		return ERR_PTR(-ENOMEM);
+
+	if (pmem_should_map_pages(dev)) {
 		pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res);
-	else
+	} else
 		pmem->virt_addr = (void __pmem *) devm_memremap(dev,
 				pmem->phys_addr, pmem->size,
 				ARCH_MEMREMAP_PMEM);
 
-	if (IS_ERR(pmem->virt_addr))
+	if (IS_ERR(pmem->virt_addr)) {
+		blk_cleanup_queue(q);
 		return (void __force *) pmem->virt_addr;
+	}
 
+	pmem->pmem_queue = q;
 	return pmem;
 }
 
@@ -179,10 +187,6 @@ static int pmem_attach_disk(struct device *dev,
 	int nid = dev_to_node(dev);
 	struct gendisk *disk;
 
-	pmem->pmem_queue = blk_alloc_queue_node(GFP_KERNEL, nid);
-	if (!pmem->pmem_queue)
-		return -ENOMEM;
-
 	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
 	blk_queue_physical_block_size(pmem->pmem_queue, PAGE_SIZE);
 	blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX);
@@ -400,19 +404,22 @@ static int nd_pmem_probe(struct device *dev)
 	dev_set_drvdata(dev, pmem);
 	ndns->rw_bytes = pmem_rw_bytes;
 
-	if (is_nd_btt(dev))
+	if (is_nd_btt(dev)) {
+		/* btt allocates its own request_queue */
+		blk_cleanup_queue(pmem->pmem_queue);
+		pmem->pmem_queue = NULL;
 		return nvdimm_namespace_attach_btt(ndns);
+	}
 
 	if (is_nd_pfn(dev))
 		return nvdimm_namespace_attach_pfn(ndns);
 
-	if (nd_btt_probe(ndns, pmem) == 0) {
-		/* we'll come back as btt-pmem */
-		return -ENXIO;
-	}
-
-	if (nd_pfn_probe(ndns, pmem) == 0) {
-		/* we'll come back as pfn-pmem */
+	if (nd_btt_probe(ndns, pmem) == 0 || nd_pfn_probe(ndns, pmem) == 0) {
+		/*
+		 * We'll come back as either btt-pmem, or pfn-pmem, so
+		 * drop the queue allocation for now.
+		 */
+		blk_cleanup_queue(pmem->pmem_queue);
 		return -ENXIO;
 	}
 


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 04/15] libnvdimm, pmem: move request_queue allocation earlier in probe
@ 2015-11-02  4:30   ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe; +Cc: jack, linux-nvdimm, david, linux-kernel, ross.zwisler, hch

Before the dynamically allocated struct pages from devm_memremap_pages()
can be put to use outside the driver, we need a mechanism to track
whether they are still in use at teardown.  Towards that goal reorder
the initialization sequence to allow the 'q_usage_counter' from the
request_queue to be used by the devm_memremap_pages() implementation (in
subsequent patches).

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c |   37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 349f03e7ed06..e46988fbdee5 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -133,6 +133,7 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 		struct resource *res, int id)
 {
 	struct pmem_device *pmem;
+	struct request_queue *q;
 
 	pmem = devm_kzalloc(dev, sizeof(*pmem), GFP_KERNEL);
 	if (!pmem)
@@ -150,16 +151,23 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 		return ERR_PTR(-EBUSY);
 	}
 
-	if (pmem_should_map_pages(dev))
+	q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(dev));
+	if (!q)
+		return ERR_PTR(-ENOMEM);
+
+	if (pmem_should_map_pages(dev)) {
 		pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res);
-	else
+	} else
 		pmem->virt_addr = (void __pmem *) devm_memremap(dev,
 				pmem->phys_addr, pmem->size,
 				ARCH_MEMREMAP_PMEM);
 
-	if (IS_ERR(pmem->virt_addr))
+	if (IS_ERR(pmem->virt_addr)) {
+		blk_cleanup_queue(q);
 		return (void __force *) pmem->virt_addr;
+	}
 
+	pmem->pmem_queue = q;
 	return pmem;
 }
 
@@ -179,10 +187,6 @@ static int pmem_attach_disk(struct device *dev,
 	int nid = dev_to_node(dev);
 	struct gendisk *disk;
 
-	pmem->pmem_queue = blk_alloc_queue_node(GFP_KERNEL, nid);
-	if (!pmem->pmem_queue)
-		return -ENOMEM;
-
 	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
 	blk_queue_physical_block_size(pmem->pmem_queue, PAGE_SIZE);
 	blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX);
@@ -400,19 +404,22 @@ static int nd_pmem_probe(struct device *dev)
 	dev_set_drvdata(dev, pmem);
 	ndns->rw_bytes = pmem_rw_bytes;
 
-	if (is_nd_btt(dev))
+	if (is_nd_btt(dev)) {
+		/* btt allocates its own request_queue */
+		blk_cleanup_queue(pmem->pmem_queue);
+		pmem->pmem_queue = NULL;
 		return nvdimm_namespace_attach_btt(ndns);
+	}
 
 	if (is_nd_pfn(dev))
 		return nvdimm_namespace_attach_pfn(ndns);
 
-	if (nd_btt_probe(ndns, pmem) == 0) {
-		/* we'll come back as btt-pmem */
-		return -ENXIO;
-	}
-
-	if (nd_pfn_probe(ndns, pmem) == 0) {
-		/* we'll come back as pfn-pmem */
+	if (nd_btt_probe(ndns, pmem) == 0 || nd_pfn_probe(ndns, pmem) == 0) {
+		/*
+		 * We'll come back as either btt-pmem, or pfn-pmem, so
+		 * drop the queue allocation for now.
+		 */
+		blk_cleanup_queue(pmem->pmem_queue);
 		return -ENXIO;
 	}
 


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 05/15] libnvdimm, pmem: fix size trim in pmem_direct_access()
  2015-11-02  4:29 ` Dan Williams
@ 2015-11-02  4:30   ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe; +Cc: jack, linux-nvdimm, david, linux-kernel, stable, ross.zwisler, hch

This masking prevents access to the end of the device via dax_do_io(),
and is unnecessary as arch_add_memory() would have rejected an unaligned
allocation.

Cc: <stable@vger.kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c |   17 +++--------------
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index e46988fbdee5..93472953e231 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -100,26 +100,15 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 }
 
 static long pmem_direct_access(struct block_device *bdev, sector_t sector,
-		      void __pmem **kaddr, unsigned long *pfn)
+		      void __pmem **kaddr, pfn_t *pfn)
 {
 	struct pmem_device *pmem = bdev->bd_disk->private_data;
 	resource_size_t offset = sector * 512 + pmem->data_offset;
-	resource_size_t size;
 
-	if (pmem->data_offset) {
-		/*
-		 * Limit the direct_access() size to what is covered by
-		 * the memmap
-		 */
-		size = (pmem->size - offset) & ~ND_PFN_MASK;
-	} else
-		size = pmem->size - offset;
-
-	/* FIXME convert DAX to comprehend that this mapping has a lifetime */
 	*kaddr = pmem->virt_addr + offset;
-	*pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;
+	*pfn = __phys_to_pfn(pmem->phys_addr + offset, pmem->pfn_flags);
 
-	return size;
+	return pmem->size - offset;
 }
 
 static const struct block_device_operations pmem_fops = {


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 05/15] libnvdimm, pmem: fix size trim in pmem_direct_access()
@ 2015-11-02  4:30   ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe; +Cc: jack, linux-nvdimm, david, linux-kernel, stable, ross.zwisler, hch

This masking prevents access to the end of the device via dax_do_io(),
and is unnecessary as arch_add_memory() would have rejected an unaligned
allocation.

Cc: <stable@vger.kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c |   17 +++--------------
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index e46988fbdee5..93472953e231 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -100,26 +100,15 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 }
 
 static long pmem_direct_access(struct block_device *bdev, sector_t sector,
-		      void __pmem **kaddr, unsigned long *pfn)
+		      void __pmem **kaddr, pfn_t *pfn)
 {
 	struct pmem_device *pmem = bdev->bd_disk->private_data;
 	resource_size_t offset = sector * 512 + pmem->data_offset;
-	resource_size_t size;
 
-	if (pmem->data_offset) {
-		/*
-		 * Limit the direct_access() size to what is covered by
-		 * the memmap
-		 */
-		size = (pmem->size - offset) & ~ND_PFN_MASK;
-	} else
-		size = pmem->size - offset;
-
-	/* FIXME convert DAX to comprehend that this mapping has a lifetime */
 	*kaddr = pmem->virt_addr + offset;
-	*pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;
+	*pfn = __phys_to_pfn(pmem->phys_addr + offset, pmem->pfn_flags);
 
-	return size;
+	return pmem->size - offset;
 }
 
 static const struct block_device_operations pmem_fops = {


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 06/15] um: kill pfn_t
  2015-11-02  4:29 ` Dan Williams
@ 2015-11-02  4:30   ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe
  Cc: Dave Hansen, jack, linux-nvdimm, Richard Weinberger, Jeff Dike,
	david, linux-kernel, ross.zwisler, hch

The core has developed a need for a "pfn_t" type [1].  Convert the usage
of pfn_t by usermode-linux to an unsigned long, and update pfn_to_phys()
to drop its expectation of a typed pfn.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html

Cc: Dave Hansen <dave@sr71.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/um/include/asm/page.h           |    6 +++---
 arch/um/include/asm/pgtable-3level.h |    4 ++--
 arch/um/include/asm/pgtable.h        |    2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/um/include/asm/page.h b/arch/um/include/asm/page.h
index 71c5d132062a..fe26a5e06268 100644
--- a/arch/um/include/asm/page.h
+++ b/arch/um/include/asm/page.h
@@ -18,6 +18,7 @@
 
 struct page;
 
+#include <linux/pfn.h>
 #include <linux/types.h>
 #include <asm/vm-flags.h>
 
@@ -76,7 +77,6 @@ typedef struct { unsigned long pmd; } pmd_t;
 #define pte_is_zero(p) (!((p).pte & ~_PAGE_NEWPAGE))
 #define pte_set_val(p, phys, prot) (p).pte = (phys | pgprot_val(prot))
 
-typedef unsigned long pfn_t;
 typedef unsigned long phys_t;
 
 #endif
@@ -109,8 +109,8 @@ extern unsigned long uml_physmem;
 #define __pa(virt) to_phys((void *) (unsigned long) (virt))
 #define __va(phys) to_virt((unsigned long) (phys))
 
-#define phys_to_pfn(p) ((pfn_t) ((p) >> PAGE_SHIFT))
-#define pfn_to_phys(pfn) ((phys_t) ((pfn) << PAGE_SHIFT))
+#define phys_to_pfn(p) ((p) >> PAGE_SHIFT)
+#define pfn_to_phys(pfn) PFN_PHYS(pfn)
 
 #define pfn_valid(pfn) ((pfn) < max_mapnr)
 #define virt_addr_valid(v) pfn_valid(phys_to_pfn(__pa(v)))
diff --git a/arch/um/include/asm/pgtable-3level.h b/arch/um/include/asm/pgtable-3level.h
index 2b4274e7c095..bae8523a162f 100644
--- a/arch/um/include/asm/pgtable-3level.h
+++ b/arch/um/include/asm/pgtable-3level.h
@@ -98,7 +98,7 @@ static inline unsigned long pte_pfn(pte_t pte)
 	return phys_to_pfn(pte_val(pte));
 }
 
-static inline pte_t pfn_pte(pfn_t page_nr, pgprot_t pgprot)
+static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
 {
 	pte_t pte;
 	phys_t phys = pfn_to_phys(page_nr);
@@ -107,7 +107,7 @@ static inline pte_t pfn_pte(pfn_t page_nr, pgprot_t pgprot)
 	return pte;
 }
 
-static inline pmd_t pfn_pmd(pfn_t page_nr, pgprot_t pgprot)
+static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)
 {
 	return __pmd((page_nr << PAGE_SHIFT) | pgprot_val(pgprot));
 }
diff --git a/arch/um/include/asm/pgtable.h b/arch/um/include/asm/pgtable.h
index 18eb9924dda3..7485398d0737 100644
--- a/arch/um/include/asm/pgtable.h
+++ b/arch/um/include/asm/pgtable.h
@@ -271,7 +271,7 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b)
 
 #define phys_to_page(phys) pfn_to_page(phys_to_pfn(phys))
 #define __virt_to_page(virt) phys_to_page(__pa(virt))
-#define page_to_phys(page) pfn_to_phys((pfn_t) page_to_pfn(page))
+#define page_to_phys(page) pfn_to_phys(page_to_pfn(page))
 #define virt_to_page(addr) __virt_to_page((const unsigned long) addr)
 
 #define mk_pte(page, pgprot) \


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 06/15] um: kill pfn_t
@ 2015-11-02  4:30   ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe
  Cc: Dave Hansen, jack, linux-nvdimm, Richard Weinberger, Jeff Dike,
	david, linux-kernel, ross.zwisler, hch

The core has developed a need for a "pfn_t" type [1].  Convert the usage
of pfn_t by usermode-linux to an unsigned long, and update pfn_to_phys()
to drop its expectation of a typed pfn.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html

Cc: Dave Hansen <dave@sr71.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/um/include/asm/page.h           |    6 +++---
 arch/um/include/asm/pgtable-3level.h |    4 ++--
 arch/um/include/asm/pgtable.h        |    2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/um/include/asm/page.h b/arch/um/include/asm/page.h
index 71c5d132062a..fe26a5e06268 100644
--- a/arch/um/include/asm/page.h
+++ b/arch/um/include/asm/page.h
@@ -18,6 +18,7 @@
 
 struct page;
 
+#include <linux/pfn.h>
 #include <linux/types.h>
 #include <asm/vm-flags.h>
 
@@ -76,7 +77,6 @@ typedef struct { unsigned long pmd; } pmd_t;
 #define pte_is_zero(p) (!((p).pte & ~_PAGE_NEWPAGE))
 #define pte_set_val(p, phys, prot) (p).pte = (phys | pgprot_val(prot))
 
-typedef unsigned long pfn_t;
 typedef unsigned long phys_t;
 
 #endif
@@ -109,8 +109,8 @@ extern unsigned long uml_physmem;
 #define __pa(virt) to_phys((void *) (unsigned long) (virt))
 #define __va(phys) to_virt((unsigned long) (phys))
 
-#define phys_to_pfn(p) ((pfn_t) ((p) >> PAGE_SHIFT))
-#define pfn_to_phys(pfn) ((phys_t) ((pfn) << PAGE_SHIFT))
+#define phys_to_pfn(p) ((p) >> PAGE_SHIFT)
+#define pfn_to_phys(pfn) PFN_PHYS(pfn)
 
 #define pfn_valid(pfn) ((pfn) < max_mapnr)
 #define virt_addr_valid(v) pfn_valid(phys_to_pfn(__pa(v)))
diff --git a/arch/um/include/asm/pgtable-3level.h b/arch/um/include/asm/pgtable-3level.h
index 2b4274e7c095..bae8523a162f 100644
--- a/arch/um/include/asm/pgtable-3level.h
+++ b/arch/um/include/asm/pgtable-3level.h
@@ -98,7 +98,7 @@ static inline unsigned long pte_pfn(pte_t pte)
 	return phys_to_pfn(pte_val(pte));
 }
 
-static inline pte_t pfn_pte(pfn_t page_nr, pgprot_t pgprot)
+static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
 {
 	pte_t pte;
 	phys_t phys = pfn_to_phys(page_nr);
@@ -107,7 +107,7 @@ static inline pte_t pfn_pte(pfn_t page_nr, pgprot_t pgprot)
 	return pte;
 }
 
-static inline pmd_t pfn_pmd(pfn_t page_nr, pgprot_t pgprot)
+static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)
 {
 	return __pmd((page_nr << PAGE_SHIFT) | pgprot_val(pgprot));
 }
diff --git a/arch/um/include/asm/pgtable.h b/arch/um/include/asm/pgtable.h
index 18eb9924dda3..7485398d0737 100644
--- a/arch/um/include/asm/pgtable.h
+++ b/arch/um/include/asm/pgtable.h
@@ -271,7 +271,7 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b)
 
 #define phys_to_page(phys) pfn_to_page(phys_to_pfn(phys))
 #define __virt_to_page(virt) phys_to_page(__pa(virt))
-#define page_to_phys(page) pfn_to_phys((pfn_t) page_to_pfn(page))
+#define page_to_phys(page) pfn_to_phys(page_to_pfn(page))
 #define virt_to_page(addr) __virt_to_page((const unsigned long) addr)
 
 #define mk_pte(page, pgprot) \


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 07/15] kvm: rename pfn_t to kvm_pfn_t
  2015-11-02  4:29 ` Dan Williams
@ 2015-11-02  4:30   ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe
  Cc: jack, linux-nvdimm, david, linux-kernel, Paolo Bonzini,
	ross.zwisler, hch, Christoffer Dall

The core has developed a need for a "pfn_t" type [1].  Move the existing
pfn_t in KVM to kvm_pfn_t [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html

Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/arm/include/asm/kvm_mmu.h        |    5 ++--
 arch/arm/kvm/mmu.c                    |   10 ++++---
 arch/arm64/include/asm/kvm_mmu.h      |    3 +-
 arch/mips/include/asm/kvm_host.h      |    6 ++--
 arch/mips/kvm/emulate.c               |    2 +
 arch/mips/kvm/tlb.c                   |   14 +++++-----
 arch/powerpc/include/asm/kvm_book3s.h |    4 +--
 arch/powerpc/include/asm/kvm_ppc.h    |    2 +
 arch/powerpc/kvm/book3s.c             |    6 ++--
 arch/powerpc/kvm/book3s_32_mmu_host.c |    2 +
 arch/powerpc/kvm/book3s_64_mmu_host.c |    2 +
 arch/powerpc/kvm/e500.h               |    2 +
 arch/powerpc/kvm/e500_mmu_host.c      |    8 +++---
 arch/powerpc/kvm/trace_pr.h           |    2 +
 arch/x86/kvm/iommu.c                  |   11 ++++----
 arch/x86/kvm/mmu.c                    |   37 +++++++++++++-------------
 arch/x86/kvm/mmu_audit.c              |    2 +
 arch/x86/kvm/paging_tmpl.h            |    6 ++--
 arch/x86/kvm/vmx.c                    |    2 +
 arch/x86/kvm/x86.c                    |    2 +
 include/linux/kvm_host.h              |   37 +++++++++++++-------------
 include/linux/kvm_types.h             |    2 +
 virt/kvm/kvm_main.c                   |   47 +++++++++++++++++----------------
 23 files changed, 110 insertions(+), 104 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 405aa1883307..8ebd282dfc2b 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -182,7 +182,8 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
 	return (vcpu->arch.cp15[c1_SCTLR] & 0b101) == 0b101;
 }
 
-static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
+static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
+					       kvm_pfn_t pfn,
 					       unsigned long size,
 					       bool ipa_uncached)
 {
@@ -246,7 +247,7 @@ static inline void __kvm_flush_dcache_pte(pte_t pte)
 static inline void __kvm_flush_dcache_pmd(pmd_t pmd)
 {
 	unsigned long size = PMD_SIZE;
-	pfn_t pfn = pmd_pfn(pmd);
+	kvm_pfn_t pfn = pmd_pfn(pmd);
 
 	while (size) {
 		void *va = kmap_atomic_pfn(pfn);
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 6984342da13d..e2dcbfdc4a8c 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -988,9 +988,9 @@ out:
 	return ret;
 }
 
-static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap)
+static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
 {
-	pfn_t pfn = *pfnp;
+	kvm_pfn_t pfn = *pfnp;
 	gfn_t gfn = *ipap >> PAGE_SHIFT;
 
 	if (PageTransCompound(pfn_to_page(pfn))) {
@@ -1202,7 +1202,7 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
 }
 
-static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
+static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
 				      unsigned long size, bool uncached)
 {
 	__coherent_cache_guest_page(vcpu, pfn, size, uncached);
@@ -1219,7 +1219,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	struct kvm *kvm = vcpu->kvm;
 	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
 	struct vm_area_struct *vma;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	pgprot_t mem_type = PAGE_S2;
 	bool fault_ipa_uncached;
 	bool logging_active = memslot_is_logging(memslot);
@@ -1347,7 +1347,7 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
 {
 	pmd_t *pmd;
 	pte_t *pte;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	bool pfn_valid = false;
 
 	trace_kvm_access_fault(fault_ipa);
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 61505676d085..385fc8cef82d 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -230,7 +230,8 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
 	return (vcpu_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
 }
 
-static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
+static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
+					       kvm_pfn_t pfn,
 					       unsigned long size,
 					       bool ipa_uncached)
 {
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 5a1a882e0a75..9c67f05a0a1b 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -101,9 +101,9 @@
 #define CAUSEF_DC			(_ULCAST_(1) << 27)
 
 extern atomic_t kvm_mips_instance;
-extern pfn_t(*kvm_mips_gfn_to_pfn) (struct kvm *kvm, gfn_t gfn);
-extern void (*kvm_mips_release_pfn_clean) (pfn_t pfn);
-extern bool(*kvm_mips_is_error_pfn) (pfn_t pfn);
+extern kvm_pfn_t (*kvm_mips_gfn_to_pfn)(struct kvm *kvm, gfn_t gfn);
+extern void (*kvm_mips_release_pfn_clean)(kvm_pfn_t pfn);
+extern bool (*kvm_mips_is_error_pfn)(kvm_pfn_t pfn);
 
 struct kvm_vm_stat {
 	u32 remote_tlb_flush;
diff --git a/arch/mips/kvm/emulate.c b/arch/mips/kvm/emulate.c
index d5fa3eaf39a1..476296cf37d3 100644
--- a/arch/mips/kvm/emulate.c
+++ b/arch/mips/kvm/emulate.c
@@ -1525,7 +1525,7 @@ int kvm_mips_sync_icache(unsigned long va, struct kvm_vcpu *vcpu)
 	struct kvm *kvm = vcpu->kvm;
 	unsigned long pa;
 	gfn_t gfn;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	gfn = va >> PAGE_SHIFT;
 
diff --git a/arch/mips/kvm/tlb.c b/arch/mips/kvm/tlb.c
index aed0ac2a4972..570479c03bdc 100644
--- a/arch/mips/kvm/tlb.c
+++ b/arch/mips/kvm/tlb.c
@@ -38,13 +38,13 @@ atomic_t kvm_mips_instance;
 EXPORT_SYMBOL(kvm_mips_instance);
 
 /* These function pointers are initialized once the KVM module is loaded */
-pfn_t (*kvm_mips_gfn_to_pfn)(struct kvm *kvm, gfn_t gfn);
+kvm_pfn_t (*kvm_mips_gfn_to_pfn)(struct kvm *kvm, gfn_t gfn);
 EXPORT_SYMBOL(kvm_mips_gfn_to_pfn);
 
-void (*kvm_mips_release_pfn_clean)(pfn_t pfn);
+void (*kvm_mips_release_pfn_clean)(kvm_pfn_t pfn);
 EXPORT_SYMBOL(kvm_mips_release_pfn_clean);
 
-bool (*kvm_mips_is_error_pfn)(pfn_t pfn);
+bool (*kvm_mips_is_error_pfn)(kvm_pfn_t pfn);
 EXPORT_SYMBOL(kvm_mips_is_error_pfn);
 
 uint32_t kvm_mips_get_kernel_asid(struct kvm_vcpu *vcpu)
@@ -144,7 +144,7 @@ EXPORT_SYMBOL(kvm_mips_dump_guest_tlbs);
 static int kvm_mips_map_page(struct kvm *kvm, gfn_t gfn)
 {
 	int srcu_idx, err = 0;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	if (kvm->arch.guest_pmap[gfn] != KVM_INVALID_PAGE)
 		return 0;
@@ -262,7 +262,7 @@ int kvm_mips_handle_kseg0_tlb_fault(unsigned long badvaddr,
 				    struct kvm_vcpu *vcpu)
 {
 	gfn_t gfn;
-	pfn_t pfn0, pfn1;
+	kvm_pfn_t pfn0, pfn1;
 	unsigned long vaddr = 0;
 	unsigned long entryhi = 0, entrylo0 = 0, entrylo1 = 0;
 	int even;
@@ -313,7 +313,7 @@ EXPORT_SYMBOL(kvm_mips_handle_kseg0_tlb_fault);
 int kvm_mips_handle_commpage_tlb_fault(unsigned long badvaddr,
 	struct kvm_vcpu *vcpu)
 {
-	pfn_t pfn0, pfn1;
+	kvm_pfn_t pfn0, pfn1;
 	unsigned long flags, old_entryhi = 0, vaddr = 0;
 	unsigned long entrylo0 = 0, entrylo1 = 0;
 
@@ -360,7 +360,7 @@ int kvm_mips_handle_mapped_seg_tlb_fault(struct kvm_vcpu *vcpu,
 {
 	unsigned long entryhi = 0, entrylo0 = 0, entrylo1 = 0;
 	struct kvm *kvm = vcpu->kvm;
-	pfn_t pfn0, pfn1;
+	kvm_pfn_t pfn0, pfn1;
 
 	if ((tlb->tlb_hi & VPN2_MASK) == 0) {
 		pfn0 = 0;
diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index 9fac01cb89c1..8f39796c9da8 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -154,8 +154,8 @@ extern void kvmppc_set_bat(struct kvm_vcpu *vcpu, struct kvmppc_bat *bat,
 			   bool upper, u32 val);
 extern void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong msr);
 extern int kvmppc_emulate_paired_single(struct kvm_run *run, struct kvm_vcpu *vcpu);
-extern pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa, bool writing,
-			bool *writable);
+extern kvm_pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa,
+			bool writing, bool *writable);
 extern void kvmppc_add_revmap_chain(struct kvm *kvm, struct revmap_entry *rev,
 			unsigned long *rmap, long pte_index, int realmode);
 extern void kvmppc_update_rmap_change(unsigned long *rmap, unsigned long psize);
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index c6ef05bd0765..2241d5357129 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -515,7 +515,7 @@ void kvmppc_claim_lpid(long lpid);
 void kvmppc_free_lpid(long lpid);
 void kvmppc_init_lpid(unsigned long nr_lpids);
 
-static inline void kvmppc_mmu_flush_icache(pfn_t pfn)
+static inline void kvmppc_mmu_flush_icache(kvm_pfn_t pfn)
 {
 	struct page *page;
 	/*
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 099c79d8c160..638c6d9be9e0 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -366,7 +366,7 @@ int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvmppc_core_prepare_to_enter);
 
-pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa, bool writing,
+kvm_pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa, bool writing,
 			bool *writable)
 {
 	ulong mp_pa = vcpu->arch.magic_page_pa & KVM_PAM;
@@ -379,9 +379,9 @@ pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa, bool writing,
 	gpa &= ~0xFFFULL;
 	if (unlikely(mp_pa) && unlikely((gpa & KVM_PAM) == mp_pa)) {
 		ulong shared_page = ((ulong)vcpu->arch.shared) & PAGE_MASK;
-		pfn_t pfn;
+		kvm_pfn_t pfn;
 
-		pfn = (pfn_t)virt_to_phys((void*)shared_page) >> PAGE_SHIFT;
+		pfn = (kvm_pfn_t)virt_to_phys((void*)shared_page) >> PAGE_SHIFT;
 		get_page(pfn_to_page(pfn));
 		if (writable)
 			*writable = true;
diff --git a/arch/powerpc/kvm/book3s_32_mmu_host.c b/arch/powerpc/kvm/book3s_32_mmu_host.c
index d5c9bfeb0c9c..55c4d51ea3e2 100644
--- a/arch/powerpc/kvm/book3s_32_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_32_mmu_host.c
@@ -142,7 +142,7 @@ extern char etext[];
 int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte,
 			bool iswrite)
 {
-	pfn_t hpaddr;
+	kvm_pfn_t hpaddr;
 	u64 vpn;
 	u64 vsid;
 	struct kvmppc_sid_map *map;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c b/arch/powerpc/kvm/book3s_64_mmu_host.c
index 79ad35abd196..913cd2198fa6 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_host.c
@@ -83,7 +83,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte,
 			bool iswrite)
 {
 	unsigned long vpn;
-	pfn_t hpaddr;
+	kvm_pfn_t hpaddr;
 	ulong hash, hpteg;
 	u64 vsid;
 	int ret;
diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index 72920bed3ac6..94f04fcb373e 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -41,7 +41,7 @@ enum vcpu_ftr {
 #define E500_TLB_MAS2_ATTR	(0x7f)
 
 struct tlbe_ref {
-	pfn_t pfn;		/* valid only for TLB0, except briefly */
+	kvm_pfn_t pfn;		/* valid only for TLB0, except briefly */
 	unsigned int flags;	/* E500_TLB_* */
 };
 
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 4d33e199edcc..8a5bb6dfcc2d 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -163,9 +163,9 @@ void kvmppc_map_magic(struct kvm_vcpu *vcpu)
 	struct kvm_book3e_206_tlb_entry magic;
 	ulong shared_page = ((ulong)vcpu->arch.shared) & PAGE_MASK;
 	unsigned int stid;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
-	pfn = (pfn_t)virt_to_phys((void *)shared_page) >> PAGE_SHIFT;
+	pfn = (kvm_pfn_t)virt_to_phys((void *)shared_page) >> PAGE_SHIFT;
 	get_page(pfn_to_page(pfn));
 
 	preempt_disable();
@@ -246,7 +246,7 @@ static inline int tlbe_is_writable(struct kvm_book3e_206_tlb_entry *tlbe)
 
 static inline void kvmppc_e500_ref_setup(struct tlbe_ref *ref,
 					 struct kvm_book3e_206_tlb_entry *gtlbe,
-					 pfn_t pfn, unsigned int wimg)
+					 kvm_pfn_t pfn, unsigned int wimg)
 {
 	ref->pfn = pfn;
 	ref->flags = E500_TLB_VALID;
@@ -309,7 +309,7 @@ static void kvmppc_e500_setup_stlbe(
 	int tsize, struct tlbe_ref *ref, u64 gvaddr,
 	struct kvm_book3e_206_tlb_entry *stlbe)
 {
-	pfn_t pfn = ref->pfn;
+	kvm_pfn_t pfn = ref->pfn;
 	u32 pr = vcpu->arch.shared->msr & MSR_PR;
 
 	BUG_ON(!(ref->flags & E500_TLB_VALID));
diff --git a/arch/powerpc/kvm/trace_pr.h b/arch/powerpc/kvm/trace_pr.h
index 810507cb688a..d44f324184fb 100644
--- a/arch/powerpc/kvm/trace_pr.h
+++ b/arch/powerpc/kvm/trace_pr.h
@@ -30,7 +30,7 @@ TRACE_EVENT(kvm_book3s_reenter,
 #ifdef CONFIG_PPC_BOOK3S_64
 
 TRACE_EVENT(kvm_book3s_64_mmu_map,
-	TP_PROTO(int rflags, ulong hpteg, ulong va, pfn_t hpaddr,
+	TP_PROTO(int rflags, ulong hpteg, ulong va, kvm_pfn_t hpaddr,
 		 struct kvmppc_pte *orig_pte),
 	TP_ARGS(rflags, hpteg, va, hpaddr, orig_pte),
 
diff --git a/arch/x86/kvm/iommu.c b/arch/x86/kvm/iommu.c
index 5c520ebf6343..a22a488b4622 100644
--- a/arch/x86/kvm/iommu.c
+++ b/arch/x86/kvm/iommu.c
@@ -43,11 +43,11 @@ static int kvm_iommu_unmap_memslots(struct kvm *kvm);
 static void kvm_iommu_put_pages(struct kvm *kvm,
 				gfn_t base_gfn, unsigned long npages);
 
-static pfn_t kvm_pin_pages(struct kvm_memory_slot *slot, gfn_t gfn,
+static kvm_pfn_t kvm_pin_pages(struct kvm_memory_slot *slot, gfn_t gfn,
 			   unsigned long npages)
 {
 	gfn_t end_gfn;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	pfn     = gfn_to_pfn_memslot(slot, gfn);
 	end_gfn = gfn + npages;
@@ -62,7 +62,8 @@ static pfn_t kvm_pin_pages(struct kvm_memory_slot *slot, gfn_t gfn,
 	return pfn;
 }
 
-static void kvm_unpin_pages(struct kvm *kvm, pfn_t pfn, unsigned long npages)
+static void kvm_unpin_pages(struct kvm *kvm, kvm_pfn_t pfn,
+		unsigned long npages)
 {
 	unsigned long i;
 
@@ -73,7 +74,7 @@ static void kvm_unpin_pages(struct kvm *kvm, pfn_t pfn, unsigned long npages)
 int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
 	gfn_t gfn, end_gfn;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	int r = 0;
 	struct iommu_domain *domain = kvm->arch.iommu_domain;
 	int flags;
@@ -275,7 +276,7 @@ static void kvm_iommu_put_pages(struct kvm *kvm,
 {
 	struct iommu_domain *domain;
 	gfn_t end_gfn, gfn;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	u64 phys;
 
 	domain  = kvm->arch.iommu_domain;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ff606f507913..6ab963ae0427 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -259,7 +259,7 @@ static unsigned get_mmio_spte_access(u64 spte)
 }
 
 static bool set_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
-			  pfn_t pfn, unsigned access)
+			  kvm_pfn_t pfn, unsigned access)
 {
 	if (unlikely(is_noslot_pfn(pfn))) {
 		mark_mmio_spte(vcpu, sptep, gfn, access);
@@ -325,7 +325,7 @@ static int is_last_spte(u64 pte, int level)
 	return 0;
 }
 
-static pfn_t spte_to_pfn(u64 pte)
+static kvm_pfn_t spte_to_pfn(u64 pte)
 {
 	return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
 }
@@ -587,7 +587,7 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
  */
 static int mmu_spte_clear_track_bits(u64 *sptep)
 {
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	u64 old_spte = *sptep;
 
 	if (!spte_has_volatile_bits(old_spte))
@@ -1369,7 +1369,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
 	int need_flush = 0;
 	u64 new_spte;
 	pte_t *ptep = (pte_t *)data;
-	pfn_t new_pfn;
+	kvm_pfn_t new_pfn;
 
 	WARN_ON(pte_huge(*ptep));
 	new_pfn = pte_pfn(*ptep);
@@ -2456,7 +2456,7 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return 0;
 }
 
-static bool kvm_is_mmio_pfn(pfn_t pfn)
+static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
 	if (pfn_valid(pfn))
 		return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
@@ -2466,7 +2466,7 @@ static bool kvm_is_mmio_pfn(pfn_t pfn)
 
 static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		    unsigned pte_access, int level,
-		    gfn_t gfn, pfn_t pfn, bool speculative,
+		    gfn_t gfn, kvm_pfn_t pfn, bool speculative,
 		    bool can_unsync, bool host_writable)
 {
 	u64 spte;
@@ -2546,7 +2546,7 @@ done:
 
 static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 			 unsigned pte_access, int write_fault, int *emulate,
-			 int level, gfn_t gfn, pfn_t pfn, bool speculative,
+			 int level, gfn_t gfn, kvm_pfn_t pfn, bool speculative,
 			 bool host_writable)
 {
 	int was_rmapped = 0;
@@ -2606,7 +2606,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	kvm_release_pfn_clean(pfn);
 }
 
-static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
+static kvm_pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
 				     bool no_dirty_log)
 {
 	struct kvm_memory_slot *slot;
@@ -2689,7 +2689,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
 }
 
 static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
-			int map_writable, int level, gfn_t gfn, pfn_t pfn,
+			int map_writable, int level, gfn_t gfn, kvm_pfn_t pfn,
 			bool prefault)
 {
 	struct kvm_shadow_walk_iterator iterator;
@@ -2739,7 +2739,7 @@ static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct *
 	send_sig_info(SIGBUS, &info, tsk);
 }
 
-static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, pfn_t pfn)
+static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
 {
 	/*
 	 * Do not cache the mmio info caused by writing the readonly gfn
@@ -2759,9 +2759,10 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, pfn_t pfn)
 }
 
 static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
-					gfn_t *gfnp, pfn_t *pfnp, int *levelp)
+					gfn_t *gfnp, kvm_pfn_t *pfnp,
+					int *levelp)
 {
-	pfn_t pfn = *pfnp;
+	kvm_pfn_t pfn = *pfnp;
 	gfn_t gfn = *gfnp;
 	int level = *levelp;
 
@@ -2800,7 +2801,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 }
 
 static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
-				pfn_t pfn, unsigned access, int *ret_val)
+				kvm_pfn_t pfn, unsigned access, int *ret_val)
 {
 	bool ret = true;
 
@@ -2954,7 +2955,7 @@ exit:
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-			 gva_t gva, pfn_t *pfn, bool write, bool *writable);
+			 gva_t gva, kvm_pfn_t *pfn, bool write, bool *writable);
 static void make_mmu_pages_available(struct kvm_vcpu *vcpu);
 
 static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
@@ -2963,7 +2964,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
 	int r;
 	int level;
 	int force_pt_level;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	unsigned long mmu_seq;
 	bool map_writable, write = error_code & PFERR_WRITE_MASK;
 
@@ -3435,7 +3436,7 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-			 gva_t gva, pfn_t *pfn, bool write, bool *writable)
+			 gva_t gva, kvm_pfn_t *pfn, bool write, bool *writable)
 {
 	struct kvm_memory_slot *slot;
 	bool async;
@@ -3473,7 +3474,7 @@ check_hugepage_cache_consistency(struct kvm_vcpu *vcpu, gfn_t gfn, int level)
 static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 			  bool prefault)
 {
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	int r;
 	int level;
 	int force_pt_level;
@@ -4627,7 +4628,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 	u64 *sptep;
 	struct rmap_iterator iter;
 	int need_tlb_flush = 0;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	struct kvm_mmu_page *sp;
 
 restart:
diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
index 03d518e499a6..37a4d14115c0 100644
--- a/arch/x86/kvm/mmu_audit.c
+++ b/arch/x86/kvm/mmu_audit.c
@@ -97,7 +97,7 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
 {
 	struct kvm_mmu_page *sp;
 	gfn_t gfn;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	hpa_t hpa;
 
 	sp = page_header(__pa(sptep));
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 736e6ab8784d..9dd02cb74724 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -456,7 +456,7 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 {
 	unsigned pte_access;
 	gfn_t gfn;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
 		return false;
@@ -551,7 +551,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
 static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 			 struct guest_walker *gw,
 			 int write_fault, int hlevel,
-			 pfn_t pfn, bool map_writable, bool prefault)
+			 kvm_pfn_t pfn, bool map_writable, bool prefault)
 {
 	struct kvm_mmu_page *sp = NULL;
 	struct kvm_shadow_walk_iterator it;
@@ -696,7 +696,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	int user_fault = error_code & PFERR_USER_MASK;
 	struct guest_walker walker;
 	int r;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	int level = PT_PAGE_TABLE_LEVEL;
 	int force_pt_level;
 	unsigned long mmu_seq;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6a8bc64566ab..52aba0f1207b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4046,7 +4046,7 @@ out:
 static int init_rmode_identity_map(struct kvm *kvm)
 {
 	int i, idx, r = 0;
-	pfn_t identity_map_pfn;
+	kvm_pfn_t identity_map_pfn;
 	u32 tmp;
 
 	if (!enable_ept)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9a9a19830321..60b08427ca6c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4935,7 +4935,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
 				  int emulation_type)
 {
 	gpa_t gpa = cr2;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	if (emulation_type & EMULTYPE_NO_REEXECUTE)
 		return false;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1bef9e21e725..2420b43f3acc 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -65,7 +65,7 @@
  * error pfns indicate that the gfn is in slot but faild to
  * translate it to pfn on host.
  */
-static inline bool is_error_pfn(pfn_t pfn)
+static inline bool is_error_pfn(kvm_pfn_t pfn)
 {
 	return !!(pfn & KVM_PFN_ERR_MASK);
 }
@@ -75,13 +75,13 @@ static inline bool is_error_pfn(pfn_t pfn)
  * translated to pfn - it is not in slot or failed to
  * translate it to pfn.
  */
-static inline bool is_error_noslot_pfn(pfn_t pfn)
+static inline bool is_error_noslot_pfn(kvm_pfn_t pfn)
 {
 	return !!(pfn & KVM_PFN_ERR_NOSLOT_MASK);
 }
 
 /* noslot pfn indicates that the gfn is not in slot. */
-static inline bool is_noslot_pfn(pfn_t pfn)
+static inline bool is_noslot_pfn(kvm_pfn_t pfn)
 {
 	return pfn == KVM_PFN_NOSLOT;
 }
@@ -569,19 +569,20 @@ void kvm_release_page_clean(struct page *page);
 void kvm_release_page_dirty(struct page *page);
 void kvm_set_page_accessed(struct page *page);
 
-pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn);
-pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
-pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
+kvm_pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn);
+kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
+kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable);
-pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
-pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn);
-pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn, bool atomic,
-			   bool *async, bool write_fault, bool *writable);
+kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
+kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn);
+kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
+			       bool atomic, bool *async, bool write_fault,
+			       bool *writable);
 
-void kvm_release_pfn_clean(pfn_t pfn);
-void kvm_set_pfn_dirty(pfn_t pfn);
-void kvm_set_pfn_accessed(pfn_t pfn);
-void kvm_get_pfn(pfn_t pfn);
+void kvm_release_pfn_clean(kvm_pfn_t pfn);
+void kvm_set_pfn_dirty(kvm_pfn_t pfn);
+void kvm_set_pfn_accessed(kvm_pfn_t pfn);
+void kvm_get_pfn(kvm_pfn_t pfn);
 
 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 			int len);
@@ -607,8 +608,8 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
 
 struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
 struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn);
-pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
-pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
+kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
+kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn);
 unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn);
 unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool *writable);
@@ -789,7 +790,7 @@ void kvm_arch_sync_events(struct kvm *kvm);
 int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
 void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
 
-bool kvm_is_reserved_pfn(pfn_t pfn);
+bool kvm_is_reserved_pfn(kvm_pfn_t pfn);
 
 struct kvm_irq_ack_notifier {
 	struct hlist_node link;
@@ -940,7 +941,7 @@ static inline gfn_t gpa_to_gfn(gpa_t gpa)
 	return (gfn_t)(gpa >> PAGE_SHIFT);
 }
 
-static inline hpa_t pfn_to_hpa(pfn_t pfn)
+static inline hpa_t pfn_to_hpa(kvm_pfn_t pfn)
 {
 	return (hpa_t)pfn << PAGE_SHIFT;
 }
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 1b47a185c2f0..8bf259dae9f6 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -53,7 +53,7 @@ typedef unsigned long  hva_t;
 typedef u64            hpa_t;
 typedef u64            hfn_t;
 
-typedef hfn_t pfn_t;
+typedef hfn_t kvm_pfn_t;
 
 struct gfn_to_hva_cache {
 	u64 generation;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8db1d9361993..02cd2eddd3ff 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -111,7 +111,7 @@ static void hardware_disable_all(void);
 
 static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
 
-static void kvm_release_pfn_dirty(pfn_t pfn);
+static void kvm_release_pfn_dirty(kvm_pfn_t pfn);
 static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
 
 __visible bool kvm_rebooting;
@@ -119,7 +119,7 @@ EXPORT_SYMBOL_GPL(kvm_rebooting);
 
 static bool largepages_enabled = true;
 
-bool kvm_is_reserved_pfn(pfn_t pfn)
+bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
 	if (pfn_valid(pfn))
 		return PageReserved(pfn_to_page(pfn));
@@ -1296,7 +1296,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
  * true indicates success, otherwise false is returned.
  */
 static bool hva_to_pfn_fast(unsigned long addr, bool atomic, bool *async,
-			    bool write_fault, bool *writable, pfn_t *pfn)
+			    bool write_fault, bool *writable, kvm_pfn_t *pfn)
 {
 	struct page *page[1];
 	int npages;
@@ -1329,7 +1329,7 @@ static bool hva_to_pfn_fast(unsigned long addr, bool atomic, bool *async,
  * 1 indicates success, -errno is returned if error is detected.
  */
 static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
-			   bool *writable, pfn_t *pfn)
+			   bool *writable, kvm_pfn_t *pfn)
 {
 	struct page *page[1];
 	int npages = 0;
@@ -1393,11 +1393,11 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
  * 2): @write_fault = false && @writable, @writable will tell the caller
  *     whether the mapping is writable.
  */
-static pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
+static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
 			bool write_fault, bool *writable)
 {
 	struct vm_area_struct *vma;
-	pfn_t pfn = 0;
+	kvm_pfn_t pfn = 0;
 	int npages;
 
 	/* we can do it either atomically or asynchronously, not both */
@@ -1438,8 +1438,9 @@ exit:
 	return pfn;
 }
 
-pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn, bool atomic,
-			   bool *async, bool write_fault, bool *writable)
+kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
+			       bool atomic, bool *async, bool write_fault,
+			       bool *writable)
 {
 	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
 
@@ -1460,7 +1461,7 @@ pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn, bool atomic,
 }
 EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
 
-pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
+kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
 	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, NULL,
@@ -1468,37 +1469,37 @@ pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
 
-pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
+kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
 
-pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
+kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
 
-pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn)
+kvm_pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn)
 {
 	return gfn_to_pfn_memslot_atomic(gfn_to_memslot(kvm, gfn), gfn);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_atomic);
 
-pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn)
+kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
 	return gfn_to_pfn_memslot_atomic(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn_atomic);
 
-pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
+kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
 {
 	return gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn);
 
-pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
 	return gfn_to_pfn_memslot(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
 }
@@ -1521,7 +1522,7 @@ int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
 }
 EXPORT_SYMBOL_GPL(gfn_to_page_many_atomic);
 
-static struct page *kvm_pfn_to_page(pfn_t pfn)
+static struct page *kvm_pfn_to_page(kvm_pfn_t pfn)
 {
 	if (is_error_noslot_pfn(pfn))
 		return KVM_ERR_PTR_BAD_PAGE;
@@ -1536,7 +1537,7 @@ static struct page *kvm_pfn_to_page(pfn_t pfn)
 
 struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
 {
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	pfn = gfn_to_pfn(kvm, gfn);
 
@@ -1546,7 +1547,7 @@ EXPORT_SYMBOL_GPL(gfn_to_page);
 
 struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	pfn = kvm_vcpu_gfn_to_pfn(vcpu, gfn);
 
@@ -1562,7 +1563,7 @@ void kvm_release_page_clean(struct page *page)
 }
 EXPORT_SYMBOL_GPL(kvm_release_page_clean);
 
-void kvm_release_pfn_clean(pfn_t pfn)
+void kvm_release_pfn_clean(kvm_pfn_t pfn)
 {
 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
 		put_page(pfn_to_page(pfn));
@@ -1577,13 +1578,13 @@ void kvm_release_page_dirty(struct page *page)
 }
 EXPORT_SYMBOL_GPL(kvm_release_page_dirty);
 
-static void kvm_release_pfn_dirty(pfn_t pfn)
+static void kvm_release_pfn_dirty(kvm_pfn_t pfn)
 {
 	kvm_set_pfn_dirty(pfn);
 	kvm_release_pfn_clean(pfn);
 }
 
-void kvm_set_pfn_dirty(pfn_t pfn)
+void kvm_set_pfn_dirty(kvm_pfn_t pfn)
 {
 	if (!kvm_is_reserved_pfn(pfn)) {
 		struct page *page = pfn_to_page(pfn);
@@ -1594,14 +1595,14 @@ void kvm_set_pfn_dirty(pfn_t pfn)
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty);
 
-void kvm_set_pfn_accessed(pfn_t pfn)
+void kvm_set_pfn_accessed(kvm_pfn_t pfn)
 {
 	if (!kvm_is_reserved_pfn(pfn))
 		mark_page_accessed(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
 
-void kvm_get_pfn(pfn_t pfn)
+void kvm_get_pfn(kvm_pfn_t pfn)
 {
 	if (!kvm_is_reserved_pfn(pfn))
 		get_page(pfn_to_page(pfn));


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 07/15] kvm: rename pfn_t to kvm_pfn_t
@ 2015-11-02  4:30   ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe
  Cc: jack, linux-nvdimm, david, linux-kernel, Paolo Bonzini,
	ross.zwisler, hch, Christoffer Dall

The core has developed a need for a "pfn_t" type [1].  Move the existing
pfn_t in KVM to kvm_pfn_t [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html

Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/arm/include/asm/kvm_mmu.h        |    5 ++--
 arch/arm/kvm/mmu.c                    |   10 ++++---
 arch/arm64/include/asm/kvm_mmu.h      |    3 +-
 arch/mips/include/asm/kvm_host.h      |    6 ++--
 arch/mips/kvm/emulate.c               |    2 +
 arch/mips/kvm/tlb.c                   |   14 +++++-----
 arch/powerpc/include/asm/kvm_book3s.h |    4 +--
 arch/powerpc/include/asm/kvm_ppc.h    |    2 +
 arch/powerpc/kvm/book3s.c             |    6 ++--
 arch/powerpc/kvm/book3s_32_mmu_host.c |    2 +
 arch/powerpc/kvm/book3s_64_mmu_host.c |    2 +
 arch/powerpc/kvm/e500.h               |    2 +
 arch/powerpc/kvm/e500_mmu_host.c      |    8 +++---
 arch/powerpc/kvm/trace_pr.h           |    2 +
 arch/x86/kvm/iommu.c                  |   11 ++++----
 arch/x86/kvm/mmu.c                    |   37 +++++++++++++-------------
 arch/x86/kvm/mmu_audit.c              |    2 +
 arch/x86/kvm/paging_tmpl.h            |    6 ++--
 arch/x86/kvm/vmx.c                    |    2 +
 arch/x86/kvm/x86.c                    |    2 +
 include/linux/kvm_host.h              |   37 +++++++++++++-------------
 include/linux/kvm_types.h             |    2 +
 virt/kvm/kvm_main.c                   |   47 +++++++++++++++++----------------
 23 files changed, 110 insertions(+), 104 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 405aa1883307..8ebd282dfc2b 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -182,7 +182,8 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
 	return (vcpu->arch.cp15[c1_SCTLR] & 0b101) == 0b101;
 }
 
-static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
+static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
+					       kvm_pfn_t pfn,
 					       unsigned long size,
 					       bool ipa_uncached)
 {
@@ -246,7 +247,7 @@ static inline void __kvm_flush_dcache_pte(pte_t pte)
 static inline void __kvm_flush_dcache_pmd(pmd_t pmd)
 {
 	unsigned long size = PMD_SIZE;
-	pfn_t pfn = pmd_pfn(pmd);
+	kvm_pfn_t pfn = pmd_pfn(pmd);
 
 	while (size) {
 		void *va = kmap_atomic_pfn(pfn);
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 6984342da13d..e2dcbfdc4a8c 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -988,9 +988,9 @@ out:
 	return ret;
 }
 
-static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap)
+static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
 {
-	pfn_t pfn = *pfnp;
+	kvm_pfn_t pfn = *pfnp;
 	gfn_t gfn = *ipap >> PAGE_SHIFT;
 
 	if (PageTransCompound(pfn_to_page(pfn))) {
@@ -1202,7 +1202,7 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
 }
 
-static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
+static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
 				      unsigned long size, bool uncached)
 {
 	__coherent_cache_guest_page(vcpu, pfn, size, uncached);
@@ -1219,7 +1219,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	struct kvm *kvm = vcpu->kvm;
 	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
 	struct vm_area_struct *vma;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	pgprot_t mem_type = PAGE_S2;
 	bool fault_ipa_uncached;
 	bool logging_active = memslot_is_logging(memslot);
@@ -1347,7 +1347,7 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
 {
 	pmd_t *pmd;
 	pte_t *pte;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	bool pfn_valid = false;
 
 	trace_kvm_access_fault(fault_ipa);
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 61505676d085..385fc8cef82d 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -230,7 +230,8 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
 	return (vcpu_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
 }
 
-static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
+static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
+					       kvm_pfn_t pfn,
 					       unsigned long size,
 					       bool ipa_uncached)
 {
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 5a1a882e0a75..9c67f05a0a1b 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -101,9 +101,9 @@
 #define CAUSEF_DC			(_ULCAST_(1) << 27)
 
 extern atomic_t kvm_mips_instance;
-extern pfn_t(*kvm_mips_gfn_to_pfn) (struct kvm *kvm, gfn_t gfn);
-extern void (*kvm_mips_release_pfn_clean) (pfn_t pfn);
-extern bool(*kvm_mips_is_error_pfn) (pfn_t pfn);
+extern kvm_pfn_t (*kvm_mips_gfn_to_pfn)(struct kvm *kvm, gfn_t gfn);
+extern void (*kvm_mips_release_pfn_clean)(kvm_pfn_t pfn);
+extern bool (*kvm_mips_is_error_pfn)(kvm_pfn_t pfn);
 
 struct kvm_vm_stat {
 	u32 remote_tlb_flush;
diff --git a/arch/mips/kvm/emulate.c b/arch/mips/kvm/emulate.c
index d5fa3eaf39a1..476296cf37d3 100644
--- a/arch/mips/kvm/emulate.c
+++ b/arch/mips/kvm/emulate.c
@@ -1525,7 +1525,7 @@ int kvm_mips_sync_icache(unsigned long va, struct kvm_vcpu *vcpu)
 	struct kvm *kvm = vcpu->kvm;
 	unsigned long pa;
 	gfn_t gfn;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	gfn = va >> PAGE_SHIFT;
 
diff --git a/arch/mips/kvm/tlb.c b/arch/mips/kvm/tlb.c
index aed0ac2a4972..570479c03bdc 100644
--- a/arch/mips/kvm/tlb.c
+++ b/arch/mips/kvm/tlb.c
@@ -38,13 +38,13 @@ atomic_t kvm_mips_instance;
 EXPORT_SYMBOL(kvm_mips_instance);
 
 /* These function pointers are initialized once the KVM module is loaded */
-pfn_t (*kvm_mips_gfn_to_pfn)(struct kvm *kvm, gfn_t gfn);
+kvm_pfn_t (*kvm_mips_gfn_to_pfn)(struct kvm *kvm, gfn_t gfn);
 EXPORT_SYMBOL(kvm_mips_gfn_to_pfn);
 
-void (*kvm_mips_release_pfn_clean)(pfn_t pfn);
+void (*kvm_mips_release_pfn_clean)(kvm_pfn_t pfn);
 EXPORT_SYMBOL(kvm_mips_release_pfn_clean);
 
-bool (*kvm_mips_is_error_pfn)(pfn_t pfn);
+bool (*kvm_mips_is_error_pfn)(kvm_pfn_t pfn);
 EXPORT_SYMBOL(kvm_mips_is_error_pfn);
 
 uint32_t kvm_mips_get_kernel_asid(struct kvm_vcpu *vcpu)
@@ -144,7 +144,7 @@ EXPORT_SYMBOL(kvm_mips_dump_guest_tlbs);
 static int kvm_mips_map_page(struct kvm *kvm, gfn_t gfn)
 {
 	int srcu_idx, err = 0;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	if (kvm->arch.guest_pmap[gfn] != KVM_INVALID_PAGE)
 		return 0;
@@ -262,7 +262,7 @@ int kvm_mips_handle_kseg0_tlb_fault(unsigned long badvaddr,
 				    struct kvm_vcpu *vcpu)
 {
 	gfn_t gfn;
-	pfn_t pfn0, pfn1;
+	kvm_pfn_t pfn0, pfn1;
 	unsigned long vaddr = 0;
 	unsigned long entryhi = 0, entrylo0 = 0, entrylo1 = 0;
 	int even;
@@ -313,7 +313,7 @@ EXPORT_SYMBOL(kvm_mips_handle_kseg0_tlb_fault);
 int kvm_mips_handle_commpage_tlb_fault(unsigned long badvaddr,
 	struct kvm_vcpu *vcpu)
 {
-	pfn_t pfn0, pfn1;
+	kvm_pfn_t pfn0, pfn1;
 	unsigned long flags, old_entryhi = 0, vaddr = 0;
 	unsigned long entrylo0 = 0, entrylo1 = 0;
 
@@ -360,7 +360,7 @@ int kvm_mips_handle_mapped_seg_tlb_fault(struct kvm_vcpu *vcpu,
 {
 	unsigned long entryhi = 0, entrylo0 = 0, entrylo1 = 0;
 	struct kvm *kvm = vcpu->kvm;
-	pfn_t pfn0, pfn1;
+	kvm_pfn_t pfn0, pfn1;
 
 	if ((tlb->tlb_hi & VPN2_MASK) == 0) {
 		pfn0 = 0;
diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index 9fac01cb89c1..8f39796c9da8 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -154,8 +154,8 @@ extern void kvmppc_set_bat(struct kvm_vcpu *vcpu, struct kvmppc_bat *bat,
 			   bool upper, u32 val);
 extern void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong msr);
 extern int kvmppc_emulate_paired_single(struct kvm_run *run, struct kvm_vcpu *vcpu);
-extern pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa, bool writing,
-			bool *writable);
+extern kvm_pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa,
+			bool writing, bool *writable);
 extern void kvmppc_add_revmap_chain(struct kvm *kvm, struct revmap_entry *rev,
 			unsigned long *rmap, long pte_index, int realmode);
 extern void kvmppc_update_rmap_change(unsigned long *rmap, unsigned long psize);
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index c6ef05bd0765..2241d5357129 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -515,7 +515,7 @@ void kvmppc_claim_lpid(long lpid);
 void kvmppc_free_lpid(long lpid);
 void kvmppc_init_lpid(unsigned long nr_lpids);
 
-static inline void kvmppc_mmu_flush_icache(pfn_t pfn)
+static inline void kvmppc_mmu_flush_icache(kvm_pfn_t pfn)
 {
 	struct page *page;
 	/*
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 099c79d8c160..638c6d9be9e0 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -366,7 +366,7 @@ int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvmppc_core_prepare_to_enter);
 
-pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa, bool writing,
+kvm_pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa, bool writing,
 			bool *writable)
 {
 	ulong mp_pa = vcpu->arch.magic_page_pa & KVM_PAM;
@@ -379,9 +379,9 @@ pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa, bool writing,
 	gpa &= ~0xFFFULL;
 	if (unlikely(mp_pa) && unlikely((gpa & KVM_PAM) == mp_pa)) {
 		ulong shared_page = ((ulong)vcpu->arch.shared) & PAGE_MASK;
-		pfn_t pfn;
+		kvm_pfn_t pfn;
 
-		pfn = (pfn_t)virt_to_phys((void*)shared_page) >> PAGE_SHIFT;
+		pfn = (kvm_pfn_t)virt_to_phys((void*)shared_page) >> PAGE_SHIFT;
 		get_page(pfn_to_page(pfn));
 		if (writable)
 			*writable = true;
diff --git a/arch/powerpc/kvm/book3s_32_mmu_host.c b/arch/powerpc/kvm/book3s_32_mmu_host.c
index d5c9bfeb0c9c..55c4d51ea3e2 100644
--- a/arch/powerpc/kvm/book3s_32_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_32_mmu_host.c
@@ -142,7 +142,7 @@ extern char etext[];
 int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte,
 			bool iswrite)
 {
-	pfn_t hpaddr;
+	kvm_pfn_t hpaddr;
 	u64 vpn;
 	u64 vsid;
 	struct kvmppc_sid_map *map;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c b/arch/powerpc/kvm/book3s_64_mmu_host.c
index 79ad35abd196..913cd2198fa6 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_host.c
@@ -83,7 +83,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte,
 			bool iswrite)
 {
 	unsigned long vpn;
-	pfn_t hpaddr;
+	kvm_pfn_t hpaddr;
 	ulong hash, hpteg;
 	u64 vsid;
 	int ret;
diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index 72920bed3ac6..94f04fcb373e 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -41,7 +41,7 @@ enum vcpu_ftr {
 #define E500_TLB_MAS2_ATTR	(0x7f)
 
 struct tlbe_ref {
-	pfn_t pfn;		/* valid only for TLB0, except briefly */
+	kvm_pfn_t pfn;		/* valid only for TLB0, except briefly */
 	unsigned int flags;	/* E500_TLB_* */
 };
 
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 4d33e199edcc..8a5bb6dfcc2d 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -163,9 +163,9 @@ void kvmppc_map_magic(struct kvm_vcpu *vcpu)
 	struct kvm_book3e_206_tlb_entry magic;
 	ulong shared_page = ((ulong)vcpu->arch.shared) & PAGE_MASK;
 	unsigned int stid;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
-	pfn = (pfn_t)virt_to_phys((void *)shared_page) >> PAGE_SHIFT;
+	pfn = (kvm_pfn_t)virt_to_phys((void *)shared_page) >> PAGE_SHIFT;
 	get_page(pfn_to_page(pfn));
 
 	preempt_disable();
@@ -246,7 +246,7 @@ static inline int tlbe_is_writable(struct kvm_book3e_206_tlb_entry *tlbe)
 
 static inline void kvmppc_e500_ref_setup(struct tlbe_ref *ref,
 					 struct kvm_book3e_206_tlb_entry *gtlbe,
-					 pfn_t pfn, unsigned int wimg)
+					 kvm_pfn_t pfn, unsigned int wimg)
 {
 	ref->pfn = pfn;
 	ref->flags = E500_TLB_VALID;
@@ -309,7 +309,7 @@ static void kvmppc_e500_setup_stlbe(
 	int tsize, struct tlbe_ref *ref, u64 gvaddr,
 	struct kvm_book3e_206_tlb_entry *stlbe)
 {
-	pfn_t pfn = ref->pfn;
+	kvm_pfn_t pfn = ref->pfn;
 	u32 pr = vcpu->arch.shared->msr & MSR_PR;
 
 	BUG_ON(!(ref->flags & E500_TLB_VALID));
diff --git a/arch/powerpc/kvm/trace_pr.h b/arch/powerpc/kvm/trace_pr.h
index 810507cb688a..d44f324184fb 100644
--- a/arch/powerpc/kvm/trace_pr.h
+++ b/arch/powerpc/kvm/trace_pr.h
@@ -30,7 +30,7 @@ TRACE_EVENT(kvm_book3s_reenter,
 #ifdef CONFIG_PPC_BOOK3S_64
 
 TRACE_EVENT(kvm_book3s_64_mmu_map,
-	TP_PROTO(int rflags, ulong hpteg, ulong va, pfn_t hpaddr,
+	TP_PROTO(int rflags, ulong hpteg, ulong va, kvm_pfn_t hpaddr,
 		 struct kvmppc_pte *orig_pte),
 	TP_ARGS(rflags, hpteg, va, hpaddr, orig_pte),
 
diff --git a/arch/x86/kvm/iommu.c b/arch/x86/kvm/iommu.c
index 5c520ebf6343..a22a488b4622 100644
--- a/arch/x86/kvm/iommu.c
+++ b/arch/x86/kvm/iommu.c
@@ -43,11 +43,11 @@ static int kvm_iommu_unmap_memslots(struct kvm *kvm);
 static void kvm_iommu_put_pages(struct kvm *kvm,
 				gfn_t base_gfn, unsigned long npages);
 
-static pfn_t kvm_pin_pages(struct kvm_memory_slot *slot, gfn_t gfn,
+static kvm_pfn_t kvm_pin_pages(struct kvm_memory_slot *slot, gfn_t gfn,
 			   unsigned long npages)
 {
 	gfn_t end_gfn;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	pfn     = gfn_to_pfn_memslot(slot, gfn);
 	end_gfn = gfn + npages;
@@ -62,7 +62,8 @@ static pfn_t kvm_pin_pages(struct kvm_memory_slot *slot, gfn_t gfn,
 	return pfn;
 }
 
-static void kvm_unpin_pages(struct kvm *kvm, pfn_t pfn, unsigned long npages)
+static void kvm_unpin_pages(struct kvm *kvm, kvm_pfn_t pfn,
+		unsigned long npages)
 {
 	unsigned long i;
 
@@ -73,7 +74,7 @@ static void kvm_unpin_pages(struct kvm *kvm, pfn_t pfn, unsigned long npages)
 int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
 	gfn_t gfn, end_gfn;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	int r = 0;
 	struct iommu_domain *domain = kvm->arch.iommu_domain;
 	int flags;
@@ -275,7 +276,7 @@ static void kvm_iommu_put_pages(struct kvm *kvm,
 {
 	struct iommu_domain *domain;
 	gfn_t end_gfn, gfn;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	u64 phys;
 
 	domain  = kvm->arch.iommu_domain;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ff606f507913..6ab963ae0427 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -259,7 +259,7 @@ static unsigned get_mmio_spte_access(u64 spte)
 }
 
 static bool set_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
-			  pfn_t pfn, unsigned access)
+			  kvm_pfn_t pfn, unsigned access)
 {
 	if (unlikely(is_noslot_pfn(pfn))) {
 		mark_mmio_spte(vcpu, sptep, gfn, access);
@@ -325,7 +325,7 @@ static int is_last_spte(u64 pte, int level)
 	return 0;
 }
 
-static pfn_t spte_to_pfn(u64 pte)
+static kvm_pfn_t spte_to_pfn(u64 pte)
 {
 	return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
 }
@@ -587,7 +587,7 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
  */
 static int mmu_spte_clear_track_bits(u64 *sptep)
 {
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	u64 old_spte = *sptep;
 
 	if (!spte_has_volatile_bits(old_spte))
@@ -1369,7 +1369,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
 	int need_flush = 0;
 	u64 new_spte;
 	pte_t *ptep = (pte_t *)data;
-	pfn_t new_pfn;
+	kvm_pfn_t new_pfn;
 
 	WARN_ON(pte_huge(*ptep));
 	new_pfn = pte_pfn(*ptep);
@@ -2456,7 +2456,7 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return 0;
 }
 
-static bool kvm_is_mmio_pfn(pfn_t pfn)
+static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
 	if (pfn_valid(pfn))
 		return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
@@ -2466,7 +2466,7 @@ static bool kvm_is_mmio_pfn(pfn_t pfn)
 
 static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		    unsigned pte_access, int level,
-		    gfn_t gfn, pfn_t pfn, bool speculative,
+		    gfn_t gfn, kvm_pfn_t pfn, bool speculative,
 		    bool can_unsync, bool host_writable)
 {
 	u64 spte;
@@ -2546,7 +2546,7 @@ done:
 
 static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 			 unsigned pte_access, int write_fault, int *emulate,
-			 int level, gfn_t gfn, pfn_t pfn, bool speculative,
+			 int level, gfn_t gfn, kvm_pfn_t pfn, bool speculative,
 			 bool host_writable)
 {
 	int was_rmapped = 0;
@@ -2606,7 +2606,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	kvm_release_pfn_clean(pfn);
 }
 
-static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
+static kvm_pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
 				     bool no_dirty_log)
 {
 	struct kvm_memory_slot *slot;
@@ -2689,7 +2689,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
 }
 
 static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
-			int map_writable, int level, gfn_t gfn, pfn_t pfn,
+			int map_writable, int level, gfn_t gfn, kvm_pfn_t pfn,
 			bool prefault)
 {
 	struct kvm_shadow_walk_iterator iterator;
@@ -2739,7 +2739,7 @@ static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct *
 	send_sig_info(SIGBUS, &info, tsk);
 }
 
-static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, pfn_t pfn)
+static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
 {
 	/*
 	 * Do not cache the mmio info caused by writing the readonly gfn
@@ -2759,9 +2759,10 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, pfn_t pfn)
 }
 
 static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
-					gfn_t *gfnp, pfn_t *pfnp, int *levelp)
+					gfn_t *gfnp, kvm_pfn_t *pfnp,
+					int *levelp)
 {
-	pfn_t pfn = *pfnp;
+	kvm_pfn_t pfn = *pfnp;
 	gfn_t gfn = *gfnp;
 	int level = *levelp;
 
@@ -2800,7 +2801,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 }
 
 static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
-				pfn_t pfn, unsigned access, int *ret_val)
+				kvm_pfn_t pfn, unsigned access, int *ret_val)
 {
 	bool ret = true;
 
@@ -2954,7 +2955,7 @@ exit:
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-			 gva_t gva, pfn_t *pfn, bool write, bool *writable);
+			 gva_t gva, kvm_pfn_t *pfn, bool write, bool *writable);
 static void make_mmu_pages_available(struct kvm_vcpu *vcpu);
 
 static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
@@ -2963,7 +2964,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
 	int r;
 	int level;
 	int force_pt_level;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	unsigned long mmu_seq;
 	bool map_writable, write = error_code & PFERR_WRITE_MASK;
 
@@ -3435,7 +3436,7 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-			 gva_t gva, pfn_t *pfn, bool write, bool *writable)
+			 gva_t gva, kvm_pfn_t *pfn, bool write, bool *writable)
 {
 	struct kvm_memory_slot *slot;
 	bool async;
@@ -3473,7 +3474,7 @@ check_hugepage_cache_consistency(struct kvm_vcpu *vcpu, gfn_t gfn, int level)
 static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 			  bool prefault)
 {
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	int r;
 	int level;
 	int force_pt_level;
@@ -4627,7 +4628,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 	u64 *sptep;
 	struct rmap_iterator iter;
 	int need_tlb_flush = 0;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	struct kvm_mmu_page *sp;
 
 restart:
diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
index 03d518e499a6..37a4d14115c0 100644
--- a/arch/x86/kvm/mmu_audit.c
+++ b/arch/x86/kvm/mmu_audit.c
@@ -97,7 +97,7 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
 {
 	struct kvm_mmu_page *sp;
 	gfn_t gfn;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	hpa_t hpa;
 
 	sp = page_header(__pa(sptep));
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 736e6ab8784d..9dd02cb74724 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -456,7 +456,7 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 {
 	unsigned pte_access;
 	gfn_t gfn;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
 		return false;
@@ -551,7 +551,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
 static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 			 struct guest_walker *gw,
 			 int write_fault, int hlevel,
-			 pfn_t pfn, bool map_writable, bool prefault)
+			 kvm_pfn_t pfn, bool map_writable, bool prefault)
 {
 	struct kvm_mmu_page *sp = NULL;
 	struct kvm_shadow_walk_iterator it;
@@ -696,7 +696,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	int user_fault = error_code & PFERR_USER_MASK;
 	struct guest_walker walker;
 	int r;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	int level = PT_PAGE_TABLE_LEVEL;
 	int force_pt_level;
 	unsigned long mmu_seq;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6a8bc64566ab..52aba0f1207b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4046,7 +4046,7 @@ out:
 static int init_rmode_identity_map(struct kvm *kvm)
 {
 	int i, idx, r = 0;
-	pfn_t identity_map_pfn;
+	kvm_pfn_t identity_map_pfn;
 	u32 tmp;
 
 	if (!enable_ept)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9a9a19830321..60b08427ca6c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4935,7 +4935,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
 				  int emulation_type)
 {
 	gpa_t gpa = cr2;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	if (emulation_type & EMULTYPE_NO_REEXECUTE)
 		return false;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1bef9e21e725..2420b43f3acc 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -65,7 +65,7 @@
  * error pfns indicate that the gfn is in slot but faild to
  * translate it to pfn on host.
  */
-static inline bool is_error_pfn(pfn_t pfn)
+static inline bool is_error_pfn(kvm_pfn_t pfn)
 {
 	return !!(pfn & KVM_PFN_ERR_MASK);
 }
@@ -75,13 +75,13 @@ static inline bool is_error_pfn(pfn_t pfn)
  * translated to pfn - it is not in slot or failed to
  * translate it to pfn.
  */
-static inline bool is_error_noslot_pfn(pfn_t pfn)
+static inline bool is_error_noslot_pfn(kvm_pfn_t pfn)
 {
 	return !!(pfn & KVM_PFN_ERR_NOSLOT_MASK);
 }
 
 /* noslot pfn indicates that the gfn is not in slot. */
-static inline bool is_noslot_pfn(pfn_t pfn)
+static inline bool is_noslot_pfn(kvm_pfn_t pfn)
 {
 	return pfn == KVM_PFN_NOSLOT;
 }
@@ -569,19 +569,20 @@ void kvm_release_page_clean(struct page *page);
 void kvm_release_page_dirty(struct page *page);
 void kvm_set_page_accessed(struct page *page);
 
-pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn);
-pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
-pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
+kvm_pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn);
+kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
+kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable);
-pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
-pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn);
-pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn, bool atomic,
-			   bool *async, bool write_fault, bool *writable);
+kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
+kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn);
+kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
+			       bool atomic, bool *async, bool write_fault,
+			       bool *writable);
 
-void kvm_release_pfn_clean(pfn_t pfn);
-void kvm_set_pfn_dirty(pfn_t pfn);
-void kvm_set_pfn_accessed(pfn_t pfn);
-void kvm_get_pfn(pfn_t pfn);
+void kvm_release_pfn_clean(kvm_pfn_t pfn);
+void kvm_set_pfn_dirty(kvm_pfn_t pfn);
+void kvm_set_pfn_accessed(kvm_pfn_t pfn);
+void kvm_get_pfn(kvm_pfn_t pfn);
 
 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 			int len);
@@ -607,8 +608,8 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
 
 struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
 struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn);
-pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
-pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
+kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
+kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn);
 unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn);
 unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool *writable);
@@ -789,7 +790,7 @@ void kvm_arch_sync_events(struct kvm *kvm);
 int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
 void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
 
-bool kvm_is_reserved_pfn(pfn_t pfn);
+bool kvm_is_reserved_pfn(kvm_pfn_t pfn);
 
 struct kvm_irq_ack_notifier {
 	struct hlist_node link;
@@ -940,7 +941,7 @@ static inline gfn_t gpa_to_gfn(gpa_t gpa)
 	return (gfn_t)(gpa >> PAGE_SHIFT);
 }
 
-static inline hpa_t pfn_to_hpa(pfn_t pfn)
+static inline hpa_t pfn_to_hpa(kvm_pfn_t pfn)
 {
 	return (hpa_t)pfn << PAGE_SHIFT;
 }
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 1b47a185c2f0..8bf259dae9f6 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -53,7 +53,7 @@ typedef unsigned long  hva_t;
 typedef u64            hpa_t;
 typedef u64            hfn_t;
 
-typedef hfn_t pfn_t;
+typedef hfn_t kvm_pfn_t;
 
 struct gfn_to_hva_cache {
 	u64 generation;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8db1d9361993..02cd2eddd3ff 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -111,7 +111,7 @@ static void hardware_disable_all(void);
 
 static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
 
-static void kvm_release_pfn_dirty(pfn_t pfn);
+static void kvm_release_pfn_dirty(kvm_pfn_t pfn);
 static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
 
 __visible bool kvm_rebooting;
@@ -119,7 +119,7 @@ EXPORT_SYMBOL_GPL(kvm_rebooting);
 
 static bool largepages_enabled = true;
 
-bool kvm_is_reserved_pfn(pfn_t pfn)
+bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
 	if (pfn_valid(pfn))
 		return PageReserved(pfn_to_page(pfn));
@@ -1296,7 +1296,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
  * true indicates success, otherwise false is returned.
  */
 static bool hva_to_pfn_fast(unsigned long addr, bool atomic, bool *async,
-			    bool write_fault, bool *writable, pfn_t *pfn)
+			    bool write_fault, bool *writable, kvm_pfn_t *pfn)
 {
 	struct page *page[1];
 	int npages;
@@ -1329,7 +1329,7 @@ static bool hva_to_pfn_fast(unsigned long addr, bool atomic, bool *async,
  * 1 indicates success, -errno is returned if error is detected.
  */
 static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
-			   bool *writable, pfn_t *pfn)
+			   bool *writable, kvm_pfn_t *pfn)
 {
 	struct page *page[1];
 	int npages = 0;
@@ -1393,11 +1393,11 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
  * 2): @write_fault = false && @writable, @writable will tell the caller
  *     whether the mapping is writable.
  */
-static pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
+static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
 			bool write_fault, bool *writable)
 {
 	struct vm_area_struct *vma;
-	pfn_t pfn = 0;
+	kvm_pfn_t pfn = 0;
 	int npages;
 
 	/* we can do it either atomically or asynchronously, not both */
@@ -1438,8 +1438,9 @@ exit:
 	return pfn;
 }
 
-pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn, bool atomic,
-			   bool *async, bool write_fault, bool *writable)
+kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
+			       bool atomic, bool *async, bool write_fault,
+			       bool *writable)
 {
 	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
 
@@ -1460,7 +1461,7 @@ pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn, bool atomic,
 }
 EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
 
-pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
+kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
 	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, NULL,
@@ -1468,37 +1469,37 @@ pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
 
-pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
+kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
 
-pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
+kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
 
-pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn)
+kvm_pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn)
 {
 	return gfn_to_pfn_memslot_atomic(gfn_to_memslot(kvm, gfn), gfn);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_atomic);
 
-pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn)
+kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
 	return gfn_to_pfn_memslot_atomic(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn_atomic);
 
-pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
+kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
 {
 	return gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn);
 
-pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
 	return gfn_to_pfn_memslot(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
 }
@@ -1521,7 +1522,7 @@ int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
 }
 EXPORT_SYMBOL_GPL(gfn_to_page_many_atomic);
 
-static struct page *kvm_pfn_to_page(pfn_t pfn)
+static struct page *kvm_pfn_to_page(kvm_pfn_t pfn)
 {
 	if (is_error_noslot_pfn(pfn))
 		return KVM_ERR_PTR_BAD_PAGE;
@@ -1536,7 +1537,7 @@ static struct page *kvm_pfn_to_page(pfn_t pfn)
 
 struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
 {
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	pfn = gfn_to_pfn(kvm, gfn);
 
@@ -1546,7 +1547,7 @@ EXPORT_SYMBOL_GPL(gfn_to_page);
 
 struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	pfn = kvm_vcpu_gfn_to_pfn(vcpu, gfn);
 
@@ -1562,7 +1563,7 @@ void kvm_release_page_clean(struct page *page)
 }
 EXPORT_SYMBOL_GPL(kvm_release_page_clean);
 
-void kvm_release_pfn_clean(pfn_t pfn)
+void kvm_release_pfn_clean(kvm_pfn_t pfn)
 {
 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
 		put_page(pfn_to_page(pfn));
@@ -1577,13 +1578,13 @@ void kvm_release_page_dirty(struct page *page)
 }
 EXPORT_SYMBOL_GPL(kvm_release_page_dirty);
 
-static void kvm_release_pfn_dirty(pfn_t pfn)
+static void kvm_release_pfn_dirty(kvm_pfn_t pfn)
 {
 	kvm_set_pfn_dirty(pfn);
 	kvm_release_pfn_clean(pfn);
 }
 
-void kvm_set_pfn_dirty(pfn_t pfn)
+void kvm_set_pfn_dirty(kvm_pfn_t pfn)
 {
 	if (!kvm_is_reserved_pfn(pfn)) {
 		struct page *page = pfn_to_page(pfn);
@@ -1594,14 +1595,14 @@ void kvm_set_pfn_dirty(pfn_t pfn)
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty);
 
-void kvm_set_pfn_accessed(pfn_t pfn)
+void kvm_set_pfn_accessed(kvm_pfn_t pfn)
 {
 	if (!kvm_is_reserved_pfn(pfn))
 		mark_page_accessed(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
 
-void kvm_get_pfn(pfn_t pfn)
+void kvm_get_pfn(kvm_pfn_t pfn)
 {
 	if (!kvm_is_reserved_pfn(pfn))
 		get_page(pfn_to_page(pfn));


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 08/15] mm, dax, pmem: introduce pfn_t
  2015-11-02  4:29 ` Dan Williams
@ 2015-11-02  4:30   ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe
  Cc: Dave Hansen, jack, linux-nvdimm, david, linux-kernel,
	ross.zwisler, Andrew Morton, hch

For the purpose of communicating the optional presence of a 'struct
page' for the pfn returned from ->direct_access(), introduce a type that
encapsulates a page-frame-number plus flags.  These flags contain the
historical "page_link" encoding for a scatterlist entry, but can also
denote "device memory".  Where "device memory" is a set of pfns that are
not part of the kernel's linear mapping by default, but are accessed via
the same memory controller as ram.

The motivation for this new type is large capacity persistent memory
that needs struct page entries in the 'memmap' to support 3rd party DMA
(i.e. O_DIRECT I/O with a persistent memory source/target).  However, we
also need it in support of maintaining a list of mapped inodes which
need to be unmapped at driver teardown or freeze_bdev() time.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave@sr71.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/sysdev/axonram.c |    8 ++---
 drivers/block/brd.c           |    4 +--
 drivers/nvdimm/pmem.c         |    6 +++-
 drivers/s390/block/dcssblk.c  |   10 +++---
 fs/block_dev.c                |    2 +
 fs/dax.c                      |   19 ++++++------
 include/linux/blkdev.h        |    4 +--
 include/linux/mm.h            |   65 +++++++++++++++++++++++++++++++++++++++++
 include/linux/pfn.h           |    9 ++++++
 9 files changed, 101 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index d2b79bc336c1..59ca4c0ab529 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -141,15 +141,13 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
  */
 static long
 axon_ram_direct_access(struct block_device *device, sector_t sector,
-		       void __pmem **kaddr, unsigned long *pfn)
+		       void __pmem **kaddr, pfn_t *pfn)
 {
 	struct axon_ram_bank *bank = device->bd_disk->private_data;
 	loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
-	void *addr = (void *)(bank->ph_addr + offset);
-
-	*kaddr = (void __pmem *)addr;
-	*pfn = virt_to_phys(addr) >> PAGE_SHIFT;
 
+	*kaddr = (void __pmem __force *) bank->io_addr + offset;
+	*pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
 	return bank->size - offset;
 }
 
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index b9794aeeb878..0bbc60463779 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -374,7 +374,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 static long brd_direct_access(struct block_device *bdev, sector_t sector,
-			void __pmem **kaddr, unsigned long *pfn)
+			void __pmem **kaddr, pfn_t *pfn)
 {
 	struct brd_device *brd = bdev->bd_disk->private_data;
 	struct page *page;
@@ -385,7 +385,7 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
 	if (!page)
 		return -ENOSPC;
 	*kaddr = (void __pmem *)page_address(page);
-	*pfn = page_to_pfn(page);
+	*pfn = page_to_pfn_t(page);
 
 	return PAGE_SIZE;
 }
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 93472953e231..09093372e5f0 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -39,6 +39,7 @@ struct pmem_device {
 	phys_addr_t		phys_addr;
 	/* when non-zero this device is hosting a 'pfn' instance */
 	phys_addr_t		data_offset;
+	unsigned long		pfn_flags;
 	void __pmem		*virt_addr;
 	size_t			size;
 };
@@ -106,7 +107,7 @@ static long pmem_direct_access(struct block_device *bdev, sector_t sector,
 	resource_size_t offset = sector * 512 + pmem->data_offset;
 
 	*kaddr = pmem->virt_addr + offset;
-	*pfn = __phys_to_pfn(pmem->phys_addr + offset, pmem->pfn_flags);
+	*pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
 
 	return pmem->size - offset;
 }
@@ -144,8 +145,10 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 	if (!q)
 		return ERR_PTR(-ENOMEM);
 
+	pmem->pfn_flags = PFN_DEV;
 	if (pmem_should_map_pages(dev)) {
 		pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res);
+		pmem->pfn_flags |= PFN_MAP;
 	} else
 		pmem->virt_addr = (void __pmem *) devm_memremap(dev,
 				pmem->phys_addr, pmem->size,
@@ -356,6 +359,7 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
 	pmem = dev_get_drvdata(dev);
 	devm_memunmap(dev, (void __force *) pmem->virt_addr);
 	pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res);
+	pmem->pfn_flags |= PFN_MAP;
 	if (IS_ERR(pmem->virt_addr)) {
 		rc = PTR_ERR(pmem->virt_addr);
 		goto err;
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 5ed44fe21380..e2b2839e4de5 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -29,7 +29,7 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
 static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
-			 void __pmem **kaddr, unsigned long *pfn);
+			 void __pmem **kaddr, pfn_t *pfn);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -881,20 +881,18 @@ fail:
 
 static long
 dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
-			void __pmem **kaddr, unsigned long *pfn)
+			void __pmem **kaddr, pfn_t *pfn)
 {
 	struct dcssblk_dev_info *dev_info;
 	unsigned long offset, dev_sz;
-	void *addr;
 
 	dev_info = bdev->bd_disk->private_data;
 	if (!dev_info)
 		return -ENODEV;
 	dev_sz = dev_info->end - dev_info->start;
 	offset = secnum * 512;
-	addr = (void *) (dev_info->start + offset);
-	*pfn = virt_to_phys(addr) >> PAGE_SHIFT;
-	*kaddr = (void __pmem *) addr;
+	*kaddr = (void __pmem *) (dev_info->start + offset);
+	*pfn = phys_to_pfn_t(dev_info->start + offset, PFN_DEV);
 
 	return dev_sz - offset;
 }
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0a793c7930eb..84b042778812 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -442,7 +442,7 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
  * accessible at this address.
  */
 long bdev_direct_access(struct block_device *bdev, sector_t sector,
-			void __pmem **addr, unsigned long *pfn, long size)
+			void __pmem **addr, pfn_t *pfn, long size)
 {
 	long avail;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
diff --git a/fs/dax.c b/fs/dax.c
index a480729c00ec..4d6861f022d9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -31,7 +31,7 @@
 #include <linux/sizes.h>
 
 static void __pmem *__dax_map_atomic(struct block_device *bdev, sector_t sector,
-		long size, unsigned long *pfn, long *len)
+		long size, pfn_t *pfn, long *len)
 {
 	long rc;
 	void __pmem *addr;
@@ -52,7 +52,7 @@ static void __pmem *__dax_map_atomic(struct block_device *bdev, sector_t sector,
 static void __pmem *dax_map_atomic(struct block_device *bdev, sector_t sector,
 		long size)
 {
-	unsigned long pfn;
+	pfn_t pfn;
 
 	return __dax_map_atomic(bdev, sector, size, &pfn, NULL);
 }
@@ -72,8 +72,8 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 	might_sleep();
 	do {
 		void __pmem *addr;
-		unsigned long pfn;
 		long count, sz;
+		pfn_t pfn;
 
 		sz = min_t(long, size, SZ_1M);
 		addr = __dax_map_atomic(bdev, sector, size, &pfn, &count);
@@ -141,7 +141,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 	struct block_device *bdev = NULL;
 	int rw = iov_iter_rw(iter), rc;
 	long map_len = 0;
-	unsigned long pfn;
+	pfn_t pfn;
 	void __pmem *addr = NULL;
 	void __pmem *kmap = (void __pmem *) ERR_PTR(-EIO);
 	bool hole = false;
@@ -333,9 +333,9 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
 	struct address_space *mapping = inode->i_mapping;
 	struct block_device *bdev = bh->b_bdev;
-	unsigned long pfn;
 	void __pmem *addr;
 	pgoff_t size;
+	pfn_t pfn;
 	int error;
 
 	i_mmap_lock_read(mapping);
@@ -366,7 +366,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, addr);
 
-	error = vm_insert_mixed(vma, vaddr, pfn);
+	error = vm_insert_mixed(vma, vaddr, pfn_t_to_pfn(pfn));
 
  out:
 	i_mmap_unlock_read(mapping);
@@ -655,8 +655,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
+		pfn_t pfn;
 		long length;
-		unsigned long pfn;
 		void __pmem *kaddr = __dax_map_atomic(bdev,
 				to_sector(&bh, inode), HPAGE_SIZE, &pfn,
 				&length);
@@ -665,7 +665,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			result = VM_FAULT_SIGBUS;
 			goto out;
 		}
-		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) {
+		if ((length < PMD_SIZE) || (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)) {
 			dax_unmap_atomic(bdev, kaddr);
 			goto fallback;
 		}
@@ -679,7 +679,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 		dax_unmap_atomic(bdev, kaddr);
 
-		result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
+		result |= vmf_insert_pfn_pmd(vma, address, pmd,
+				pfn_t_to_pfn(pfn), write);
 	}
 
  out:
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 59a770dad804..b78e01542e9e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1620,7 +1620,7 @@ struct block_device_operations {
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	long (*direct_access)(struct block_device *, sector_t, void __pmem **,
-			unsigned long *pfn);
+			pfn_t *);
 	unsigned int (*check_events) (struct gendisk *disk,
 				      unsigned int clearing);
 	/* ->media_changed() is DEPRECATED, use ->check_events() instead */
@@ -1639,7 +1639,7 @@ extern int bdev_read_page(struct block_device *, sector_t, struct page *);
 extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 						struct writeback_control *);
 extern long bdev_direct_access(struct block_device *, sector_t,
-		void __pmem **addr, unsigned long *pfn, long size);
+		void __pmem **addr, pfn_t *pfn, long size);
 #else /* CONFIG_BLOCK */
 
 struct block_device;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80001de019ba..b8a90c481ae4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -927,6 +927,71 @@ static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg)
 #endif
 
 /*
+ * PFN_FLAGS_MASK - mask of all the possible valid pfn_t flags
+ * PFN_SG_CHAIN - pfn is a pointer to the next scatterlist entry
+ * PFN_SG_LAST - pfn references a page and is the last scatterlist entry
+ * PFN_DEV - pfn is not covered by system memmap by default
+ * PFN_MAP - pfn has a dynamic page mapping established by a device driver
+ */
+#define PFN_FLAGS_MASK (~PAGE_MASK << (BITS_PER_LONG - PAGE_SHIFT))
+#define PFN_SG_CHAIN (1UL << (BITS_PER_LONG - 1))
+#define PFN_SG_LAST (1UL << (BITS_PER_LONG - 2))
+#define PFN_DEV (1UL << (BITS_PER_LONG - 3))
+#define PFN_MAP (1UL << (BITS_PER_LONG - 4))
+
+static inline pfn_t __pfn_to_pfn_t(unsigned long pfn, unsigned long flags)
+{
+	pfn_t pfn_t = { .val = pfn | (flags & PFN_FLAGS_MASK), };
+
+	return pfn_t;
+}
+
+/* a default pfn to pfn_t conversion assumes that @pfn is pfn_valid() */
+static inline pfn_t pfn_to_pfn_t(unsigned long pfn)
+{
+	return __pfn_to_pfn_t(pfn, 0);
+}
+
+static inline pfn_t phys_to_pfn_t(dma_addr_t addr, unsigned long flags)
+{
+	return __pfn_to_pfn_t(addr >> PAGE_SHIFT, flags);
+}
+
+static inline bool pfn_t_has_page(pfn_t pfn)
+{
+	return (pfn.val & PFN_MAP) == PFN_MAP || (pfn.val & PFN_DEV) == 0;
+}
+
+static inline unsigned long pfn_t_to_pfn(pfn_t pfn)
+{
+	return pfn.val & ~PFN_FLAGS_MASK;
+}
+
+static inline struct page *pfn_t_to_page(pfn_t pfn)
+{
+	if (pfn_t_has_page(pfn))
+		return pfn_to_page(pfn_t_to_pfn(pfn));
+	return NULL;
+}
+
+static inline dma_addr_t pfn_t_to_phys(pfn_t pfn)
+{
+	return PFN_PHYS(pfn_t_to_pfn(pfn));
+}
+
+static inline void *pfn_t_to_virt(pfn_t pfn)
+{
+	if (pfn_t_has_page(pfn))
+		return __va(pfn_t_to_phys(pfn));
+	return NULL;
+}
+
+static inline pfn_t page_to_pfn_t(struct page *page)
+{
+	return pfn_to_pfn_t(page_to_pfn(page));
+}
+
+/*
  * Some inline functions in vmstat.h depend on page_zone()
  */
 #include <linux/vmstat.h>
diff --git a/include/linux/pfn.h b/include/linux/pfn.h
index 7646637221f3..96df85985f16 100644
--- a/include/linux/pfn.h
+++ b/include/linux/pfn.h
@@ -3,6 +3,15 @@
 
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
+
+/*
+ * pfn_t: encapsulates a page-frame number that is optionally backed
+ * by memmap (struct page).  Whether a pfn_t has a 'struct page'
+ * backing is indicated by flags in the high bits of the value.
+ */
+typedef struct {
+	unsigned long val;
+} pfn_t;
 #endif
 
 #define PFN_ALIGN(x)	(((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK)


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 08/15] mm, dax, pmem: introduce pfn_t
@ 2015-11-02  4:30   ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe
  Cc: Dave Hansen, jack, linux-nvdimm, david, linux-kernel,
	ross.zwisler, Andrew Morton, hch

For the purpose of communicating the optional presence of a 'struct
page' for the pfn returned from ->direct_access(), introduce a type that
encapsulates a page-frame-number plus flags.  These flags contain the
historical "page_link" encoding for a scatterlist entry, but can also
denote "device memory".  Where "device memory" is a set of pfns that are
not part of the kernel's linear mapping by default, but are accessed via
the same memory controller as ram.

The motivation for this new type is large capacity persistent memory
that needs struct page entries in the 'memmap' to support 3rd party DMA
(i.e. O_DIRECT I/O with a persistent memory source/target).  However, we
also need it in support of maintaining a list of mapped inodes which
need to be unmapped at driver teardown or freeze_bdev() time.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave@sr71.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/sysdev/axonram.c |    8 ++---
 drivers/block/brd.c           |    4 +--
 drivers/nvdimm/pmem.c         |    6 +++-
 drivers/s390/block/dcssblk.c  |   10 +++---
 fs/block_dev.c                |    2 +
 fs/dax.c                      |   19 ++++++------
 include/linux/blkdev.h        |    4 +--
 include/linux/mm.h            |   65 +++++++++++++++++++++++++++++++++++++++++
 include/linux/pfn.h           |    9 ++++++
 9 files changed, 101 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index d2b79bc336c1..59ca4c0ab529 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -141,15 +141,13 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
  */
 static long
 axon_ram_direct_access(struct block_device *device, sector_t sector,
-		       void __pmem **kaddr, unsigned long *pfn)
+		       void __pmem **kaddr, pfn_t *pfn)
 {
 	struct axon_ram_bank *bank = device->bd_disk->private_data;
 	loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
-	void *addr = (void *)(bank->ph_addr + offset);
-
-	*kaddr = (void __pmem *)addr;
-	*pfn = virt_to_phys(addr) >> PAGE_SHIFT;
 
+	*kaddr = (void __pmem __force *) bank->io_addr + offset;
+	*pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
 	return bank->size - offset;
 }
 
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index b9794aeeb878..0bbc60463779 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -374,7 +374,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 static long brd_direct_access(struct block_device *bdev, sector_t sector,
-			void __pmem **kaddr, unsigned long *pfn)
+			void __pmem **kaddr, pfn_t *pfn)
 {
 	struct brd_device *brd = bdev->bd_disk->private_data;
 	struct page *page;
@@ -385,7 +385,7 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
 	if (!page)
 		return -ENOSPC;
 	*kaddr = (void __pmem *)page_address(page);
-	*pfn = page_to_pfn(page);
+	*pfn = page_to_pfn_t(page);
 
 	return PAGE_SIZE;
 }
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 93472953e231..09093372e5f0 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -39,6 +39,7 @@ struct pmem_device {
 	phys_addr_t		phys_addr;
 	/* when non-zero this device is hosting a 'pfn' instance */
 	phys_addr_t		data_offset;
+	unsigned long		pfn_flags;
 	void __pmem		*virt_addr;
 	size_t			size;
 };
@@ -106,7 +107,7 @@ static long pmem_direct_access(struct block_device *bdev, sector_t sector,
 	resource_size_t offset = sector * 512 + pmem->data_offset;
 
 	*kaddr = pmem->virt_addr + offset;
-	*pfn = __phys_to_pfn(pmem->phys_addr + offset, pmem->pfn_flags);
+	*pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
 
 	return pmem->size - offset;
 }
@@ -144,8 +145,10 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 	if (!q)
 		return ERR_PTR(-ENOMEM);
 
+	pmem->pfn_flags = PFN_DEV;
 	if (pmem_should_map_pages(dev)) {
 		pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res);
+		pmem->pfn_flags |= PFN_MAP;
 	} else
 		pmem->virt_addr = (void __pmem *) devm_memremap(dev,
 				pmem->phys_addr, pmem->size,
@@ -356,6 +359,7 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
 	pmem = dev_get_drvdata(dev);
 	devm_memunmap(dev, (void __force *) pmem->virt_addr);
 	pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res);
+	pmem->pfn_flags |= PFN_MAP;
 	if (IS_ERR(pmem->virt_addr)) {
 		rc = PTR_ERR(pmem->virt_addr);
 		goto err;
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 5ed44fe21380..e2b2839e4de5 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -29,7 +29,7 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
 static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
-			 void __pmem **kaddr, unsigned long *pfn);
+			 void __pmem **kaddr, pfn_t *pfn);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -881,20 +881,18 @@ fail:
 
 static long
 dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
-			void __pmem **kaddr, unsigned long *pfn)
+			void __pmem **kaddr, pfn_t *pfn)
 {
 	struct dcssblk_dev_info *dev_info;
 	unsigned long offset, dev_sz;
-	void *addr;
 
 	dev_info = bdev->bd_disk->private_data;
 	if (!dev_info)
 		return -ENODEV;
 	dev_sz = dev_info->end - dev_info->start;
 	offset = secnum * 512;
-	addr = (void *) (dev_info->start + offset);
-	*pfn = virt_to_phys(addr) >> PAGE_SHIFT;
-	*kaddr = (void __pmem *) addr;
+	*kaddr = (void __pmem *) (dev_info->start + offset);
+	*pfn = phys_to_pfn_t(dev_info->start + offset, PFN_DEV);
 
 	return dev_sz - offset;
 }
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0a793c7930eb..84b042778812 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -442,7 +442,7 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
  * accessible at this address.
  */
 long bdev_direct_access(struct block_device *bdev, sector_t sector,
-			void __pmem **addr, unsigned long *pfn, long size)
+			void __pmem **addr, pfn_t *pfn, long size)
 {
 	long avail;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
diff --git a/fs/dax.c b/fs/dax.c
index a480729c00ec..4d6861f022d9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -31,7 +31,7 @@
 #include <linux/sizes.h>
 
 static void __pmem *__dax_map_atomic(struct block_device *bdev, sector_t sector,
-		long size, unsigned long *pfn, long *len)
+		long size, pfn_t *pfn, long *len)
 {
 	long rc;
 	void __pmem *addr;
@@ -52,7 +52,7 @@ static void __pmem *__dax_map_atomic(struct block_device *bdev, sector_t sector,
 static void __pmem *dax_map_atomic(struct block_device *bdev, sector_t sector,
 		long size)
 {
-	unsigned long pfn;
+	pfn_t pfn;
 
 	return __dax_map_atomic(bdev, sector, size, &pfn, NULL);
 }
@@ -72,8 +72,8 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 	might_sleep();
 	do {
 		void __pmem *addr;
-		unsigned long pfn;
 		long count, sz;
+		pfn_t pfn;
 
 		sz = min_t(long, size, SZ_1M);
 		addr = __dax_map_atomic(bdev, sector, size, &pfn, &count);
@@ -141,7 +141,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 	struct block_device *bdev = NULL;
 	int rw = iov_iter_rw(iter), rc;
 	long map_len = 0;
-	unsigned long pfn;
+	pfn_t pfn;
 	void __pmem *addr = NULL;
 	void __pmem *kmap = (void __pmem *) ERR_PTR(-EIO);
 	bool hole = false;
@@ -333,9 +333,9 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
 	struct address_space *mapping = inode->i_mapping;
 	struct block_device *bdev = bh->b_bdev;
-	unsigned long pfn;
 	void __pmem *addr;
 	pgoff_t size;
+	pfn_t pfn;
 	int error;
 
 	i_mmap_lock_read(mapping);
@@ -366,7 +366,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, addr);
 
-	error = vm_insert_mixed(vma, vaddr, pfn);
+	error = vm_insert_mixed(vma, vaddr, pfn_t_to_pfn(pfn));
 
  out:
 	i_mmap_unlock_read(mapping);
@@ -655,8 +655,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
+		pfn_t pfn;
 		long length;
-		unsigned long pfn;
 		void __pmem *kaddr = __dax_map_atomic(bdev,
 				to_sector(&bh, inode), HPAGE_SIZE, &pfn,
 				&length);
@@ -665,7 +665,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			result = VM_FAULT_SIGBUS;
 			goto out;
 		}
-		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) {
+		if ((length < PMD_SIZE) || (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)) {
 			dax_unmap_atomic(bdev, kaddr);
 			goto fallback;
 		}
@@ -679,7 +679,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 		dax_unmap_atomic(bdev, kaddr);
 
-		result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
+		result |= vmf_insert_pfn_pmd(vma, address, pmd,
+				pfn_t_to_pfn(pfn), write);
 	}
 
  out:
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 59a770dad804..b78e01542e9e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1620,7 +1620,7 @@ struct block_device_operations {
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	long (*direct_access)(struct block_device *, sector_t, void __pmem **,
-			unsigned long *pfn);
+			pfn_t *);
 	unsigned int (*check_events) (struct gendisk *disk,
 				      unsigned int clearing);
 	/* ->media_changed() is DEPRECATED, use ->check_events() instead */
@@ -1639,7 +1639,7 @@ extern int bdev_read_page(struct block_device *, sector_t, struct page *);
 extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 						struct writeback_control *);
 extern long bdev_direct_access(struct block_device *, sector_t,
-		void __pmem **addr, unsigned long *pfn, long size);
+		void __pmem **addr, pfn_t *pfn, long size);
 #else /* CONFIG_BLOCK */
 
 struct block_device;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80001de019ba..b8a90c481ae4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -927,6 +927,71 @@ static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg)
 #endif
 
 /*
+ * PFN_FLAGS_MASK - mask of all the possible valid pfn_t flags
+ * PFN_SG_CHAIN - pfn is a pointer to the next scatterlist entry
+ * PFN_SG_LAST - pfn references a page and is the last scatterlist entry
+ * PFN_DEV - pfn is not covered by system memmap by default
+ * PFN_MAP - pfn has a dynamic page mapping established by a device driver
+ */
+#define PFN_FLAGS_MASK (~PAGE_MASK << (BITS_PER_LONG - PAGE_SHIFT))
+#define PFN_SG_CHAIN (1UL << (BITS_PER_LONG - 1))
+#define PFN_SG_LAST (1UL << (BITS_PER_LONG - 2))
+#define PFN_DEV (1UL << (BITS_PER_LONG - 3))
+#define PFN_MAP (1UL << (BITS_PER_LONG - 4))
+
+static inline pfn_t __pfn_to_pfn_t(unsigned long pfn, unsigned long flags)
+{
+	pfn_t pfn_t = { .val = pfn | (flags & PFN_FLAGS_MASK), };
+
+	return pfn_t;
+}
+
+/* a default pfn to pfn_t conversion assumes that @pfn is pfn_valid() */
+static inline pfn_t pfn_to_pfn_t(unsigned long pfn)
+{
+	return __pfn_to_pfn_t(pfn, 0);
+}
+
+static inline pfn_t phys_to_pfn_t(dma_addr_t addr, unsigned long flags)
+{
+	return __pfn_to_pfn_t(addr >> PAGE_SHIFT, flags);
+}
+
+static inline bool pfn_t_has_page(pfn_t pfn)
+{
+	return (pfn.val & PFN_MAP) == PFN_MAP || (pfn.val & PFN_DEV) == 0;
+}
+
+static inline unsigned long pfn_t_to_pfn(pfn_t pfn)
+{
+	return pfn.val & ~PFN_FLAGS_MASK;
+}
+
+static inline struct page *pfn_t_to_page(pfn_t pfn)
+{
+	if (pfn_t_has_page(pfn))
+		return pfn_to_page(pfn_t_to_pfn(pfn));
+	return NULL;
+}
+
+static inline dma_addr_t pfn_t_to_phys(pfn_t pfn)
+{
+	return PFN_PHYS(pfn_t_to_pfn(pfn));
+}
+
+static inline void *pfn_t_to_virt(pfn_t pfn)
+{
+	if (pfn_t_has_page(pfn))
+		return __va(pfn_t_to_phys(pfn));
+	return NULL;
+}
+
+static inline pfn_t page_to_pfn_t(struct page *page)
+{
+	return pfn_to_pfn_t(page_to_pfn(page));
+}
+
+/*
  * Some inline functions in vmstat.h depend on page_zone()
  */
 #include <linux/vmstat.h>
diff --git a/include/linux/pfn.h b/include/linux/pfn.h
index 7646637221f3..96df85985f16 100644
--- a/include/linux/pfn.h
+++ b/include/linux/pfn.h
@@ -3,6 +3,15 @@
 
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
+
+/*
+ * pfn_t: encapsulates a page-frame number that is optionally backed
+ * by memmap (struct page).  Whether a pfn_t has a 'struct page'
+ * backing is indicated by flags in the high bits of the value.
+ */
+typedef struct {
+	unsigned long val;
+} pfn_t;
 #endif
 
 #define PFN_ALIGN(x)	(((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK)


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 09/15] block: notify queue death confirmation
  2015-11-02  4:29 ` Dan Williams
@ 2015-11-02  4:30   ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe
  Cc: Jens Axboe, jack, linux-nvdimm, david, linux-kernel, ross.zwisler, hch

The pmem driver arranges for references to be taken against the queue
while pages it allocated via devm_memremap_pages() are in use.  At
shutdown time, before those pages can be deallocated, they need to be
unmapped, and guaranteed to be idle.  The unmap scan can only be done
once we are certain no new page references will be taken.  Once the blk
queue percpu_ref is confirmed dead the dax core will cease allowing new
references and we can free these "device" pages.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/blk-core.c       |   12 +++++++++---
 block/blk-mq.c         |   19 +++++++++++++++----
 include/linux/blkdev.h |    4 +++-
 3 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6ebe33ed5154..5159946a2b41 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -516,6 +516,12 @@ void blk_set_queue_dying(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
+void blk_wait_queue_dead(struct request_queue *q)
+{
+	wait_event(q->q_freeze_wq, q->q_usage_dead);
+}
+EXPORT_SYMBOL(blk_wait_queue_dead);
+
 /**
  * blk_cleanup_queue - shutdown a request queue
  * @q: request queue to shutdown
@@ -641,7 +647,7 @@ int blk_queue_enter(struct request_queue *q, gfp_t gfp)
 		if (!(gfp & __GFP_WAIT))
 			return -EBUSY;
 
-		ret = wait_event_interruptible(q->mq_freeze_wq,
+		ret = wait_event_interruptible(q->q_freeze_wq,
 				!atomic_read(&q->mq_freeze_depth) ||
 				blk_queue_dying(q));
 		if (blk_queue_dying(q))
@@ -661,7 +667,7 @@ static void blk_queue_usage_counter_release(struct percpu_ref *ref)
 	struct request_queue *q =
 		container_of(ref, struct request_queue, q_usage_counter);
 
-	wake_up_all(&q->mq_freeze_wq);
+	wake_up_all(&q->q_freeze_wq);
 }
 
 struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
@@ -723,7 +729,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	q->bypass_depth = 1;
 	__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);
 
-	init_waitqueue_head(&q->mq_freeze_wq);
+	init_waitqueue_head(&q->q_freeze_wq);
 
 	/*
 	 * Init percpu_ref in atomic mode so that it's faster to shutdown.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6c240712553a..e0417febbcd4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -78,13 +78,23 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
 	clear_bit(CTX_TO_BIT(hctx, ctx), &bm->word);
 }
 
+static void blk_confirm_queue_death(struct percpu_ref *ref)
+{
+	struct request_queue *q = container_of(ref, typeof(*q),
+			q_usage_counter);
+
+	q->q_usage_dead = 1;
+	wake_up_all(&q->q_freeze_wq);
+}
+
 void blk_mq_freeze_queue_start(struct request_queue *q)
 {
 	int freeze_depth;
 
 	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
 	if (freeze_depth == 1) {
-		percpu_ref_kill(&q->q_usage_counter);
+		percpu_ref_kill_and_confirm(&q->q_usage_counter,
+				blk_confirm_queue_death);
 		blk_mq_run_hw_queues(q, false);
 	}
 }
@@ -92,7 +102,7 @@ EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start);
 
 static void blk_mq_freeze_queue_wait(struct request_queue *q)
 {
-	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
+	wait_event(q->q_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
 }
 
 /*
@@ -130,7 +140,8 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 	WARN_ON_ONCE(freeze_depth < 0);
 	if (!freeze_depth) {
 		percpu_ref_reinit(&q->q_usage_counter);
-		wake_up_all(&q->mq_freeze_wq);
+		q->q_usage_dead = 0;
+		wake_up_all(&q->q_freeze_wq);
 	}
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
@@ -149,7 +160,7 @@ void blk_mq_wake_waiters(struct request_queue *q)
 	 * dying, we need to ensure that processes currently waiting on
 	 * the queue are notified as well.
 	 */
-	wake_up_all(&q->mq_freeze_wq);
+	wake_up_all(&q->q_freeze_wq);
 }
 
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b78e01542e9e..e121e5e0c6ac 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -431,6 +431,7 @@ struct request_queue {
 	 */
 	unsigned int		flush_flags;
 	unsigned int		flush_not_queueable:1;
+	unsigned int		q_usage_dead:1;
 	struct blk_flush_queue	*fq;
 
 	struct list_head	requeue_list;
@@ -453,7 +454,7 @@ struct request_queue {
 	struct throtl_data *td;
 #endif
 	struct rcu_head		rcu_head;
-	wait_queue_head_t	mq_freeze_wq;
+	wait_queue_head_t	q_freeze_wq;
 	struct percpu_ref	q_usage_counter;
 	struct list_head	all_q_node;
 
@@ -953,6 +954,7 @@ extern struct request_queue *blk_init_queue_node(request_fn_proc *rfn,
 extern struct request_queue *blk_init_queue(request_fn_proc *, spinlock_t *);
 extern struct request_queue *blk_init_allocated_queue(struct request_queue *,
 						      request_fn_proc *, spinlock_t *);
+extern void blk_wait_queue_dead(struct request_queue *q);
 extern void blk_cleanup_queue(struct request_queue *);
 extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
 extern void blk_queue_bounce_limit(struct request_queue *, u64);


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 09/15] block: notify queue death confirmation
@ 2015-11-02  4:30   ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe
  Cc: Jens Axboe, jack, linux-nvdimm, david, linux-kernel, ross.zwisler, hch

The pmem driver arranges for references to be taken against the queue
while pages it allocated via devm_memremap_pages() are in use.  At
shutdown time, before those pages can be deallocated, they need to be
unmapped, and guaranteed to be idle.  The unmap scan can only be done
once we are certain no new page references will be taken.  Once the blk
queue percpu_ref is confirmed dead the dax core will cease allowing new
references and we can free these "device" pages.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/blk-core.c       |   12 +++++++++---
 block/blk-mq.c         |   19 +++++++++++++++----
 include/linux/blkdev.h |    4 +++-
 3 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6ebe33ed5154..5159946a2b41 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -516,6 +516,12 @@ void blk_set_queue_dying(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
+void blk_wait_queue_dead(struct request_queue *q)
+{
+	wait_event(q->q_freeze_wq, q->q_usage_dead);
+}
+EXPORT_SYMBOL(blk_wait_queue_dead);
+
 /**
  * blk_cleanup_queue - shutdown a request queue
  * @q: request queue to shutdown
@@ -641,7 +647,7 @@ int blk_queue_enter(struct request_queue *q, gfp_t gfp)
 		if (!(gfp & __GFP_WAIT))
 			return -EBUSY;
 
-		ret = wait_event_interruptible(q->mq_freeze_wq,
+		ret = wait_event_interruptible(q->q_freeze_wq,
 				!atomic_read(&q->mq_freeze_depth) ||
 				blk_queue_dying(q));
 		if (blk_queue_dying(q))
@@ -661,7 +667,7 @@ static void blk_queue_usage_counter_release(struct percpu_ref *ref)
 	struct request_queue *q =
 		container_of(ref, struct request_queue, q_usage_counter);
 
-	wake_up_all(&q->mq_freeze_wq);
+	wake_up_all(&q->q_freeze_wq);
 }
 
 struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
@@ -723,7 +729,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	q->bypass_depth = 1;
 	__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);
 
-	init_waitqueue_head(&q->mq_freeze_wq);
+	init_waitqueue_head(&q->q_freeze_wq);
 
 	/*
 	 * Init percpu_ref in atomic mode so that it's faster to shutdown.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6c240712553a..e0417febbcd4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -78,13 +78,23 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
 	clear_bit(CTX_TO_BIT(hctx, ctx), &bm->word);
 }
 
+static void blk_confirm_queue_death(struct percpu_ref *ref)
+{
+	struct request_queue *q = container_of(ref, typeof(*q),
+			q_usage_counter);
+
+	q->q_usage_dead = 1;
+	wake_up_all(&q->q_freeze_wq);
+}
+
 void blk_mq_freeze_queue_start(struct request_queue *q)
 {
 	int freeze_depth;
 
 	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
 	if (freeze_depth == 1) {
-		percpu_ref_kill(&q->q_usage_counter);
+		percpu_ref_kill_and_confirm(&q->q_usage_counter,
+				blk_confirm_queue_death);
 		blk_mq_run_hw_queues(q, false);
 	}
 }
@@ -92,7 +102,7 @@ EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start);
 
 static void blk_mq_freeze_queue_wait(struct request_queue *q)
 {
-	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
+	wait_event(q->q_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
 }
 
 /*
@@ -130,7 +140,8 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 	WARN_ON_ONCE(freeze_depth < 0);
 	if (!freeze_depth) {
 		percpu_ref_reinit(&q->q_usage_counter);
-		wake_up_all(&q->mq_freeze_wq);
+		q->q_usage_dead = 0;
+		wake_up_all(&q->q_freeze_wq);
 	}
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
@@ -149,7 +160,7 @@ void blk_mq_wake_waiters(struct request_queue *q)
 	 * dying, we need to ensure that processes currently waiting on
 	 * the queue are notified as well.
 	 */
-	wake_up_all(&q->mq_freeze_wq);
+	wake_up_all(&q->q_freeze_wq);
 }
 
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b78e01542e9e..e121e5e0c6ac 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -431,6 +431,7 @@ struct request_queue {
 	 */
 	unsigned int		flush_flags;
 	unsigned int		flush_not_queueable:1;
+	unsigned int		q_usage_dead:1;
 	struct blk_flush_queue	*fq;
 
 	struct list_head	requeue_list;
@@ -453,7 +454,7 @@ struct request_queue {
 	struct throtl_data *td;
 #endif
 	struct rcu_head		rcu_head;
-	wait_queue_head_t	mq_freeze_wq;
+	wait_queue_head_t	q_freeze_wq;
 	struct percpu_ref	q_usage_counter;
 	struct list_head	all_q_node;
 
@@ -953,6 +954,7 @@ extern struct request_queue *blk_init_queue_node(request_fn_proc *rfn,
 extern struct request_queue *blk_init_queue(request_fn_proc *, spinlock_t *);
 extern struct request_queue *blk_init_allocated_queue(struct request_queue *,
 						      request_fn_proc *, spinlock_t *);
+extern void blk_wait_queue_dead(struct request_queue *q);
 extern void blk_cleanup_queue(struct request_queue *);
 extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
 extern void blk_queue_bounce_limit(struct request_queue *, u64);


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 10/15] dax, pmem: introduce zone_device_revoke() and devm_memunmap_pages()
  2015-11-02  4:29 ` Dan Williams
@ 2015-11-02  4:30   ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe
  Cc: Dave Hansen, jack, linux-nvdimm, david, linux-kernel, Jan Kara,
	Matthew Wilcox, Andrew Morton, ross.zwisler, hch

Before we allow ZONE_DEVICE pages to be put into active use outside of
the pmem driver, we need a mechanism to revoke access and assert they
are idle when the driver is shutdown.  devm_memunmap_pages() checks that
the reference count passed in at devm_memremap_pages() time is dead, and
then uses zone_device_revoke() to unmap any active inode mappings.

For pmem, it is using the q_usage_counter percpu_ref from its
request_queue as the reference count for devm_memremap_pages().

Cc: Jan Kara <jack@suse.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c |   50 +++++++++++++++++++++----
 fs/dax.c              |   20 ++++++++++
 include/linux/io.h    |   17 ---------
 include/linux/mm.h    |   25 +++++++++++++
 kernel/memremap.c     |   98 ++++++++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 182 insertions(+), 28 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 09093372e5f0..aa2f1292120a 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -24,12 +24,15 @@
 #include <linux/memory_hotplug.h>
 #include <linux/moduleparam.h>
 #include <linux/vmalloc.h>
+#include <linux/async.h>
 #include <linux/slab.h>
 #include <linux/pmem.h>
 #include <linux/nd.h>
 #include "pfn.h"
 #include "nd.h"
 
+static ASYNC_DOMAIN_EXCLUSIVE(async_pmem);
+
 struct pmem_device {
 	struct request_queue	*pmem_queue;
 	struct gendisk		*pmem_disk;
@@ -147,7 +150,8 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 
 	pmem->pfn_flags = PFN_DEV;
 	if (pmem_should_map_pages(dev)) {
-		pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res);
+		pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res,
+				&q->q_usage_counter);
 		pmem->pfn_flags |= PFN_MAP;
 	} else
 		pmem->virt_addr = (void __pmem *) devm_memremap(dev,
@@ -163,14 +167,43 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 	return pmem;
 }
 
-static void pmem_detach_disk(struct pmem_device *pmem)
+
+static void async_blk_cleanup_queue(void *data, async_cookie_t cookie)
 {
+	struct pmem_device *pmem = data;
+
+	blk_cleanup_queue(pmem->pmem_queue);
+}
+
+static void pmem_detach_disk(struct device *dev)
+{
+	struct pmem_device *pmem = dev_get_drvdata(dev);
+	struct request_queue *q = pmem->pmem_queue;
+
 	if (!pmem->pmem_disk)
 		return;
 
 	del_gendisk(pmem->pmem_disk);
 	put_disk(pmem->pmem_disk);
-	blk_cleanup_queue(pmem->pmem_queue);
+	async_schedule_domain(async_blk_cleanup_queue, pmem, &async_pmem);
+
+	if (pmem_should_map_pages(dev)) {
+		/*
+		 * Wait for queue to go dead so that we know no new
+		 * references will be taken against the pages allocated
+		 * by devm_memremap_pages().
+		 */
+		blk_wait_queue_dead(q);
+
+		/*
+		 * Manually release the page mapping so that
+		 * blk_cleanup_queue() can complete queue draining.
+		 */
+		devm_memunmap_pages(dev, (void __force *) pmem->virt_addr);
+	}
+
+	/* Wait for blk_cleanup_queue() to finish */
+	async_synchronize_full_domain(&async_pmem);
 }
 
 static int pmem_attach_disk(struct device *dev,
@@ -299,11 +332,9 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
 static int nvdimm_namespace_detach_pfn(struct nd_namespace_common *ndns)
 {
 	struct nd_pfn *nd_pfn = to_nd_pfn(ndns->claim);
-	struct pmem_device *pmem;
 
 	/* free pmem disk */
-	pmem = dev_get_drvdata(&nd_pfn->dev);
-	pmem_detach_disk(pmem);
+	pmem_detach_disk(&nd_pfn->dev);
 
 	/* release nd_pfn resources */
 	kfree(nd_pfn->pfn_sb);
@@ -321,6 +352,7 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
 	struct nd_region *nd_region;
 	struct nd_pfn_sb *pfn_sb;
 	struct pmem_device *pmem;
+	struct request_queue *q;
 	phys_addr_t offset;
 	int rc;
 
@@ -357,8 +389,10 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
 
 	/* establish pfn range for lookup, and switch to direct map */
 	pmem = dev_get_drvdata(dev);
+	q = pmem->pmem_queue;
 	devm_memunmap(dev, (void __force *) pmem->virt_addr);
-	pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res);
+	pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res,
+			&q->q_usage_counter);
 	pmem->pfn_flags |= PFN_MAP;
 	if (IS_ERR(pmem->virt_addr)) {
 		rc = PTR_ERR(pmem->virt_addr);
@@ -428,7 +462,7 @@ static int nd_pmem_remove(struct device *dev)
 	else if (is_nd_pfn(dev))
 		nvdimm_namespace_detach_pfn(pmem->ndns);
 	else
-		pmem_detach_disk(pmem);
+		pmem_detach_disk(dev);
 
 	return 0;
 }
diff --git a/fs/dax.c b/fs/dax.c
index 4d6861f022d9..ac8992e86779 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -327,6 +327,23 @@ static int copy_user_bh(struct page *to, struct inode *inode,
 	return 0;
 }
 
+/* must be called within a dax_map_atomic / dax_unmap_atomic section */
+static void dax_account_mapping(struct block_device *bdev, pfn_t pfn,
+		struct address_space *mapping)
+{
+	/*
+	 * If we are establishing a mapping for a page mapped pfn, take an
+	 * extra reference against the request_queue.  See zone_device_revoke
+	 * for the paired decrement.
+	 */
+	if (pfn_t_has_page(pfn)) {
+		struct page *page = pfn_t_to_page(pfn);
+
+		page->mapping = mapping;
+		percpu_ref_get(&bdev->bd_queue->q_usage_counter);
+	}
+}
+
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
@@ -364,6 +381,8 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		clear_pmem(addr, PAGE_SIZE);
 		wmb_pmem();
 	}
+
+	dax_account_mapping(bdev, pfn, mapping);
 	dax_unmap_atomic(bdev, addr);
 
 	error = vm_insert_mixed(vma, vaddr, pfn_t_to_pfn(pfn));
@@ -677,6 +696,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 			result |= VM_FAULT_MAJOR;
 		}
+		dax_account_mapping(bdev, pfn, mapping);
 		dax_unmap_atomic(bdev, kaddr);
 
 		result |= vmf_insert_pfn_pmd(vma, address, pmd,
diff --git a/include/linux/io.h b/include/linux/io.h
index de64c1e53612..2f2f8859abd9 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -87,23 +87,6 @@ void *devm_memremap(struct device *dev, resource_size_t offset,
 		size_t size, unsigned long flags);
 void devm_memunmap(struct device *dev, void *addr);
 
-void *__devm_memremap_pages(struct device *dev, struct resource *res);
-
-#ifdef CONFIG_ZONE_DEVICE
-void *devm_memremap_pages(struct device *dev, struct resource *res);
-#else
-static inline void *devm_memremap_pages(struct device *dev, struct resource *res)
-{
-	/*
-	 * Fail attempts to call devm_memremap_pages() without
-	 * ZONE_DEVICE support enabled, this requires callers to fall
-	 * back to plain devm_memremap() based on config
-	 */
-	WARN_ON_ONCE(1);
-	return ERR_PTR(-ENXIO);
-}
-#endif
-
 /*
  * Some systems do not have legacy ISA devices.
  * /dev/port is not a valid interface on these systems.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b8a90c481ae4..f6225140b5d7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -717,6 +717,31 @@ static inline enum zone_type page_zonenum(const struct page *page)
 	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
 }
 
+struct percpu_ref;
+struct resource;
+struct device;
+#ifdef CONFIG_ZONE_DEVICE
+void devm_memunmap_pages(struct device *dev, void *addr);
+void *devm_memremap_pages(struct device *dev, struct resource *res,
+		struct percpu_ref *ref);
+#else
+static inline void devm_memunmap_pages(struct device *dev, void *addr)
+{
+}
+
+static inline void *devm_memremap_pages(struct device *dev,
+		struct resource *res, struct percpu_ref *ref)
+{
+	/*
+	 * Fail attempts to call devm_memremap_pages() without
+	 * ZONE_DEVICE support enabled, this requires callers to fall
+	 * back to plain devm_memremap() based on config
+	 */
+	WARN_ON_ONCE(1);
+	return ERR_PTR(-ENXIO);
+}
+#endif
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 3218e8b1fc28..a73e18d8a120 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -12,9 +12,11 @@
  */
 #include <linux/device.h>
 #include <linux/types.h>
+#include <linux/fs.h>
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/memory_hotplug.h>
+#include <linux/percpu-refcount.h>
 
 #ifndef ioremap_cache
 /* temporary while we convert existing ioremap_cache users to memremap */
@@ -140,17 +142,88 @@ EXPORT_SYMBOL(devm_memunmap);
 #ifdef CONFIG_ZONE_DEVICE
 struct page_map {
 	struct resource res;
+	struct percpu_ref *ref;
 };
 
-static void devm_memremap_pages_release(struct device *dev, void *res)
+static unsigned long pfn_first(struct page_map *page_map)
 {
-	struct page_map *page_map = res;
+	const struct resource *res = &page_map->res;
+
+	return res->start >> PAGE_SHIFT;
+}
+
+static unsigned long pfn_end(struct page_map *page_map)
+{
+	const struct resource *res = &page_map->res;
+
+	return (res->start + resource_size(res)) >> PAGE_SHIFT;
+}
+
+#define for_each_device_pfn(pfn, map) \
+	for (pfn = pfn_first(map); pfn < pfn_end(map); pfn++)
+
+static void zone_device_revoke(struct device *dev, struct page_map *page_map)
+{
+	unsigned long pfn;
+	int retry = 3;
+	struct percpu_ref *ref = page_map->ref;
+	struct address_space *mapping_prev;
+
+	if (percpu_ref_tryget_live(ref)) {
+		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
+		percpu_ref_put(ref);
+	}
+
+ retry:
+	mapping_prev = NULL;
+	for_each_device_pfn(pfn, page_map) {
+		struct page *page = pfn_to_page(pfn);
+		struct address_space *mapping = page->mapping;
+		struct inode *inode = mapping ? mapping->host : NULL;
+
+		dev_WARN_ONCE(dev, atomic_read(&page->_count) < 1,
+				"%s: ZONE_DEVICE page was freed!\n", __func__);
+
+		/* See dax_account_mapping */
+		if (mapping) {
+			percpu_ref_put(ref);
+			page->mapping = NULL;
+		}
+
+		if (!mapping || !inode || mapping == mapping_prev) {
+			dev_WARN_ONCE(dev, atomic_read(&page->_count) > 1,
+					"%s: unexpected elevated page count pfn: %lx\n",
+					__func__, pfn);
+			continue;
+		}
+
+		unmap_mapping_range(mapping, 0, 0, 1);
+		mapping_prev = mapping;
+	}
+
+	/*
+	 * Straggling mappings may have been established immediately
+	 * after the percpu_ref was killed.
+	 */
+	if (!percpu_ref_is_zero(ref) && retry--)
+		goto retry;
+
+	if (!percpu_ref_is_zero(ref))
+		dev_warn(dev, "%s: not all references released\n", __func__);
+}
+
+static void devm_memremap_pages_release(struct device *dev, void *data)
+{
+	struct page_map *page_map = data;
+
+	zone_device_revoke(dev, page_map);
 
 	/* pages are dead and unused, undo the arch mapping */
 	arch_remove_memory(page_map->res.start, resource_size(&page_map->res));
 }
 
-void *devm_memremap_pages(struct device *dev, struct resource *res)
+void *devm_memremap_pages(struct device *dev, struct resource *res,
+		struct percpu_ref *ref)
 {
 	int is_ram = region_intersects(res->start, resource_size(res),
 			"System RAM");
@@ -172,6 +245,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res)
 		return ERR_PTR(-ENOMEM);
 
 	memcpy(&page_map->res, res, sizeof(*res));
+	page_map->ref = ref;
 
 	nid = dev_to_node(dev);
 	if (nid < 0)
@@ -187,4 +261,22 @@ void *devm_memremap_pages(struct device *dev, struct resource *res)
 	return __va(res->start);
 }
 EXPORT_SYMBOL(devm_memremap_pages);
+
+static int page_map_match(struct device *dev, void *res, void *match_data)
+{
+	struct page_map *page_map = res;
+	resource_size_t phys = *(resource_size_t *) match_data;
+
+	return page_map->res.start == phys;
+}
+
+void devm_memunmap_pages(struct device *dev, void *addr)
+{
+	resource_size_t start = __pa(addr);
+
+	if (devres_release(dev, devm_memremap_pages_release, page_map_match,
+				&start) != 0)
+		dev_WARN(dev, "failed to find page map to release\n");
+}
+EXPORT_SYMBOL(devm_memunmap_pages);
 #endif /* CONFIG_ZONE_DEVICE */


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 10/15] dax, pmem: introduce zone_device_revoke() and devm_memunmap_pages()
@ 2015-11-02  4:30   ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe
  Cc: Dave Hansen, jack, linux-nvdimm, david, linux-kernel, Jan Kara,
	Matthew Wilcox, Andrew Morton, ross.zwisler, hch

Before we allow ZONE_DEVICE pages to be put into active use outside of
the pmem driver, we need a mechanism to revoke access and assert they
are idle when the driver is shutdown.  devm_memunmap_pages() checks that
the reference count passed in at devm_memremap_pages() time is dead, and
then uses zone_device_revoke() to unmap any active inode mappings.

For pmem, it is using the q_usage_counter percpu_ref from its
request_queue as the reference count for devm_memremap_pages().

Cc: Jan Kara <jack@suse.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c |   50 +++++++++++++++++++++----
 fs/dax.c              |   20 ++++++++++
 include/linux/io.h    |   17 ---------
 include/linux/mm.h    |   25 +++++++++++++
 kernel/memremap.c     |   98 ++++++++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 182 insertions(+), 28 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 09093372e5f0..aa2f1292120a 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -24,12 +24,15 @@
 #include <linux/memory_hotplug.h>
 #include <linux/moduleparam.h>
 #include <linux/vmalloc.h>
+#include <linux/async.h>
 #include <linux/slab.h>
 #include <linux/pmem.h>
 #include <linux/nd.h>
 #include "pfn.h"
 #include "nd.h"
 
+static ASYNC_DOMAIN_EXCLUSIVE(async_pmem);
+
 struct pmem_device {
 	struct request_queue	*pmem_queue;
 	struct gendisk		*pmem_disk;
@@ -147,7 +150,8 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 
 	pmem->pfn_flags = PFN_DEV;
 	if (pmem_should_map_pages(dev)) {
-		pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res);
+		pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res,
+				&q->q_usage_counter);
 		pmem->pfn_flags |= PFN_MAP;
 	} else
 		pmem->virt_addr = (void __pmem *) devm_memremap(dev,
@@ -163,14 +167,43 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 	return pmem;
 }
 
-static void pmem_detach_disk(struct pmem_device *pmem)
+
+static void async_blk_cleanup_queue(void *data, async_cookie_t cookie)
 {
+	struct pmem_device *pmem = data;
+
+	blk_cleanup_queue(pmem->pmem_queue);
+}
+
+static void pmem_detach_disk(struct device *dev)
+{
+	struct pmem_device *pmem = dev_get_drvdata(dev);
+	struct request_queue *q = pmem->pmem_queue;
+
 	if (!pmem->pmem_disk)
 		return;
 
 	del_gendisk(pmem->pmem_disk);
 	put_disk(pmem->pmem_disk);
-	blk_cleanup_queue(pmem->pmem_queue);
+	async_schedule_domain(async_blk_cleanup_queue, pmem, &async_pmem);
+
+	if (pmem_should_map_pages(dev)) {
+		/*
+		 * Wait for queue to go dead so that we know no new
+		 * references will be taken against the pages allocated
+		 * by devm_memremap_pages().
+		 */
+		blk_wait_queue_dead(q);
+
+		/*
+		 * Manually release the page mapping so that
+		 * blk_cleanup_queue() can complete queue draining.
+		 */
+		devm_memunmap_pages(dev, (void __force *) pmem->virt_addr);
+	}
+
+	/* Wait for blk_cleanup_queue() to finish */
+	async_synchronize_full_domain(&async_pmem);
 }
 
 static int pmem_attach_disk(struct device *dev,
@@ -299,11 +332,9 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
 static int nvdimm_namespace_detach_pfn(struct nd_namespace_common *ndns)
 {
 	struct nd_pfn *nd_pfn = to_nd_pfn(ndns->claim);
-	struct pmem_device *pmem;
 
 	/* free pmem disk */
-	pmem = dev_get_drvdata(&nd_pfn->dev);
-	pmem_detach_disk(pmem);
+	pmem_detach_disk(&nd_pfn->dev);
 
 	/* release nd_pfn resources */
 	kfree(nd_pfn->pfn_sb);
@@ -321,6 +352,7 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
 	struct nd_region *nd_region;
 	struct nd_pfn_sb *pfn_sb;
 	struct pmem_device *pmem;
+	struct request_queue *q;
 	phys_addr_t offset;
 	int rc;
 
@@ -357,8 +389,10 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
 
 	/* establish pfn range for lookup, and switch to direct map */
 	pmem = dev_get_drvdata(dev);
+	q = pmem->pmem_queue;
 	devm_memunmap(dev, (void __force *) pmem->virt_addr);
-	pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res);
+	pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res,
+			&q->q_usage_counter);
 	pmem->pfn_flags |= PFN_MAP;
 	if (IS_ERR(pmem->virt_addr)) {
 		rc = PTR_ERR(pmem->virt_addr);
@@ -428,7 +462,7 @@ static int nd_pmem_remove(struct device *dev)
 	else if (is_nd_pfn(dev))
 		nvdimm_namespace_detach_pfn(pmem->ndns);
 	else
-		pmem_detach_disk(pmem);
+		pmem_detach_disk(dev);
 
 	return 0;
 }
diff --git a/fs/dax.c b/fs/dax.c
index 4d6861f022d9..ac8992e86779 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -327,6 +327,23 @@ static int copy_user_bh(struct page *to, struct inode *inode,
 	return 0;
 }
 
+/* must be called within a dax_map_atomic / dax_unmap_atomic section */
+static void dax_account_mapping(struct block_device *bdev, pfn_t pfn,
+		struct address_space *mapping)
+{
+	/*
+	 * If we are establishing a mapping for a page mapped pfn, take an
+	 * extra reference against the request_queue.  See zone_device_revoke
+	 * for the paired decrement.
+	 */
+	if (pfn_t_has_page(pfn)) {
+		struct page *page = pfn_t_to_page(pfn);
+
+		page->mapping = mapping;
+		percpu_ref_get(&bdev->bd_queue->q_usage_counter);
+	}
+}
+
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
@@ -364,6 +381,8 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		clear_pmem(addr, PAGE_SIZE);
 		wmb_pmem();
 	}
+
+	dax_account_mapping(bdev, pfn, mapping);
 	dax_unmap_atomic(bdev, addr);
 
 	error = vm_insert_mixed(vma, vaddr, pfn_t_to_pfn(pfn));
@@ -677,6 +696,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 			result |= VM_FAULT_MAJOR;
 		}
+		dax_account_mapping(bdev, pfn, mapping);
 		dax_unmap_atomic(bdev, kaddr);
 
 		result |= vmf_insert_pfn_pmd(vma, address, pmd,
diff --git a/include/linux/io.h b/include/linux/io.h
index de64c1e53612..2f2f8859abd9 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -87,23 +87,6 @@ void *devm_memremap(struct device *dev, resource_size_t offset,
 		size_t size, unsigned long flags);
 void devm_memunmap(struct device *dev, void *addr);
 
-void *__devm_memremap_pages(struct device *dev, struct resource *res);
-
-#ifdef CONFIG_ZONE_DEVICE
-void *devm_memremap_pages(struct device *dev, struct resource *res);
-#else
-static inline void *devm_memremap_pages(struct device *dev, struct resource *res)
-{
-	/*
-	 * Fail attempts to call devm_memremap_pages() without
-	 * ZONE_DEVICE support enabled, this requires callers to fall
-	 * back to plain devm_memremap() based on config
-	 */
-	WARN_ON_ONCE(1);
-	return ERR_PTR(-ENXIO);
-}
-#endif
-
 /*
  * Some systems do not have legacy ISA devices.
  * /dev/port is not a valid interface on these systems.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b8a90c481ae4..f6225140b5d7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -717,6 +717,31 @@ static inline enum zone_type page_zonenum(const struct page *page)
 	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
 }
 
+struct percpu_ref;
+struct resource;
+struct device;
+#ifdef CONFIG_ZONE_DEVICE
+void devm_memunmap_pages(struct device *dev, void *addr);
+void *devm_memremap_pages(struct device *dev, struct resource *res,
+		struct percpu_ref *ref);
+#else
+static inline void devm_memunmap_pages(struct device *dev, void *addr)
+{
+}
+
+static inline void *devm_memremap_pages(struct device *dev,
+		struct resource *res, struct percpu_ref *ref)
+{
+	/*
+	 * Fail attempts to call devm_memremap_pages() without
+	 * ZONE_DEVICE support enabled, this requires callers to fall
+	 * back to plain devm_memremap() based on config
+	 */
+	WARN_ON_ONCE(1);
+	return ERR_PTR(-ENXIO);
+}
+#endif
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 3218e8b1fc28..a73e18d8a120 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -12,9 +12,11 @@
  */
 #include <linux/device.h>
 #include <linux/types.h>
+#include <linux/fs.h>
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/memory_hotplug.h>
+#include <linux/percpu-refcount.h>
 
 #ifndef ioremap_cache
 /* temporary while we convert existing ioremap_cache users to memremap */
@@ -140,17 +142,88 @@ EXPORT_SYMBOL(devm_memunmap);
 #ifdef CONFIG_ZONE_DEVICE
 struct page_map {
 	struct resource res;
+	struct percpu_ref *ref;
 };
 
-static void devm_memremap_pages_release(struct device *dev, void *res)
+static unsigned long pfn_first(struct page_map *page_map)
 {
-	struct page_map *page_map = res;
+	const struct resource *res = &page_map->res;
+
+	return res->start >> PAGE_SHIFT;
+}
+
+static unsigned long pfn_end(struct page_map *page_map)
+{
+	const struct resource *res = &page_map->res;
+
+	return (res->start + resource_size(res)) >> PAGE_SHIFT;
+}
+
+#define for_each_device_pfn(pfn, map) \
+	for (pfn = pfn_first(map); pfn < pfn_end(map); pfn++)
+
+static void zone_device_revoke(struct device *dev, struct page_map *page_map)
+{
+	unsigned long pfn;
+	int retry = 3;
+	struct percpu_ref *ref = page_map->ref;
+	struct address_space *mapping_prev;
+
+	if (percpu_ref_tryget_live(ref)) {
+		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
+		percpu_ref_put(ref);
+	}
+
+ retry:
+	mapping_prev = NULL;
+	for_each_device_pfn(pfn, page_map) {
+		struct page *page = pfn_to_page(pfn);
+		struct address_space *mapping = page->mapping;
+		struct inode *inode = mapping ? mapping->host : NULL;
+
+		dev_WARN_ONCE(dev, atomic_read(&page->_count) < 1,
+				"%s: ZONE_DEVICE page was freed!\n", __func__);
+
+		/* See dax_account_mapping */
+		if (mapping) {
+			percpu_ref_put(ref);
+			page->mapping = NULL;
+		}
+
+		if (!mapping || !inode || mapping == mapping_prev) {
+			dev_WARN_ONCE(dev, atomic_read(&page->_count) > 1,
+					"%s: unexpected elevated page count pfn: %lx\n",
+					__func__, pfn);
+			continue;
+		}
+
+		unmap_mapping_range(mapping, 0, 0, 1);
+		mapping_prev = mapping;
+	}
+
+	/*
+	 * Straggling mappings may have been established immediately
+	 * after the percpu_ref was killed.
+	 */
+	if (!percpu_ref_is_zero(ref) && retry--)
+		goto retry;
+
+	if (!percpu_ref_is_zero(ref))
+		dev_warn(dev, "%s: not all references released\n", __func__);
+}
+
+static void devm_memremap_pages_release(struct device *dev, void *data)
+{
+	struct page_map *page_map = data;
+
+	zone_device_revoke(dev, page_map);
 
 	/* pages are dead and unused, undo the arch mapping */
 	arch_remove_memory(page_map->res.start, resource_size(&page_map->res));
 }
 
-void *devm_memremap_pages(struct device *dev, struct resource *res)
+void *devm_memremap_pages(struct device *dev, struct resource *res,
+		struct percpu_ref *ref)
 {
 	int is_ram = region_intersects(res->start, resource_size(res),
 			"System RAM");
@@ -172,6 +245,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res)
 		return ERR_PTR(-ENOMEM);
 
 	memcpy(&page_map->res, res, sizeof(*res));
+	page_map->ref = ref;
 
 	nid = dev_to_node(dev);
 	if (nid < 0)
@@ -187,4 +261,22 @@ void *devm_memremap_pages(struct device *dev, struct resource *res)
 	return __va(res->start);
 }
 EXPORT_SYMBOL(devm_memremap_pages);
+
+static int page_map_match(struct device *dev, void *res, void *match_data)
+{
+	struct page_map *page_map = res;
+	resource_size_t phys = *(resource_size_t *) match_data;
+
+	return page_map->res.start == phys;
+}
+
+void devm_memunmap_pages(struct device *dev, void *addr)
+{
+	resource_size_t start = __pa(addr);
+
+	if (devres_release(dev, devm_memremap_pages_release, page_map_match,
+				&start) != 0)
+		dev_WARN(dev, "failed to find page map to release\n");
+}
+EXPORT_SYMBOL(devm_memunmap_pages);
 #endif /* CONFIG_ZONE_DEVICE */


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 11/15] block: introduce bdev_file_inode()
  2015-11-02  4:29 ` Dan Williams
@ 2015-11-02  4:30   ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe
  Cc: jack, linux-nvdimm, david, linux-kernel, Jeff Moyer, Al Viro,
	ross.zwisler, hch

Similar to the file_inode() helper, provide a helper to lookup the inode for a
raw block device itself.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/block_dev.c |   19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 84b042778812..29cd1eb0765d 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -147,11 +147,16 @@ blkdev_get_block(struct inode *inode, sector_t iblock,
 	return 0;
 }
 
+static struct inode *bdev_file_inode(struct file *file)
+{
+	return file->f_mapping->host;
+}
+
 static ssize_t
 blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
 {
 	struct file *file = iocb->ki_filp;
-	struct inode *inode = file->f_mapping->host;
+	struct inode *inode = bdev_file_inode(file);
 
 	if (IS_DAX(inode))
 		return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
@@ -329,7 +334,7 @@ static int blkdev_write_end(struct file *file, struct address_space *mapping,
  */
 static loff_t block_llseek(struct file *file, loff_t offset, int whence)
 {
-	struct inode *bd_inode = file->f_mapping->host;
+	struct inode *bd_inode = bdev_file_inode(file);
 	loff_t retval;
 
 	mutex_lock(&bd_inode->i_mutex);
@@ -340,7 +345,7 @@ static loff_t block_llseek(struct file *file, loff_t offset, int whence)
 	
 int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
 {
-	struct inode *bd_inode = filp->f_mapping->host;
+	struct inode *bd_inode = bdev_file_inode(filp);
 	struct block_device *bdev = I_BDEV(bd_inode);
 	int error;
 	
@@ -1579,14 +1584,14 @@ EXPORT_SYMBOL(blkdev_put);
 
 static int blkdev_close(struct inode * inode, struct file * filp)
 {
-	struct block_device *bdev = I_BDEV(filp->f_mapping->host);
+	struct block_device *bdev = I_BDEV(bdev_file_inode(filp));
 	blkdev_put(bdev, filp->f_mode);
 	return 0;
 }
 
 static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 {
-	struct block_device *bdev = I_BDEV(file->f_mapping->host);
+	struct block_device *bdev = I_BDEV(bdev_file_inode(file));
 	fmode_t mode = file->f_mode;
 
 	/*
@@ -1611,7 +1616,7 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
-	struct inode *bd_inode = file->f_mapping->host;
+	struct inode *bd_inode = bdev_file_inode(file);
 	loff_t size = i_size_read(bd_inode);
 	struct blk_plug plug;
 	ssize_t ret;
@@ -1643,7 +1648,7 @@ EXPORT_SYMBOL_GPL(blkdev_write_iter);
 ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct file *file = iocb->ki_filp;
-	struct inode *bd_inode = file->f_mapping->host;
+	struct inode *bd_inode = bdev_file_inode(file);
 	loff_t size = i_size_read(bd_inode);
 	loff_t pos = iocb->ki_pos;
 


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 11/15] block: introduce bdev_file_inode()
@ 2015-11-02  4:30   ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe
  Cc: jack, linux-nvdimm, david, linux-kernel, Jeff Moyer, Al Viro,
	ross.zwisler, hch

Similar to the file_inode() helper, provide a helper to lookup the inode for a
raw block device itself.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/block_dev.c |   19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 84b042778812..29cd1eb0765d 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -147,11 +147,16 @@ blkdev_get_block(struct inode *inode, sector_t iblock,
 	return 0;
 }
 
+static struct inode *bdev_file_inode(struct file *file)
+{
+	return file->f_mapping->host;
+}
+
 static ssize_t
 blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
 {
 	struct file *file = iocb->ki_filp;
-	struct inode *inode = file->f_mapping->host;
+	struct inode *inode = bdev_file_inode(file);
 
 	if (IS_DAX(inode))
 		return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
@@ -329,7 +334,7 @@ static int blkdev_write_end(struct file *file, struct address_space *mapping,
  */
 static loff_t block_llseek(struct file *file, loff_t offset, int whence)
 {
-	struct inode *bd_inode = file->f_mapping->host;
+	struct inode *bd_inode = bdev_file_inode(file);
 	loff_t retval;
 
 	mutex_lock(&bd_inode->i_mutex);
@@ -340,7 +345,7 @@ static loff_t block_llseek(struct file *file, loff_t offset, int whence)
 	
 int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
 {
-	struct inode *bd_inode = filp->f_mapping->host;
+	struct inode *bd_inode = bdev_file_inode(filp);
 	struct block_device *bdev = I_BDEV(bd_inode);
 	int error;
 	
@@ -1579,14 +1584,14 @@ EXPORT_SYMBOL(blkdev_put);
 
 static int blkdev_close(struct inode * inode, struct file * filp)
 {
-	struct block_device *bdev = I_BDEV(filp->f_mapping->host);
+	struct block_device *bdev = I_BDEV(bdev_file_inode(filp));
 	blkdev_put(bdev, filp->f_mode);
 	return 0;
 }
 
 static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 {
-	struct block_device *bdev = I_BDEV(file->f_mapping->host);
+	struct block_device *bdev = I_BDEV(bdev_file_inode(file));
 	fmode_t mode = file->f_mode;
 
 	/*
@@ -1611,7 +1616,7 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
-	struct inode *bd_inode = file->f_mapping->host;
+	struct inode *bd_inode = bdev_file_inode(file);
 	loff_t size = i_size_read(bd_inode);
 	struct blk_plug plug;
 	ssize_t ret;
@@ -1643,7 +1648,7 @@ EXPORT_SYMBOL_GPL(blkdev_write_iter);
 ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct file *file = iocb->ki_filp;
-	struct inode *bd_inode = file->f_mapping->host;
+	struct inode *bd_inode = bdev_file_inode(file);
 	loff_t size = i_size_read(bd_inode);
 	loff_t pos = iocb->ki_pos;
 


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 12/15] block: enable dax for raw block devices
  2015-11-02  4:29 ` Dan Williams
@ 2015-11-02  4:30   ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe
  Cc: jack, linux-nvdimm, david, linux-kernel, hch, Jeff Moyer,
	Jan Kara, ross.zwisler, kbuild test robot, Andrew Morton

If an application wants exclusive access to all of the persistent memory
provided by an NVDIMM namespace it can use this raw-block-dax facility
to forgo establishing a filesystem.  This capability is targeted
primarily to hypervisors wanting to provision persistent memory for
guests.  It can be disabled / enabled dynamically via the new BLKDAXSET
ioctl.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reviewed-by: Jan Kara <jack@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/ioctl.c           |   43 ++++++++++++++++++++++
 fs/block_dev.c          |   90 ++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/fs.h      |    3 ++
 include/uapi/linux/fs.h |    2 +
 4 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 8061eba42887..205d57612fbd 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -295,6 +295,35 @@ static inline int is_unrecognized_ioctl(int ret)
 		ret == -ENOIOCTLCMD;
 }
 
+#ifdef CONFIG_FS_DAX
+static int blkdev_set_dax(struct block_device *bdev, int n)
+{
+	struct gendisk *disk = bdev->bd_disk;
+	int rc = 0;
+
+	if (n)
+		n = S_DAX;
+
+	if (n && !disk->fops->direct_access)
+		return -ENOTTY;
+
+	mutex_lock(&bdev->bd_inode->i_mutex);
+	if (bdev->bd_map_count == 0)
+		inode_set_flags(bdev->bd_inode, n, S_DAX);
+	else
+		rc = -EBUSY;
+	mutex_unlock(&bdev->bd_inode->i_mutex);
+	return rc;
+}
+#else
+static int blkdev_set_dax(struct block_device *bdev, int n)
+{
+	if (n)
+		return -ENOTTY;
+	return 0;
+}
+#endif
+
 /*
  * always keep this in sync with compat_blkdev_ioctl()
  */
@@ -449,6 +478,20 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
 	case BLKTRACETEARDOWN:
 		ret = blk_trace_ioctl(bdev, cmd, (char __user *) arg);
 		break;
+	case BLKDAXSET:
+		if (!capable(CAP_SYS_ADMIN))
+			return -EACCES;
+
+		if (get_user(n, (int __user *)(arg)))
+			return -EFAULT;
+		n = !!n;
+		if (n == !!(bdev->bd_inode->i_flags & S_DAX))
+			return 0;
+
+		return blkdev_set_dax(bdev, n);
+	case BLKDAXGET:
+		return put_int(arg, !!(bdev->bd_inode->i_flags & S_DAX));
+		break;
 	default:
 		ret = __blkdev_driver_ioctl(bdev, mode, cmd, arg);
 	}
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 29cd1eb0765d..13ce6d0ff7f6 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1687,13 +1687,101 @@ static const struct address_space_operations def_blk_aops = {
 	.is_dirty_writeback = buffer_check_dirty_writeback,
 };
 
+#ifdef CONFIG_FS_DAX
+/*
+ * In the raw block case we do not need to contend with truncation nor
+ * unwritten file extents.  Without those concerns there is no need for
+ * additional locking beyond the mmap_sem context that these routines
+ * are already executing under.
+ *
+ * Note, there is no protection if the block device is dynamically
+ * resized (partition grow/shrink) during a fault. A stable block device
+ * size is already not enforced in the blkdev_direct_IO path.
+ *
+ * For DAX, it is the responsibility of the block device driver to
+ * ensure the whole-disk device size is stable while requests are in
+ * flight.
+ *
+ * Finally, unlike the filemap_page_mkwrite() case there is no
+ * filesystem superblock to sync against freezing.  We still include a
+ * pfn_mkwrite callback for dax drivers to receive write fault
+ * notifications.
+ */
+static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return __dax_fault(vma, vmf, blkdev_get_block, NULL);
+}
+
+static int blkdev_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
+		pmd_t *pmd, unsigned int flags)
+{
+	return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block, NULL);
+}
+
+static void blkdev_vm_open(struct vm_area_struct *vma)
+{
+	struct inode *bd_inode = bdev_file_inode(vma->vm_file);
+	struct block_device *bdev = I_BDEV(bd_inode);
+
+	mutex_lock(&bd_inode->i_mutex);
+	bdev->bd_map_count++;
+	mutex_unlock(&bd_inode->i_mutex);
+}
+
+static void blkdev_vm_close(struct vm_area_struct *vma)
+{
+	struct inode *bd_inode = bdev_file_inode(vma->vm_file);
+	struct block_device *bdev = I_BDEV(bd_inode);
+
+	mutex_lock(&bd_inode->i_mutex);
+	bdev->bd_map_count--;
+	mutex_unlock(&bd_inode->i_mutex);
+}
+
+static const struct vm_operations_struct blkdev_dax_vm_ops = {
+	.open		= blkdev_vm_open,
+	.close		= blkdev_vm_close,
+	.fault		= blkdev_dax_fault,
+	.pmd_fault	= blkdev_dax_pmd_fault,
+	.pfn_mkwrite	= blkdev_dax_fault,
+};
+
+static const struct vm_operations_struct blkdev_default_vm_ops = {
+	.open		= blkdev_vm_open,
+	.close		= blkdev_vm_close,
+	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
+};
+
+static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct inode *bd_inode = bdev_file_inode(file);
+	struct block_device *bdev = I_BDEV(bd_inode);
+
+	file_accessed(file);
+	mutex_lock(&bd_inode->i_mutex);
+	bdev->bd_map_count++;
+	if (IS_DAX(bd_inode)) {
+		vma->vm_ops = &blkdev_dax_vm_ops;
+		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	} else {
+		vma->vm_ops = &blkdev_default_vm_ops;
+	}
+	mutex_unlock(&bd_inode->i_mutex);
+
+	return 0;
+}
+#else
+#define blkdev_mmap generic_file_mmap
+#endif
+
 const struct file_operations def_blk_fops = {
 	.open		= blkdev_open,
 	.release	= blkdev_close,
 	.llseek		= block_llseek,
 	.read_iter	= blkdev_read_iter,
 	.write_iter	= blkdev_write_iter,
-	.mmap		= generic_file_mmap,
+	.mmap		= blkdev_mmap,
 	.fsync		= blkdev_fsync,
 	.unlocked_ioctl	= block_ioctl,
 #ifdef CONFIG_COMPAT
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 72d8a844c692..8fb2d4b848bf 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -482,6 +482,9 @@ struct block_device {
 	int			bd_fsfreeze_count;
 	/* Mutex for freeze */
 	struct mutex		bd_fsfreeze_mutex;
+#ifdef CONFIG_FS_DAX
+	int			bd_map_count;
+#endif
 };
 
 /*
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 9b964a5920af..cc2f0fdae707 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -152,6 +152,8 @@ struct inodes_stat_t {
 #define BLKSECDISCARD _IO(0x12,125)
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
+#define BLKDAXSET _IO(0x12,128)
+#define BLKDAXGET _IO(0x12,129)
 
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 12/15] block: enable dax for raw block devices
@ 2015-11-02  4:30   ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe
  Cc: jack, linux-nvdimm, david, linux-kernel, hch, Jeff Moyer,
	Jan Kara, ross.zwisler, kbuild test robot, Andrew Morton

If an application wants exclusive access to all of the persistent memory
provided by an NVDIMM namespace it can use this raw-block-dax facility
to forgo establishing a filesystem.  This capability is targeted
primarily to hypervisors wanting to provision persistent memory for
guests.  It can be disabled / enabled dynamically via the new BLKDAXSET
ioctl.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reviewed-by: Jan Kara <jack@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/ioctl.c           |   43 ++++++++++++++++++++++
 fs/block_dev.c          |   90 ++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/fs.h      |    3 ++
 include/uapi/linux/fs.h |    2 +
 4 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 8061eba42887..205d57612fbd 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -295,6 +295,35 @@ static inline int is_unrecognized_ioctl(int ret)
 		ret == -ENOIOCTLCMD;
 }
 
+#ifdef CONFIG_FS_DAX
+static int blkdev_set_dax(struct block_device *bdev, int n)
+{
+	struct gendisk *disk = bdev->bd_disk;
+	int rc = 0;
+
+	if (n)
+		n = S_DAX;
+
+	if (n && !disk->fops->direct_access)
+		return -ENOTTY;
+
+	mutex_lock(&bdev->bd_inode->i_mutex);
+	if (bdev->bd_map_count == 0)
+		inode_set_flags(bdev->bd_inode, n, S_DAX);
+	else
+		rc = -EBUSY;
+	mutex_unlock(&bdev->bd_inode->i_mutex);
+	return rc;
+}
+#else
+static int blkdev_set_dax(struct block_device *bdev, int n)
+{
+	if (n)
+		return -ENOTTY;
+	return 0;
+}
+#endif
+
 /*
  * always keep this in sync with compat_blkdev_ioctl()
  */
@@ -449,6 +478,20 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
 	case BLKTRACETEARDOWN:
 		ret = blk_trace_ioctl(bdev, cmd, (char __user *) arg);
 		break;
+	case BLKDAXSET:
+		if (!capable(CAP_SYS_ADMIN))
+			return -EACCES;
+
+		if (get_user(n, (int __user *)(arg)))
+			return -EFAULT;
+		n = !!n;
+		if (n == !!(bdev->bd_inode->i_flags & S_DAX))
+			return 0;
+
+		return blkdev_set_dax(bdev, n);
+	case BLKDAXGET:
+		return put_int(arg, !!(bdev->bd_inode->i_flags & S_DAX));
+		break;
 	default:
 		ret = __blkdev_driver_ioctl(bdev, mode, cmd, arg);
 	}
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 29cd1eb0765d..13ce6d0ff7f6 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1687,13 +1687,101 @@ static const struct address_space_operations def_blk_aops = {
 	.is_dirty_writeback = buffer_check_dirty_writeback,
 };
 
+#ifdef CONFIG_FS_DAX
+/*
+ * In the raw block case we do not need to contend with truncation nor
+ * unwritten file extents.  Without those concerns there is no need for
+ * additional locking beyond the mmap_sem context that these routines
+ * are already executing under.
+ *
+ * Note, there is no protection if the block device is dynamically
+ * resized (partition grow/shrink) during a fault. A stable block device
+ * size is already not enforced in the blkdev_direct_IO path.
+ *
+ * For DAX, it is the responsibility of the block device driver to
+ * ensure the whole-disk device size is stable while requests are in
+ * flight.
+ *
+ * Finally, unlike the filemap_page_mkwrite() case there is no
+ * filesystem superblock to sync against freezing.  We still include a
+ * pfn_mkwrite callback for dax drivers to receive write fault
+ * notifications.
+ */
+static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return __dax_fault(vma, vmf, blkdev_get_block, NULL);
+}
+
+static int blkdev_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
+		pmd_t *pmd, unsigned int flags)
+{
+	return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block, NULL);
+}
+
+static void blkdev_vm_open(struct vm_area_struct *vma)
+{
+	struct inode *bd_inode = bdev_file_inode(vma->vm_file);
+	struct block_device *bdev = I_BDEV(bd_inode);
+
+	mutex_lock(&bd_inode->i_mutex);
+	bdev->bd_map_count++;
+	mutex_unlock(&bd_inode->i_mutex);
+}
+
+static void blkdev_vm_close(struct vm_area_struct *vma)
+{
+	struct inode *bd_inode = bdev_file_inode(vma->vm_file);
+	struct block_device *bdev = I_BDEV(bd_inode);
+
+	mutex_lock(&bd_inode->i_mutex);
+	bdev->bd_map_count--;
+	mutex_unlock(&bd_inode->i_mutex);
+}
+
+static const struct vm_operations_struct blkdev_dax_vm_ops = {
+	.open		= blkdev_vm_open,
+	.close		= blkdev_vm_close,
+	.fault		= blkdev_dax_fault,
+	.pmd_fault	= blkdev_dax_pmd_fault,
+	.pfn_mkwrite	= blkdev_dax_fault,
+};
+
+static const struct vm_operations_struct blkdev_default_vm_ops = {
+	.open		= blkdev_vm_open,
+	.close		= blkdev_vm_close,
+	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
+};
+
+static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct inode *bd_inode = bdev_file_inode(file);
+	struct block_device *bdev = I_BDEV(bd_inode);
+
+	file_accessed(file);
+	mutex_lock(&bd_inode->i_mutex);
+	bdev->bd_map_count++;
+	if (IS_DAX(bd_inode)) {
+		vma->vm_ops = &blkdev_dax_vm_ops;
+		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	} else {
+		vma->vm_ops = &blkdev_default_vm_ops;
+	}
+	mutex_unlock(&bd_inode->i_mutex);
+
+	return 0;
+}
+#else
+#define blkdev_mmap generic_file_mmap
+#endif
+
 const struct file_operations def_blk_fops = {
 	.open		= blkdev_open,
 	.release	= blkdev_close,
 	.llseek		= block_llseek,
 	.read_iter	= blkdev_read_iter,
 	.write_iter	= blkdev_write_iter,
-	.mmap		= generic_file_mmap,
+	.mmap		= blkdev_mmap,
 	.fsync		= blkdev_fsync,
 	.unlocked_ioctl	= block_ioctl,
 #ifdef CONFIG_COMPAT
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 72d8a844c692..8fb2d4b848bf 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -482,6 +482,9 @@ struct block_device {
 	int			bd_fsfreeze_count;
 	/* Mutex for freeze */
 	struct mutex		bd_fsfreeze_mutex;
+#ifdef CONFIG_FS_DAX
+	int			bd_map_count;
+#endif
 };
 
 /*
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 9b964a5920af..cc2f0fdae707 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -152,6 +152,8 @@ struct inodes_stat_t {
 #define BLKSECDISCARD _IO(0x12,125)
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
+#define BLKDAXSET _IO(0x12,128)
+#define BLKDAXGET _IO(0x12,129)
 
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 13/15] block, dax: make dax mappings opt-in by default
  2015-11-02  4:29 ` Dan Williams
@ 2015-11-02  4:30   ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe; +Cc: jack, linux-nvdimm, david, linux-kernel, ross.zwisler, hch

Now that we have the ability to dynamically enable DAX for a raw block
inode, make the behavior opt-in by default.  DAX does not have feature
parity with pagecache backed mappings, so applications should knowingly
enable DAX semantics.

Note, this is only for mappings returned to userspace.  For the
synchronous usages of DAX, dax_do_io(), there is no semantic difference
with the bio submission path, so that path remains default enabled.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/ioctl.c      |    3 +--
 fs/block_dev.c     |   33 +++++++++++++++++++++++----------
 include/linux/fs.h |    8 ++++++++
 3 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 205d57612fbd..c4c3a09d9ca9 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -298,13 +298,12 @@ static inline int is_unrecognized_ioctl(int ret)
 #ifdef CONFIG_FS_DAX
 static int blkdev_set_dax(struct block_device *bdev, int n)
 {
-	struct gendisk *disk = bdev->bd_disk;
 	int rc = 0;
 
 	if (n)
 		n = S_DAX;
 
-	if (n && !disk->fops->direct_access)
+	if (n && !blkdev_dax_capable(bdev))
 		return -ENOTTY;
 
 	mutex_lock(&bdev->bd_inode->i_mutex);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 13ce6d0ff7f6..ee34a31e6fa4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -152,16 +152,37 @@ static struct inode *bdev_file_inode(struct file *file)
 	return file->f_mapping->host;
 }
 
+#ifdef CONFIG_FS_DAX
+bool blkdev_dax_capable(struct block_device *bdev)
+{
+	struct gendisk *disk = bdev->bd_disk;
+
+	if (!disk->fops->direct_access)
+		return false;
+
+	/*
+	 * If the partition is not aligned on a page boundary, we can't
+	 * do dax I/O to it.
+	 */
+	if ((bdev->bd_part->start_sect % (PAGE_SIZE / 512))
+			|| (bdev->bd_part->nr_sects % (PAGE_SIZE / 512)))
+		return false;
+
+	return true;
+}
+#endif
+
 static ssize_t
 blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = bdev_file_inode(file);
+	struct block_device *bdev = I_BDEV(inode);
 
-	if (IS_DAX(inode))
+	if (blkdev_dax_capable(bdev))
 		return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
 				NULL, DIO_SKIP_DIO_COUNT);
-	return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
+	return __blockdev_direct_IO(iocb, inode, bdev, iter, offset,
 				    blkdev_get_block, NULL, NULL,
 				    DIO_SKIP_DIO_COUNT);
 }
@@ -1185,7 +1206,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 		bdev->bd_disk = disk;
 		bdev->bd_queue = disk->queue;
 		bdev->bd_contains = bdev;
-		bdev->bd_inode->i_flags = disk->fops->direct_access ? S_DAX : 0;
 		if (!partno) {
 			ret = -ENXIO;
 			bdev->bd_part = disk_get_part(disk, partno);
@@ -1247,13 +1267,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 				goto out_clear;
 			}
 			bd_set_size(bdev, (loff_t)bdev->bd_part->nr_sects << 9);
-			/*
-			 * If the partition is not aligned on a page
-			 * boundary, we can't do dax I/O to it.
-			 */
-			if ((bdev->bd_part->start_sect % (PAGE_SIZE / 512)) ||
-			    (bdev->bd_part->nr_sects % (PAGE_SIZE / 512)))
-				bdev->bd_inode->i_flags &= ~S_DAX;
 		}
 	} else {
 		if (bdev->bd_contains == bdev) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8fb2d4b848bf..5a9e14538f69 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2282,6 +2282,14 @@ extern struct super_block *freeze_bdev(struct block_device *);
 extern void emergency_thaw_all(void);
 extern int thaw_bdev(struct block_device *bdev, struct super_block *sb);
 extern int fsync_bdev(struct block_device *);
+#ifdef CONFIG_FS_DAX
+extern bool blkdev_dax_capable(struct block_device *bdev);
+#else
+static inline bool blkdev_dax_capable(struct block_device *bdev)
+{
+	return false;
+}
+#endif
 
 extern struct super_block *blockdev_superblock;
 


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 13/15] block, dax: make dax mappings opt-in by default
@ 2015-11-02  4:30   ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe; +Cc: jack, linux-nvdimm, david, linux-kernel, ross.zwisler, hch

Now that we have the ability to dynamically enable DAX for a raw block
inode, make the behavior opt-in by default.  DAX does not have feature
parity with pagecache backed mappings, so applications should knowingly
enable DAX semantics.

Note, this is only for mappings returned to userspace.  For the
synchronous usages of DAX, dax_do_io(), there is no semantic difference
with the bio submission path, so that path remains default enabled.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/ioctl.c      |    3 +--
 fs/block_dev.c     |   33 +++++++++++++++++++++++----------
 include/linux/fs.h |    8 ++++++++
 3 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 205d57612fbd..c4c3a09d9ca9 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -298,13 +298,12 @@ static inline int is_unrecognized_ioctl(int ret)
 #ifdef CONFIG_FS_DAX
 static int blkdev_set_dax(struct block_device *bdev, int n)
 {
-	struct gendisk *disk = bdev->bd_disk;
 	int rc = 0;
 
 	if (n)
 		n = S_DAX;
 
-	if (n && !disk->fops->direct_access)
+	if (n && !blkdev_dax_capable(bdev))
 		return -ENOTTY;
 
 	mutex_lock(&bdev->bd_inode->i_mutex);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 13ce6d0ff7f6..ee34a31e6fa4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -152,16 +152,37 @@ static struct inode *bdev_file_inode(struct file *file)
 	return file->f_mapping->host;
 }
 
+#ifdef CONFIG_FS_DAX
+bool blkdev_dax_capable(struct block_device *bdev)
+{
+	struct gendisk *disk = bdev->bd_disk;
+
+	if (!disk->fops->direct_access)
+		return false;
+
+	/*
+	 * If the partition is not aligned on a page boundary, we can't
+	 * do dax I/O to it.
+	 */
+	if ((bdev->bd_part->start_sect % (PAGE_SIZE / 512))
+			|| (bdev->bd_part->nr_sects % (PAGE_SIZE / 512)))
+		return false;
+
+	return true;
+}
+#endif
+
 static ssize_t
 blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = bdev_file_inode(file);
+	struct block_device *bdev = I_BDEV(inode);
 
-	if (IS_DAX(inode))
+	if (blkdev_dax_capable(bdev))
 		return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
 				NULL, DIO_SKIP_DIO_COUNT);
-	return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
+	return __blockdev_direct_IO(iocb, inode, bdev, iter, offset,
 				    blkdev_get_block, NULL, NULL,
 				    DIO_SKIP_DIO_COUNT);
 }
@@ -1185,7 +1206,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 		bdev->bd_disk = disk;
 		bdev->bd_queue = disk->queue;
 		bdev->bd_contains = bdev;
-		bdev->bd_inode->i_flags = disk->fops->direct_access ? S_DAX : 0;
 		if (!partno) {
 			ret = -ENXIO;
 			bdev->bd_part = disk_get_part(disk, partno);
@@ -1247,13 +1267,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 				goto out_clear;
 			}
 			bd_set_size(bdev, (loff_t)bdev->bd_part->nr_sects << 9);
-			/*
-			 * If the partition is not aligned on a page
-			 * boundary, we can't do dax I/O to it.
-			 */
-			if ((bdev->bd_part->start_sect % (PAGE_SIZE / 512)) ||
-			    (bdev->bd_part->nr_sects % (PAGE_SIZE / 512)))
-				bdev->bd_inode->i_flags &= ~S_DAX;
 		}
 	} else {
 		if (bdev->bd_contains == bdev) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8fb2d4b848bf..5a9e14538f69 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2282,6 +2282,14 @@ extern struct super_block *freeze_bdev(struct block_device *);
 extern void emergency_thaw_all(void);
 extern int thaw_bdev(struct block_device *bdev, struct super_block *sb);
 extern int fsync_bdev(struct block_device *);
+#ifdef CONFIG_FS_DAX
+extern bool blkdev_dax_capable(struct block_device *bdev);
+#else
+static inline bool blkdev_dax_capable(struct block_device *bdev)
+{
+	return false;
+}
+#endif
 
 extern struct super_block *blockdev_superblock;
 


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 14/15] dax: dirty extent notification
  2015-11-02  4:29 ` Dan Williams
@ 2015-11-02  4:30   ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe; +Cc: jack, linux-nvdimm, david, linux-kernel, ross.zwisler, hch

DAX-enabled block device drivers can use hints from fs/dax.c to
optimize their internal tracking of potentially dirty cpu cache lines.
If a DAX mapping is being used for synchronous operations, dax_do_io(),
a dax-enabled block-driver knows that fs/dax.c will handle immediate
flushing.  For asynchronous mappings, i.e.  returned to userspace via
mmap, the driver can track active extents of the media for flushing.

We can later extend the DAX paths to indicate when an async mapping is
"closed" allowing the active extents to be marked clean.

Because this capability requires adding two new parameters to
->direct_access ('size' and 'flags') convert the function to a control
parameter block.  As a result this cleans up dax_map_atomic() usage as
there is no longer a need to have a separate __dax_map_atomic, and the
return value can match bdev_direct_access().

No functional change results from this patch, just code movement to the
new parameter scheme.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/sysdev/axonram.c |    9 +-
 drivers/block/brd.c           |   10 +-
 drivers/nvdimm/pmem.c         |   10 +-
 drivers/s390/block/dcssblk.c  |    9 +-
 fs/block_dev.c                |   17 ++--
 fs/dax.c                      |  167 ++++++++++++++++++++++-------------------
 include/linux/blkdev.h        |   24 +++++-
 7 files changed, 136 insertions(+), 110 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 59ca4c0ab529..11aeb47a6540 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -140,14 +140,13 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
  * @device, @sector, @data: see block_device_operations method
  */
 static long
-axon_ram_direct_access(struct block_device *device, sector_t sector,
-		       void __pmem **kaddr, pfn_t *pfn)
+axon_ram_direct_access(struct block_device *device, struct blk_dax_ctl *dax)
 {
 	struct axon_ram_bank *bank = device->bd_disk->private_data;
-	loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
+	loff_t offset = (loff_t)dax->sector << AXON_RAM_SECTOR_SHIFT;
 
-	*kaddr = (void __pmem __force *) bank->io_addr + offset;
-	*pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
+	dax->addr = (void __pmem __force *) bank->io_addr + offset;
+	dax->pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
 	return bank->size - offset;
 }
 
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 0bbc60463779..686e1e7a5973 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -373,19 +373,19 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
 }
 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
-static long brd_direct_access(struct block_device *bdev, sector_t sector,
-			void __pmem **kaddr, pfn_t *pfn)
+static long brd_direct_access(struct block_device *bdev,
+		struct blk_dax_ctl *dax)
 {
 	struct brd_device *brd = bdev->bd_disk->private_data;
 	struct page *page;
 
 	if (!brd)
 		return -ENODEV;
-	page = brd_insert_page(brd, sector);
+	page = brd_insert_page(brd, dax->sector);
 	if (!page)
 		return -ENOSPC;
-	*kaddr = (void __pmem *)page_address(page);
-	*pfn = page_to_pfn_t(page);
+	dax->addr = (void __pmem *)page_address(page);
+	dax->pfn = page_to_pfn_t(page);
 
 	return PAGE_SIZE;
 }
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index aa2f1292120a..3d83f3079602 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -103,14 +103,14 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 	return 0;
 }
 
-static long pmem_direct_access(struct block_device *bdev, sector_t sector,
-		      void __pmem **kaddr, pfn_t *pfn)
+static long pmem_direct_access(struct block_device *bdev,
+		struct blk_dax_ctl *dax)
 {
 	struct pmem_device *pmem = bdev->bd_disk->private_data;
-	resource_size_t offset = sector * 512 + pmem->data_offset;
+	resource_size_t offset = dax->sector * 512 + pmem->data_offset;
 
-	*kaddr = pmem->virt_addr + offset;
-	*pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
+	dax->addr = pmem->virt_addr + offset;
+	dax->pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
 
 	return pmem->size - offset;
 }
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index e2b2839e4de5..6b01f56373e0 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -880,8 +880,7 @@ fail:
 }
 
 static long
-dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
-			void __pmem **kaddr, pfn_t *pfn)
+dcssblk_direct_access (struct block_device *bdev, struct blk_dax_ctl *dax)
 {
 	struct dcssblk_dev_info *dev_info;
 	unsigned long offset, dev_sz;
@@ -890,9 +889,9 @@ dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
 	if (!dev_info)
 		return -ENODEV;
 	dev_sz = dev_info->end - dev_info->start;
-	offset = secnum * 512;
-	*kaddr = (void __pmem *) (dev_info->start + offset);
-	*pfn = phys_to_pfn_t(dev_info->start + offset, PFN_DEV);
+	offset = dax->sector * 512;
+	dax->addr = (void __pmem *) (dev_info->start + offset);
+	dax->pfn = phys_to_pfn_t(dev_info->start + offset, PFN_DEV);
 
 	return dev_sz - offset;
 }
diff --git a/fs/block_dev.c b/fs/block_dev.c
index ee34a31e6fa4..d1b0bbf00bd3 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -453,10 +453,7 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
 /**
  * bdev_direct_access() - Get the address for directly-accessibly memory
  * @bdev: The device containing the memory
- * @sector: The offset within the device
- * @addr: Where to put the address of the memory
- * @pfn: The Page Frame Number for the memory
- * @size: The number of bytes requested
+ * @ctl: control and output parameters for ->direct_access
  *
  * If a block device is made up of directly addressable memory, this function
  * will tell the caller the PFN and the address of the memory.  The address
@@ -467,10 +464,10 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
  * Return: negative errno if an error occurs, otherwise the number of bytes
  * accessible at this address.
  */
-long bdev_direct_access(struct block_device *bdev, sector_t sector,
-			void __pmem **addr, pfn_t *pfn, long size)
+long bdev_direct_access(struct block_device *bdev, struct blk_dax_ctl *ctl)
 {
-	long avail;
+	sector_t sector, save;
+	long avail, size = ctl->size;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
 
 	/*
@@ -479,6 +476,8 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
 	 */
 	might_sleep();
 
+	save = ctl->sector;
+	sector = ctl->sector;
 	if (size < 0)
 		return size;
 	if (!ops->direct_access)
@@ -489,7 +488,9 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
 	sector += get_start_sect(bdev);
 	if (sector % (PAGE_SIZE / 512))
 		return -EINVAL;
-	avail = ops->direct_access(bdev, sector, addr, pfn);
+	ctl->sector = sector;
+	avail = ops->direct_access(bdev, ctl);
+	ctl->sector = save;
 	if (!avail)
 		return -ERANGE;
 	return min(avail, size);
diff --git a/fs/dax.c b/fs/dax.c
index ac8992e86779..f5835c4a7e1f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -30,36 +30,28 @@
 #include <linux/vmstat.h>
 #include <linux/sizes.h>
 
-static void __pmem *__dax_map_atomic(struct block_device *bdev, sector_t sector,
-		long size, pfn_t *pfn, long *len)
+static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
-	long rc;
-	void __pmem *addr;
 	struct request_queue *q = bdev->bd_queue;
+	long rc = -EIO;
 
+	dax->addr = (void __pmem *) ERR_PTR(-EIO);
 	if (blk_queue_enter(q, GFP_NOWAIT) != 0)
-		return (void __pmem *) ERR_PTR(-EIO);
-	rc = bdev_direct_access(bdev, sector, &addr, pfn, size);
-	if (len)
-		*len = rc;
+		return rc;
+
+	rc = bdev_direct_access(bdev, dax);
 	if (rc < 0) {
+		dax->addr = (void __pmem *) ERR_PTR(rc);
 		blk_queue_exit(q);
-		return (void __pmem *) ERR_PTR(rc);
+		return rc;
 	}
-	return addr;
-}
-
-static void __pmem *dax_map_atomic(struct block_device *bdev, sector_t sector,
-		long size)
-{
-	pfn_t pfn;
-
-	return __dax_map_atomic(bdev, sector, size, &pfn, NULL);
+	return rc;
 }
 
-static void dax_unmap_atomic(struct block_device *bdev, void __pmem *addr)
+static void dax_unmap_atomic(struct block_device *bdev,
+		const struct blk_dax_ctl *dax)
 {
-	if (IS_ERR(addr))
+	if (IS_ERR(dax->addr))
 		return;
 	blk_queue_exit(bdev->bd_queue);
 }
@@ -67,28 +59,29 @@ static void dax_unmap_atomic(struct block_device *bdev, void __pmem *addr)
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
 	struct block_device *bdev = inode->i_sb->s_bdev;
-	sector_t sector = block << (inode->i_blkbits - 9);
+	struct blk_dax_ctl dax;
 
 	might_sleep();
+	dax.sector = block << (inode->i_blkbits - 9),
+	dax.flags = 0;
+	dax.size = size;
 	do {
-		void __pmem *addr;
 		long count, sz;
-		pfn_t pfn;
 
 		sz = min_t(long, size, SZ_1M);
-		addr = __dax_map_atomic(bdev, sector, size, &pfn, &count);
-		if (IS_ERR(addr))
-			return PTR_ERR(addr);
+		count = dax_map_atomic(bdev, &dax);
+		if (count < 0)
+			return count;
 		if (count < sz)
 			sz = count;
-		clear_pmem(addr, sz);
-		addr += sz;
-		size -= sz;
+		clear_pmem(dax.addr, sz);
+		dax_unmap_atomic(bdev, &dax);
+		dax.addr += sz;
+		dax.size -= sz;
 		BUG_ON(sz & 511);
-		sector += sz / 512;
-		dax_unmap_atomic(bdev, addr);
+		dax.sector += sz / 512;
 		cond_resched();
-	} while (size);
+	} while (dax.size);
 
 	wmb_pmem();
 	return 0;
@@ -141,9 +134,11 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 	struct block_device *bdev = NULL;
 	int rw = iov_iter_rw(iter), rc;
 	long map_len = 0;
-	pfn_t pfn;
 	void __pmem *addr = NULL;
-	void __pmem *kmap = (void __pmem *) ERR_PTR(-EIO);
+	struct blk_dax_ctl dax = {
+		.addr = (void __pmem *) ERR_PTR(-EIO),
+		.flags = 0,
+	};
 	bool hole = false;
 	bool need_wmb = false;
 
@@ -181,15 +176,15 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				addr = NULL;
 				size = bh->b_size - first;
 			} else {
-				dax_unmap_atomic(bdev, kmap);
-				kmap = __dax_map_atomic(bdev,
-						to_sector(bh, inode),
-						bh->b_size, &pfn, &map_len);
-				if (IS_ERR(kmap)) {
-					rc = PTR_ERR(kmap);
+				dax_unmap_atomic(bdev, &dax);
+				dax.sector = to_sector(bh, inode);
+				dax.size = bh->b_size;
+				map_len = dax_map_atomic(bdev, &dax);
+				if (map_len < 0) {
+					rc = map_len;
 					break;
 				}
-				addr = kmap;
+				addr = dax.addr;
 				if (buffer_unwritten(bh) || buffer_new(bh)) {
 					dax_new_buf(addr, map_len, first, pos,
 							end);
@@ -219,7 +214,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 
 	if (need_wmb)
 		wmb_pmem();
-	dax_unmap_atomic(bdev, kmap);
+	dax_unmap_atomic(bdev, &dax);
 
 	return (pos == start) ? rc : pos - start;
 }
@@ -313,17 +308,20 @@ static int dax_load_hole(struct address_space *mapping, struct page *page,
 static int copy_user_bh(struct page *to, struct inode *inode,
 		struct buffer_head *bh, unsigned long vaddr)
 {
+	struct blk_dax_ctl dax = {
+		.sector = to_sector(bh, inode),
+		.size = bh->b_size,
+		.flags = 0,
+	};
 	struct block_device *bdev = bh->b_bdev;
-	void __pmem *vfrom;
 	void *vto;
 
-	vfrom = dax_map_atomic(bdev, to_sector(bh, inode), bh->b_size);
-	if (IS_ERR(vfrom))
-		return PTR_ERR(vfrom);
+	if (dax_map_atomic(bdev, &dax) < 0)
+		return PTR_ERR(dax.addr);
 	vto = kmap_atomic(to);
-	copy_user_page(vto, (void __force *)vfrom, vaddr, to);
+	copy_user_page(vto, (void __force *)dax.addr, vaddr, to);
 	kunmap_atomic(vto);
-	dax_unmap_atomic(bdev, vfrom);
+	dax_unmap_atomic(bdev, &dax);
 	return 0;
 }
 
@@ -344,15 +342,25 @@ static void dax_account_mapping(struct block_device *bdev, pfn_t pfn,
 	}
 }
 
+static unsigned long vm_fault_to_dax_flags(struct vm_fault *vmf)
+{
+	if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE))
+		return BLKDAX_F_DIRTY;
+	return 0;
+}
+
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
 	struct address_space *mapping = inode->i_mapping;
 	struct block_device *bdev = bh->b_bdev;
-	void __pmem *addr;
+	struct blk_dax_ctl dax = {
+		.sector = to_sector(bh, inode),
+		.size = bh->b_size,
+		.flags = vm_fault_to_dax_flags(vmf),
+	};
 	pgoff_t size;
-	pfn_t pfn;
 	int error;
 
 	i_mmap_lock_read(mapping);
@@ -370,22 +378,20 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		goto out;
 	}
 
-	addr = __dax_map_atomic(bdev, to_sector(bh, inode), bh->b_size,
-			&pfn, NULL);
-	if (IS_ERR(addr)) {
-		error = PTR_ERR(addr);
+	if (dax_map_atomic(bdev, &dax) < 0) {
+		error = PTR_ERR(dax.addr);
 		goto out;
 	}
 
 	if (buffer_unwritten(bh) || buffer_new(bh)) {
-		clear_pmem(addr, PAGE_SIZE);
+		clear_pmem(dax.addr, PAGE_SIZE);
 		wmb_pmem();
 	}
 
-	dax_account_mapping(bdev, pfn, mapping);
-	dax_unmap_atomic(bdev, addr);
+	dax_account_mapping(bdev, dax.pfn, mapping);
+	dax_unmap_atomic(bdev, &dax);
 
-	error = vm_insert_mixed(vma, vaddr, pfn_t_to_pfn(pfn));
+	error = vm_insert_mixed(vma, vaddr, pfn_t_to_pfn(dax.pfn));
 
  out:
 	i_mmap_unlock_read(mapping);
@@ -674,33 +680,35 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
-		pfn_t pfn;
-		long length;
-		void __pmem *kaddr = __dax_map_atomic(bdev,
-				to_sector(&bh, inode), HPAGE_SIZE, &pfn,
-				&length);
-
-		if (IS_ERR(kaddr)) {
+		struct blk_dax_ctl dax = {
+			.sector = to_sector(&bh, inode),
+			.size = HPAGE_SIZE,
+			.flags = flags,
+		};
+		long length = dax_map_atomic(bdev, &dax);
+
+		if (length < 0) {
 			result = VM_FAULT_SIGBUS;
 			goto out;
 		}
-		if ((length < PMD_SIZE) || (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)) {
-			dax_unmap_atomic(bdev, kaddr);
+		if ((length < HPAGE_SIZE)
+				|| (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)) {
+			dax_unmap_atomic(bdev, &dax);
 			goto fallback;
 		}
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
-			clear_pmem(kaddr, HPAGE_SIZE);
+			clear_pmem(dax.addr, HPAGE_SIZE);
 			wmb_pmem();
 			count_vm_event(PGMAJFAULT);
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 			result |= VM_FAULT_MAJOR;
 		}
-		dax_account_mapping(bdev, pfn, mapping);
-		dax_unmap_atomic(bdev, kaddr);
+		dax_account_mapping(bdev, dax.pfn, mapping);
+		dax_unmap_atomic(bdev, &dax);
 
 		result |= vmf_insert_pfn_pmd(vma, address, pmd,
-				pfn_t_to_pfn(pfn), write);
+				pfn_t_to_pfn(dax.pfn), write);
 	}
 
  out:
@@ -803,14 +811,17 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 		return err;
 	if (buffer_written(&bh)) {
 		struct block_device *bdev = bh.b_bdev;
-		void __pmem *addr = dax_map_atomic(bdev, to_sector(&bh, inode),
-				PAGE_CACHE_SIZE);
-
-		if (IS_ERR(addr))
-			return PTR_ERR(addr);
-		clear_pmem(addr + offset, length);
+		struct blk_dax_ctl dax = {
+			.sector = to_sector(&bh, inode),
+			.size = PAGE_CACHE_SIZE,
+			.flags = 0,
+		};
+
+		if (dax_map_atomic(bdev, &dax) < 0)
+			return PTR_ERR(dax.addr);
+		clear_pmem(dax.addr + offset, length);
 		wmb_pmem();
-		dax_unmap_atomic(bdev, addr);
+		dax_unmap_atomic(bdev, &dax);
 	}
 
 	return 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e121e5e0c6ac..663e9974820f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1615,14 +1615,31 @@ static inline bool integrity_req_gap_front_merge(struct request *req,
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+#define BLKDAX_F_DIRTY (1UL << 0) /* range is mapped writable to userspace */
+
+/**
+ * struct blk_dax_ctl - control and output parameters for ->direct_access
+ * @sector: (input) offset relative to a block_device
+ * @addr: (output) kernel virtual address for @sector populated by driver
+ * @flags: (input) BLKDAX_F_*
+ * @pfn: (output) page frame number for @addr populated by driver
+ * @size: (input) number of bytes requested
+ */
+struct blk_dax_ctl {
+	sector_t sector;
+	void __pmem *addr;
+	unsigned long flags;
+	long size;
+	pfn_t pfn;
+};
+
 struct block_device_operations {
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
 	int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
-	long (*direct_access)(struct block_device *, sector_t, void __pmem **,
-			pfn_t *);
+	long (*direct_access)(struct block_device *, struct blk_dax_ctl *);
 	unsigned int (*check_events) (struct gendisk *disk,
 				      unsigned int clearing);
 	/* ->media_changed() is DEPRECATED, use ->check_events() instead */
@@ -1640,8 +1657,7 @@ extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
 extern int bdev_read_page(struct block_device *, sector_t, struct page *);
 extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 						struct writeback_control *);
-extern long bdev_direct_access(struct block_device *, sector_t,
-		void __pmem **addr, pfn_t *pfn, long size);
+extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *);
 #else /* CONFIG_BLOCK */
 
 struct block_device;


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 14/15] dax: dirty extent notification
@ 2015-11-02  4:30   ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:30 UTC (permalink / raw)
  To: axboe; +Cc: jack, linux-nvdimm, david, linux-kernel, ross.zwisler, hch

DAX-enabled block device drivers can use hints from fs/dax.c to
optimize their internal tracking of potentially dirty cpu cache lines.
If a DAX mapping is being used for synchronous operations, dax_do_io(),
a dax-enabled block-driver knows that fs/dax.c will handle immediate
flushing.  For asynchronous mappings, i.e.  returned to userspace via
mmap, the driver can track active extents of the media for flushing.

We can later extend the DAX paths to indicate when an async mapping is
"closed" allowing the active extents to be marked clean.

Because this capability requires adding two new parameters to
->direct_access ('size' and 'flags') convert the function to a control
parameter block.  As a result this cleans up dax_map_atomic() usage as
there is no longer a need to have a separate __dax_map_atomic, and the
return value can match bdev_direct_access().

No functional change results from this patch, just code movement to the
new parameter scheme.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/sysdev/axonram.c |    9 +-
 drivers/block/brd.c           |   10 +-
 drivers/nvdimm/pmem.c         |   10 +-
 drivers/s390/block/dcssblk.c  |    9 +-
 fs/block_dev.c                |   17 ++--
 fs/dax.c                      |  167 ++++++++++++++++++++++-------------------
 include/linux/blkdev.h        |   24 +++++-
 7 files changed, 136 insertions(+), 110 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 59ca4c0ab529..11aeb47a6540 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -140,14 +140,13 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
  * @device, @sector, @data: see block_device_operations method
  */
 static long
-axon_ram_direct_access(struct block_device *device, sector_t sector,
-		       void __pmem **kaddr, pfn_t *pfn)
+axon_ram_direct_access(struct block_device *device, struct blk_dax_ctl *dax)
 {
 	struct axon_ram_bank *bank = device->bd_disk->private_data;
-	loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
+	loff_t offset = (loff_t)dax->sector << AXON_RAM_SECTOR_SHIFT;
 
-	*kaddr = (void __pmem __force *) bank->io_addr + offset;
-	*pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
+	dax->addr = (void __pmem __force *) bank->io_addr + offset;
+	dax->pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
 	return bank->size - offset;
 }
 
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 0bbc60463779..686e1e7a5973 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -373,19 +373,19 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
 }
 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
-static long brd_direct_access(struct block_device *bdev, sector_t sector,
-			void __pmem **kaddr, pfn_t *pfn)
+static long brd_direct_access(struct block_device *bdev,
+		struct blk_dax_ctl *dax)
 {
 	struct brd_device *brd = bdev->bd_disk->private_data;
 	struct page *page;
 
 	if (!brd)
 		return -ENODEV;
-	page = brd_insert_page(brd, sector);
+	page = brd_insert_page(brd, dax->sector);
 	if (!page)
 		return -ENOSPC;
-	*kaddr = (void __pmem *)page_address(page);
-	*pfn = page_to_pfn_t(page);
+	dax->addr = (void __pmem *)page_address(page);
+	dax->pfn = page_to_pfn_t(page);
 
 	return PAGE_SIZE;
 }
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index aa2f1292120a..3d83f3079602 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -103,14 +103,14 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 	return 0;
 }
 
-static long pmem_direct_access(struct block_device *bdev, sector_t sector,
-		      void __pmem **kaddr, pfn_t *pfn)
+static long pmem_direct_access(struct block_device *bdev,
+		struct blk_dax_ctl *dax)
 {
 	struct pmem_device *pmem = bdev->bd_disk->private_data;
-	resource_size_t offset = sector * 512 + pmem->data_offset;
+	resource_size_t offset = dax->sector * 512 + pmem->data_offset;
 
-	*kaddr = pmem->virt_addr + offset;
-	*pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
+	dax->addr = pmem->virt_addr + offset;
+	dax->pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
 
 	return pmem->size - offset;
 }
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index e2b2839e4de5..6b01f56373e0 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -880,8 +880,7 @@ fail:
 }
 
 static long
-dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
-			void __pmem **kaddr, pfn_t *pfn)
+dcssblk_direct_access (struct block_device *bdev, struct blk_dax_ctl *dax)
 {
 	struct dcssblk_dev_info *dev_info;
 	unsigned long offset, dev_sz;
@@ -890,9 +889,9 @@ dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
 	if (!dev_info)
 		return -ENODEV;
 	dev_sz = dev_info->end - dev_info->start;
-	offset = secnum * 512;
-	*kaddr = (void __pmem *) (dev_info->start + offset);
-	*pfn = phys_to_pfn_t(dev_info->start + offset, PFN_DEV);
+	offset = dax->sector * 512;
+	dax->addr = (void __pmem *) (dev_info->start + offset);
+	dax->pfn = phys_to_pfn_t(dev_info->start + offset, PFN_DEV);
 
 	return dev_sz - offset;
 }
diff --git a/fs/block_dev.c b/fs/block_dev.c
index ee34a31e6fa4..d1b0bbf00bd3 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -453,10 +453,7 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
 /**
  * bdev_direct_access() - Get the address for directly-accessibly memory
  * @bdev: The device containing the memory
- * @sector: The offset within the device
- * @addr: Where to put the address of the memory
- * @pfn: The Page Frame Number for the memory
- * @size: The number of bytes requested
+ * @ctl: control and output parameters for ->direct_access
  *
  * If a block device is made up of directly addressable memory, this function
  * will tell the caller the PFN and the address of the memory.  The address
@@ -467,10 +464,10 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
  * Return: negative errno if an error occurs, otherwise the number of bytes
  * accessible at this address.
  */
-long bdev_direct_access(struct block_device *bdev, sector_t sector,
-			void __pmem **addr, pfn_t *pfn, long size)
+long bdev_direct_access(struct block_device *bdev, struct blk_dax_ctl *ctl)
 {
-	long avail;
+	sector_t sector, save;
+	long avail, size = ctl->size;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
 
 	/*
@@ -479,6 +476,8 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
 	 */
 	might_sleep();
 
+	save = ctl->sector;
+	sector = ctl->sector;
 	if (size < 0)
 		return size;
 	if (!ops->direct_access)
@@ -489,7 +488,9 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
 	sector += get_start_sect(bdev);
 	if (sector % (PAGE_SIZE / 512))
 		return -EINVAL;
-	avail = ops->direct_access(bdev, sector, addr, pfn);
+	ctl->sector = sector;
+	avail = ops->direct_access(bdev, ctl);
+	ctl->sector = save;
 	if (!avail)
 		return -ERANGE;
 	return min(avail, size);
diff --git a/fs/dax.c b/fs/dax.c
index ac8992e86779..f5835c4a7e1f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -30,36 +30,28 @@
 #include <linux/vmstat.h>
 #include <linux/sizes.h>
 
-static void __pmem *__dax_map_atomic(struct block_device *bdev, sector_t sector,
-		long size, pfn_t *pfn, long *len)
+static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
-	long rc;
-	void __pmem *addr;
 	struct request_queue *q = bdev->bd_queue;
+	long rc = -EIO;
 
+	dax->addr = (void __pmem *) ERR_PTR(-EIO);
 	if (blk_queue_enter(q, GFP_NOWAIT) != 0)
-		return (void __pmem *) ERR_PTR(-EIO);
-	rc = bdev_direct_access(bdev, sector, &addr, pfn, size);
-	if (len)
-		*len = rc;
+		return rc;
+
+	rc = bdev_direct_access(bdev, dax);
 	if (rc < 0) {
+		dax->addr = (void __pmem *) ERR_PTR(rc);
 		blk_queue_exit(q);
-		return (void __pmem *) ERR_PTR(rc);
+		return rc;
 	}
-	return addr;
-}
-
-static void __pmem *dax_map_atomic(struct block_device *bdev, sector_t sector,
-		long size)
-{
-	pfn_t pfn;
-
-	return __dax_map_atomic(bdev, sector, size, &pfn, NULL);
+	return rc;
 }
 
-static void dax_unmap_atomic(struct block_device *bdev, void __pmem *addr)
+static void dax_unmap_atomic(struct block_device *bdev,
+		const struct blk_dax_ctl *dax)
 {
-	if (IS_ERR(addr))
+	if (IS_ERR(dax->addr))
 		return;
 	blk_queue_exit(bdev->bd_queue);
 }
@@ -67,28 +59,29 @@ static void dax_unmap_atomic(struct block_device *bdev, void __pmem *addr)
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
 	struct block_device *bdev = inode->i_sb->s_bdev;
-	sector_t sector = block << (inode->i_blkbits - 9);
+	struct blk_dax_ctl dax;
 
 	might_sleep();
+	dax.sector = block << (inode->i_blkbits - 9),
+	dax.flags = 0;
+	dax.size = size;
 	do {
-		void __pmem *addr;
 		long count, sz;
-		pfn_t pfn;
 
 		sz = min_t(long, size, SZ_1M);
-		addr = __dax_map_atomic(bdev, sector, size, &pfn, &count);
-		if (IS_ERR(addr))
-			return PTR_ERR(addr);
+		count = dax_map_atomic(bdev, &dax);
+		if (count < 0)
+			return count;
 		if (count < sz)
 			sz = count;
-		clear_pmem(addr, sz);
-		addr += sz;
-		size -= sz;
+		clear_pmem(dax.addr, sz);
+		dax_unmap_atomic(bdev, &dax);
+		dax.addr += sz;
+		dax.size -= sz;
 		BUG_ON(sz & 511);
-		sector += sz / 512;
-		dax_unmap_atomic(bdev, addr);
+		dax.sector += sz / 512;
 		cond_resched();
-	} while (size);
+	} while (dax.size);
 
 	wmb_pmem();
 	return 0;
@@ -141,9 +134,11 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 	struct block_device *bdev = NULL;
 	int rw = iov_iter_rw(iter), rc;
 	long map_len = 0;
-	pfn_t pfn;
 	void __pmem *addr = NULL;
-	void __pmem *kmap = (void __pmem *) ERR_PTR(-EIO);
+	struct blk_dax_ctl dax = {
+		.addr = (void __pmem *) ERR_PTR(-EIO),
+		.flags = 0,
+	};
 	bool hole = false;
 	bool need_wmb = false;
 
@@ -181,15 +176,15 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				addr = NULL;
 				size = bh->b_size - first;
 			} else {
-				dax_unmap_atomic(bdev, kmap);
-				kmap = __dax_map_atomic(bdev,
-						to_sector(bh, inode),
-						bh->b_size, &pfn, &map_len);
-				if (IS_ERR(kmap)) {
-					rc = PTR_ERR(kmap);
+				dax_unmap_atomic(bdev, &dax);
+				dax.sector = to_sector(bh, inode);
+				dax.size = bh->b_size;
+				map_len = dax_map_atomic(bdev, &dax);
+				if (map_len < 0) {
+					rc = map_len;
 					break;
 				}
-				addr = kmap;
+				addr = dax.addr;
 				if (buffer_unwritten(bh) || buffer_new(bh)) {
 					dax_new_buf(addr, map_len, first, pos,
 							end);
@@ -219,7 +214,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 
 	if (need_wmb)
 		wmb_pmem();
-	dax_unmap_atomic(bdev, kmap);
+	dax_unmap_atomic(bdev, &dax);
 
 	return (pos == start) ? rc : pos - start;
 }
@@ -313,17 +308,20 @@ static int dax_load_hole(struct address_space *mapping, struct page *page,
 static int copy_user_bh(struct page *to, struct inode *inode,
 		struct buffer_head *bh, unsigned long vaddr)
 {
+	struct blk_dax_ctl dax = {
+		.sector = to_sector(bh, inode),
+		.size = bh->b_size,
+		.flags = 0,
+	};
 	struct block_device *bdev = bh->b_bdev;
-	void __pmem *vfrom;
 	void *vto;
 
-	vfrom = dax_map_atomic(bdev, to_sector(bh, inode), bh->b_size);
-	if (IS_ERR(vfrom))
-		return PTR_ERR(vfrom);
+	if (dax_map_atomic(bdev, &dax) < 0)
+		return PTR_ERR(dax.addr);
 	vto = kmap_atomic(to);
-	copy_user_page(vto, (void __force *)vfrom, vaddr, to);
+	copy_user_page(vto, (void __force *)dax.addr, vaddr, to);
 	kunmap_atomic(vto);
-	dax_unmap_atomic(bdev, vfrom);
+	dax_unmap_atomic(bdev, &dax);
 	return 0;
 }
 
@@ -344,15 +342,25 @@ static void dax_account_mapping(struct block_device *bdev, pfn_t pfn,
 	}
 }
 
+static unsigned long vm_fault_to_dax_flags(struct vm_fault *vmf)
+{
+	if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE))
+		return BLKDAX_F_DIRTY;
+	return 0;
+}
+
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
 	struct address_space *mapping = inode->i_mapping;
 	struct block_device *bdev = bh->b_bdev;
-	void __pmem *addr;
+	struct blk_dax_ctl dax = {
+		.sector = to_sector(bh, inode),
+		.size = bh->b_size,
+		.flags = vm_fault_to_dax_flags(vmf),
+	};
 	pgoff_t size;
-	pfn_t pfn;
 	int error;
 
 	i_mmap_lock_read(mapping);
@@ -370,22 +378,20 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		goto out;
 	}
 
-	addr = __dax_map_atomic(bdev, to_sector(bh, inode), bh->b_size,
-			&pfn, NULL);
-	if (IS_ERR(addr)) {
-		error = PTR_ERR(addr);
+	if (dax_map_atomic(bdev, &dax) < 0) {
+		error = PTR_ERR(dax.addr);
 		goto out;
 	}
 
 	if (buffer_unwritten(bh) || buffer_new(bh)) {
-		clear_pmem(addr, PAGE_SIZE);
+		clear_pmem(dax.addr, PAGE_SIZE);
 		wmb_pmem();
 	}
 
-	dax_account_mapping(bdev, pfn, mapping);
-	dax_unmap_atomic(bdev, addr);
+	dax_account_mapping(bdev, dax.pfn, mapping);
+	dax_unmap_atomic(bdev, &dax);
 
-	error = vm_insert_mixed(vma, vaddr, pfn_t_to_pfn(pfn));
+	error = vm_insert_mixed(vma, vaddr, pfn_t_to_pfn(dax.pfn));
 
  out:
 	i_mmap_unlock_read(mapping);
@@ -674,33 +680,35 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
-		pfn_t pfn;
-		long length;
-		void __pmem *kaddr = __dax_map_atomic(bdev,
-				to_sector(&bh, inode), HPAGE_SIZE, &pfn,
-				&length);
-
-		if (IS_ERR(kaddr)) {
+		struct blk_dax_ctl dax = {
+			.sector = to_sector(&bh, inode),
+			.size = HPAGE_SIZE,
+			.flags = flags,
+		};
+		long length = dax_map_atomic(bdev, &dax);
+
+		if (length < 0) {
 			result = VM_FAULT_SIGBUS;
 			goto out;
 		}
-		if ((length < PMD_SIZE) || (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)) {
-			dax_unmap_atomic(bdev, kaddr);
+		if ((length < HPAGE_SIZE)
+				|| (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)) {
+			dax_unmap_atomic(bdev, &dax);
 			goto fallback;
 		}
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
-			clear_pmem(kaddr, HPAGE_SIZE);
+			clear_pmem(dax.addr, HPAGE_SIZE);
 			wmb_pmem();
 			count_vm_event(PGMAJFAULT);
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 			result |= VM_FAULT_MAJOR;
 		}
-		dax_account_mapping(bdev, pfn, mapping);
-		dax_unmap_atomic(bdev, kaddr);
+		dax_account_mapping(bdev, dax.pfn, mapping);
+		dax_unmap_atomic(bdev, &dax);
 
 		result |= vmf_insert_pfn_pmd(vma, address, pmd,
-				pfn_t_to_pfn(pfn), write);
+				pfn_t_to_pfn(dax.pfn), write);
 	}
 
  out:
@@ -803,14 +811,17 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 		return err;
 	if (buffer_written(&bh)) {
 		struct block_device *bdev = bh.b_bdev;
-		void __pmem *addr = dax_map_atomic(bdev, to_sector(&bh, inode),
-				PAGE_CACHE_SIZE);
-
-		if (IS_ERR(addr))
-			return PTR_ERR(addr);
-		clear_pmem(addr + offset, length);
+		struct blk_dax_ctl dax = {
+			.sector = to_sector(&bh, inode),
+			.size = PAGE_CACHE_SIZE,
+			.flags = 0,
+		};
+
+		if (dax_map_atomic(bdev, &dax) < 0)
+			return PTR_ERR(dax.addr);
+		clear_pmem(dax.addr + offset, length);
 		wmb_pmem();
-		dax_unmap_atomic(bdev, addr);
+		dax_unmap_atomic(bdev, &dax);
 	}
 
 	return 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e121e5e0c6ac..663e9974820f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1615,14 +1615,31 @@ static inline bool integrity_req_gap_front_merge(struct request *req,
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+#define BLKDAX_F_DIRTY (1UL << 0) /* range is mapped writable to userspace */
+
+/**
+ * struct blk_dax_ctl - control and output parameters for ->direct_access
+ * @sector: (input) offset relative to a block_device
+ * @addr: (output) kernel virtual address for @sector populated by driver
+ * @flags: (input) BLKDAX_F_*
+ * @pfn: (output) page frame number for @addr populated by driver
+ * @size: (input) number of bytes requested
+ */
+struct blk_dax_ctl {
+	sector_t sector;
+	void __pmem *addr;
+	unsigned long flags;
+	long size;
+	pfn_t pfn;
+};
+
 struct block_device_operations {
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
 	int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
-	long (*direct_access)(struct block_device *, sector_t, void __pmem **,
-			pfn_t *);
+	long (*direct_access)(struct block_device *, struct blk_dax_ctl *);
 	unsigned int (*check_events) (struct gendisk *disk,
 				      unsigned int clearing);
 	/* ->media_changed() is DEPRECATED, use ->check_events() instead */
@@ -1640,8 +1657,7 @@ extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
 extern int bdev_read_page(struct block_device *, sector_t, struct page *);
 extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 						struct writeback_control *);
-extern long bdev_direct_access(struct block_device *, sector_t,
-		void __pmem **addr, pfn_t *pfn, long size);
+extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *);
 #else /* CONFIG_BLOCK */
 
 struct block_device;


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 15/15] pmem: blkdev_issue_flush support
  2015-11-02  4:29 ` Dan Williams
@ 2015-11-02  4:31   ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:31 UTC (permalink / raw)
  To: axboe; +Cc: jack, linux-nvdimm, david, linux-kernel, ross.zwisler, hch

For the normal (make_request) I/O path writes are always synchronously
flushed through to media.  However, when DAX is in use it is possible
that userspace leaves dirty data in the cache.  Ideally userspace uses
cache-writeback and persistent-commit instructions directly to flush
writes to media.  If instead userspace uses fsync()/msync() for
consistency guarantees then the driver needs to flush the cpu cache
manually.

Ideally an architecture would provide a single instruction to write-back
all dirty lines in the cache.  In the absence of that the driver resorts
to flushing line by line.

Introduce mmio_wb_range() as the non-invalidating version of
mmio_flush_range() and arrange for a small number of flusher threads to
parallelize the work.

The flush is a nop until a userspace mapping, BLKDAX_F_DIRTY request,
arrives and we reduce the amount of work per-flush by tracking open
active dax extents.  Finer granularity 'dax_active' tracking and
clearing mapped extents will be a subject of future experiments.  For
now this enables moderately cheap fsync/msync without per-fs and mm
enabling.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/include/asm/cacheflush.h |    4 +
 block/blk-core.c                  |    1 
 block/blk.h                       |   11 ---
 drivers/nvdimm/pmem.c             |  139 +++++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h            |   11 +++
 5 files changed, 154 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
index e63aa38e85fb..3eafa8088489 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -89,6 +89,10 @@ int set_pages_rw(struct page *page, int numpages);
 
 void clflush_cache_range(void *addr, unsigned int size);
 
+#ifdef CONFIG_ARCH_HAS_PMEM_API
+#define mmio_wb_range(addr, size) __arch_wb_cache_pmem(addr, size)
+#endif
+
 #define mmio_flush_range(addr, size) clflush_cache_range(addr, size)
 
 #ifdef CONFIG_DEBUG_RODATA
diff --git a/block/blk-core.c b/block/blk-core.c
index 5159946a2b41..43e402f9c06e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -661,6 +661,7 @@ void blk_queue_exit(struct request_queue *q)
 {
 	percpu_ref_put(&q->q_usage_counter);
 }
+EXPORT_SYMBOL(blk_queue_exit);
 
 static void blk_queue_usage_counter_release(struct percpu_ref *ref)
 {
diff --git a/block/blk.h b/block/blk.h
index dc7d9411fa45..a83f14f07921 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -74,17 +74,6 @@ bool __blk_end_bidi_request(struct request *rq, int error,
 			    unsigned int nr_bytes, unsigned int bidi_bytes);
 void blk_freeze_queue(struct request_queue *q);
 
-static inline void blk_queue_enter_live(struct request_queue *q)
-{
-	/*
-	 * Given that running in generic_make_request() context
-	 * guarantees that a live reference against q_usage_counter has
-	 * been established, further references under that same context
-	 * need not check that the queue has been frozen (marked dead).
-	 */
-	percpu_ref_get(&q->q_usage_counter);
-}
-
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 void blk_flush_integrity(void);
 #else
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3d83f3079602..6f39d0017399 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -33,6 +33,9 @@
 
 static ASYNC_DOMAIN_EXCLUSIVE(async_pmem);
 
+#define NUM_FLUSH_THREADS 4
+#define DAX_EXTENT_SHIFT 8
+#define NUM_DAX_EXTENTS (1ULL << DAX_EXTENT_SHIFT)
 struct pmem_device {
 	struct request_queue	*pmem_queue;
 	struct gendisk		*pmem_disk;
@@ -45,6 +48,10 @@ struct pmem_device {
 	unsigned long		pfn_flags;
 	void __pmem		*virt_addr;
 	size_t			size;
+	unsigned long		size_shift;
+	struct bio		*flush_bio;
+	spinlock_t		lock;
+	DECLARE_BITMAP(dax_active, NUM_DAX_EXTENTS);
 };
 
 static int pmem_major;
@@ -68,6 +75,105 @@ static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 	kunmap_atomic(mem);
 }
 
+struct pmem_flush_ctx {
+	struct pmem_device *pmem;
+	struct block_device *bdev;
+	int id;
+};
+
+static resource_size_t dax_extent_shift(struct pmem_device *pmem)
+{
+	return pmem->size_shift - DAX_EXTENT_SHIFT;
+}
+
+static resource_size_t dax_extent_size(struct pmem_device *pmem)
+{
+	return 1ULL << dax_extent_shift(pmem);
+}
+
+static void pmem_flush(void *data, async_cookie_t cookie)
+{
+	unsigned int i;
+	resource_size_t offset;
+	struct pmem_flush_ctx *ctx = data;
+	struct pmem_device *pmem = ctx->pmem;
+	struct device *dev = part_to_dev(ctx->bdev->bd_part);
+	unsigned long extent = dax_extent_size(pmem) / NUM_FLUSH_THREADS;
+
+	for_each_set_bit(i, pmem->dax_active, NUM_DAX_EXTENTS) {
+		unsigned long flush_len;
+		void *addr;
+
+		offset = dax_extent_size(pmem) * i + extent * ctx->id;
+		if (offset > pmem->size)
+			break;
+		flush_len = min_t(resource_size_t, extent, pmem->size - offset);
+		addr = (void __force *) pmem->virt_addr + offset;
+		dev_dbg(dev, "%s: %p %#lx\n", __func__, addr, flush_len);
+		while (flush_len) {
+			unsigned long len = min_t(unsigned long, flush_len, SZ_1M);
+
+#if defined(mmio_wb_range)
+			mmio_wb_range(addr, len);
+#elif defined(mmio_flush_range)
+			mmio_flush_range(addr, len);
+#else
+			dev_err_once(dev, "%s: failed, no flush method\n",
+					__func__);
+			return;
+#endif
+			flush_len -= len;
+			addr += len;
+			cond_resched();
+		}
+	}
+}
+
+static void __pmem_flush_request(void *data, async_cookie_t cookie)
+{
+	struct pmem_flush_ctx ctx[NUM_FLUSH_THREADS];
+	struct pmem_device *pmem = data;
+	struct bio *bio;
+	int i;
+
+	spin_lock(&pmem->lock);
+	bio = pmem->flush_bio;
+	pmem->flush_bio = bio->bi_next;
+	bio->bi_next = NULL;
+	spin_unlock(&pmem->lock);
+
+	for (i = 0; i < NUM_FLUSH_THREADS; i++) {
+		ctx[i].bdev = bio->bi_bdev;
+		ctx[i].pmem = pmem;
+		ctx[i].id = i;
+		cookie = async_schedule_domain(pmem_flush, &ctx[i], &async_pmem);
+	}
+	async_synchronize_cookie_domain(cookie, &async_pmem);
+	wmb_pmem();
+	bio_endio(bio);
+	blk_queue_exit(pmem->pmem_queue);
+}
+
+static void pmem_flush_request(struct pmem_device *pmem, struct bio *bio)
+{
+	int do_flush = 1;
+
+	spin_lock(&pmem->lock);
+	if (bitmap_weight(pmem->dax_active, NUM_DAX_EXTENTS) == 0) {
+		do_flush = 0;
+	} else {
+		bio->bi_next = pmem->flush_bio;
+		pmem->flush_bio = bio;
+	}
+	spin_unlock(&pmem->lock);
+
+	if (do_flush) {
+		blk_queue_enter_live(pmem->pmem_queue);
+		async_schedule(__pmem_flush_request, pmem);
+	} else
+		bio_endio(bio);
+}
+
 static void pmem_make_request(struct request_queue *q, struct bio *bio)
 {
 	bool do_acct;
@@ -87,7 +193,11 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
 	if (bio_data_dir(bio))
 		wmb_pmem();
 
-	bio_endio(bio);
+	/* we're always durable unless/until dax is activated */
+	if (bio->bi_rw & REQ_FLUSH)
+		pmem_flush_request(pmem, bio);
+	else
+		bio_endio(bio);
 }
 
 static int pmem_rw_page(struct block_device *bdev, sector_t sector,
@@ -112,6 +222,27 @@ static long pmem_direct_access(struct block_device *bdev,
 	dax->addr = pmem->virt_addr + offset;
 	dax->pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
 
+	if (dax->flags & BLKDAX_F_DIRTY) {
+		unsigned long start = offset >> dax_extent_shift(pmem);
+		unsigned long len;
+		size_t size;
+
+		size = min_t(size_t, pmem->size - offset, dax->size);
+		size = ALIGN(size, dax_extent_size(pmem));
+		len = max_t(unsigned long, 1, size >> dax_extent_shift(pmem));
+
+		/*
+		 * Any flush initiated after the lock is dropped observes new
+		 * dirty state
+		 */
+		spin_lock(&pmem->lock);
+		bitmap_set(pmem->dax_active, start, len);
+		spin_unlock(&pmem->lock);
+
+		dev_dbg(part_to_dev(bdev->bd_part), "dax active %lx +%lx\n",
+				start, len);
+	}
+
 	return pmem->size - offset;
 }
 
@@ -132,8 +263,12 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 	if (!pmem)
 		return ERR_PTR(-ENOMEM);
 
+	spin_lock_init(&pmem->lock);
 	pmem->phys_addr = res->start;
 	pmem->size = resource_size(res);
+	pmem->size_shift = ilog2(pmem->size);
+	if (1ULL << pmem->size_shift < pmem->size)
+		pmem->size_shift++;
 	if (!arch_has_wmb_pmem())
 		dev_warn(dev, "unable to guarantee persistence of writes\n");
 
@@ -217,6 +352,8 @@ static int pmem_attach_disk(struct device *dev,
 	blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX);
 	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->pmem_queue);
+	/* every write via pmem_make_request has FUA semantics by default */
+	blk_queue_flush(pmem->pmem_queue, REQ_FLUSH | REQ_FUA);
 
 	disk = alloc_disk_node(0, nid);
 	if (!disk) {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 663e9974820f..de8a3d58f071 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -793,6 +793,17 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
+static inline void blk_queue_enter_live(struct request_queue *q)
+{
+	/*
+	 * Given that running in generic_make_request() context
+	 * guarantees that a live reference against q_usage_counter has
+	 * been established, further references under that same context
+	 * need not check that the queue has been frozen (marked dead).
+	 */
+	percpu_ref_get(&q->q_usage_counter);
+}
+
 extern int blk_queue_enter(struct request_queue *q, gfp_t gfp);
 extern void blk_queue_exit(struct request_queue *q);
 extern void blk_start_queue(struct request_queue *q);


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 15/15] pmem: blkdev_issue_flush support
@ 2015-11-02  4:31   ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-02  4:31 UTC (permalink / raw)
  To: axboe; +Cc: jack, linux-nvdimm, david, linux-kernel, ross.zwisler, hch

For the normal (make_request) I/O path writes are always synchronously
flushed through to media.  However, when DAX is in use it is possible
that userspace leaves dirty data in the cache.  Ideally userspace uses
cache-writeback and persistent-commit instructions directly to flush
writes to media.  If instead userspace uses fsync()/msync() for
consistency guarantees then the driver needs to flush the cpu cache
manually.

Ideally an architecture would provide a single instruction to write-back
all dirty lines in the cache.  In the absence of that the driver resorts
to flushing line by line.

Introduce mmio_wb_range() as the non-invalidating version of
mmio_flush_range() and arrange for a small number of flusher threads to
parallelize the work.

The flush is a nop until a userspace mapping, BLKDAX_F_DIRTY request,
arrives and we reduce the amount of work per-flush by tracking open
active dax extents.  Finer granularity 'dax_active' tracking and
clearing mapped extents will be a subject of future experiments.  For
now this enables moderately cheap fsync/msync without per-fs and mm
enabling.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/include/asm/cacheflush.h |    4 +
 block/blk-core.c                  |    1 
 block/blk.h                       |   11 ---
 drivers/nvdimm/pmem.c             |  139 +++++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h            |   11 +++
 5 files changed, 154 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
index e63aa38e85fb..3eafa8088489 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -89,6 +89,10 @@ int set_pages_rw(struct page *page, int numpages);
 
 void clflush_cache_range(void *addr, unsigned int size);
 
+#ifdef CONFIG_ARCH_HAS_PMEM_API
+#define mmio_wb_range(addr, size) __arch_wb_cache_pmem(addr, size)
+#endif
+
 #define mmio_flush_range(addr, size) clflush_cache_range(addr, size)
 
 #ifdef CONFIG_DEBUG_RODATA
diff --git a/block/blk-core.c b/block/blk-core.c
index 5159946a2b41..43e402f9c06e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -661,6 +661,7 @@ void blk_queue_exit(struct request_queue *q)
 {
 	percpu_ref_put(&q->q_usage_counter);
 }
+EXPORT_SYMBOL(blk_queue_exit);
 
 static void blk_queue_usage_counter_release(struct percpu_ref *ref)
 {
diff --git a/block/blk.h b/block/blk.h
index dc7d9411fa45..a83f14f07921 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -74,17 +74,6 @@ bool __blk_end_bidi_request(struct request *rq, int error,
 			    unsigned int nr_bytes, unsigned int bidi_bytes);
 void blk_freeze_queue(struct request_queue *q);
 
-static inline void blk_queue_enter_live(struct request_queue *q)
-{
-	/*
-	 * Given that running in generic_make_request() context
-	 * guarantees that a live reference against q_usage_counter has
-	 * been established, further references under that same context
-	 * need not check that the queue has been frozen (marked dead).
-	 */
-	percpu_ref_get(&q->q_usage_counter);
-}
-
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 void blk_flush_integrity(void);
 #else
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3d83f3079602..6f39d0017399 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -33,6 +33,9 @@
 
 static ASYNC_DOMAIN_EXCLUSIVE(async_pmem);
 
+#define NUM_FLUSH_THREADS 4
+#define DAX_EXTENT_SHIFT 8
+#define NUM_DAX_EXTENTS (1ULL << DAX_EXTENT_SHIFT)
 struct pmem_device {
 	struct request_queue	*pmem_queue;
 	struct gendisk		*pmem_disk;
@@ -45,6 +48,10 @@ struct pmem_device {
 	unsigned long		pfn_flags;
 	void __pmem		*virt_addr;
 	size_t			size;
+	unsigned long		size_shift;
+	struct bio		*flush_bio;
+	spinlock_t		lock;
+	DECLARE_BITMAP(dax_active, NUM_DAX_EXTENTS);
 };
 
 static int pmem_major;
@@ -68,6 +75,105 @@ static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 	kunmap_atomic(mem);
 }
 
+struct pmem_flush_ctx {
+	struct pmem_device *pmem;
+	struct block_device *bdev;
+	int id;
+};
+
+static resource_size_t dax_extent_shift(struct pmem_device *pmem)
+{
+	return pmem->size_shift - DAX_EXTENT_SHIFT;
+}
+
+static resource_size_t dax_extent_size(struct pmem_device *pmem)
+{
+	return 1ULL << dax_extent_shift(pmem);
+}
+
+static void pmem_flush(void *data, async_cookie_t cookie)
+{
+	unsigned int i;
+	resource_size_t offset;
+	struct pmem_flush_ctx *ctx = data;
+	struct pmem_device *pmem = ctx->pmem;
+	struct device *dev = part_to_dev(ctx->bdev->bd_part);
+	unsigned long extent = dax_extent_size(pmem) / NUM_FLUSH_THREADS;
+
+	for_each_set_bit(i, pmem->dax_active, NUM_DAX_EXTENTS) {
+		unsigned long flush_len;
+		void *addr;
+
+		offset = dax_extent_size(pmem) * i + extent * ctx->id;
+		if (offset > pmem->size)
+			break;
+		flush_len = min_t(resource_size_t, extent, pmem->size - offset);
+		addr = (void __force *) pmem->virt_addr + offset;
+		dev_dbg(dev, "%s: %p %#lx\n", __func__, addr, flush_len);
+		while (flush_len) {
+			unsigned long len = min_t(unsigned long, flush_len, SZ_1M);
+
+#if defined(mmio_wb_range)
+			mmio_wb_range(addr, len);
+#elif defined(mmio_flush_range)
+			mmio_flush_range(addr, len);
+#else
+			dev_err_once(dev, "%s: failed, no flush method\n",
+					__func__);
+			return;
+#endif
+			flush_len -= len;
+			addr += len;
+			cond_resched();
+		}
+	}
+}
+
+static void __pmem_flush_request(void *data, async_cookie_t cookie)
+{
+	struct pmem_flush_ctx ctx[NUM_FLUSH_THREADS];
+	struct pmem_device *pmem = data;
+	struct bio *bio;
+	int i;
+
+	spin_lock(&pmem->lock);
+	bio = pmem->flush_bio;
+	pmem->flush_bio = bio->bi_next;
+	bio->bi_next = NULL;
+	spin_unlock(&pmem->lock);
+
+	for (i = 0; i < NUM_FLUSH_THREADS; i++) {
+		ctx[i].bdev = bio->bi_bdev;
+		ctx[i].pmem = pmem;
+		ctx[i].id = i;
+		cookie = async_schedule_domain(pmem_flush, &ctx[i], &async_pmem);
+	}
+	async_synchronize_cookie_domain(cookie, &async_pmem);
+	wmb_pmem();
+	bio_endio(bio);
+	blk_queue_exit(pmem->pmem_queue);
+}
+
+static void pmem_flush_request(struct pmem_device *pmem, struct bio *bio)
+{
+	int do_flush = 1;
+
+	spin_lock(&pmem->lock);
+	if (bitmap_weight(pmem->dax_active, NUM_DAX_EXTENTS) == 0) {
+		do_flush = 0;
+	} else {
+		bio->bi_next = pmem->flush_bio;
+		pmem->flush_bio = bio;
+	}
+	spin_unlock(&pmem->lock);
+
+	if (do_flush) {
+		blk_queue_enter_live(pmem->pmem_queue);
+		async_schedule(__pmem_flush_request, pmem);
+	} else
+		bio_endio(bio);
+}
+
 static void pmem_make_request(struct request_queue *q, struct bio *bio)
 {
 	bool do_acct;
@@ -87,7 +193,11 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
 	if (bio_data_dir(bio))
 		wmb_pmem();
 
-	bio_endio(bio);
+	/* we're always durable unless/until dax is activated */
+	if (bio->bi_rw & REQ_FLUSH)
+		pmem_flush_request(pmem, bio);
+	else
+		bio_endio(bio);
 }
 
 static int pmem_rw_page(struct block_device *bdev, sector_t sector,
@@ -112,6 +222,27 @@ static long pmem_direct_access(struct block_device *bdev,
 	dax->addr = pmem->virt_addr + offset;
 	dax->pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
 
+	if (dax->flags & BLKDAX_F_DIRTY) {
+		unsigned long start = offset >> dax_extent_shift(pmem);
+		unsigned long len;
+		size_t size;
+
+		size = min_t(size_t, pmem->size - offset, dax->size);
+		size = ALIGN(size, dax_extent_size(pmem));
+		len = max_t(unsigned long, 1, size >> dax_extent_shift(pmem));
+
+		/*
+		 * Any flush initiated after the lock is dropped observes new
+		 * dirty state
+		 */
+		spin_lock(&pmem->lock);
+		bitmap_set(pmem->dax_active, start, len);
+		spin_unlock(&pmem->lock);
+
+		dev_dbg(part_to_dev(bdev->bd_part), "dax active %lx +%lx\n",
+				start, len);
+	}
+
 	return pmem->size - offset;
 }
 
@@ -132,8 +263,12 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 	if (!pmem)
 		return ERR_PTR(-ENOMEM);
 
+	spin_lock_init(&pmem->lock);
 	pmem->phys_addr = res->start;
 	pmem->size = resource_size(res);
+	pmem->size_shift = ilog2(pmem->size);
+	if (1ULL << pmem->size_shift < pmem->size)
+		pmem->size_shift++;
 	if (!arch_has_wmb_pmem())
 		dev_warn(dev, "unable to guarantee persistence of writes\n");
 
@@ -217,6 +352,8 @@ static int pmem_attach_disk(struct device *dev,
 	blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX);
 	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->pmem_queue);
+	/* every write via pmem_make_request has FUA semantics by default */
+	blk_queue_flush(pmem->pmem_queue, REQ_FLUSH | REQ_FUA);
 
 	disk = alloc_disk_node(0, nid);
 	if (!disk) {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 663e9974820f..de8a3d58f071 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -793,6 +793,17 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
+static inline void blk_queue_enter_live(struct request_queue *q)
+{
+	/*
+	 * Given that running in generic_make_request() context
+	 * guarantees that a live reference against q_usage_counter has
+	 * been established, further references under that same context
+	 * need not check that the queue has been frozen (marked dead).
+	 */
+	percpu_ref_get(&q->q_usage_counter);
+}
+
 extern int blk_queue_enter(struct request_queue *q, gfp_t gfp);
 extern void blk_queue_exit(struct request_queue *q);
 extern void blk_start_queue(struct request_queue *q);


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 08/15] mm, dax, pmem: introduce pfn_t
  2015-11-02  4:30   ` Dan Williams
@ 2015-11-02 16:30     ` Joe Perches
  -1 siblings, 0 replies; 95+ messages in thread
From: Joe Perches @ 2015-11-02 16:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, Dave Hansen, jack, linux-nvdimm, david, linux-kernel,
	ross.zwisler, Andrew Morton, hch

On Sun, 2015-11-01 at 23:30 -0500, Dan Williams wrote:
> For the purpose of communicating the optional presence of a 'struct
> page' for the pfn returned from ->direct_access(), introduce a type that
> encapsulates a page-frame-number plus flags.  These flags contain the
> historical "page_link" encoding for a scatterlist entry, but can also
> denote "device memory".  Where "device memory" is a set of pfns that are
> not part of the kernel's linear mapping by default, but are accessed via
> the same memory controller as ram.
> 
> The motivation for this new type is large capacity persistent memory
> that needs struct page entries in the 'memmap' to support 3rd party DMA
> (i.e. O_DIRECT I/O with a persistent memory source/target).  However, we
> also need it in support of maintaining a list of mapped inodes which
> need to be unmapped at driver teardown or freeze_bdev() time.
[]
> diff --git a/include/linux/mm.h b/include/linux/mm.h
[]
> +#define PFN_FLAGS_MASK (~PAGE_MASK << (BITS_PER_LONG - PAGE_SHIFT))
> +#define PFN_SG_CHAIN (1UL << (BITS_PER_LONG - 1))
> +#define PFN_SG_LAST (1UL << (BITS_PER_LONG - 2))
> +#define PFN_DEV (1UL << (BITS_PER_LONG - 3))
> +#define PFN_MAP (1UL << (BITS_PER_LONG - 4))
[]
> diff --git a/include/linux/pfn.h b/include/linux/pfn.h
[]
> @@ -3,6 +3,15 @@
[]
> + * pfn_t: encapsulates a page-frame number that is optionally backed
> + * by memmap (struct page).  Whether a pfn_t has a 'struct page'
> + * backing is indicated by flags in the high bits of the value.
> + */
> +typedef struct {
> +	unsigned long val;
> +} pfn_t;
>  #endif

Perhaps this would be more intelligible as an
anonymous union of bit-fields and unsigned long.



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 08/15] mm, dax, pmem: introduce pfn_t
@ 2015-11-02 16:30     ` Joe Perches
  0 siblings, 0 replies; 95+ messages in thread
From: Joe Perches @ 2015-11-02 16:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, Dave Hansen, jack, linux-nvdimm, david, linux-kernel,
	ross.zwisler, Andrew Morton, hch

On Sun, 2015-11-01 at 23:30 -0500, Dan Williams wrote:
> For the purpose of communicating the optional presence of a 'struct
> page' for the pfn returned from ->direct_access(), introduce a type that
> encapsulates a page-frame-number plus flags.  These flags contain the
> historical "page_link" encoding for a scatterlist entry, but can also
> denote "device memory".  Where "device memory" is a set of pfns that are
> not part of the kernel's linear mapping by default, but are accessed via
> the same memory controller as ram.
> 
> The motivation for this new type is large capacity persistent memory
> that needs struct page entries in the 'memmap' to support 3rd party DMA
> (i.e. O_DIRECT I/O with a persistent memory source/target).  However, we
> also need it in support of maintaining a list of mapped inodes which
> need to be unmapped at driver teardown or freeze_bdev() time.
[]
> diff --git a/include/linux/mm.h b/include/linux/mm.h
[]
> +#define PFN_FLAGS_MASK (~PAGE_MASK << (BITS_PER_LONG - PAGE_SHIFT))
> +#define PFN_SG_CHAIN (1UL << (BITS_PER_LONG - 1))
> +#define PFN_SG_LAST (1UL << (BITS_PER_LONG - 2))
> +#define PFN_DEV (1UL << (BITS_PER_LONG - 3))
> +#define PFN_MAP (1UL << (BITS_PER_LONG - 4))
[]
> diff --git a/include/linux/pfn.h b/include/linux/pfn.h
[]
> @@ -3,6 +3,15 @@
[]
> + * pfn_t: encapsulates a page-frame number that is optionally backed
> + * by memmap (struct page).  Whether a pfn_t has a 'struct page'
> + * backing is indicated by flags in the high bits of the value.
> + */
> +typedef struct {
> +	unsigned long val;
> +} pfn_t;
>  #endif

Perhaps this would be more intelligible as an
anonymous union of bit-fields and unsigned long.



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 13/15] block, dax: make dax mappings opt-in by default
  2015-11-02  4:30   ` Dan Williams
@ 2015-11-03  0:32     ` Dave Chinner
  -1 siblings, 0 replies; 95+ messages in thread
From: Dave Chinner @ 2015-11-03  0:32 UTC (permalink / raw)
  To: Dan Williams; +Cc: axboe, jack, linux-nvdimm, linux-kernel, ross.zwisler, hch

On Sun, Nov 01, 2015 at 11:30:53PM -0500, Dan Williams wrote:
> Now that we have the ability to dynamically enable DAX for a raw block
> inode, make the behavior opt-in by default.  DAX does not have feature
> parity with pagecache backed mappings, so applications should knowingly
> enable DAX semantics.
> 
> Note, this is only for mappings returned to userspace.  For the
> synchronous usages of DAX, dax_do_io(), there is no semantic difference
> with the bio submission path, so that path remains default enabled.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  block/ioctl.c      |    3 +--
>  fs/block_dev.c     |   33 +++++++++++++++++++++++----------
>  include/linux/fs.h |    8 ++++++++
>  3 files changed, 32 insertions(+), 12 deletions(-)
> 
> diff --git a/block/ioctl.c b/block/ioctl.c
> index 205d57612fbd..c4c3a09d9ca9 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -298,13 +298,12 @@ static inline int is_unrecognized_ioctl(int ret)
>  #ifdef CONFIG_FS_DAX
>  static int blkdev_set_dax(struct block_device *bdev, int n)
>  {
> -	struct gendisk *disk = bdev->bd_disk;
>  	int rc = 0;
>  
>  	if (n)
>  		n = S_DAX;
>  
> -	if (n && !disk->fops->direct_access)
> +	if (n && !blkdev_dax_capable(bdev))
>  		return -ENOTTY;
>  
>  	mutex_lock(&bdev->bd_inode->i_mutex);
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 13ce6d0ff7f6..ee34a31e6fa4 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -152,16 +152,37 @@ static struct inode *bdev_file_inode(struct file *file)
>  	return file->f_mapping->host;
>  }
>  
> +#ifdef CONFIG_FS_DAX
> +bool blkdev_dax_capable(struct block_device *bdev)
> +{
> +	struct gendisk *disk = bdev->bd_disk;
> +
> +	if (!disk->fops->direct_access)
> +		return false;
> +
> +	/*
> +	 * If the partition is not aligned on a page boundary, we can't
> +	 * do dax I/O to it.
> +	 */
> +	if ((bdev->bd_part->start_sect % (PAGE_SIZE / 512))
> +			|| (bdev->bd_part->nr_sects % (PAGE_SIZE / 512)))
> +		return false;
> +
> +	return true;

Where do you check that S_DAX has been enabled on the block device
now?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 13/15] block, dax: make dax mappings opt-in by default
@ 2015-11-03  0:32     ` Dave Chinner
  0 siblings, 0 replies; 95+ messages in thread
From: Dave Chinner @ 2015-11-03  0:32 UTC (permalink / raw)
  To: Dan Williams; +Cc: axboe, jack, linux-nvdimm, linux-kernel, ross.zwisler, hch

On Sun, Nov 01, 2015 at 11:30:53PM -0500, Dan Williams wrote:
> Now that we have the ability to dynamically enable DAX for a raw block
> inode, make the behavior opt-in by default.  DAX does not have feature
> parity with pagecache backed mappings, so applications should knowingly
> enable DAX semantics.
> 
> Note, this is only for mappings returned to userspace.  For the
> synchronous usages of DAX, dax_do_io(), there is no semantic difference
> with the bio submission path, so that path remains default enabled.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  block/ioctl.c      |    3 +--
>  fs/block_dev.c     |   33 +++++++++++++++++++++++----------
>  include/linux/fs.h |    8 ++++++++
>  3 files changed, 32 insertions(+), 12 deletions(-)
> 
> diff --git a/block/ioctl.c b/block/ioctl.c
> index 205d57612fbd..c4c3a09d9ca9 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -298,13 +298,12 @@ static inline int is_unrecognized_ioctl(int ret)
>  #ifdef CONFIG_FS_DAX
>  static int blkdev_set_dax(struct block_device *bdev, int n)
>  {
> -	struct gendisk *disk = bdev->bd_disk;
>  	int rc = 0;
>  
>  	if (n)
>  		n = S_DAX;
>  
> -	if (n && !disk->fops->direct_access)
> +	if (n && !blkdev_dax_capable(bdev))
>  		return -ENOTTY;
>  
>  	mutex_lock(&bdev->bd_inode->i_mutex);
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 13ce6d0ff7f6..ee34a31e6fa4 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -152,16 +152,37 @@ static struct inode *bdev_file_inode(struct file *file)
>  	return file->f_mapping->host;
>  }
>  
> +#ifdef CONFIG_FS_DAX
> +bool blkdev_dax_capable(struct block_device *bdev)
> +{
> +	struct gendisk *disk = bdev->bd_disk;
> +
> +	if (!disk->fops->direct_access)
> +		return false;
> +
> +	/*
> +	 * If the partition is not aligned on a page boundary, we can't
> +	 * do dax I/O to it.
> +	 */
> +	if ((bdev->bd_part->start_sect % (PAGE_SIZE / 512))
> +			|| (bdev->bd_part->nr_sects % (PAGE_SIZE / 512)))
> +		return false;
> +
> +	return true;

Where do you check that S_DAX has been enabled on the block device
now?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
  2015-11-02  4:29   ` Dan Williams
@ 2015-11-03  0:51     ` Dave Chinner
  -1 siblings, 0 replies; 95+ messages in thread
From: Dave Chinner @ 2015-11-03  0:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, jack, linux-nvdimm, linux-kernel, Jeff Moyer, Jan Kara,
	ross.zwisler, hch

On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> dax_clear_blocks is currently performing a cond_resched() after every
> PAGE_SIZE memset.  We need not check so frequently, for example md-raid
> only calls cond_resched() at stripe granularity.  Also, in preparation
> for introducing a dax_map_atomic() operation that temporarily pins a dax
> mapping move the call to cond_resched() to the outer loop.
> 
> Reviewed-by: Jan Kara <jack@suse.com>
> Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/dax.c |   27 ++++++++++++---------------
>  1 file changed, 12 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5dc33d788d50..f8e543839e5c 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -28,6 +28,7 @@
>  #include <linux/sched.h>
>  #include <linux/uio.h>
>  #include <linux/vmstat.h>
> +#include <linux/sizes.h>
>  
>  int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>  {
> @@ -38,24 +39,20 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>  	do {
>  		void __pmem *addr;
>  		unsigned long pfn;
> -		long count;
> +		long count, sz;
>  
> -		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
> +		sz = min_t(long, size, SZ_1M);
> +		count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
>  		if (count < 0)
>  			return count;
> -		BUG_ON(size < count);
> -		while (count > 0) {
> -			unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
> -			if (pgsz > count)
> -				pgsz = count;
> -			clear_pmem(addr, pgsz);
> -			addr += pgsz;
> -			size -= pgsz;
> -			count -= pgsz;
> -			BUG_ON(pgsz & 511);
> -			sector += pgsz / 512;
> -			cond_resched();
> -		}
> +		if (count < sz)
> +			sz = count;
> +		clear_pmem(addr, sz);
> +		addr += sz;
> +		size -= sz;
> +		BUG_ON(sz & 511);
> +		sector += sz / 512;
> +		cond_resched();
>  	} while (size);
>  
>  	wmb_pmem();

dax_clear_blocks() needs to go away and be replaced by a driver
level implementation of blkdev_issue_zerout(). This is effectively a
block device operation (we're taking sector addresses and zeroing
them), so it really belongs in the pmem drivers rather than the DAX
code.

I suspect a REQ_WRITE_SAME implementation is the way to go here, as
then the filesystems can just call sb_issue_zerout() and the block
layer zeroing will work on all types of storage without the
filesystem having to care whether DAX is in use or not.

Putting the implementation of the zeroing in the pmem drivers will
enable the drivers to optimise the caching behaviour of block
zeroing.  The synchronous cache flushing behaviour of this function
is a performance killer as we are now block zeroing on allocation
and that results in two synchronous data writes (zero on alloc,
commit, write data, commit) for each page.

The zeroing (and the data, for that matter) doesn't need to be
committed to persistent store until the allocation is written and
committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
write, so it makes sense to deploy the big hammer and delay the
blocking CPU cache flushes until the last possible moment in cases
like this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
@ 2015-11-03  0:51     ` Dave Chinner
  0 siblings, 0 replies; 95+ messages in thread
From: Dave Chinner @ 2015-11-03  0:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, jack, linux-nvdimm, linux-kernel, Jeff Moyer, Jan Kara,
	ross.zwisler, hch

On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> dax_clear_blocks is currently performing a cond_resched() after every
> PAGE_SIZE memset.  We need not check so frequently, for example md-raid
> only calls cond_resched() at stripe granularity.  Also, in preparation
> for introducing a dax_map_atomic() operation that temporarily pins a dax
> mapping move the call to cond_resched() to the outer loop.
> 
> Reviewed-by: Jan Kara <jack@suse.com>
> Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/dax.c |   27 ++++++++++++---------------
>  1 file changed, 12 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5dc33d788d50..f8e543839e5c 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -28,6 +28,7 @@
>  #include <linux/sched.h>
>  #include <linux/uio.h>
>  #include <linux/vmstat.h>
> +#include <linux/sizes.h>
>  
>  int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>  {
> @@ -38,24 +39,20 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>  	do {
>  		void __pmem *addr;
>  		unsigned long pfn;
> -		long count;
> +		long count, sz;
>  
> -		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
> +		sz = min_t(long, size, SZ_1M);
> +		count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
>  		if (count < 0)
>  			return count;
> -		BUG_ON(size < count);
> -		while (count > 0) {
> -			unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
> -			if (pgsz > count)
> -				pgsz = count;
> -			clear_pmem(addr, pgsz);
> -			addr += pgsz;
> -			size -= pgsz;
> -			count -= pgsz;
> -			BUG_ON(pgsz & 511);
> -			sector += pgsz / 512;
> -			cond_resched();
> -		}
> +		if (count < sz)
> +			sz = count;
> +		clear_pmem(addr, sz);
> +		addr += sz;
> +		size -= sz;
> +		BUG_ON(sz & 511);
> +		sector += sz / 512;
> +		cond_resched();
>  	} while (size);
>  
>  	wmb_pmem();

dax_clear_blocks() needs to go away and be replaced by a driver
level implementation of blkdev_issue_zerout(). This is effectively a
block device operation (we're taking sector addresses and zeroing
them), so it really belongs in the pmem drivers rather than the DAX
code.

I suspect a REQ_WRITE_SAME implementation is the way to go here, as
then the filesystems can just call sb_issue_zerout() and the block
layer zeroing will work on all types of storage without the
filesystem having to care whether DAX is in use or not.

Putting the implementation of the zeroing in the pmem drivers will
enable the drivers to optimise the caching behaviour of block
zeroing.  The synchronous cache flushing behaviour of this function
is a performance killer as we are now block zeroing on allocation
and that results in two synchronous data writes (zero on alloc,
commit, write data, commit) for each page.

The zeroing (and the data, for that matter) doesn't need to be
committed to persistent store until the allocation is written and
committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
write, so it makes sense to deploy the big hammer and delay the
blocking CPU cache flushes until the last possible moment in cases
like this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
  2015-11-02  4:30   ` Dan Williams
@ 2015-11-03  1:16     ` Dave Chinner
  -1 siblings, 0 replies; 95+ messages in thread
From: Dave Chinner @ 2015-11-03  1:16 UTC (permalink / raw)
  To: Dan Williams; +Cc: axboe, jack, linux-nvdimm, linux-kernel, ross.zwisler, hch

On Sun, Nov 01, 2015 at 11:30:58PM -0500, Dan Williams wrote:
> DAX-enabled block device drivers can use hints from fs/dax.c to
> optimize their internal tracking of potentially dirty cpu cache lines.
> If a DAX mapping is being used for synchronous operations, dax_do_io(),
> a dax-enabled block-driver knows that fs/dax.c will handle immediate
> flushing.  For asynchronous mappings, i.e.  returned to userspace via
> mmap, the driver can track active extents of the media for flushing.

So, essentially, you are marking the calls into the mapping calls
with BLKDAX_F_DIRTY when the mapping is requested for a write page
fault?  Hence allowing the block device to track "dirty pages"
exactly?

But, really, if we're going to use Ross's mapping tree patches that
use exceptional entries to track dirty pfns, why do we need to this
special interface from DAX to the block device? Ross's changes will
track mmap'd ranges that are dirtied at the filesytem inode level,
and the fsync/writeback will trigger CPU cache writeback of those
dirty ranges. This will work for block devices that are mapped by
DAX, too, because they have a inode+mapping tree, too.

And if we are going to use Ross's infrastructure (which, when we
work the kinks out of, I think we will), we really should change
dax_do_io() to track pfns that are dirtied this way, too. That will
allow us to get rid of all the cache flushing from the DAX layer
(they'll get pushed into fsync/writeback) and so we only take the
CPU cache flushing penalties when synchronous operations are
requested by userspace...

> We can later extend the DAX paths to indicate when an async mapping is
> "closed" allowing the active extents to be marked clean.

Yes, that's a basic feature of Ross's patches. Hence I think this
special case DAX<->bdev interface is the wrong direction to be
taking.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
@ 2015-11-03  1:16     ` Dave Chinner
  0 siblings, 0 replies; 95+ messages in thread
From: Dave Chinner @ 2015-11-03  1:16 UTC (permalink / raw)
  To: Dan Williams; +Cc: axboe, jack, linux-nvdimm, linux-kernel, ross.zwisler, hch

On Sun, Nov 01, 2015 at 11:30:58PM -0500, Dan Williams wrote:
> DAX-enabled block device drivers can use hints from fs/dax.c to
> optimize their internal tracking of potentially dirty cpu cache lines.
> If a DAX mapping is being used for synchronous operations, dax_do_io(),
> a dax-enabled block-driver knows that fs/dax.c will handle immediate
> flushing.  For asynchronous mappings, i.e.  returned to userspace via
> mmap, the driver can track active extents of the media for flushing.

So, essentially, you are marking the calls into the mapping calls
with BLKDAX_F_DIRTY when the mapping is requested for a write page
fault?  Hence allowing the block device to track "dirty pages"
exactly?

But, really, if we're going to use Ross's mapping tree patches that
use exceptional entries to track dirty pfns, why do we need to this
special interface from DAX to the block device? Ross's changes will
track mmap'd ranges that are dirtied at the filesytem inode level,
and the fsync/writeback will trigger CPU cache writeback of those
dirty ranges. This will work for block devices that are mapped by
DAX, too, because they have a inode+mapping tree, too.

And if we are going to use Ross's infrastructure (which, when we
work the kinks out of, I think we will), we really should change
dax_do_io() to track pfns that are dirtied this way, too. That will
allow us to get rid of all the cache flushing from the DAX layer
(they'll get pushed into fsync/writeback) and so we only take the
CPU cache flushing penalties when synchronous operations are
requested by userspace...

> We can later extend the DAX paths to indicate when an async mapping is
> "closed" allowing the active extents to be marked clean.

Yes, that's a basic feature of Ross's patches. Hence I think this
special case DAX<->bdev interface is the wrong direction to be
taking.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
  2015-11-03  0:51     ` Dave Chinner
@ 2015-11-03  3:27       ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03  3:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel, Jeff Moyer,
	Jan Kara, Ross Zwisler, Christoph Hellwig

On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
>> dax_clear_blocks is currently performing a cond_resched() after every
>> PAGE_SIZE memset.  We need not check so frequently, for example md-raid
>> only calls cond_resched() at stripe granularity.  Also, in preparation
>> for introducing a dax_map_atomic() operation that temporarily pins a dax
>> mapping move the call to cond_resched() to the outer loop.
>>
>> Reviewed-by: Jan Kara <jack@suse.com>
>> Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/dax.c |   27 ++++++++++++---------------
>>  1 file changed, 12 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index 5dc33d788d50..f8e543839e5c 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -28,6 +28,7 @@
>>  #include <linux/sched.h>
>>  #include <linux/uio.h>
>>  #include <linux/vmstat.h>
>> +#include <linux/sizes.h>
>>
>>  int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>>  {
>> @@ -38,24 +39,20 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>>       do {
>>               void __pmem *addr;
>>               unsigned long pfn;
>> -             long count;
>> +             long count, sz;
>>
>> -             count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
>> +             sz = min_t(long, size, SZ_1M);
>> +             count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
>>               if (count < 0)
>>                       return count;
>> -             BUG_ON(size < count);
>> -             while (count > 0) {
>> -                     unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
>> -                     if (pgsz > count)
>> -                             pgsz = count;
>> -                     clear_pmem(addr, pgsz);
>> -                     addr += pgsz;
>> -                     size -= pgsz;
>> -                     count -= pgsz;
>> -                     BUG_ON(pgsz & 511);
>> -                     sector += pgsz / 512;
>> -                     cond_resched();
>> -             }
>> +             if (count < sz)
>> +                     sz = count;
>> +             clear_pmem(addr, sz);
>> +             addr += sz;
>> +             size -= sz;
>> +             BUG_ON(sz & 511);
>> +             sector += sz / 512;
>> +             cond_resched();
>>       } while (size);
>>
>>       wmb_pmem();
>
> dax_clear_blocks() needs to go away and be replaced by a driver
> level implementation of blkdev_issue_zerout(). This is effectively a
> block device operation (we're taking sector addresses and zeroing
> them), so it really belongs in the pmem drivers rather than the DAX
> code.
>
> I suspect a REQ_WRITE_SAME implementation is the way to go here, as
> then the filesystems can just call sb_issue_zerout() and the block
> layer zeroing will work on all types of storage without the
> filesystem having to care whether DAX is in use or not.
>
> Putting the implementation of the zeroing in the pmem drivers will
> enable the drivers to optimise the caching behaviour of block
> zeroing.  The synchronous cache flushing behaviour of this function
> is a performance killer as we are now block zeroing on allocation
> and that results in two synchronous data writes (zero on alloc,
> commit, write data, commit) for each page.
>
> The zeroing (and the data, for that matter) doesn't need to be
> committed to persistent store until the allocation is written and
> committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
> write, so it makes sense to deploy the big hammer and delay the
> blocking CPU cache flushes until the last possible moment in cases
> like this.

In pmem terms that would be a non-temporal memset plus a delayed
wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
loop over the dirty-data issuing flushes after the fact.  We'll bump
the priority of the non-temporal memset implementation.

I like the idea of pushing this down into the driver vs your other
feedback of pushing dirty extent tracking up into the radix... but
more on that in the other thread.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
@ 2015-11-03  3:27       ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03  3:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org, linux-kernel,
	Jeff Moyer, Jan Kara, Ross Zwisler, Christoph Hellwig

On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
>> dax_clear_blocks is currently performing a cond_resched() after every
>> PAGE_SIZE memset.  We need not check so frequently, for example md-raid
>> only calls cond_resched() at stripe granularity.  Also, in preparation
>> for introducing a dax_map_atomic() operation that temporarily pins a dax
>> mapping move the call to cond_resched() to the outer loop.
>>
>> Reviewed-by: Jan Kara <jack@suse.com>
>> Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/dax.c |   27 ++++++++++++---------------
>>  1 file changed, 12 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index 5dc33d788d50..f8e543839e5c 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -28,6 +28,7 @@
>>  #include <linux/sched.h>
>>  #include <linux/uio.h>
>>  #include <linux/vmstat.h>
>> +#include <linux/sizes.h>
>>
>>  int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>>  {
>> @@ -38,24 +39,20 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>>       do {
>>               void __pmem *addr;
>>               unsigned long pfn;
>> -             long count;
>> +             long count, sz;
>>
>> -             count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
>> +             sz = min_t(long, size, SZ_1M);
>> +             count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
>>               if (count < 0)
>>                       return count;
>> -             BUG_ON(size < count);
>> -             while (count > 0) {
>> -                     unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
>> -                     if (pgsz > count)
>> -                             pgsz = count;
>> -                     clear_pmem(addr, pgsz);
>> -                     addr += pgsz;
>> -                     size -= pgsz;
>> -                     count -= pgsz;
>> -                     BUG_ON(pgsz & 511);
>> -                     sector += pgsz / 512;
>> -                     cond_resched();
>> -             }
>> +             if (count < sz)
>> +                     sz = count;
>> +             clear_pmem(addr, sz);
>> +             addr += sz;
>> +             size -= sz;
>> +             BUG_ON(sz & 511);
>> +             sector += sz / 512;
>> +             cond_resched();
>>       } while (size);
>>
>>       wmb_pmem();
>
> dax_clear_blocks() needs to go away and be replaced by a driver
> level implementation of blkdev_issue_zerout(). This is effectively a
> block device operation (we're taking sector addresses and zeroing
> them), so it really belongs in the pmem drivers rather than the DAX
> code.
>
> I suspect a REQ_WRITE_SAME implementation is the way to go here, as
> then the filesystems can just call sb_issue_zerout() and the block
> layer zeroing will work on all types of storage without the
> filesystem having to care whether DAX is in use or not.
>
> Putting the implementation of the zeroing in the pmem drivers will
> enable the drivers to optimise the caching behaviour of block
> zeroing.  The synchronous cache flushing behaviour of this function
> is a performance killer as we are now block zeroing on allocation
> and that results in two synchronous data writes (zero on alloc,
> commit, write data, commit) for each page.
>
> The zeroing (and the data, for that matter) doesn't need to be
> committed to persistent store until the allocation is written and
> committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
> write, so it makes sense to deploy the big hammer and delay the
> blocking CPU cache flushes until the last possible moment in cases
> like this.

In pmem terms that would be a non-temporal memset plus a delayed
wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
loop over the dirty-data issuing flushes after the fact.  We'll bump
the priority of the non-temporal memset implementation.

I like the idea of pushing this down into the driver vs your other
feedback of pushing dirty extent tracking up into the radix... but
more on that in the other thread.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
  2015-11-03  3:27       ` Dan Williams
@ 2015-11-03  4:48         ` Dave Chinner
  -1 siblings, 0 replies; 95+ messages in thread
From: Dave Chinner @ 2015-11-03  4:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel, Jeff Moyer,
	Jan Kara, Ross Zwisler, Christoph Hellwig

On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> > The zeroing (and the data, for that matter) doesn't need to be
> > committed to persistent store until the allocation is written and
> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
> > write, so it makes sense to deploy the big hammer and delay the
> > blocking CPU cache flushes until the last possible moment in cases
> > like this.
> 
> In pmem terms that would be a non-temporal memset plus a delayed
> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
> loop over the dirty-data issuing flushes after the fact.  We'll bump
> the priority of the non-temporal memset implementation.

Why is it better to do two synchronous physical writes to memory
within a couple of microseconds of CPU time rather than writing them
through the cache and, in most cases, only doing one physical write
to memory in a separate context that expects to wait for a flush
to complete?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
@ 2015-11-03  4:48         ` Dave Chinner
  0 siblings, 0 replies; 95+ messages in thread
From: Dave Chinner @ 2015-11-03  4:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org, linux-kernel,
	Jeff Moyer, Jan Kara, Ross Zwisler, Christoph Hellwig

On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> > The zeroing (and the data, for that matter) doesn't need to be
> > committed to persistent store until the allocation is written and
> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
> > write, so it makes sense to deploy the big hammer and delay the
> > blocking CPU cache flushes until the last possible moment in cases
> > like this.
> 
> In pmem terms that would be a non-temporal memset plus a delayed
> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
> loop over the dirty-data issuing flushes after the fact.  We'll bump
> the priority of the non-temporal memset implementation.

Why is it better to do two synchronous physical writes to memory
within a couple of microseconds of CPU time rather than writing them
through the cache and, in most cases, only doing one physical write
to memory in a separate context that expects to wait for a flush
to complete?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
  2015-11-03  1:16     ` Dave Chinner
@ 2015-11-03  4:56       ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03  4:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel, Ross Zwisler,
	Christoph Hellwig

On Mon, Nov 2, 2015 at 5:16 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Nov 01, 2015 at 11:30:58PM -0500, Dan Williams wrote:
>> DAX-enabled block device drivers can use hints from fs/dax.c to
>> optimize their internal tracking of potentially dirty cpu cache lines.
>> If a DAX mapping is being used for synchronous operations, dax_do_io(),
>> a dax-enabled block-driver knows that fs/dax.c will handle immediate
>> flushing.  For asynchronous mappings, i.e.  returned to userspace via
>> mmap, the driver can track active extents of the media for flushing.
>
> So, essentially, you are marking the calls into the mapping calls
> with BLKDAX_F_DIRTY when the mapping is requested for a write page
> fault?  Hence allowing the block device to track "dirty pages"
> exactly?

Not pages, but larger extents (1 extent = 1/NUM_DAX_EXTENTS of the
total storage capacity), because tracking dirty mappings should be
temporary compatibility hack and not a first class citizen.

> But, really, if we're going to use Ross's mapping tree patches that
> use exceptional entries to track dirty pfns, why do we need to this
> special interface from DAX to the block device? Ross's changes will
> track mmap'd ranges that are dirtied at the filesytem inode level,
> and the fsync/writeback will trigger CPU cache writeback of those
> dirty ranges. This will work for block devices that are mapped by
> DAX, too, because they have a inode+mapping tree, too.
>
> And if we are going to use Ross's infrastructure (which, when we
> work the kinks out of, I think we will), we really should change
> dax_do_io() to track pfns that are dirtied this way, too. That will
> allow us to get rid of all the cache flushing from the DAX layer
> (they'll get pushed into fsync/writeback) and so we only take the
> CPU cache flushing penalties when synchronous operations are
> requested by userspace...

No, we definitely can't do that.   I think your mental model of the
cache flushing is similar to the disk model where a small buffer is
flushed after a large streaming write.  Both Ross' patches and my
approach suffer from the same horror that the cache flushing is O(N)
currently, so we don't want to make it responsible for more data
ranges areas than is strictly necessary.

>> We can later extend the DAX paths to indicate when an async mapping is
>> "closed" allowing the active extents to be marked clean.
>
> Yes, that's a basic feature of Ross's patches. Hence I think this
> special case DAX<->bdev interface is the wrong direction to be
> taking.

So here's my problem with the "track dirty mappings" in the core
mm/vfs approach, it's harder to unwind and delete when it turns out no
application actually needs it, or the platform gives us an O(1) flush
method that is independent of dirty pte tracking.

We have the NVML [1] library as the recommended method for
applications to interact with persistent memory and it is not using
fsync/msync for its synchronization primitives, it's managing the
cache directly.  The *only* user for tracking dirty DAX mappings is
unmodified legacy applications that do mmap I/O and call fsync/msync.

DAX in my opinion is not a transparent accelerator of all existing
apps, it's a targeted mechanism for applications ready to take
advantage of byte addressable persistent memory.  This is why I'm a
big supporter of your per-inode DAX control proposal.  The fact that
fsync is painful for large amounts of dirty data is a feature.  It
detects inodes that should have had DAX-disabled in the first
instance.  The only advantage of the radix approach is that the second
fsync after the big hit may be faster, but that still can't beat
either targeted disabling of DAX or updating the app to use NVML.

So, again, I remain to be convinced that we need to carry complexity
in the core kernel when we have the page cache to cover those cases.
The driver solution is a minimal extension of the data
bdev_direct_access() is already sending down to the driver, and covers
the gap without mm/fs entanglements while we figure out a longer term
solution.

[1]: https://github.com/pmem/nvml

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
@ 2015-11-03  4:56       ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03  4:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org, linux-kernel,
	Ross Zwisler, Christoph Hellwig

On Mon, Nov 2, 2015 at 5:16 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Nov 01, 2015 at 11:30:58PM -0500, Dan Williams wrote:
>> DAX-enabled block device drivers can use hints from fs/dax.c to
>> optimize their internal tracking of potentially dirty cpu cache lines.
>> If a DAX mapping is being used for synchronous operations, dax_do_io(),
>> a dax-enabled block-driver knows that fs/dax.c will handle immediate
>> flushing.  For asynchronous mappings, i.e.  returned to userspace via
>> mmap, the driver can track active extents of the media for flushing.
>
> So, essentially, you are marking the calls into the mapping calls
> with BLKDAX_F_DIRTY when the mapping is requested for a write page
> fault?  Hence allowing the block device to track "dirty pages"
> exactly?

Not pages, but larger extents (1 extent = 1/NUM_DAX_EXTENTS of the
total storage capacity), because tracking dirty mappings should be
temporary compatibility hack and not a first class citizen.

> But, really, if we're going to use Ross's mapping tree patches that
> use exceptional entries to track dirty pfns, why do we need to this
> special interface from DAX to the block device? Ross's changes will
> track mmap'd ranges that are dirtied at the filesytem inode level,
> and the fsync/writeback will trigger CPU cache writeback of those
> dirty ranges. This will work for block devices that are mapped by
> DAX, too, because they have a inode+mapping tree, too.
>
> And if we are going to use Ross's infrastructure (which, when we
> work the kinks out of, I think we will), we really should change
> dax_do_io() to track pfns that are dirtied this way, too. That will
> allow us to get rid of all the cache flushing from the DAX layer
> (they'll get pushed into fsync/writeback) and so we only take the
> CPU cache flushing penalties when synchronous operations are
> requested by userspace...

No, we definitely can't do that.   I think your mental model of the
cache flushing is similar to the disk model where a small buffer is
flushed after a large streaming write.  Both Ross' patches and my
approach suffer from the same horror that the cache flushing is O(N)
currently, so we don't want to make it responsible for more data
ranges areas than is strictly necessary.

>> We can later extend the DAX paths to indicate when an async mapping is
>> "closed" allowing the active extents to be marked clean.
>
> Yes, that's a basic feature of Ross's patches. Hence I think this
> special case DAX<->bdev interface is the wrong direction to be
> taking.

So here's my problem with the "track dirty mappings" in the core
mm/vfs approach, it's harder to unwind and delete when it turns out no
application actually needs it, or the platform gives us an O(1) flush
method that is independent of dirty pte tracking.

We have the NVML [1] library as the recommended method for
applications to interact with persistent memory and it is not using
fsync/msync for its synchronization primitives, it's managing the
cache directly.  The *only* user for tracking dirty DAX mappings is
unmodified legacy applications that do mmap I/O and call fsync/msync.

DAX in my opinion is not a transparent accelerator of all existing
apps, it's a targeted mechanism for applications ready to take
advantage of byte addressable persistent memory.  This is why I'm a
big supporter of your per-inode DAX control proposal.  The fact that
fsync is painful for large amounts of dirty data is a feature.  It
detects inodes that should have had DAX-disabled in the first
instance.  The only advantage of the radix approach is that the second
fsync after the big hit may be faster, but that still can't beat
either targeted disabling of DAX or updating the app to use NVML.

So, again, I remain to be convinced that we need to carry complexity
in the core kernel when we have the page cache to cover those cases.
The driver solution is a minimal extension of the data
bdev_direct_access() is already sending down to the driver, and covers
the gap without mm/fs entanglements while we figure out a longer term
solution.

[1]: https://github.com/pmem/nvml

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
  2015-11-03  4:48         ` Dave Chinner
@ 2015-11-03  5:31           ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03  5:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel, Jeff Moyer,
	Jan Kara, Ross Zwisler, Christoph Hellwig

On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
>> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
>> > The zeroing (and the data, for that matter) doesn't need to be
>> > committed to persistent store until the allocation is written and
>> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
>> > write, so it makes sense to deploy the big hammer and delay the
>> > blocking CPU cache flushes until the last possible moment in cases
>> > like this.
>>
>> In pmem terms that would be a non-temporal memset plus a delayed
>> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
>> loop over the dirty-data issuing flushes after the fact.  We'll bump
>> the priority of the non-temporal memset implementation.
>
> Why is it better to do two synchronous physical writes to memory
> within a couple of microseconds of CPU time rather than writing them
> through the cache and, in most cases, only doing one physical write
> to memory in a separate context that expects to wait for a flush
> to complete?

With a switch to non-temporal writes they wouldn't be synchronous,
although it's doubtful that the subsequent writes after zeroing would
also hit the store buffer.

If we had a method to flush by physical-cache-way rather than a
virtual address then it would indeed be better to save up for one
final flush, but when we need to resort to looping through all the
virtual addresses that might have touched it gets expensive.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
@ 2015-11-03  5:31           ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03  5:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org, linux-kernel,
	Jeff Moyer, Jan Kara, Ross Zwisler, Christoph Hellwig

On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
>> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
>> > The zeroing (and the data, for that matter) doesn't need to be
>> > committed to persistent store until the allocation is written and
>> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
>> > write, so it makes sense to deploy the big hammer and delay the
>> > blocking CPU cache flushes until the last possible moment in cases
>> > like this.
>>
>> In pmem terms that would be a non-temporal memset plus a delayed
>> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
>> loop over the dirty-data issuing flushes after the fact.  We'll bump
>> the priority of the non-temporal memset implementation.
>
> Why is it better to do two synchronous physical writes to memory
> within a couple of microseconds of CPU time rather than writing them
> through the cache and, in most cases, only doing one physical write
> to memory in a separate context that expects to wait for a flush
> to complete?

With a switch to non-temporal writes they wouldn't be synchronous,
although it's doubtful that the subsequent writes after zeroing would
also hit the store buffer.

If we had a method to flush by physical-cache-way rather than a
virtual address then it would indeed be better to save up for one
final flush, but when we need to resort to looping through all the
virtual addresses that might have touched it gets expensive.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
  2015-11-03  4:56       ` Dan Williams
@ 2015-11-03  5:40         ` Dave Chinner
  -1 siblings, 0 replies; 95+ messages in thread
From: Dave Chinner @ 2015-11-03  5:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel, Ross Zwisler,
	Christoph Hellwig

On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 5:16 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Nov 01, 2015 at 11:30:58PM -0500, Dan Williams wrote:
> >> DAX-enabled block device drivers can use hints from fs/dax.c to
> >> optimize their internal tracking of potentially dirty cpu cache lines.
> >> If a DAX mapping is being used for synchronous operations, dax_do_io(),
> >> a dax-enabled block-driver knows that fs/dax.c will handle immediate
> >> flushing.  For asynchronous mappings, i.e.  returned to userspace via
> >> mmap, the driver can track active extents of the media for flushing.
> >
> > So, essentially, you are marking the calls into the mapping calls
> > with BLKDAX_F_DIRTY when the mapping is requested for a write page
> > fault?  Hence allowing the block device to track "dirty pages"
> > exactly?
> 
> Not pages, but larger extents (1 extent = 1/NUM_DAX_EXTENTS of the
> total storage capacity), because tracking dirty mappings should be
> temporary compatibility hack and not a first class citizen.
> 
> > But, really, if we're going to use Ross's mapping tree patches that
> > use exceptional entries to track dirty pfns, why do we need to this
> > special interface from DAX to the block device? Ross's changes will
> > track mmap'd ranges that are dirtied at the filesytem inode level,
> > and the fsync/writeback will trigger CPU cache writeback of those
> > dirty ranges. This will work for block devices that are mapped by
> > DAX, too, because they have a inode+mapping tree, too.
> >
> > And if we are going to use Ross's infrastructure (which, when we
> > work the kinks out of, I think we will), we really should change
> > dax_do_io() to track pfns that are dirtied this way, too. That will
> > allow us to get rid of all the cache flushing from the DAX layer
> > (they'll get pushed into fsync/writeback) and so we only take the
> > CPU cache flushing penalties when synchronous operations are
> > requested by userspace...
> 
> No, we definitely can't do that.   I think your mental model of the
> cache flushing is similar to the disk model where a small buffer is
> flushed after a large streaming write.  Both Ross' patches and my
> approach suffer from the same horror that the cache flushing is O(N)
> currently, so we don't want to make it responsible for more data
> ranges areas than is strictly necessary.

I didn't see anything that was O(N) in Ross's patches. What part of
the fsync algorithm that Ross proposed are you refering to here?

> >> We can later extend the DAX paths to indicate when an async mapping is
> >> "closed" allowing the active extents to be marked clean.
> >
> > Yes, that's a basic feature of Ross's patches. Hence I think this
> > special case DAX<->bdev interface is the wrong direction to be
> > taking.
> 
> So here's my problem with the "track dirty mappings" in the core
> mm/vfs approach, it's harder to unwind and delete when it turns out no
> application actually needs it, or the platform gives us an O(1) flush
> method that is independent of dirty pte tracking.
> 
> We have the NVML [1] library as the recommended method for
> applications to interact with persistent memory and it is not using
> fsync/msync for its synchronization primitives, it's managing the
> cache directly.  The *only* user for tracking dirty DAX mappings is
> unmodified legacy applications that do mmap I/O and call fsync/msync.

I'm pretty sure there are going to be many people still writing new
applications that use POSIX APIs they expect to work correctly on
pmem because, well, it's going to take 10 years before persistent
memory is common enough for most application developers to only
target storage via NVML.

The whole world is not crazy HFT applications that need to bypass
the kernel for *everything* because even a few nanoseconds of extra
latency matters.

> DAX in my opinion is not a transparent accelerator of all existing
> apps, it's a targeted mechanism for applications ready to take
> advantage of byte addressable persistent memory. 

And this is where we disagree. DAX is a method of allowing POSIX
compliant applications get the best of both worlds - portability
with existing storage and filesystems, yet with the speed and byte
addressiblity of persistent storage through the use of mmap.

Applications designed specifically for persistent memory don't want
a general purpose, POSIX compatible filesystem underneath them. The
should be interacting directly with, and only with, your NVML
library. If the NVML library is implemented by using DAX on a POSIX
compatible, general purpose filesystem, then you're just going to
have to live with everything we need to do to make DAX work with
general purpose POSIX compatible applications.

DAX has always been intended as a *stopgap measure* designed to
bridge the gap between existing POSIX based storage APIs and PMEM
native filesystem implementations. You're advocating that DAX should
only be used by PMEM native applications using NVML and then saying
anything that might be needed for POSIX compatible behaviour is
unacceptible overhead...

> This is why I'm a
> big supporter of your per-inode DAX control proposal.  The fact that
> fsync is painful for large amounts of dirty data is a feature.  It
> detects inodes that should have had DAX-disabled in the first
> instance.

fsync is painful for any storage when there is large amounts of
dirty data. DAX is no different, and it's not a reason for saying
"don't use DAX". DAX + fsync should be faster than "buffered IO
through the page cache on pmem + fsync" because there is only one
memory copy being done in the DAX case.

The buffered IO case has all that per-page radix tree tracking in it,
writeback, etc. Yet:

# mount -o dax /dev/ram0 /mnt/scratch
# time xfs_io -fc "truncate 0" -c "pwrite -b 8m 0 3g" -c fsync /mnt/scratch/file
wrote 3221225472/3221225472 bytes at offset 0
3.000 GiB, 384 ops; 0:00:10.00 (305.746 MiB/sec and 38.2182 ops/sec)
0.00user 10.05system 0:10.05elapsed 100%CPU (0avgtext+0avgdata 10512maxresident)k
0inputs+0outputs (0major+2156minor)pagefaults 0swaps
# umount /mnt/scratch
# mount /dev/ram0 /mnt/scratch
# time xfs_io -fc "truncate 0" -c "pwrite -b 8m 0 3g" -c fsync /mnt/scratch/file
wrote 3221225472/3221225472 bytes at offset 0
3.000 GiB, 384 ops; 0:00:02.00 (1.218 GiB/sec and 155.9046 ops/sec)
0.00user 2.83system 0:02.86elapsed 99%CPU (0avgtext+0avgdata 10468maxresident)k
0inputs+0outputs (0major+2154minor)pagefaults 0swaps
#

So don't tell me that tracking dirty pages in the radix tree too
slow for DAX and that DAX should not be used for POSIX IO based
applications - it should be as fast as buffered IO, if not faster,
and if it isn't then we've screwed up real bad. And right now, we're
screwing up real bad.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
@ 2015-11-03  5:40         ` Dave Chinner
  0 siblings, 0 replies; 95+ messages in thread
From: Dave Chinner @ 2015-11-03  5:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org, linux-kernel,
	Ross Zwisler, Christoph Hellwig

On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 5:16 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Nov 01, 2015 at 11:30:58PM -0500, Dan Williams wrote:
> >> DAX-enabled block device drivers can use hints from fs/dax.c to
> >> optimize their internal tracking of potentially dirty cpu cache lines.
> >> If a DAX mapping is being used for synchronous operations, dax_do_io(),
> >> a dax-enabled block-driver knows that fs/dax.c will handle immediate
> >> flushing.  For asynchronous mappings, i.e.  returned to userspace via
> >> mmap, the driver can track active extents of the media for flushing.
> >
> > So, essentially, you are marking the calls into the mapping calls
> > with BLKDAX_F_DIRTY when the mapping is requested for a write page
> > fault?  Hence allowing the block device to track "dirty pages"
> > exactly?
> 
> Not pages, but larger extents (1 extent = 1/NUM_DAX_EXTENTS of the
> total storage capacity), because tracking dirty mappings should be
> temporary compatibility hack and not a first class citizen.
> 
> > But, really, if we're going to use Ross's mapping tree patches that
> > use exceptional entries to track dirty pfns, why do we need to this
> > special interface from DAX to the block device? Ross's changes will
> > track mmap'd ranges that are dirtied at the filesytem inode level,
> > and the fsync/writeback will trigger CPU cache writeback of those
> > dirty ranges. This will work for block devices that are mapped by
> > DAX, too, because they have a inode+mapping tree, too.
> >
> > And if we are going to use Ross's infrastructure (which, when we
> > work the kinks out of, I think we will), we really should change
> > dax_do_io() to track pfns that are dirtied this way, too. That will
> > allow us to get rid of all the cache flushing from the DAX layer
> > (they'll get pushed into fsync/writeback) and so we only take the
> > CPU cache flushing penalties when synchronous operations are
> > requested by userspace...
> 
> No, we definitely can't do that.   I think your mental model of the
> cache flushing is similar to the disk model where a small buffer is
> flushed after a large streaming write.  Both Ross' patches and my
> approach suffer from the same horror that the cache flushing is O(N)
> currently, so we don't want to make it responsible for more data
> ranges areas than is strictly necessary.

I didn't see anything that was O(N) in Ross's patches. What part of
the fsync algorithm that Ross proposed are you refering to here?

> >> We can later extend the DAX paths to indicate when an async mapping is
> >> "closed" allowing the active extents to be marked clean.
> >
> > Yes, that's a basic feature of Ross's patches. Hence I think this
> > special case DAX<->bdev interface is the wrong direction to be
> > taking.
> 
> So here's my problem with the "track dirty mappings" in the core
> mm/vfs approach, it's harder to unwind and delete when it turns out no
> application actually needs it, or the platform gives us an O(1) flush
> method that is independent of dirty pte tracking.
> 
> We have the NVML [1] library as the recommended method for
> applications to interact with persistent memory and it is not using
> fsync/msync for its synchronization primitives, it's managing the
> cache directly.  The *only* user for tracking dirty DAX mappings is
> unmodified legacy applications that do mmap I/O and call fsync/msync.

I'm pretty sure there are going to be many people still writing new
applications that use POSIX APIs they expect to work correctly on
pmem because, well, it's going to take 10 years before persistent
memory is common enough for most application developers to only
target storage via NVML.

The whole world is not crazy HFT applications that need to bypass
the kernel for *everything* because even a few nanoseconds of extra
latency matters.

> DAX in my opinion is not a transparent accelerator of all existing
> apps, it's a targeted mechanism for applications ready to take
> advantage of byte addressable persistent memory. 

And this is where we disagree. DAX is a method of allowing POSIX
compliant applications get the best of both worlds - portability
with existing storage and filesystems, yet with the speed and byte
addressiblity of persistent storage through the use of mmap.

Applications designed specifically for persistent memory don't want
a general purpose, POSIX compatible filesystem underneath them. The
should be interacting directly with, and only with, your NVML
library. If the NVML library is implemented by using DAX on a POSIX
compatible, general purpose filesystem, then you're just going to
have to live with everything we need to do to make DAX work with
general purpose POSIX compatible applications.

DAX has always been intended as a *stopgap measure* designed to
bridge the gap between existing POSIX based storage APIs and PMEM
native filesystem implementations. You're advocating that DAX should
only be used by PMEM native applications using NVML and then saying
anything that might be needed for POSIX compatible behaviour is
unacceptible overhead...

> This is why I'm a
> big supporter of your per-inode DAX control proposal.  The fact that
> fsync is painful for large amounts of dirty data is a feature.  It
> detects inodes that should have had DAX-disabled in the first
> instance.

fsync is painful for any storage when there is large amounts of
dirty data. DAX is no different, and it's not a reason for saying
"don't use DAX". DAX + fsync should be faster than "buffered IO
through the page cache on pmem + fsync" because there is only one
memory copy being done in the DAX case.

The buffered IO case has all that per-page radix tree tracking in it,
writeback, etc. Yet:

# mount -o dax /dev/ram0 /mnt/scratch
# time xfs_io -fc "truncate 0" -c "pwrite -b 8m 0 3g" -c fsync /mnt/scratch/file
wrote 3221225472/3221225472 bytes at offset 0
3.000 GiB, 384 ops; 0:00:10.00 (305.746 MiB/sec and 38.2182 ops/sec)
0.00user 10.05system 0:10.05elapsed 100%CPU (0avgtext+0avgdata 10512maxresident)k
0inputs+0outputs (0major+2156minor)pagefaults 0swaps
# umount /mnt/scratch
# mount /dev/ram0 /mnt/scratch
# time xfs_io -fc "truncate 0" -c "pwrite -b 8m 0 3g" -c fsync /mnt/scratch/file
wrote 3221225472/3221225472 bytes at offset 0
3.000 GiB, 384 ops; 0:00:02.00 (1.218 GiB/sec and 155.9046 ops/sec)
0.00user 2.83system 0:02.86elapsed 99%CPU (0avgtext+0avgdata 10468maxresident)k
0inputs+0outputs (0major+2154minor)pagefaults 0swaps
#

So don't tell me that tracking dirty pages in the radix tree too
slow for DAX and that DAX should not be used for POSIX IO based
applications - it should be as fast as buffered IO, if not faster,
and if it isn't then we've screwed up real bad. And right now, we're
screwing up real bad.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
  2015-11-03  5:31           ` Dan Williams
@ 2015-11-03  5:52             ` Dave Chinner
  -1 siblings, 0 replies; 95+ messages in thread
From: Dave Chinner @ 2015-11-03  5:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel, Jeff Moyer,
	Jan Kara, Ross Zwisler, Christoph Hellwig

On Mon, Nov 02, 2015 at 09:31:11PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
> >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> >> > The zeroing (and the data, for that matter) doesn't need to be
> >> > committed to persistent store until the allocation is written and
> >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
> >> > write, so it makes sense to deploy the big hammer and delay the
> >> > blocking CPU cache flushes until the last possible moment in cases
> >> > like this.
> >>
> >> In pmem terms that would be a non-temporal memset plus a delayed
> >> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
> >> loop over the dirty-data issuing flushes after the fact.  We'll bump
> >> the priority of the non-temporal memset implementation.
> >
> > Why is it better to do two synchronous physical writes to memory
> > within a couple of microseconds of CPU time rather than writing them
> > through the cache and, in most cases, only doing one physical write
> > to memory in a separate context that expects to wait for a flush
> > to complete?
> 
> With a switch to non-temporal writes they wouldn't be synchronous,
> although it's doubtful that the subsequent writes after zeroing would
> also hit the store buffer.
> 
> If we had a method to flush by physical-cache-way rather than a
> virtual address then it would indeed be better to save up for one
> final flush, but when we need to resort to looping through all the
> virtual addresses that might have touched it gets expensive.

msync() is for flushing userspace mmap ranges addresses back to
physical memory. fsync() is for flushing kernel addresses (i.e. as
returned by bdev_direct_access()) back to physical addresses.
msync() calls ->fsync() as part of it's operation, fsync() does not
care about whether mmap has been sync'd first or not.

i.e. we don't care about random dirty userspace virtual mappings in
fsync() - if you have them then you need to call msync() first. So
we shouldn't ever be having to walk virtual addresses in fsync -
just the kaddr returned by bdev_direct_access() is all that fsync
needs to flush...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
@ 2015-11-03  5:52             ` Dave Chinner
  0 siblings, 0 replies; 95+ messages in thread
From: Dave Chinner @ 2015-11-03  5:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org, linux-kernel,
	Jeff Moyer, Jan Kara, Ross Zwisler, Christoph Hellwig

On Mon, Nov 02, 2015 at 09:31:11PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
> >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> >> > The zeroing (and the data, for that matter) doesn't need to be
> >> > committed to persistent store until the allocation is written and
> >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
> >> > write, so it makes sense to deploy the big hammer and delay the
> >> > blocking CPU cache flushes until the last possible moment in cases
> >> > like this.
> >>
> >> In pmem terms that would be a non-temporal memset plus a delayed
> >> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
> >> loop over the dirty-data issuing flushes after the fact.  We'll bump
> >> the priority of the non-temporal memset implementation.
> >
> > Why is it better to do two synchronous physical writes to memory
> > within a couple of microseconds of CPU time rather than writing them
> > through the cache and, in most cases, only doing one physical write
> > to memory in a separate context that expects to wait for a flush
> > to complete?
> 
> With a switch to non-temporal writes they wouldn't be synchronous,
> although it's doubtful that the subsequent writes after zeroing would
> also hit the store buffer.
> 
> If we had a method to flush by physical-cache-way rather than a
> virtual address then it would indeed be better to save up for one
> final flush, but when we need to resort to looping through all the
> virtual addresses that might have touched it gets expensive.

msync() is for flushing userspace mmap ranges addresses back to
physical memory. fsync() is for flushing kernel addresses (i.e. as
returned by bdev_direct_access()) back to physical addresses.
msync() calls ->fsync() as part of it's operation, fsync() does not
care about whether mmap has been sync'd first or not.

i.e. we don't care about random dirty userspace virtual mappings in
fsync() - if you have them then you need to call msync() first. So
we shouldn't ever be having to walk virtual addresses in fsync -
just the kaddr returned by bdev_direct_access() is all that fsync
needs to flush...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
  2015-11-03  5:40         ` Dave Chinner
@ 2015-11-03  7:20           ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03  7:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel, Ross Zwisler,
	Christoph Hellwig

On Mon, Nov 2, 2015 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
>> No, we definitely can't do that.   I think your mental model of the
>> cache flushing is similar to the disk model where a small buffer is
>> flushed after a large streaming write.  Both Ross' patches and my
>> approach suffer from the same horror that the cache flushing is O(N)
>> currently, so we don't want to make it responsible for more data
>> ranges areas than is strictly necessary.
>
> I didn't see anything that was O(N) in Ross's patches. What part of
> the fsync algorithm that Ross proposed are you refering to here?

We have to issue clflush per touched virtual address rather than a
constant number of physical ways, or a flush-all instruction.

>> >> We can later extend the DAX paths to indicate when an async mapping is
>> >> "closed" allowing the active extents to be marked clean.
>> >
>> > Yes, that's a basic feature of Ross's patches. Hence I think this
>> > special case DAX<->bdev interface is the wrong direction to be
>> > taking.
>>
>> So here's my problem with the "track dirty mappings" in the core
>> mm/vfs approach, it's harder to unwind and delete when it turns out no
>> application actually needs it, or the platform gives us an O(1) flush
>> method that is independent of dirty pte tracking.
>>
>> We have the NVML [1] library as the recommended method for
>> applications to interact with persistent memory and it is not using
>> fsync/msync for its synchronization primitives, it's managing the
>> cache directly.  The *only* user for tracking dirty DAX mappings is
>> unmodified legacy applications that do mmap I/O and call fsync/msync.
>
> I'm pretty sure there are going to be many people still writing new
> applications that use POSIX APIs they expect to work correctly on
> pmem because, well, it's going to take 10 years before persistent
> memory is common enough for most application developers to only
> target storage via NVML.
>
> The whole world is not crazy HFT applications that need to bypass
> the kernel for *everything* because even a few nanoseconds of extra
> latency matters.

I agree with all of that...

>> DAX in my opinion is not a transparent accelerator of all existing
>> apps, it's a targeted mechanism for applications ready to take
>> advantage of byte addressable persistent memory.
>
> And this is where we disagree. DAX is a method of allowing POSIX
> compliant applications get the best of both worlds - portability
> with existing storage and filesystems, yet with the speed and byte
> addressiblity of persistent storage through the use of mmap.
>
> Applications designed specifically for persistent memory don't want
> a general purpose, POSIX compatible filesystem underneath them. The
> should be interacting directly with, and only with, your NVML
> library. If the NVML library is implemented by using DAX on a POSIX
> compatible, general purpose filesystem, then you're just going to
> have to live with everything we need to do to make DAX work with
> general purpose POSIX compatible applications.
>
> DAX has always been intended as a *stopgap measure* designed to
> bridge the gap between existing POSIX based storage APIs and PMEM
> native filesystem implementations. You're advocating that DAX should
> only be used by PMEM native applications using NVML and then saying
> anything that might be needed for POSIX compatible behaviour is
> unacceptible overhead...

Also agreed, up until you this last sentence which is not what I am
saying at all.  I didn't say it is unacceptable overhead, my solution
in the driver has the exact same overhead.

Where I instead think we disagree is the acceptable cost of the "flush
cache" operation before the recommended solution is to locally disable
DAX, or require help from the platform to do this operation more
efficiently.  What I submit is unacceptable is to have the cpu loop
over every address heading out to storage.  The radix solution only
makes the second fsync after the first potentially less costly over
time.

I don't think we'll need it long term, or so I hope. The question
becomes do we want to carry this complexity in the core or push
selectively disabling DAX in the interim and have the simple driver
approach for cases where it's not feasible to disable DAX.  For 4.4 we
have the practical matter of not having the time to get mm folks to
review the radix approach.

I'm not opposed to ripping out the driver solution in 4.5 when we have
the time to get Ross' implementation reviewed.  I'm also holding back
the get_user_page() patches until 4.5 and given the big fat comment in
write_protect_page() about gup-fast interactions we'll need to think
through similar implications.

>
>> This is why I'm a
>> big supporter of your per-inode DAX control proposal.  The fact that
>> fsync is painful for large amounts of dirty data is a feature.  It
>> detects inodes that should have had DAX-disabled in the first
>> instance.
>
> fsync is painful for any storage when there is large amounts of
> dirty data. DAX is no different, and it's not a reason for saying
> "don't use DAX". DAX + fsync should be faster than "buffered IO
> through the page cache on pmem + fsync" because there is only one
> memory copy being done in the DAX case.
>
> The buffered IO case has all that per-page radix tree tracking in it,
> writeback, etc. Yet:
>
> # mount -o dax /dev/ram0 /mnt/scratch
> # time xfs_io -fc "truncate 0" -c "pwrite -b 8m 0 3g" -c fsync /mnt/scratch/file
> wrote 3221225472/3221225472 bytes at offset 0
> 3.000 GiB, 384 ops; 0:00:10.00 (305.746 MiB/sec and 38.2182 ops/sec)
> 0.00user 10.05system 0:10.05elapsed 100%CPU (0avgtext+0avgdata 10512maxresident)k
> 0inputs+0outputs (0major+2156minor)pagefaults 0swaps
> # umount /mnt/scratch
> # mount /dev/ram0 /mnt/scratch
> # time xfs_io -fc "truncate 0" -c "pwrite -b 8m 0 3g" -c fsync /mnt/scratch/file
> wrote 3221225472/3221225472 bytes at offset 0
> 3.000 GiB, 384 ops; 0:00:02.00 (1.218 GiB/sec and 155.9046 ops/sec)
> 0.00user 2.83system 0:02.86elapsed 99%CPU (0avgtext+0avgdata 10468maxresident)k
> 0inputs+0outputs (0major+2154minor)pagefaults 0swaps
> #
>
> So don't tell me that tracking dirty pages in the radix tree too
> slow for DAX and that DAX should not be used for POSIX IO based
> applications - it should be as fast as buffered IO, if not faster,
> and if it isn't then we've screwed up real bad. And right now, we're
> screwing up real bad.

Again, it's not the dirty tracking in the radix I'm worried about it's
looping through all the virtual addresses within those pages..

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
@ 2015-11-03  7:20           ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03  7:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org, linux-kernel,
	Ross Zwisler, Christoph Hellwig

On Mon, Nov 2, 2015 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
>> No, we definitely can't do that.   I think your mental model of the
>> cache flushing is similar to the disk model where a small buffer is
>> flushed after a large streaming write.  Both Ross' patches and my
>> approach suffer from the same horror that the cache flushing is O(N)
>> currently, so we don't want to make it responsible for more data
>> ranges areas than is strictly necessary.
>
> I didn't see anything that was O(N) in Ross's patches. What part of
> the fsync algorithm that Ross proposed are you refering to here?

We have to issue clflush per touched virtual address rather than a
constant number of physical ways, or a flush-all instruction.

>> >> We can later extend the DAX paths to indicate when an async mapping is
>> >> "closed" allowing the active extents to be marked clean.
>> >
>> > Yes, that's a basic feature of Ross's patches. Hence I think this
>> > special case DAX<->bdev interface is the wrong direction to be
>> > taking.
>>
>> So here's my problem with the "track dirty mappings" in the core
>> mm/vfs approach, it's harder to unwind and delete when it turns out no
>> application actually needs it, or the platform gives us an O(1) flush
>> method that is independent of dirty pte tracking.
>>
>> We have the NVML [1] library as the recommended method for
>> applications to interact with persistent memory and it is not using
>> fsync/msync for its synchronization primitives, it's managing the
>> cache directly.  The *only* user for tracking dirty DAX mappings is
>> unmodified legacy applications that do mmap I/O and call fsync/msync.
>
> I'm pretty sure there are going to be many people still writing new
> applications that use POSIX APIs they expect to work correctly on
> pmem because, well, it's going to take 10 years before persistent
> memory is common enough for most application developers to only
> target storage via NVML.
>
> The whole world is not crazy HFT applications that need to bypass
> the kernel for *everything* because even a few nanoseconds of extra
> latency matters.

I agree with all of that...

>> DAX in my opinion is not a transparent accelerator of all existing
>> apps, it's a targeted mechanism for applications ready to take
>> advantage of byte addressable persistent memory.
>
> And this is where we disagree. DAX is a method of allowing POSIX
> compliant applications get the best of both worlds - portability
> with existing storage and filesystems, yet with the speed and byte
> addressiblity of persistent storage through the use of mmap.
>
> Applications designed specifically for persistent memory don't want
> a general purpose, POSIX compatible filesystem underneath them. The
> should be interacting directly with, and only with, your NVML
> library. If the NVML library is implemented by using DAX on a POSIX
> compatible, general purpose filesystem, then you're just going to
> have to live with everything we need to do to make DAX work with
> general purpose POSIX compatible applications.
>
> DAX has always been intended as a *stopgap measure* designed to
> bridge the gap between existing POSIX based storage APIs and PMEM
> native filesystem implementations. You're advocating that DAX should
> only be used by PMEM native applications using NVML and then saying
> anything that might be needed for POSIX compatible behaviour is
> unacceptible overhead...

Also agreed, up until you this last sentence which is not what I am
saying at all.  I didn't say it is unacceptable overhead, my solution
in the driver has the exact same overhead.

Where I instead think we disagree is the acceptable cost of the "flush
cache" operation before the recommended solution is to locally disable
DAX, or require help from the platform to do this operation more
efficiently.  What I submit is unacceptable is to have the cpu loop
over every address heading out to storage.  The radix solution only
makes the second fsync after the first potentially less costly over
time.

I don't think we'll need it long term, or so I hope. The question
becomes do we want to carry this complexity in the core or push
selectively disabling DAX in the interim and have the simple driver
approach for cases where it's not feasible to disable DAX.  For 4.4 we
have the practical matter of not having the time to get mm folks to
review the radix approach.

I'm not opposed to ripping out the driver solution in 4.5 when we have
the time to get Ross' implementation reviewed.  I'm also holding back
the get_user_page() patches until 4.5 and given the big fat comment in
write_protect_page() about gup-fast interactions we'll need to think
through similar implications.

>
>> This is why I'm a
>> big supporter of your per-inode DAX control proposal.  The fact that
>> fsync is painful for large amounts of dirty data is a feature.  It
>> detects inodes that should have had DAX-disabled in the first
>> instance.
>
> fsync is painful for any storage when there is large amounts of
> dirty data. DAX is no different, and it's not a reason for saying
> "don't use DAX". DAX + fsync should be faster than "buffered IO
> through the page cache on pmem + fsync" because there is only one
> memory copy being done in the DAX case.
>
> The buffered IO case has all that per-page radix tree tracking in it,
> writeback, etc. Yet:
>
> # mount -o dax /dev/ram0 /mnt/scratch
> # time xfs_io -fc "truncate 0" -c "pwrite -b 8m 0 3g" -c fsync /mnt/scratch/file
> wrote 3221225472/3221225472 bytes at offset 0
> 3.000 GiB, 384 ops; 0:00:10.00 (305.746 MiB/sec and 38.2182 ops/sec)
> 0.00user 10.05system 0:10.05elapsed 100%CPU (0avgtext+0avgdata 10512maxresident)k
> 0inputs+0outputs (0major+2156minor)pagefaults 0swaps
> # umount /mnt/scratch
> # mount /dev/ram0 /mnt/scratch
> # time xfs_io -fc "truncate 0" -c "pwrite -b 8m 0 3g" -c fsync /mnt/scratch/file
> wrote 3221225472/3221225472 bytes at offset 0
> 3.000 GiB, 384 ops; 0:00:02.00 (1.218 GiB/sec and 155.9046 ops/sec)
> 0.00user 2.83system 0:02.86elapsed 99%CPU (0avgtext+0avgdata 10468maxresident)k
> 0inputs+0outputs (0major+2154minor)pagefaults 0swaps
> #
>
> So don't tell me that tracking dirty pages in the radix tree too
> slow for DAX and that DAX should not be used for POSIX IO based
> applications - it should be as fast as buffered IO, if not faster,
> and if it isn't then we've screwed up real bad. And right now, we're
> screwing up real bad.

Again, it's not the dirty tracking in the radix I'm worried about it's
looping through all the virtual addresses within those pages..

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
  2015-11-03  5:52             ` Dave Chinner
@ 2015-11-03  7:24               ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03  7:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel, Jeff Moyer,
	Jan Kara, Ross Zwisler, Christoph Hellwig

On Mon, Nov 2, 2015 at 9:52 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Nov 02, 2015 at 09:31:11PM -0800, Dan Williams wrote:
>> On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
>> >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
>> >> > The zeroing (and the data, for that matter) doesn't need to be
>> >> > committed to persistent store until the allocation is written and
>> >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
>> >> > write, so it makes sense to deploy the big hammer and delay the
>> >> > blocking CPU cache flushes until the last possible moment in cases
>> >> > like this.
>> >>
>> >> In pmem terms that would be a non-temporal memset plus a delayed
>> >> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
>> >> loop over the dirty-data issuing flushes after the fact.  We'll bump
>> >> the priority of the non-temporal memset implementation.
>> >
>> > Why is it better to do two synchronous physical writes to memory
>> > within a couple of microseconds of CPU time rather than writing them
>> > through the cache and, in most cases, only doing one physical write
>> > to memory in a separate context that expects to wait for a flush
>> > to complete?
>>
>> With a switch to non-temporal writes they wouldn't be synchronous,
>> although it's doubtful that the subsequent writes after zeroing would
>> also hit the store buffer.
>>
>> If we had a method to flush by physical-cache-way rather than a
>> virtual address then it would indeed be better to save up for one
>> final flush, but when we need to resort to looping through all the
>> virtual addresses that might have touched it gets expensive.
>
> msync() is for flushing userspace mmap ranges addresses back to
> physical memory. fsync() is for flushing kernel addresses (i.e. as
> returned by bdev_direct_access()) back to physical addresses.
> msync() calls ->fsync() as part of it's operation, fsync() does not
> care about whether mmap has been sync'd first or not.
>
> i.e. we don't care about random dirty userspace virtual mappings in
> fsync() - if you have them then you need to call msync() first. So
> we shouldn't ever be having to walk virtual addresses in fsync -
> just the kaddr returned by bdev_direct_access() is all that fsync
> needs to flush...
>

Neither Ross' solution nor mine use userspace addresses.  Which
comment of mine were you reacting to?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
@ 2015-11-03  7:24               ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03  7:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org, linux-kernel,
	Jeff Moyer, Jan Kara, Ross Zwisler, Christoph Hellwig

On Mon, Nov 2, 2015 at 9:52 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Nov 02, 2015 at 09:31:11PM -0800, Dan Williams wrote:
>> On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
>> >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
>> >> > The zeroing (and the data, for that matter) doesn't need to be
>> >> > committed to persistent store until the allocation is written and
>> >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
>> >> > write, so it makes sense to deploy the big hammer and delay the
>> >> > blocking CPU cache flushes until the last possible moment in cases
>> >> > like this.
>> >>
>> >> In pmem terms that would be a non-temporal memset plus a delayed
>> >> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
>> >> loop over the dirty-data issuing flushes after the fact.  We'll bump
>> >> the priority of the non-temporal memset implementation.
>> >
>> > Why is it better to do two synchronous physical writes to memory
>> > within a couple of microseconds of CPU time rather than writing them
>> > through the cache and, in most cases, only doing one physical write
>> > to memory in a separate context that expects to wait for a flush
>> > to complete?
>>
>> With a switch to non-temporal writes they wouldn't be synchronous,
>> although it's doubtful that the subsequent writes after zeroing would
>> also hit the store buffer.
>>
>> If we had a method to flush by physical-cache-way rather than a
>> virtual address then it would indeed be better to save up for one
>> final flush, but when we need to resort to looping through all the
>> virtual addresses that might have touched it gets expensive.
>
> msync() is for flushing userspace mmap ranges addresses back to
> physical memory. fsync() is for flushing kernel addresses (i.e. as
> returned by bdev_direct_access()) back to physical addresses.
> msync() calls ->fsync() as part of it's operation, fsync() does not
> care about whether mmap has been sync'd first or not.
>
> i.e. we don't care about random dirty userspace virtual mappings in
> fsync() - if you have them then you need to call msync() first. So
> we shouldn't ever be having to walk virtual addresses in fsync -
> just the kaddr returned by bdev_direct_access() is all that fsync
> needs to flush...
>

Neither Ross' solution nor mine use userspace addresses.  Which
comment of mine were you reacting to?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 13/15] block, dax: make dax mappings opt-in by default
  2015-11-03  0:32     ` Dave Chinner
@ 2015-11-03  7:35       ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03  7:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel, Ross Zwisler,
	Christoph Hellwig

On Mon, Nov 2, 2015 at 4:32 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Nov 01, 2015 at 11:30:53PM -0500, Dan Williams wrote:
>> Now that we have the ability to dynamically enable DAX for a raw block
>> inode, make the behavior opt-in by default.  DAX does not have feature
>> parity with pagecache backed mappings, so applications should knowingly
>> enable DAX semantics.
>>
>> Note, this is only for mappings returned to userspace.  For the
>> synchronous usages of DAX, dax_do_io(), there is no semantic difference
>> with the bio submission path, so that path remains default enabled.
>>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  block/ioctl.c      |    3 +--
>>  fs/block_dev.c     |   33 +++++++++++++++++++++++----------
>>  include/linux/fs.h |    8 ++++++++
>>  3 files changed, 32 insertions(+), 12 deletions(-)
>>
>> diff --git a/block/ioctl.c b/block/ioctl.c
>> index 205d57612fbd..c4c3a09d9ca9 100644
>> --- a/block/ioctl.c
>> +++ b/block/ioctl.c
>> @@ -298,13 +298,12 @@ static inline int is_unrecognized_ioctl(int ret)
>>  #ifdef CONFIG_FS_DAX
>>  static int blkdev_set_dax(struct block_device *bdev, int n)
>>  {
>> -     struct gendisk *disk = bdev->bd_disk;
>>       int rc = 0;
>>
>>       if (n)
>>               n = S_DAX;
>>
>> -     if (n && !disk->fops->direct_access)
>> +     if (n && !blkdev_dax_capable(bdev))
>>               return -ENOTTY;
>>
>>       mutex_lock(&bdev->bd_inode->i_mutex);
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 13ce6d0ff7f6..ee34a31e6fa4 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -152,16 +152,37 @@ static struct inode *bdev_file_inode(struct file *file)
>>       return file->f_mapping->host;
>>  }
>>
>> +#ifdef CONFIG_FS_DAX
>> +bool blkdev_dax_capable(struct block_device *bdev)
>> +{
>> +     struct gendisk *disk = bdev->bd_disk;
>> +
>> +     if (!disk->fops->direct_access)
>> +             return false;
>> +
>> +     /*
>> +      * If the partition is not aligned on a page boundary, we can't
>> +      * do dax I/O to it.
>> +      */
>> +     if ((bdev->bd_part->start_sect % (PAGE_SIZE / 512))
>> +                     || (bdev->bd_part->nr_sects % (PAGE_SIZE / 512)))
>> +             return false;
>> +
>> +     return true;
>
> Where do you check that S_DAX has been enabled on the block device
> now?
>

Only in the mmap path:

static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
{
        struct inode *bd_inode = bdev_file_inode(file);
        struct block_device *bdev = I_BDEV(bd_inode);

        file_accessed(file);
        mutex_lock(&bd_inode->i_mutex);
        bdev->bd_map_count++;
        if (IS_DAX(bd_inode)) {
                vma->vm_ops = &blkdev_dax_vm_ops;
                vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
        } else {
                vma->vm_ops = &blkdev_default_vm_ops;
        }
        mutex_unlock(&bd_inode->i_mutex);

        return 0;
}

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 13/15] block, dax: make dax mappings opt-in by default
@ 2015-11-03  7:35       ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03  7:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org, linux-kernel,
	Ross Zwisler, Christoph Hellwig

On Mon, Nov 2, 2015 at 4:32 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Nov 01, 2015 at 11:30:53PM -0500, Dan Williams wrote:
>> Now that we have the ability to dynamically enable DAX for a raw block
>> inode, make the behavior opt-in by default.  DAX does not have feature
>> parity with pagecache backed mappings, so applications should knowingly
>> enable DAX semantics.
>>
>> Note, this is only for mappings returned to userspace.  For the
>> synchronous usages of DAX, dax_do_io(), there is no semantic difference
>> with the bio submission path, so that path remains default enabled.
>>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  block/ioctl.c      |    3 +--
>>  fs/block_dev.c     |   33 +++++++++++++++++++++++----------
>>  include/linux/fs.h |    8 ++++++++
>>  3 files changed, 32 insertions(+), 12 deletions(-)
>>
>> diff --git a/block/ioctl.c b/block/ioctl.c
>> index 205d57612fbd..c4c3a09d9ca9 100644
>> --- a/block/ioctl.c
>> +++ b/block/ioctl.c
>> @@ -298,13 +298,12 @@ static inline int is_unrecognized_ioctl(int ret)
>>  #ifdef CONFIG_FS_DAX
>>  static int blkdev_set_dax(struct block_device *bdev, int n)
>>  {
>> -     struct gendisk *disk = bdev->bd_disk;
>>       int rc = 0;
>>
>>       if (n)
>>               n = S_DAX;
>>
>> -     if (n && !disk->fops->direct_access)
>> +     if (n && !blkdev_dax_capable(bdev))
>>               return -ENOTTY;
>>
>>       mutex_lock(&bdev->bd_inode->i_mutex);
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 13ce6d0ff7f6..ee34a31e6fa4 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -152,16 +152,37 @@ static struct inode *bdev_file_inode(struct file *file)
>>       return file->f_mapping->host;
>>  }
>>
>> +#ifdef CONFIG_FS_DAX
>> +bool blkdev_dax_capable(struct block_device *bdev)
>> +{
>> +     struct gendisk *disk = bdev->bd_disk;
>> +
>> +     if (!disk->fops->direct_access)
>> +             return false;
>> +
>> +     /*
>> +      * If the partition is not aligned on a page boundary, we can't
>> +      * do dax I/O to it.
>> +      */
>> +     if ((bdev->bd_part->start_sect % (PAGE_SIZE / 512))
>> +                     || (bdev->bd_part->nr_sects % (PAGE_SIZE / 512)))
>> +             return false;
>> +
>> +     return true;
>
> Where do you check that S_DAX has been enabled on the block device
> now?
>

Only in the mmap path:

static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
{
        struct inode *bd_inode = bdev_file_inode(file);
        struct block_device *bdev = I_BDEV(bd_inode);

        file_accessed(file);
        mutex_lock(&bd_inode->i_mutex);
        bdev->bd_map_count++;
        if (IS_DAX(bd_inode)) {
                vma->vm_ops = &blkdev_dax_vm_ops;
                vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
        } else {
                vma->vm_ops = &blkdev_default_vm_ops;
        }
        mutex_unlock(&bd_inode->i_mutex);

        return 0;
}

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
  2015-11-03  5:31           ` Dan Williams
@ 2015-11-03 16:21             ` Jan Kara
  -1 siblings, 0 replies; 95+ messages in thread
From: Jan Kara @ 2015-11-03 16:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel,
	Jeff Moyer, Jan Kara, Ross Zwisler, Christoph Hellwig

On Mon 02-11-15 21:31:11, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
> >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> >> > The zeroing (and the data, for that matter) doesn't need to be
> >> > committed to persistent store until the allocation is written and
> >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
> >> > write, so it makes sense to deploy the big hammer and delay the
> >> > blocking CPU cache flushes until the last possible moment in cases
> >> > like this.
> >>
> >> In pmem terms that would be a non-temporal memset plus a delayed
> >> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
> >> loop over the dirty-data issuing flushes after the fact.  We'll bump
> >> the priority of the non-temporal memset implementation.
> >
> > Why is it better to do two synchronous physical writes to memory
> > within a couple of microseconds of CPU time rather than writing them
> > through the cache and, in most cases, only doing one physical write
> > to memory in a separate context that expects to wait for a flush
> > to complete?
> 
> With a switch to non-temporal writes they wouldn't be synchronous,
> although it's doubtful that the subsequent writes after zeroing would
> also hit the store buffer.
> 
> If we had a method to flush by physical-cache-way rather than a
> virtual address then it would indeed be better to save up for one
> final flush, but when we need to resort to looping through all the
> virtual addresses that might have touched it gets expensive.

Similarly as Dave I'm somewhat confused by your use of "virtual addresses"
and I wasn't able to figure out what exactly are you speaking about. In
Ross' patches, fsync will iterate over all 4 KB ranges (they would be pages
if we had page cache) of the file that got dirtied and call wb_cache_pmem()
for each corresponding "physical block" - where "physical block" actually
ends up being an physical address in pmem. Is this iteration what you find
too costly?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
@ 2015-11-03 16:21             ` Jan Kara
  0 siblings, 0 replies; 95+ messages in thread
From: Jan Kara @ 2015-11-03 16:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org,
	linux-kernel, Jeff Moyer, Jan Kara, Ross Zwisler,
	Christoph Hellwig

On Mon 02-11-15 21:31:11, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
> >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> >> > The zeroing (and the data, for that matter) doesn't need to be
> >> > committed to persistent store until the allocation is written and
> >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
> >> > write, so it makes sense to deploy the big hammer and delay the
> >> > blocking CPU cache flushes until the last possible moment in cases
> >> > like this.
> >>
> >> In pmem terms that would be a non-temporal memset plus a delayed
> >> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
> >> loop over the dirty-data issuing flushes after the fact.  We'll bump
> >> the priority of the non-temporal memset implementation.
> >
> > Why is it better to do two synchronous physical writes to memory
> > within a couple of microseconds of CPU time rather than writing them
> > through the cache and, in most cases, only doing one physical write
> > to memory in a separate context that expects to wait for a flush
> > to complete?
> 
> With a switch to non-temporal writes they wouldn't be synchronous,
> although it's doubtful that the subsequent writes after zeroing would
> also hit the store buffer.
> 
> If we had a method to flush by physical-cache-way rather than a
> virtual address then it would indeed be better to save up for one
> final flush, but when we need to resort to looping through all the
> virtual addresses that might have touched it gets expensive.

Similarly as Dave I'm somewhat confused by your use of "virtual addresses"
and I wasn't able to figure out what exactly are you speaking about. In
Ross' patches, fsync will iterate over all 4 KB ranges (they would be pages
if we had page cache) of the file that got dirtied and call wb_cache_pmem()
for each corresponding "physical block" - where "physical block" actually
ends up being an physical address in pmem. Is this iteration what you find
too costly?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
  2015-11-03  5:31           ` Dan Williams
@ 2015-11-03 17:57             ` Ross Zwisler
  -1 siblings, 0 replies; 95+ messages in thread
From: Ross Zwisler @ 2015-11-03 17:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel,
	Jeff Moyer, Jan Kara, Ross Zwisler, Christoph Hellwig

On Mon, Nov 02, 2015 at 09:31:11PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
> >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> >> > The zeroing (and the data, for that matter) doesn't need to be
> >> > committed to persistent store until the allocation is written and
> >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
> >> > write, so it makes sense to deploy the big hammer and delay the
> >> > blocking CPU cache flushes until the last possible moment in cases
> >> > like this.
> >>
> >> In pmem terms that would be a non-temporal memset plus a delayed
> >> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
> >> loop over the dirty-data issuing flushes after the fact.  We'll bump
> >> the priority of the non-temporal memset implementation.
> >
> > Why is it better to do two synchronous physical writes to memory
> > within a couple of microseconds of CPU time rather than writing them
> > through the cache and, in most cases, only doing one physical write
> > to memory in a separate context that expects to wait for a flush
> > to complete?
> 
> With a switch to non-temporal writes they wouldn't be synchronous,
> although it's doubtful that the subsequent writes after zeroing would
> also hit the store buffer.
> 
> If we had a method to flush by physical-cache-way rather than a
> virtual address then it would indeed be better to save up for one
> final flush, but when we need to resort to looping through all the
> virtual addresses that might have touched it gets expensive.

I agree with the idea that we should avoid the "big hammer" flushing in
response to REQ_FLUSH.  Here are the steps that are needed to make sure that
something is durable on media with PMEM/DAX:

1) Write, either with non-temporal stores or with stores that use the
processor cache

2) If you wrote using the processor cache, flush or write back the processor
cache

3) wmb_pmem(), synchronizing all non-temporal writes and flushes durably to
media.

PMEM does all I/O using 1 and 3 with non-temporal stores, and mmaps that go to
userspace can used cached writes, so on fsync/msync we do a bunch of flushes
for step 2.  In either case I think we should have the PMEM driver just do
step 3, the wmb_pmem(), in response to REQ_FLUSH.  This allows the zeroing
code to just do non-temporal writes of zeros, the DAX fsync/msync code to just
do flushes (which is what my patch set already does), and just leave the
wmb_pmem() to the PMEM driver at REQ_FLUSH time.

This makes the burden of REQ_FLUSH bearable for the PMEM driver, allowing us
to avoid looping through potentially terabytes of PMEM on each REQ_FLUSH bio.

This just means that the layers above the PMEM code either need to use
non-temporal writes for their I/Os, or do flushing, which I don't think is too
onerous.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
@ 2015-11-03 17:57             ` Ross Zwisler
  0 siblings, 0 replies; 95+ messages in thread
From: Ross Zwisler @ 2015-11-03 17:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org,
	linux-kernel, Jeff Moyer, Jan Kara, Ross Zwisler,
	Christoph Hellwig

On Mon, Nov 02, 2015 at 09:31:11PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
> >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> >> > The zeroing (and the data, for that matter) doesn't need to be
> >> > committed to persistent store until the allocation is written and
> >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
> >> > write, so it makes sense to deploy the big hammer and delay the
> >> > blocking CPU cache flushes until the last possible moment in cases
> >> > like this.
> >>
> >> In pmem terms that would be a non-temporal memset plus a delayed
> >> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
> >> loop over the dirty-data issuing flushes after the fact.  We'll bump
> >> the priority of the non-temporal memset implementation.
> >
> > Why is it better to do two synchronous physical writes to memory
> > within a couple of microseconds of CPU time rather than writing them
> > through the cache and, in most cases, only doing one physical write
> > to memory in a separate context that expects to wait for a flush
> > to complete?
> 
> With a switch to non-temporal writes they wouldn't be synchronous,
> although it's doubtful that the subsequent writes after zeroing would
> also hit the store buffer.
> 
> If we had a method to flush by physical-cache-way rather than a
> virtual address then it would indeed be better to save up for one
> final flush, but when we need to resort to looping through all the
> virtual addresses that might have touched it gets expensive.

I agree with the idea that we should avoid the "big hammer" flushing in
response to REQ_FLUSH.  Here are the steps that are needed to make sure that
something is durable on media with PMEM/DAX:

1) Write, either with non-temporal stores or with stores that use the
processor cache

2) If you wrote using the processor cache, flush or write back the processor
cache

3) wmb_pmem(), synchronizing all non-temporal writes and flushes durably to
media.

PMEM does all I/O using 1 and 3 with non-temporal stores, and mmaps that go to
userspace can used cached writes, so on fsync/msync we do a bunch of flushes
for step 2.  In either case I think we should have the PMEM driver just do
step 3, the wmb_pmem(), in response to REQ_FLUSH.  This allows the zeroing
code to just do non-temporal writes of zeros, the DAX fsync/msync code to just
do flushes (which is what my patch set already does), and just leave the
wmb_pmem() to the PMEM driver at REQ_FLUSH time.

This makes the burden of REQ_FLUSH bearable for the PMEM driver, allowing us
to avoid looping through potentially terabytes of PMEM on each REQ_FLUSH bio.

This just means that the layers above the PMEM code either need to use
non-temporal writes for their I/Os, or do flushing, which I don't think is too
onerous.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 03/15] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
  2015-11-02  4:29   ` Dan Williams
@ 2015-11-03 19:01     ` Ross Zwisler
  -1 siblings, 0 replies; 95+ messages in thread
From: Ross Zwisler @ 2015-11-03 19:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, Jens Axboe, jack, linux-nvdimm, david, linux-kernel,
	Jeff Moyer, Jan Kara, ross.zwisler, hch

On Sun, Nov 01, 2015 at 11:29:58PM -0500, Dan Williams wrote:
> The DAX implementation needs to protect new calls to ->direct_access()
> and usage of its return value against unbind of the underlying block
> device.  Use blk_queue_enter()/blk_queue_exit() to either prevent
> blk_cleanup_queue() from proceeding, or fail the dax_map_atomic() if the
> request_queue is being torn down.
> 
> Cc: Jan Kara <jack@suse.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
<>
> @@ -42,9 +76,9 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>  		long count, sz;
>  
>  		sz = min_t(long, size, SZ_1M);
> -		count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
> -		if (count < 0)
> -			return count;
> +		addr = __dax_map_atomic(bdev, sector, size, &pfn, &count);

I think you can use dax_map_atomic() here instead, allowing you to avoid
having a local pfn variable that otherwise goes unused.

> @@ -138,21 +176,27 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
>  				bh->b_size -= done;
>  			}
>  
> -			hole = iov_iter_rw(iter) != WRITE && !buffer_written(bh);
> +			hole = rw == READ && !buffer_written(bh);
>  			if (hole) {
>  				addr = NULL;
>  				size = bh->b_size - first;
>  			} else {
> -				retval = dax_get_addr(bh, &addr, blkbits);
> -				if (retval < 0)
> +				dax_unmap_atomic(bdev, kmap);
> +				kmap = __dax_map_atomic(bdev,
> +						to_sector(bh, inode),
> +						bh->b_size, &pfn, &map_len);

Same as above, you can use dax_map_atomic() here instead and nix the pfn variable.

> @@ -305,11 +353,10 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  		goto out;
>  	}
>  
> -	error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
> -	if (error < 0)
> -		goto out;
> -	if (error < PAGE_SIZE) {
> -		error = -EIO;
> +	addr = __dax_map_atomic(bdev, to_sector(bh, inode), bh->b_size,
> +			&pfn, NULL);
> +	if (IS_ERR(addr)) {
> +		error = PTR_ERR(addr);

Just a note that we lost the check for bdev_direct_access() returning less
than PAGE_SIZE.  Are we sure this can't happen and that it's safe to remove
the check?

> @@ -609,15 +655,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		result = VM_FAULT_NOPAGE;
>  		spin_unlock(ptl);
>  	} else {
> -		sector = bh.b_blocknr << (blkbits - 9);
> -		length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
> -						bh.b_size);
> -		if (length < 0) {
> +		long length;
> +		unsigned long pfn;
> +		void __pmem *kaddr = __dax_map_atomic(bdev,
> +				to_sector(&bh, inode), HPAGE_SIZE, &pfn,
> +				&length);

Let's use PMD_SIZE instead of HPAGE_SIZE to be consistent with the rest of the
DAX code.

> +
> +		if (IS_ERR(kaddr)) {
>  			result = VM_FAULT_SIGBUS;
>  			goto out;
>  		}
> -		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
> +		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) {
> +			dax_unmap_atomic(bdev, kaddr);
>  			goto fallback;
> +		}
>  
>  		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
>  			clear_pmem(kaddr, HPAGE_SIZE);

Ditto, let's use PMD_SIZE for consistency (I realize this was changed ealier
in the series).

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 03/15] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
@ 2015-11-03 19:01     ` Ross Zwisler
  0 siblings, 0 replies; 95+ messages in thread
From: Ross Zwisler @ 2015-11-03 19:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, Jens Axboe, jack, linux-nvdimm, david, linux-kernel,
	Jeff Moyer, Jan Kara, ross.zwisler, hch

On Sun, Nov 01, 2015 at 11:29:58PM -0500, Dan Williams wrote:
> The DAX implementation needs to protect new calls to ->direct_access()
> and usage of its return value against unbind of the underlying block
> device.  Use blk_queue_enter()/blk_queue_exit() to either prevent
> blk_cleanup_queue() from proceeding, or fail the dax_map_atomic() if the
> request_queue is being torn down.
> 
> Cc: Jan Kara <jack@suse.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
<>
> @@ -42,9 +76,9 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>  		long count, sz;
>  
>  		sz = min_t(long, size, SZ_1M);
> -		count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
> -		if (count < 0)
> -			return count;
> +		addr = __dax_map_atomic(bdev, sector, size, &pfn, &count);

I think you can use dax_map_atomic() here instead, allowing you to avoid
having a local pfn variable that otherwise goes unused.

> @@ -138,21 +176,27 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
>  				bh->b_size -= done;
>  			}
>  
> -			hole = iov_iter_rw(iter) != WRITE && !buffer_written(bh);
> +			hole = rw == READ && !buffer_written(bh);
>  			if (hole) {
>  				addr = NULL;
>  				size = bh->b_size - first;
>  			} else {
> -				retval = dax_get_addr(bh, &addr, blkbits);
> -				if (retval < 0)
> +				dax_unmap_atomic(bdev, kmap);
> +				kmap = __dax_map_atomic(bdev,
> +						to_sector(bh, inode),
> +						bh->b_size, &pfn, &map_len);

Same as above, you can use dax_map_atomic() here instead and nix the pfn variable.

> @@ -305,11 +353,10 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  		goto out;
>  	}
>  
> -	error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
> -	if (error < 0)
> -		goto out;
> -	if (error < PAGE_SIZE) {
> -		error = -EIO;
> +	addr = __dax_map_atomic(bdev, to_sector(bh, inode), bh->b_size,
> +			&pfn, NULL);
> +	if (IS_ERR(addr)) {
> +		error = PTR_ERR(addr);

Just a note that we lost the check for bdev_direct_access() returning less
than PAGE_SIZE.  Are we sure this can't happen and that it's safe to remove
the check?

> @@ -609,15 +655,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		result = VM_FAULT_NOPAGE;
>  		spin_unlock(ptl);
>  	} else {
> -		sector = bh.b_blocknr << (blkbits - 9);
> -		length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
> -						bh.b_size);
> -		if (length < 0) {
> +		long length;
> +		unsigned long pfn;
> +		void __pmem *kaddr = __dax_map_atomic(bdev,
> +				to_sector(&bh, inode), HPAGE_SIZE, &pfn,
> +				&length);

Let's use PMD_SIZE instead of HPAGE_SIZE to be consistent with the rest of the
DAX code.

> +
> +		if (IS_ERR(kaddr)) {
>  			result = VM_FAULT_SIGBUS;
>  			goto out;
>  		}
> -		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
> +		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) {
> +			dax_unmap_atomic(bdev, kaddr);
>  			goto fallback;
> +		}
>  
>  		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
>  			clear_pmem(kaddr, HPAGE_SIZE);

Ditto, let's use PMD_SIZE for consistency (I realize this was changed ealier
in the series).

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 03/15] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
  2015-11-03 19:01     ` Ross Zwisler
  (?)
@ 2015-11-03 19:09     ` Jeff Moyer
  -1 siblings, 0 replies; 95+ messages in thread
From: Jeff Moyer @ 2015-11-03 19:09 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dan Williams, axboe, Jens Axboe, jack, linux-nvdimm, david,
	linux-kernel, Jan Kara, hch

Ross Zwisler <ross.zwisler@linux.intel.com> writes:

> On Sun, Nov 01, 2015 at 11:29:58PM -0500, Dan Williams wrote:
>> The DAX implementation needs to protect new calls to ->direct_access()
>> and usage of its return value against unbind of the underlying block
>> device.  Use blk_queue_enter()/blk_queue_exit() to either prevent
>> blk_cleanup_queue() from proceeding, or fail the dax_map_atomic() if the
>> request_queue is being torn down.
>> 
>> Cc: Jan Kara <jack@suse.com>
>> Cc: Jens Axboe <axboe@kernel.dk>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
> <>
>> @@ -42,9 +76,9 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>>  		long count, sz;
>>  
>>  		sz = min_t(long, size, SZ_1M);
>> -		count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
>> -		if (count < 0)
>> -			return count;
>> +		addr = __dax_map_atomic(bdev, sector, size, &pfn, &count);
>
> I think you can use dax_map_atomic() here instead, allowing you to avoid
> having a local pfn variable that otherwise goes unused.

But __dax_map_atomic doesn't return the count, and I believe that is
what's used.

>> @@ -138,21 +176,27 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
>>  				bh->b_size -= done;
>>  			}
>>  
>> -			hole = iov_iter_rw(iter) != WRITE && !buffer_written(bh);
>> +			hole = rw == READ && !buffer_written(bh);
>>  			if (hole) {
>>  				addr = NULL;
>>  				size = bh->b_size - first;
>>  			} else {
>> -				retval = dax_get_addr(bh, &addr, blkbits);
>> -				if (retval < 0)
>> +				dax_unmap_atomic(bdev, kmap);
>> +				kmap = __dax_map_atomic(bdev,
>> +						to_sector(bh, inode),
>> +						bh->b_size, &pfn, &map_len);
>
> Same as above, you can use dax_map_atomic() here instead and nix the pfn variable.

same as above.  ;-)

>> @@ -305,11 +353,10 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>>  		goto out;
>>  	}
>>  
>> -	error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
>> -	if (error < 0)
>> -		goto out;
>> -	if (error < PAGE_SIZE) {
>> -		error = -EIO;
>> +	addr = __dax_map_atomic(bdev, to_sector(bh, inode), bh->b_size,
>> +			&pfn, NULL);
>> +	if (IS_ERR(addr)) {
>> +		error = PTR_ERR(addr);
>
> Just a note that we lost the check for bdev_direct_access() returning less
> than PAGE_SIZE.  Are we sure this can't happen and that it's safe to remove
> the check?

Yes, it's safe, I checked during my review.  This page size assumption
is present throughout the file, and makes reviewing it very frustrating
for the uninitiated.  I think that's worth a follow-on cleanup patch.

Cheers,
Jeff

>> @@ -609,15 +655,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>>  		result = VM_FAULT_NOPAGE;
>>  		spin_unlock(ptl);
>>  	} else {
>> -		sector = bh.b_blocknr << (blkbits - 9);
>> -		length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
>> -						bh.b_size);
>> -		if (length < 0) {
>> +		long length;
>> +		unsigned long pfn;
>> +		void __pmem *kaddr = __dax_map_atomic(bdev,
>> +				to_sector(&bh, inode), HPAGE_SIZE, &pfn,
>> +				&length);
>
> Let's use PMD_SIZE instead of HPAGE_SIZE to be consistent with the rest of the
> DAX code.
>
>> +
>> +		if (IS_ERR(kaddr)) {
>>  			result = VM_FAULT_SIGBUS;
>>  			goto out;
>>  		}
>> -		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
>> +		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) {
>> +			dax_unmap_atomic(bdev, kaddr);
>>  			goto fallback;
>> +		}
>>  
>>  		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
>>  			clear_pmem(kaddr, HPAGE_SIZE);
>
> Ditto, let's use PMD_SIZE for consistency (I realize this was changed ealier
> in the series).

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 04/15] libnvdimm, pmem: move request_queue allocation earlier in probe
  2015-11-02  4:30   ` Dan Williams
@ 2015-11-03 19:15     ` Ross Zwisler
  -1 siblings, 0 replies; 95+ messages in thread
From: Ross Zwisler @ 2015-11-03 19:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, jack, linux-nvdimm, david, linux-kernel, ross.zwisler, hch

On Sun, Nov 01, 2015 at 11:30:04PM -0500, Dan Williams wrote:
> Before the dynamically allocated struct pages from devm_memremap_pages()
> can be put to use outside the driver, we need a mechanism to track
> whether they are still in use at teardown.  Towards that goal reorder
> the initialization sequence to allow the 'q_usage_counter' from the
> request_queue to be used by the devm_memremap_pages() implementation (in
> subsequent patches).
> 
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
<>
> @@ -150,16 +151,23 @@ static struct pmem_device *pmem_alloc(struct device *dev,
>  		return ERR_PTR(-EBUSY);
>  	}
>  
> -	if (pmem_should_map_pages(dev))
> +	q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(dev));
> +	if (!q)
> +		return ERR_PTR(-ENOMEM);
> +
> +	if (pmem_should_map_pages(dev)) {

No need to introduce braces for this if().

Otherwise this looks fine.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 04/15] libnvdimm, pmem: move request_queue allocation earlier in probe
@ 2015-11-03 19:15     ` Ross Zwisler
  0 siblings, 0 replies; 95+ messages in thread
From: Ross Zwisler @ 2015-11-03 19:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, jack, linux-nvdimm, david, linux-kernel, ross.zwisler, hch

On Sun, Nov 01, 2015 at 11:30:04PM -0500, Dan Williams wrote:
> Before the dynamically allocated struct pages from devm_memremap_pages()
> can be put to use outside the driver, we need a mechanism to track
> whether they are still in use at teardown.  Towards that goal reorder
> the initialization sequence to allow the 'q_usage_counter' from the
> request_queue to be used by the devm_memremap_pages() implementation (in
> subsequent patches).
> 
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
<>
> @@ -150,16 +151,23 @@ static struct pmem_device *pmem_alloc(struct device *dev,
>  		return ERR_PTR(-EBUSY);
>  	}
>  
> -	if (pmem_should_map_pages(dev))
> +	q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(dev));
> +	if (!q)
> +		return ERR_PTR(-ENOMEM);
> +
> +	if (pmem_should_map_pages(dev)) {

No need to introduce braces for this if().

Otherwise this looks fine.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 05/15] libnvdimm, pmem: fix size trim in pmem_direct_access()
  2015-11-02  4:30   ` Dan Williams
@ 2015-11-03 19:32     ` Ross Zwisler
  -1 siblings, 0 replies; 95+ messages in thread
From: Ross Zwisler @ 2015-11-03 19:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, jack, linux-nvdimm, david, linux-kernel, stable,
	ross.zwisler, hch

On Sun, Nov 01, 2015 at 11:30:10PM -0500, Dan Williams wrote:
> This masking prevents access to the end of the device via dax_do_io(),
> and is unnecessary as arch_add_memory() would have rejected an unaligned
> allocation.
> 
> Cc: <stable@vger.kernel.org>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/nvdimm/pmem.c |   17 +++--------------
>  1 file changed, 3 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index e46988fbdee5..93472953e231 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -100,26 +100,15 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
>  }
>  
>  static long pmem_direct_access(struct block_device *bdev, sector_t sector,
> -		      void __pmem **kaddr, unsigned long *pfn)
> +		      void __pmem **kaddr, pfn_t *pfn)

It seems kind of weird to change only this instance of direct_access() to have
the last argument as a pfn_t instead of an unsigned long?   If pfn_t is more
descriptive (I think it is) should we update the definition in struct
block_device_operations and all the other implementors of direct_access as
well?  If that's touching too much, let's do them all together later, but
let's not change one now and have them be inconsistent.

>  {
>  	struct pmem_device *pmem = bdev->bd_disk->private_data;
>  	resource_size_t offset = sector * 512 + pmem->data_offset;
> -	resource_size_t size;
>  
> -	if (pmem->data_offset) {
> -		/*
> -		 * Limit the direct_access() size to what is covered by
> -		 * the memmap
> -		 */
> -		size = (pmem->size - offset) & ~ND_PFN_MASK;
> -	} else
> -		size = pmem->size - offset;
> -
> -	/* FIXME convert DAX to comprehend that this mapping has a lifetime */
>  	*kaddr = pmem->virt_addr + offset;
> -	*pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;
> +	*pfn = __phys_to_pfn(pmem->phys_addr + offset, pmem->pfn_flags);

__phys_to_pfn() only takes a single argument (the paddr) in v4.3,
jens/for-4.4/integrity and in nvdimm/libnvdimm-for-next.  Is this second
argument of pfn_flags actually correct?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 05/15] libnvdimm, pmem: fix size trim in pmem_direct_access()
@ 2015-11-03 19:32     ` Ross Zwisler
  0 siblings, 0 replies; 95+ messages in thread
From: Ross Zwisler @ 2015-11-03 19:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, jack, linux-nvdimm, david, linux-kernel, stable,
	ross.zwisler, hch

On Sun, Nov 01, 2015 at 11:30:10PM -0500, Dan Williams wrote:
> This masking prevents access to the end of the device via dax_do_io(),
> and is unnecessary as arch_add_memory() would have rejected an unaligned
> allocation.
> 
> Cc: <stable@vger.kernel.org>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/nvdimm/pmem.c |   17 +++--------------
>  1 file changed, 3 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index e46988fbdee5..93472953e231 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -100,26 +100,15 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
>  }
>  
>  static long pmem_direct_access(struct block_device *bdev, sector_t sector,
> -		      void __pmem **kaddr, unsigned long *pfn)
> +		      void __pmem **kaddr, pfn_t *pfn)

It seems kind of weird to change only this instance of direct_access() to have
the last argument as a pfn_t instead of an unsigned long?   If pfn_t is more
descriptive (I think it is) should we update the definition in struct
block_device_operations and all the other implementors of direct_access as
well?  If that's touching too much, let's do them all together later, but
let's not change one now and have them be inconsistent.

>  {
>  	struct pmem_device *pmem = bdev->bd_disk->private_data;
>  	resource_size_t offset = sector * 512 + pmem->data_offset;
> -	resource_size_t size;
>  
> -	if (pmem->data_offset) {
> -		/*
> -		 * Limit the direct_access() size to what is covered by
> -		 * the memmap
> -		 */
> -		size = (pmem->size - offset) & ~ND_PFN_MASK;
> -	} else
> -		size = pmem->size - offset;
> -
> -	/* FIXME convert DAX to comprehend that this mapping has a lifetime */
>  	*kaddr = pmem->virt_addr + offset;
> -	*pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;
> +	*pfn = __phys_to_pfn(pmem->phys_addr + offset, pmem->pfn_flags);

__phys_to_pfn() only takes a single argument (the paddr) in v4.3,
jens/for-4.4/integrity and in nvdimm/libnvdimm-for-next.  Is this second
argument of pfn_flags actually correct?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 13/15] block, dax: make dax mappings opt-in by default
  2015-11-03  7:35       ` Dan Williams
@ 2015-11-03 20:20         ` Dave Chinner
  -1 siblings, 0 replies; 95+ messages in thread
From: Dave Chinner @ 2015-11-03 20:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel, Ross Zwisler,
	Christoph Hellwig

On Mon, Nov 02, 2015 at 11:35:04PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 4:32 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Nov 01, 2015 at 11:30:53PM -0500, Dan Williams wrote:
> >> Now that we have the ability to dynamically enable DAX for a raw block
> >> inode, make the behavior opt-in by default.  DAX does not have feature
> >> parity with pagecache backed mappings, so applications should knowingly
> >> enable DAX semantics.
> >>
> >> Note, this is only for mappings returned to userspace.  For the
> >> synchronous usages of DAX, dax_do_io(), there is no semantic difference
> >> with the bio submission path, so that path remains default enabled.
> >>
> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> >> ---
> >>  block/ioctl.c      |    3 +--
> >>  fs/block_dev.c     |   33 +++++++++++++++++++++++----------
> >>  include/linux/fs.h |    8 ++++++++
> >>  3 files changed, 32 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/block/ioctl.c b/block/ioctl.c
> >> index 205d57612fbd..c4c3a09d9ca9 100644
> >> --- a/block/ioctl.c
> >> +++ b/block/ioctl.c
> >> @@ -298,13 +298,12 @@ static inline int is_unrecognized_ioctl(int ret)
> >>  #ifdef CONFIG_FS_DAX
> >>  static int blkdev_set_dax(struct block_device *bdev, int n)
> >>  {
> >> -     struct gendisk *disk = bdev->bd_disk;
> >>       int rc = 0;
> >>
> >>       if (n)
> >>               n = S_DAX;
> >>
> >> -     if (n && !disk->fops->direct_access)
> >> +     if (n && !blkdev_dax_capable(bdev))
> >>               return -ENOTTY;
> >>
> >>       mutex_lock(&bdev->bd_inode->i_mutex);
> >> diff --git a/fs/block_dev.c b/fs/block_dev.c
> >> index 13ce6d0ff7f6..ee34a31e6fa4 100644
> >> --- a/fs/block_dev.c
> >> +++ b/fs/block_dev.c
> >> @@ -152,16 +152,37 @@ static struct inode *bdev_file_inode(struct file *file)
> >>       return file->f_mapping->host;
> >>  }
> >>
> >> +#ifdef CONFIG_FS_DAX
> >> +bool blkdev_dax_capable(struct block_device *bdev)
> >> +{
> >> +     struct gendisk *disk = bdev->bd_disk;
> >> +
> >> +     if (!disk->fops->direct_access)
> >> +             return false;
> >> +
> >> +     /*
> >> +      * If the partition is not aligned on a page boundary, we can't
> >> +      * do dax I/O to it.
> >> +      */
> >> +     if ((bdev->bd_part->start_sect % (PAGE_SIZE / 512))
> >> +                     || (bdev->bd_part->nr_sects % (PAGE_SIZE / 512)))
> >> +             return false;
> >> +
> >> +     return true;
> >
> > Where do you check that S_DAX has been enabled on the block device
> > now?
> >
> 
> Only in the mmap path:

which means blkdev_direct_IO() is now always going to go down the
dax_do_io() path for any driver with a ->direct_access method rather
than the direct IO path, regardless of whether DAX is enabled on the
device or not.

That really seems wrong to me - you've replace explicit "is DAX
enabled" checks with "is DAX possible" checks, and so DAX paths are
used regardless of whether DAX is enabled or not. And it's not
obvious why this is done, nor is it now obvious how DAX interacts
with the block device.

This really seems like a step backwards to me.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 13/15] block, dax: make dax mappings opt-in by default
@ 2015-11-03 20:20         ` Dave Chinner
  0 siblings, 0 replies; 95+ messages in thread
From: Dave Chinner @ 2015-11-03 20:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org, linux-kernel,
	Ross Zwisler, Christoph Hellwig

On Mon, Nov 02, 2015 at 11:35:04PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 4:32 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Nov 01, 2015 at 11:30:53PM -0500, Dan Williams wrote:
> >> Now that we have the ability to dynamically enable DAX for a raw block
> >> inode, make the behavior opt-in by default.  DAX does not have feature
> >> parity with pagecache backed mappings, so applications should knowingly
> >> enable DAX semantics.
> >>
> >> Note, this is only for mappings returned to userspace.  For the
> >> synchronous usages of DAX, dax_do_io(), there is no semantic difference
> >> with the bio submission path, so that path remains default enabled.
> >>
> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> >> ---
> >>  block/ioctl.c      |    3 +--
> >>  fs/block_dev.c     |   33 +++++++++++++++++++++++----------
> >>  include/linux/fs.h |    8 ++++++++
> >>  3 files changed, 32 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/block/ioctl.c b/block/ioctl.c
> >> index 205d57612fbd..c4c3a09d9ca9 100644
> >> --- a/block/ioctl.c
> >> +++ b/block/ioctl.c
> >> @@ -298,13 +298,12 @@ static inline int is_unrecognized_ioctl(int ret)
> >>  #ifdef CONFIG_FS_DAX
> >>  static int blkdev_set_dax(struct block_device *bdev, int n)
> >>  {
> >> -     struct gendisk *disk = bdev->bd_disk;
> >>       int rc = 0;
> >>
> >>       if (n)
> >>               n = S_DAX;
> >>
> >> -     if (n && !disk->fops->direct_access)
> >> +     if (n && !blkdev_dax_capable(bdev))
> >>               return -ENOTTY;
> >>
> >>       mutex_lock(&bdev->bd_inode->i_mutex);
> >> diff --git a/fs/block_dev.c b/fs/block_dev.c
> >> index 13ce6d0ff7f6..ee34a31e6fa4 100644
> >> --- a/fs/block_dev.c
> >> +++ b/fs/block_dev.c
> >> @@ -152,16 +152,37 @@ static struct inode *bdev_file_inode(struct file *file)
> >>       return file->f_mapping->host;
> >>  }
> >>
> >> +#ifdef CONFIG_FS_DAX
> >> +bool blkdev_dax_capable(struct block_device *bdev)
> >> +{
> >> +     struct gendisk *disk = bdev->bd_disk;
> >> +
> >> +     if (!disk->fops->direct_access)
> >> +             return false;
> >> +
> >> +     /*
> >> +      * If the partition is not aligned on a page boundary, we can't
> >> +      * do dax I/O to it.
> >> +      */
> >> +     if ((bdev->bd_part->start_sect % (PAGE_SIZE / 512))
> >> +                     || (bdev->bd_part->nr_sects % (PAGE_SIZE / 512)))
> >> +             return false;
> >> +
> >> +     return true;
> >
> > Where do you check that S_DAX has been enabled on the block device
> > now?
> >
> 
> Only in the mmap path:

which means blkdev_direct_IO() is now always going to go down the
dax_do_io() path for any driver with a ->direct_access method rather
than the direct IO path, regardless of whether DAX is enabled on the
device or not.

That really seems wrong to me - you've replace explicit "is DAX
enabled" checks with "is DAX possible" checks, and so DAX paths are
used regardless of whether DAX is enabled or not. And it's not
obvious why this is done, nor is it now obvious how DAX interacts
with the block device.

This really seems like a step backwards to me.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
  2015-11-03  7:20           ` Dan Williams
@ 2015-11-03 20:51             ` Dave Chinner
  -1 siblings, 0 replies; 95+ messages in thread
From: Dave Chinner @ 2015-11-03 20:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel, Ross Zwisler,
	Christoph Hellwig

On Mon, Nov 02, 2015 at 11:20:49PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
> >> No, we definitely can't do that.   I think your mental model of the
> >> cache flushing is similar to the disk model where a small buffer is
> >> flushed after a large streaming write.  Both Ross' patches and my
> >> approach suffer from the same horror that the cache flushing is O(N)
> >> currently, so we don't want to make it responsible for more data
> >> ranges areas than is strictly necessary.
> >
> > I didn't see anything that was O(N) in Ross's patches. What part of
> > the fsync algorithm that Ross proposed are you refering to here?
> 
> We have to issue clflush per touched virtual address rather than a
> constant number of physical ways, or a flush-all instruction.
.....
> > So don't tell me that tracking dirty pages in the radix tree too
> > slow for DAX and that DAX should not be used for POSIX IO based
> > applications - it should be as fast as buffered IO, if not faster,
> > and if it isn't then we've screwed up real bad. And right now, we're
> > screwing up real bad.
> 
> Again, it's not the dirty tracking in the radix I'm worried about it's
> looping through all the virtual addresses within those pages..

So, let me summarise what I think you've just said. You are

1. fine with looping through the virtual addresses doing cache flushes
   synchronously when doing IO despite it having significant
   latency and performance costs.

2. Happy to hack a method into DAX to bypass the filesystems by
   pushing information to the block device for it to track regions that
   need cache flushes, then add infrastructure to the block device to
   track those dirty regions and then walk those addresses and issue
   cache flushes when the filesystem issues a REQ_FLUSH IO regardless
   of whether the filesystem actually needs those cachelines flushed
   for that specific IO?

3. Not happy to use the generic mm/vfs level infrastructure
   architectected specifically to provide the exact asynchronous
   cache flushing/writeback semantics we require because it will
   cause too many cache flushes, even though the number of cache
   flushes will be, at worst, the same as in 2).


1) will work, but as we can see it is *slow*. 3) is what Ross is
implementing - it's a tried and tested architecture that all mm/fs
developers understand, and his explanation of why it will work for
pmem is pretty solid and completely platform/hardware architecture
independent.

Which leaves this question: How does 2) save us anything in terms of
avoiding iterating virtual addresses and issuing cache flushes
over 3)? And is it sufficient to justify hacking a bypass into DAX
and the additional driver level complexity of having to add dirty
region tracking, flushing and cleaning to REQ_FLUSH operations?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
@ 2015-11-03 20:51             ` Dave Chinner
  0 siblings, 0 replies; 95+ messages in thread
From: Dave Chinner @ 2015-11-03 20:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org, linux-kernel,
	Ross Zwisler, Christoph Hellwig

On Mon, Nov 02, 2015 at 11:20:49PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
> >> No, we definitely can't do that.   I think your mental model of the
> >> cache flushing is similar to the disk model where a small buffer is
> >> flushed after a large streaming write.  Both Ross' patches and my
> >> approach suffer from the same horror that the cache flushing is O(N)
> >> currently, so we don't want to make it responsible for more data
> >> ranges areas than is strictly necessary.
> >
> > I didn't see anything that was O(N) in Ross's patches. What part of
> > the fsync algorithm that Ross proposed are you refering to here?
> 
> We have to issue clflush per touched virtual address rather than a
> constant number of physical ways, or a flush-all instruction.
.....
> > So don't tell me that tracking dirty pages in the radix tree too
> > slow for DAX and that DAX should not be used for POSIX IO based
> > applications - it should be as fast as buffered IO, if not faster,
> > and if it isn't then we've screwed up real bad. And right now, we're
> > screwing up real bad.
> 
> Again, it's not the dirty tracking in the radix I'm worried about it's
> looping through all the virtual addresses within those pages..

So, let me summarise what I think you've just said. You are

1. fine with looping through the virtual addresses doing cache flushes
   synchronously when doing IO despite it having significant
   latency and performance costs.

2. Happy to hack a method into DAX to bypass the filesystems by
   pushing information to the block device for it to track regions that
   need cache flushes, then add infrastructure to the block device to
   track those dirty regions and then walk those addresses and issue
   cache flushes when the filesystem issues a REQ_FLUSH IO regardless
   of whether the filesystem actually needs those cachelines flushed
   for that specific IO?

3. Not happy to use the generic mm/vfs level infrastructure
   architectected specifically to provide the exact asynchronous
   cache flushing/writeback semantics we require because it will
   cause too many cache flushes, even though the number of cache
   flushes will be, at worst, the same as in 2).


1) will work, but as we can see it is *slow*. 3) is what Ross is
implementing - it's a tried and tested architecture that all mm/fs
developers understand, and his explanation of why it will work for
pmem is pretty solid and completely platform/hardware architecture
independent.

Which leaves this question: How does 2) save us anything in terms of
avoiding iterating virtual addresses and issuing cache flushes
over 3)? And is it sufficient to justify hacking a bypass into DAX
and the additional driver level complexity of having to add dirty
region tracking, flushing and cleaning to REQ_FLUSH operations?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
  2015-11-03 17:57             ` Ross Zwisler
@ 2015-11-03 20:59               ` Dave Chinner
  -1 siblings, 0 replies; 95+ messages in thread
From: Dave Chinner @ 2015-11-03 20:59 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dan Williams, Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel,
	Jeff Moyer, Jan Kara, Christoph Hellwig

On Tue, Nov 03, 2015 at 10:57:57AM -0700, Ross Zwisler wrote:
> On Mon, Nov 02, 2015 at 09:31:11PM -0800, Dan Williams wrote:
> > On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
> > >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> > >> > The zeroing (and the data, for that matter) doesn't need to be
> > >> > committed to persistent store until the allocation is written and
> > >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
> > >> > write, so it makes sense to deploy the big hammer and delay the
> > >> > blocking CPU cache flushes until the last possible moment in cases
> > >> > like this.
> > >>
> > >> In pmem terms that would be a non-temporal memset plus a delayed
> > >> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
> > >> loop over the dirty-data issuing flushes after the fact.  We'll bump
> > >> the priority of the non-temporal memset implementation.
> > >
> > > Why is it better to do two synchronous physical writes to memory
> > > within a couple of microseconds of CPU time rather than writing them
> > > through the cache and, in most cases, only doing one physical write
> > > to memory in a separate context that expects to wait for a flush
> > > to complete?
> > 
> > With a switch to non-temporal writes they wouldn't be synchronous,
> > although it's doubtful that the subsequent writes after zeroing would
> > also hit the store buffer.
> > 
> > If we had a method to flush by physical-cache-way rather than a
> > virtual address then it would indeed be better to save up for one
> > final flush, but when we need to resort to looping through all the
> > virtual addresses that might have touched it gets expensive.
> 
> I agree with the idea that we should avoid the "big hammer" flushing in
> response to REQ_FLUSH.  Here are the steps that are needed to make sure that
> something is durable on media with PMEM/DAX:
> 
> 1) Write, either with non-temporal stores or with stores that use the
> processor cache
> 
> 2) If you wrote using the processor cache, flush or write back the processor
> cache
> 
> 3) wmb_pmem(), synchronizing all non-temporal writes and flushes durably to
> media.

Right, and when you look at buffered IO, we have:

1) write to page cache, mark page dirty
2) if you have dirty cached pages, flush dirty pages to device
3) REQ_FLUSH causes everything to be durable.

> PMEM does all I/O using 1 and 3 with non-temporal stores, and mmaps that go to
> userspace can used cached writes, so on fsync/msync we do a bunch of flushes
> for step 2.  In either case I think we should have the PMEM driver just do
> step 3, the wmb_pmem(), in response to REQ_FLUSH.  This allows the zeroing
> code to just do non-temporal writes of zeros, the DAX fsync/msync code to just
> do flushes (which is what my patch set already does), and just leave the
> wmb_pmem() to the PMEM driver at REQ_FLUSH time.
>
> This just means that the layers above the PMEM code either need to use
> non-temporal writes for their I/Os, or do flushing, which I don't think is too
> onerous.

Agreed - it fits neatly into the existing infrastructure and
algorithms and there's no evidence to suggest that using the
existing infrastructure is going to cause undue burden on PMEM based
workloads. Hence I really think this is the right way to proceed...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
@ 2015-11-03 20:59               ` Dave Chinner
  0 siblings, 0 replies; 95+ messages in thread
From: Dave Chinner @ 2015-11-03 20:59 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dan Williams, Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org,
	linux-kernel, Jeff Moyer, Jan Kara, Christoph Hellwig

On Tue, Nov 03, 2015 at 10:57:57AM -0700, Ross Zwisler wrote:
> On Mon, Nov 02, 2015 at 09:31:11PM -0800, Dan Williams wrote:
> > On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
> > >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> > >> > The zeroing (and the data, for that matter) doesn't need to be
> > >> > committed to persistent store until the allocation is written and
> > >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
> > >> > write, so it makes sense to deploy the big hammer and delay the
> > >> > blocking CPU cache flushes until the last possible moment in cases
> > >> > like this.
> > >>
> > >> In pmem terms that would be a non-temporal memset plus a delayed
> > >> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
> > >> loop over the dirty-data issuing flushes after the fact.  We'll bump
> > >> the priority of the non-temporal memset implementation.
> > >
> > > Why is it better to do two synchronous physical writes to memory
> > > within a couple of microseconds of CPU time rather than writing them
> > > through the cache and, in most cases, only doing one physical write
> > > to memory in a separate context that expects to wait for a flush
> > > to complete?
> > 
> > With a switch to non-temporal writes they wouldn't be synchronous,
> > although it's doubtful that the subsequent writes after zeroing would
> > also hit the store buffer.
> > 
> > If we had a method to flush by physical-cache-way rather than a
> > virtual address then it would indeed be better to save up for one
> > final flush, but when we need to resort to looping through all the
> > virtual addresses that might have touched it gets expensive.
> 
> I agree with the idea that we should avoid the "big hammer" flushing in
> response to REQ_FLUSH.  Here are the steps that are needed to make sure that
> something is durable on media with PMEM/DAX:
> 
> 1) Write, either with non-temporal stores or with stores that use the
> processor cache
> 
> 2) If you wrote using the processor cache, flush or write back the processor
> cache
> 
> 3) wmb_pmem(), synchronizing all non-temporal writes and flushes durably to
> media.

Right, and when you look at buffered IO, we have:

1) write to page cache, mark page dirty
2) if you have dirty cached pages, flush dirty pages to device
3) REQ_FLUSH causes everything to be durable.

> PMEM does all I/O using 1 and 3 with non-temporal stores, and mmaps that go to
> userspace can used cached writes, so on fsync/msync we do a bunch of flushes
> for step 2.  In either case I think we should have the PMEM driver just do
> step 3, the wmb_pmem(), in response to REQ_FLUSH.  This allows the zeroing
> code to just do non-temporal writes of zeros, the DAX fsync/msync code to just
> do flushes (which is what my patch set already does), and just leave the
> wmb_pmem() to the PMEM driver at REQ_FLUSH time.
>
> This just means that the layers above the PMEM code either need to use
> non-temporal writes for their I/Os, or do flushing, which I don't think is too
> onerous.

Agreed - it fits neatly into the existing infrastructure and
algorithms and there's no evidence to suggest that using the
existing infrastructure is going to cause undue burden on PMEM based
workloads. Hence I really think this is the right way to proceed...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
  2015-11-03  4:56       ` Dan Williams
@ 2015-11-03 21:18         ` Ross Zwisler
  -1 siblings, 0 replies; 95+ messages in thread
From: Ross Zwisler @ 2015-11-03 21:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel,
	Ross Zwisler, Christoph Hellwig

On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 5:16 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Nov 01, 2015 at 11:30:58PM -0500, Dan Williams wrote:
<>
> > Yes, that's a basic feature of Ross's patches. Hence I think this
> > special case DAX<->bdev interface is the wrong direction to be
> > taking.
> 
> So here's my problem with the "track dirty mappings" in the core
> mm/vfs approach, it's harder to unwind and delete when it turns out no
> application actually needs it, or the platform gives us an O(1) flush
> method that is independent of dirty pte tracking.

I don't think that we'll ever be able to "unwind and delete" the dirty page
tracking.  Even if *some* platform gives us an 0(1) flush method independent
of PTE tracking, that will be an optimized path that will need to live along
side the path that we'll need to keep for other architectures (like all the
ones that exist today).

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
@ 2015-11-03 21:18         ` Ross Zwisler
  0 siblings, 0 replies; 95+ messages in thread
From: Ross Zwisler @ 2015-11-03 21:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org,
	linux-kernel, Ross Zwisler, Christoph Hellwig

On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 5:16 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Nov 01, 2015 at 11:30:58PM -0500, Dan Williams wrote:
<>
> > Yes, that's a basic feature of Ross's patches. Hence I think this
> > special case DAX<->bdev interface is the wrong direction to be
> > taking.
> 
> So here's my problem with the "track dirty mappings" in the core
> mm/vfs approach, it's harder to unwind and delete when it turns out no
> application actually needs it, or the platform gives us an O(1) flush
> method that is independent of dirty pte tracking.

I don't think that we'll ever be able to "unwind and delete" the dirty page
tracking.  Even if *some* platform gives us an 0(1) flush method independent
of PTE tracking, that will be an optimized path that will need to live along
side the path that we'll need to keep for other architectures (like all the
ones that exist today).

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
  2015-11-03 20:51             ` Dave Chinner
@ 2015-11-03 21:19               ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03 21:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel, Ross Zwisler,
	Christoph Hellwig

On Tue, Nov 3, 2015 at 12:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Nov 02, 2015 at 11:20:49PM -0800, Dan Williams wrote:
>> On Mon, Nov 2, 2015 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
>> >> No, we definitely can't do that.   I think your mental model of the
>> >> cache flushing is similar to the disk model where a small buffer is
>> >> flushed after a large streaming write.  Both Ross' patches and my
>> >> approach suffer from the same horror that the cache flushing is O(N)
>> >> currently, so we don't want to make it responsible for more data
>> >> ranges areas than is strictly necessary.
>> >
>> > I didn't see anything that was O(N) in Ross's patches. What part of
>> > the fsync algorithm that Ross proposed are you refering to here?
>>
>> We have to issue clflush per touched virtual address rather than a
>> constant number of physical ways, or a flush-all instruction.
> .....
>> > So don't tell me that tracking dirty pages in the radix tree too
>> > slow for DAX and that DAX should not be used for POSIX IO based
>> > applications - it should be as fast as buffered IO, if not faster,
>> > and if it isn't then we've screwed up real bad. And right now, we're
>> > screwing up real bad.
>>
>> Again, it's not the dirty tracking in the radix I'm worried about it's
>> looping through all the virtual addresses within those pages..
>
> So, let me summarise what I think you've just said. You are
>
> 1. fine with looping through the virtual addresses doing cache flushes
>    synchronously when doing IO despite it having significant
>    latency and performance costs.

No, like I said in the blkdev_issue_zeroout thread we need to replace
looping flushes with non-temporal stores and delayed wmb_pmem()
wherever possible.

> 2. Happy to hack a method into DAX to bypass the filesystems by
>    pushing information to the block device for it to track regions that
>    need cache flushes, then add infrastructure to the block device to
>    track those dirty regions and then walk those addresses and issue
>    cache flushes when the filesystem issues a REQ_FLUSH IO regardless
>    of whether the filesystem actually needs those cachelines flushed
>    for that specific IO?

I'm happier with a temporary driver level hack than a temporary core
kernel change.  This requirement to flush by virtual address is
something that, in my opinion, must be addressed by the platform with
a reliable global flush or by walking a small constant number of
physical-cache-ways.  I think we're getting ahead of ourselves jumping
to solving this in the core kernel while the question of how to do
efficient large flushes is still pending.

> 3. Not happy to use the generic mm/vfs level infrastructure
>    architectected specifically to provide the exact asynchronous
>    cache flushing/writeback semantics we require because it will
>    cause too many cache flushes, even though the number of cache
>    flushes will be, at worst, the same as in 2).

Correct, because if/when a platform solution arrives the need to track
dirty pfns evaporates.

> 1) will work, but as we can see it is *slow*. 3) is what Ross is
> implementing - it's a tried and tested architecture that all mm/fs
> developers understand, and his explanation of why it will work for
> pmem is pretty solid and completely platform/hardware architecture
> independent.
>
> Which leaves this question: How does 2) save us anything in terms of
> avoiding iterating virtual addresses and issuing cache flushes
> over 3)? And is it sufficient to justify hacking a bypass into DAX
> and the additional driver level complexity of having to add dirty
> region tracking, flushing and cleaning to REQ_FLUSH operations?
>

Given what we are talking about amounts to a hardware workaround I
think that kind of logic belongs in a driver.  If the cache flushing
gets fixed and we stop needing to track individual cachelines the
flush implementation will look and feel much more like existing
storage drivers.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
@ 2015-11-03 21:19               ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03 21:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org, linux-kernel,
	Ross Zwisler, Christoph Hellwig

On Tue, Nov 3, 2015 at 12:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Nov 02, 2015 at 11:20:49PM -0800, Dan Williams wrote:
>> On Mon, Nov 2, 2015 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
>> >> No, we definitely can't do that.   I think your mental model of the
>> >> cache flushing is similar to the disk model where a small buffer is
>> >> flushed after a large streaming write.  Both Ross' patches and my
>> >> approach suffer from the same horror that the cache flushing is O(N)
>> >> currently, so we don't want to make it responsible for more data
>> >> ranges areas than is strictly necessary.
>> >
>> > I didn't see anything that was O(N) in Ross's patches. What part of
>> > the fsync algorithm that Ross proposed are you refering to here?
>>
>> We have to issue clflush per touched virtual address rather than a
>> constant number of physical ways, or a flush-all instruction.
> .....
>> > So don't tell me that tracking dirty pages in the radix tree too
>> > slow for DAX and that DAX should not be used for POSIX IO based
>> > applications - it should be as fast as buffered IO, if not faster,
>> > and if it isn't then we've screwed up real bad. And right now, we're
>> > screwing up real bad.
>>
>> Again, it's not the dirty tracking in the radix I'm worried about it's
>> looping through all the virtual addresses within those pages..
>
> So, let me summarise what I think you've just said. You are
>
> 1. fine with looping through the virtual addresses doing cache flushes
>    synchronously when doing IO despite it having significant
>    latency and performance costs.

No, like I said in the blkdev_issue_zeroout thread we need to replace
looping flushes with non-temporal stores and delayed wmb_pmem()
wherever possible.

> 2. Happy to hack a method into DAX to bypass the filesystems by
>    pushing information to the block device for it to track regions that
>    need cache flushes, then add infrastructure to the block device to
>    track those dirty regions and then walk those addresses and issue
>    cache flushes when the filesystem issues a REQ_FLUSH IO regardless
>    of whether the filesystem actually needs those cachelines flushed
>    for that specific IO?

I'm happier with a temporary driver level hack than a temporary core
kernel change.  This requirement to flush by virtual address is
something that, in my opinion, must be addressed by the platform with
a reliable global flush or by walking a small constant number of
physical-cache-ways.  I think we're getting ahead of ourselves jumping
to solving this in the core kernel while the question of how to do
efficient large flushes is still pending.

> 3. Not happy to use the generic mm/vfs level infrastructure
>    architectected specifically to provide the exact asynchronous
>    cache flushing/writeback semantics we require because it will
>    cause too many cache flushes, even though the number of cache
>    flushes will be, at worst, the same as in 2).

Correct, because if/when a platform solution arrives the need to track
dirty pfns evaporates.

> 1) will work, but as we can see it is *slow*. 3) is what Ross is
> implementing - it's a tried and tested architecture that all mm/fs
> developers understand, and his explanation of why it will work for
> pmem is pretty solid and completely platform/hardware architecture
> independent.
>
> Which leaves this question: How does 2) save us anything in terms of
> avoiding iterating virtual addresses and issuing cache flushes
> over 3)? And is it sufficient to justify hacking a bypass into DAX
> and the additional driver level complexity of having to add dirty
> region tracking, flushing and cleaning to REQ_FLUSH operations?
>

Given what we are talking about amounts to a hardware workaround I
think that kind of logic belongs in a driver.  If the cache flushing
gets fixed and we stop needing to track individual cachelines the
flush implementation will look and feel much more like existing
storage drivers.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
  2015-11-03 21:18         ` Ross Zwisler
@ 2015-11-03 21:34           ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03 21:34 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dave Chinner, Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel,
	Christoph Hellwig

On Tue, Nov 3, 2015 at 1:18 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
>> On Mon, Nov 2, 2015 at 5:16 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Sun, Nov 01, 2015 at 11:30:58PM -0500, Dan Williams wrote:
> <>
>> > Yes, that's a basic feature of Ross's patches. Hence I think this
>> > special case DAX<->bdev interface is the wrong direction to be
>> > taking.
>>
>> So here's my problem with the "track dirty mappings" in the core
>> mm/vfs approach, it's harder to unwind and delete when it turns out no
>> application actually needs it, or the platform gives us an O(1) flush
>> method that is independent of dirty pte tracking.
>
> I don't think that we'll ever be able to "unwind and delete" the dirty page
> tracking.  Even if *some* platform gives us an 0(1) flush method independent
> of PTE tracking, that will be an optimized path that will need to live along
> side the path that we'll need to keep for other architectures (like all the
> ones that exist today).

Other architectures are in worse shape when we start talking about
virtually tagged caches.  If an architecture can't support DAX
efficiently I don't know why we would jump through hoops in the core
to allow it to use DAX.  In the interim to make forward progress we
have a safe workaround in the driver, and when the reports come in
about fsync taking too much time the response is "switch from fsync to
NVML" or "turn off DAX (ideally at the inode)".  What's broken about
those mitigation options?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
@ 2015-11-03 21:34           ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03 21:34 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dave Chinner, Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org,
	linux-kernel, Christoph Hellwig

On Tue, Nov 3, 2015 at 1:18 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
>> On Mon, Nov 2, 2015 at 5:16 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Sun, Nov 01, 2015 at 11:30:58PM -0500, Dan Williams wrote:
> <>
>> > Yes, that's a basic feature of Ross's patches. Hence I think this
>> > special case DAX<->bdev interface is the wrong direction to be
>> > taking.
>>
>> So here's my problem with the "track dirty mappings" in the core
>> mm/vfs approach, it's harder to unwind and delete when it turns out no
>> application actually needs it, or the platform gives us an O(1) flush
>> method that is independent of dirty pte tracking.
>
> I don't think that we'll ever be able to "unwind and delete" the dirty page
> tracking.  Even if *some* platform gives us an 0(1) flush method independent
> of PTE tracking, that will be an optimized path that will need to live along
> side the path that we'll need to keep for other architectures (like all the
> ones that exist today).

Other architectures are in worse shape when we start talking about
virtually tagged caches.  If an architecture can't support DAX
efficiently I don't know why we would jump through hoops in the core
to allow it to use DAX.  In the interim to make forward progress we
have a safe workaround in the driver, and when the reports come in
about fsync taking too much time the response is "switch from fsync to
NVML" or "turn off DAX (ideally at the inode)".  What's broken about
those mitigation options?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
  2015-11-03 20:51             ` Dave Chinner
@ 2015-11-03 21:37               ` Ross Zwisler
  -1 siblings, 0 replies; 95+ messages in thread
From: Ross Zwisler @ 2015-11-03 21:37 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel,
	Ross Zwisler, Christoph Hellwig

On Wed, Nov 04, 2015 at 07:51:31AM +1100, Dave Chinner wrote:
> On Mon, Nov 02, 2015 at 11:20:49PM -0800, Dan Williams wrote:
> > On Mon, Nov 2, 2015 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
> > >> No, we definitely can't do that.   I think your mental model of the
> > >> cache flushing is similar to the disk model where a small buffer is
> > >> flushed after a large streaming write.  Both Ross' patches and my
> > >> approach suffer from the same horror that the cache flushing is O(N)
> > >> currently, so we don't want to make it responsible for more data
> > >> ranges areas than is strictly necessary.
> > >
> > > I didn't see anything that was O(N) in Ross's patches. What part of
> > > the fsync algorithm that Ross proposed are you refering to here?
> > 
> > We have to issue clflush per touched virtual address rather than a
> > constant number of physical ways, or a flush-all instruction.
> .....
> > > So don't tell me that tracking dirty pages in the radix tree too
> > > slow for DAX and that DAX should not be used for POSIX IO based
> > > applications - it should be as fast as buffered IO, if not faster,
> > > and if it isn't then we've screwed up real bad. And right now, we're
> > > screwing up real bad.
> > 
> > Again, it's not the dirty tracking in the radix I'm worried about it's
> > looping through all the virtual addresses within those pages..
> 
> So, let me summarise what I think you've just said. You are
> 
> 1. fine with looping through the virtual addresses doing cache flushes
>    synchronously when doing IO despite it having significant
>    latency and performance costs.
> 
> 2. Happy to hack a method into DAX to bypass the filesystems by
>    pushing information to the block device for it to track regions that
>    need cache flushes, then add infrastructure to the block device to
>    track those dirty regions and then walk those addresses and issue
>    cache flushes when the filesystem issues a REQ_FLUSH IO regardless
>    of whether the filesystem actually needs those cachelines flushed
>    for that specific IO?
> 
> 3. Not happy to use the generic mm/vfs level infrastructure
>    architectected specifically to provide the exact asynchronous
>    cache flushing/writeback semantics we require because it will
>    cause too many cache flushes, even though the number of cache
>    flushes will be, at worst, the same as in 2).
> 
> 
> 1) will work, but as we can see it is *slow*. 3) is what Ross is
> implementing - it's a tried and tested architecture that all mm/fs
> developers understand, and his explanation of why it will work for
> pmem is pretty solid and completely platform/hardware architecture
> independent.
> 
> Which leaves this question: How does 2) save us anything in terms of
> avoiding iterating virtual addresses and issuing cache flushes
> over 3)? And is it sufficient to justify hacking a bypass into DAX
> and the additional driver level complexity of having to add dirty
> region tracking, flushing and cleaning to REQ_FLUSH operations?

I also don't see a benefit of pushing this into the driver.  The generic
writeback infrastructure that is already in place seems to fit perfectly with
what we are trying to do.  I feel like putting the flushing infrastructure
into the driver, as with my first failed attempt at msync support, ends up
solving one aspect of the problem in a non-generic way that is ultimately
fatally flawed.

The driver inherently doesn't have enough information to solve this problem -
we really do need to involve the filesystem and mm layers.  For example:

1) The driver can't easily mark regions as clean once they have been flushed,
meaning that every time you dirty data you add to an ever increasing list of
things that will be flushed on the next REQ_FLUSH.

2) The driver doesn't know how inodes map to blocks, so when you get a
REQ_FLUSH for an fsync you end up flushing the dirty regions for *the entire
block device*, not just the one inode.

3) The driver doesn't understand how mmap ranges map to block regions, so if
someone msyncs a single page (causing a REQ_FLUSH) on a single mmap you will
once again flush every region that has ever been dirtied on the entire block
device.

Each of these cases is handled by the existing writeback infrastructure.  I'm
strongly in favor of waiting and solving this issue with the radix tree
patches.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
@ 2015-11-03 21:37               ` Ross Zwisler
  0 siblings, 0 replies; 95+ messages in thread
From: Ross Zwisler @ 2015-11-03 21:37 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org,
	linux-kernel, Ross Zwisler, Christoph Hellwig

On Wed, Nov 04, 2015 at 07:51:31AM +1100, Dave Chinner wrote:
> On Mon, Nov 02, 2015 at 11:20:49PM -0800, Dan Williams wrote:
> > On Mon, Nov 2, 2015 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
> > >> No, we definitely can't do that.   I think your mental model of the
> > >> cache flushing is similar to the disk model where a small buffer is
> > >> flushed after a large streaming write.  Both Ross' patches and my
> > >> approach suffer from the same horror that the cache flushing is O(N)
> > >> currently, so we don't want to make it responsible for more data
> > >> ranges areas than is strictly necessary.
> > >
> > > I didn't see anything that was O(N) in Ross's patches. What part of
> > > the fsync algorithm that Ross proposed are you refering to here?
> > 
> > We have to issue clflush per touched virtual address rather than a
> > constant number of physical ways, or a flush-all instruction.
> .....
> > > So don't tell me that tracking dirty pages in the radix tree too
> > > slow for DAX and that DAX should not be used for POSIX IO based
> > > applications - it should be as fast as buffered IO, if not faster,
> > > and if it isn't then we've screwed up real bad. And right now, we're
> > > screwing up real bad.
> > 
> > Again, it's not the dirty tracking in the radix I'm worried about it's
> > looping through all the virtual addresses within those pages..
> 
> So, let me summarise what I think you've just said. You are
> 
> 1. fine with looping through the virtual addresses doing cache flushes
>    synchronously when doing IO despite it having significant
>    latency and performance costs.
> 
> 2. Happy to hack a method into DAX to bypass the filesystems by
>    pushing information to the block device for it to track regions that
>    need cache flushes, then add infrastructure to the block device to
>    track those dirty regions and then walk those addresses and issue
>    cache flushes when the filesystem issues a REQ_FLUSH IO regardless
>    of whether the filesystem actually needs those cachelines flushed
>    for that specific IO?
> 
> 3. Not happy to use the generic mm/vfs level infrastructure
>    architectected specifically to provide the exact asynchronous
>    cache flushing/writeback semantics we require because it will
>    cause too many cache flushes, even though the number of cache
>    flushes will be, at worst, the same as in 2).
> 
> 
> 1) will work, but as we can see it is *slow*. 3) is what Ross is
> implementing - it's a tried and tested architecture that all mm/fs
> developers understand, and his explanation of why it will work for
> pmem is pretty solid and completely platform/hardware architecture
> independent.
> 
> Which leaves this question: How does 2) save us anything in terms of
> avoiding iterating virtual addresses and issuing cache flushes
> over 3)? And is it sufficient to justify hacking a bypass into DAX
> and the additional driver level complexity of having to add dirty
> region tracking, flushing and cleaning to REQ_FLUSH operations?

I also don't see a benefit of pushing this into the driver.  The generic
writeback infrastructure that is already in place seems to fit perfectly with
what we are trying to do.  I feel like putting the flushing infrastructure
into the driver, as with my first failed attempt at msync support, ends up
solving one aspect of the problem in a non-generic way that is ultimately
fatally flawed.

The driver inherently doesn't have enough information to solve this problem -
we really do need to involve the filesystem and mm layers.  For example:

1) The driver can't easily mark regions as clean once they have been flushed,
meaning that every time you dirty data you add to an ever increasing list of
things that will be flushed on the next REQ_FLUSH.

2) The driver doesn't know how inodes map to blocks, so when you get a
REQ_FLUSH for an fsync you end up flushing the dirty regions for *the entire
block device*, not just the one inode.

3) The driver doesn't understand how mmap ranges map to block regions, so if
someone msyncs a single page (causing a REQ_FLUSH) on a single mmap you will
once again flush every region that has ever been dirtied on the entire block
device.

Each of these cases is handled by the existing writeback infrastructure.  I'm
strongly in favor of waiting and solving this issue with the radix tree
patches.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 05/15] libnvdimm, pmem: fix size trim in pmem_direct_access()
  2015-11-03 19:32     ` Ross Zwisler
@ 2015-11-03 21:39       ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03 21:39 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jens Axboe, Jan Kara, linux-nvdimm, david, linux-kernel, stable,
	Christoph Hellwig

On Tue, Nov 3, 2015 at 11:32 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Sun, Nov 01, 2015 at 11:30:10PM -0500, Dan Williams wrote:
>> This masking prevents access to the end of the device via dax_do_io(),
>> and is unnecessary as arch_add_memory() would have rejected an unaligned
>> allocation.
>>
>> Cc: <stable@vger.kernel.org>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  drivers/nvdimm/pmem.c |   17 +++--------------
>>  1 file changed, 3 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
>> index e46988fbdee5..93472953e231 100644
>> --- a/drivers/nvdimm/pmem.c
>> +++ b/drivers/nvdimm/pmem.c
>> @@ -100,26 +100,15 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
>>  }
>>
>>  static long pmem_direct_access(struct block_device *bdev, sector_t sector,
>> -                   void __pmem **kaddr, unsigned long *pfn)
>> +                   void __pmem **kaddr, pfn_t *pfn)
>
> It seems kind of weird to change only this instance of direct_access() to have
> the last argument as a pfn_t instead of an unsigned long?   If pfn_t is more
> descriptive (I think it is) should we update the definition in struct
> block_device_operations and all the other implementors of direct_access as
> well?  If that's touching too much, let's do them all together later, but
> let's not change one now and have them be inconsistent.
>

Oh, nice catch, that's just a mistake when I moved this patch earlier
in the series... and I wonder why 0day didn't complain about it?  In
any event, will fix.

>>  {
>>       struct pmem_device *pmem = bdev->bd_disk->private_data;
>>       resource_size_t offset = sector * 512 + pmem->data_offset;
>> -     resource_size_t size;
>>
>> -     if (pmem->data_offset) {
>> -             /*
>> -              * Limit the direct_access() size to what is covered by
>> -              * the memmap
>> -              */
>> -             size = (pmem->size - offset) & ~ND_PFN_MASK;
>> -     } else
>> -             size = pmem->size - offset;
>> -
>> -     /* FIXME convert DAX to comprehend that this mapping has a lifetime */
>>       *kaddr = pmem->virt_addr + offset;
>> -     *pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;
>> +     *pfn = __phys_to_pfn(pmem->phys_addr + offset, pmem->pfn_flags);
>
> __phys_to_pfn() only takes a single argument (the paddr) in v4.3,
> jens/for-4.4/integrity and in nvdimm/libnvdimm-for-next.  Is this second
> argument of pfn_flags actually correct?

Yeah, this shouldn't even compile.

Thanks Ross!

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 05/15] libnvdimm, pmem: fix size trim in pmem_direct_access()
@ 2015-11-03 21:39       ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03 21:39 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org, david,
	linux-kernel, stable, Christoph Hellwig

On Tue, Nov 3, 2015 at 11:32 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Sun, Nov 01, 2015 at 11:30:10PM -0500, Dan Williams wrote:
>> This masking prevents access to the end of the device via dax_do_io(),
>> and is unnecessary as arch_add_memory() would have rejected an unaligned
>> allocation.
>>
>> Cc: <stable@vger.kernel.org>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  drivers/nvdimm/pmem.c |   17 +++--------------
>>  1 file changed, 3 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
>> index e46988fbdee5..93472953e231 100644
>> --- a/drivers/nvdimm/pmem.c
>> +++ b/drivers/nvdimm/pmem.c
>> @@ -100,26 +100,15 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
>>  }
>>
>>  static long pmem_direct_access(struct block_device *bdev, sector_t sector,
>> -                   void __pmem **kaddr, unsigned long *pfn)
>> +                   void __pmem **kaddr, pfn_t *pfn)
>
> It seems kind of weird to change only this instance of direct_access() to have
> the last argument as a pfn_t instead of an unsigned long?   If pfn_t is more
> descriptive (I think it is) should we update the definition in struct
> block_device_operations and all the other implementors of direct_access as
> well?  If that's touching too much, let's do them all together later, but
> let's not change one now and have them be inconsistent.
>

Oh, nice catch, that's just a mistake when I moved this patch earlier
in the series... and I wonder why 0day didn't complain about it?  In
any event, will fix.

>>  {
>>       struct pmem_device *pmem = bdev->bd_disk->private_data;
>>       resource_size_t offset = sector * 512 + pmem->data_offset;
>> -     resource_size_t size;
>>
>> -     if (pmem->data_offset) {
>> -             /*
>> -              * Limit the direct_access() size to what is covered by
>> -              * the memmap
>> -              */
>> -             size = (pmem->size - offset) & ~ND_PFN_MASK;
>> -     } else
>> -             size = pmem->size - offset;
>> -
>> -     /* FIXME convert DAX to comprehend that this mapping has a lifetime */
>>       *kaddr = pmem->virt_addr + offset;
>> -     *pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;
>> +     *pfn = __phys_to_pfn(pmem->phys_addr + offset, pmem->pfn_flags);
>
> __phys_to_pfn() only takes a single argument (the paddr) in v4.3,
> jens/for-4.4/integrity and in nvdimm/libnvdimm-for-next.  Is this second
> argument of pfn_flags actually correct?

Yeah, this shouldn't even compile.

Thanks Ross!

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
  2015-11-03 21:37               ` Ross Zwisler
@ 2015-11-03 21:43                 ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03 21:43 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dave Chinner, Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel,
	Christoph Hellwig

On Tue, Nov 3, 2015 at 1:37 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Wed, Nov 04, 2015 at 07:51:31AM +1100, Dave Chinner wrote:
>> On Mon, Nov 02, 2015 at 11:20:49PM -0800, Dan Williams wrote:
>> > On Mon, Nov 2, 2015 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > > On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
>> > >> No, we definitely can't do that.   I think your mental model of the
>> > >> cache flushing is similar to the disk model where a small buffer is
>> > >> flushed after a large streaming write.  Both Ross' patches and my
>> > >> approach suffer from the same horror that the cache flushing is O(N)
>> > >> currently, so we don't want to make it responsible for more data
>> > >> ranges areas than is strictly necessary.
>> > >
>> > > I didn't see anything that was O(N) in Ross's patches. What part of
>> > > the fsync algorithm that Ross proposed are you refering to here?
>> >
>> > We have to issue clflush per touched virtual address rather than a
>> > constant number of physical ways, or a flush-all instruction.
>> .....
>> > > So don't tell me that tracking dirty pages in the radix tree too
>> > > slow for DAX and that DAX should not be used for POSIX IO based
>> > > applications - it should be as fast as buffered IO, if not faster,
>> > > and if it isn't then we've screwed up real bad. And right now, we're
>> > > screwing up real bad.
>> >
>> > Again, it's not the dirty tracking in the radix I'm worried about it's
>> > looping through all the virtual addresses within those pages..
>>
>> So, let me summarise what I think you've just said. You are
>>
>> 1. fine with looping through the virtual addresses doing cache flushes
>>    synchronously when doing IO despite it having significant
>>    latency and performance costs.
>>
>> 2. Happy to hack a method into DAX to bypass the filesystems by
>>    pushing information to the block device for it to track regions that
>>    need cache flushes, then add infrastructure to the block device to
>>    track those dirty regions and then walk those addresses and issue
>>    cache flushes when the filesystem issues a REQ_FLUSH IO regardless
>>    of whether the filesystem actually needs those cachelines flushed
>>    for that specific IO?
>>
>> 3. Not happy to use the generic mm/vfs level infrastructure
>>    architectected specifically to provide the exact asynchronous
>>    cache flushing/writeback semantics we require because it will
>>    cause too many cache flushes, even though the number of cache
>>    flushes will be, at worst, the same as in 2).
>>
>>
>> 1) will work, but as we can see it is *slow*. 3) is what Ross is
>> implementing - it's a tried and tested architecture that all mm/fs
>> developers understand, and his explanation of why it will work for
>> pmem is pretty solid and completely platform/hardware architecture
>> independent.
>>
>> Which leaves this question: How does 2) save us anything in terms of
>> avoiding iterating virtual addresses and issuing cache flushes
>> over 3)? And is it sufficient to justify hacking a bypass into DAX
>> and the additional driver level complexity of having to add dirty
>> region tracking, flushing and cleaning to REQ_FLUSH operations?
>
> I also don't see a benefit of pushing this into the driver.  The generic
> writeback infrastructure that is already in place seems to fit perfectly with
> what we are trying to do.  I feel like putting the flushing infrastructure
> into the driver, as with my first failed attempt at msync support, ends up
> solving one aspect of the problem in a non-generic way that is ultimately
> fatally flawed.
>
> The driver inherently doesn't have enough information to solve this problem -
> we really do need to involve the filesystem and mm layers.  For example:
>
> 1) The driver can't easily mark regions as clean once they have been flushed,
> meaning that every time you dirty data you add to an ever increasing list of
> things that will be flushed on the next REQ_FLUSH.
>
> 2) The driver doesn't know how inodes map to blocks, so when you get a
> REQ_FLUSH for an fsync you end up flushing the dirty regions for *the entire
> block device*, not just the one inode.
>
> 3) The driver doesn't understand how mmap ranges map to block regions, so if
> someone msyncs a single page (causing a REQ_FLUSH) on a single mmap you will
> once again flush every region that has ever been dirtied on the entire block
> device.
>
> Each of these cases is handled by the existing writeback infrastructure.  I'm
> strongly in favor of waiting and solving this issue with the radix tree
> patches.

Again, all of these holes are mitigated by turning off DAX or fixing
the app.  The radix solution does nothing to address the worst case
flushing and will spin, single-threaded flushing the world.  So I
remain strongly against the core change, but that's ultimately not my
call.

Looks like we're leaving this broken for 4.4...

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 14/15] dax: dirty extent notification
@ 2015-11-03 21:43                 ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03 21:43 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dave Chinner, Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org,
	linux-kernel, Christoph Hellwig

On Tue, Nov 3, 2015 at 1:37 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Wed, Nov 04, 2015 at 07:51:31AM +1100, Dave Chinner wrote:
>> On Mon, Nov 02, 2015 at 11:20:49PM -0800, Dan Williams wrote:
>> > On Mon, Nov 2, 2015 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > > On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
>> > >> No, we definitely can't do that.   I think your mental model of the
>> > >> cache flushing is similar to the disk model where a small buffer is
>> > >> flushed after a large streaming write.  Both Ross' patches and my
>> > >> approach suffer from the same horror that the cache flushing is O(N)
>> > >> currently, so we don't want to make it responsible for more data
>> > >> ranges areas than is strictly necessary.
>> > >
>> > > I didn't see anything that was O(N) in Ross's patches. What part of
>> > > the fsync algorithm that Ross proposed are you refering to here?
>> >
>> > We have to issue clflush per touched virtual address rather than a
>> > constant number of physical ways, or a flush-all instruction.
>> .....
>> > > So don't tell me that tracking dirty pages in the radix tree too
>> > > slow for DAX and that DAX should not be used for POSIX IO based
>> > > applications - it should be as fast as buffered IO, if not faster,
>> > > and if it isn't then we've screwed up real bad. And right now, we're
>> > > screwing up real bad.
>> >
>> > Again, it's not the dirty tracking in the radix I'm worried about it's
>> > looping through all the virtual addresses within those pages..
>>
>> So, let me summarise what I think you've just said. You are
>>
>> 1. fine with looping through the virtual addresses doing cache flushes
>>    synchronously when doing IO despite it having significant
>>    latency and performance costs.
>>
>> 2. Happy to hack a method into DAX to bypass the filesystems by
>>    pushing information to the block device for it to track regions that
>>    need cache flushes, then add infrastructure to the block device to
>>    track those dirty regions and then walk those addresses and issue
>>    cache flushes when the filesystem issues a REQ_FLUSH IO regardless
>>    of whether the filesystem actually needs those cachelines flushed
>>    for that specific IO?
>>
>> 3. Not happy to use the generic mm/vfs level infrastructure
>>    architectected specifically to provide the exact asynchronous
>>    cache flushing/writeback semantics we require because it will
>>    cause too many cache flushes, even though the number of cache
>>    flushes will be, at worst, the same as in 2).
>>
>>
>> 1) will work, but as we can see it is *slow*. 3) is what Ross is
>> implementing - it's a tried and tested architecture that all mm/fs
>> developers understand, and his explanation of why it will work for
>> pmem is pretty solid and completely platform/hardware architecture
>> independent.
>>
>> Which leaves this question: How does 2) save us anything in terms of
>> avoiding iterating virtual addresses and issuing cache flushes
>> over 3)? And is it sufficient to justify hacking a bypass into DAX
>> and the additional driver level complexity of having to add dirty
>> region tracking, flushing and cleaning to REQ_FLUSH operations?
>
> I also don't see a benefit of pushing this into the driver.  The generic
> writeback infrastructure that is already in place seems to fit perfectly with
> what we are trying to do.  I feel like putting the flushing infrastructure
> into the driver, as with my first failed attempt at msync support, ends up
> solving one aspect of the problem in a non-generic way that is ultimately
> fatally flawed.
>
> The driver inherently doesn't have enough information to solve this problem -
> we really do need to involve the filesystem and mm layers.  For example:
>
> 1) The driver can't easily mark regions as clean once they have been flushed,
> meaning that every time you dirty data you add to an ever increasing list of
> things that will be flushed on the next REQ_FLUSH.
>
> 2) The driver doesn't know how inodes map to blocks, so when you get a
> REQ_FLUSH for an fsync you end up flushing the dirty regions for *the entire
> block device*, not just the one inode.
>
> 3) The driver doesn't understand how mmap ranges map to block regions, so if
> someone msyncs a single page (causing a REQ_FLUSH) on a single mmap you will
> once again flush every region that has ever been dirtied on the entire block
> device.
>
> Each of these cases is handled by the existing writeback infrastructure.  I'm
> strongly in favor of waiting and solving this issue with the radix tree
> patches.

Again, all of these holes are mitigated by turning off DAX or fixing
the app.  The radix solution does nothing to address the worst case
flushing and will spin, single-threaded flushing the world.  So I
remain strongly against the core change, but that's ultimately not my
call.

Looks like we're leaving this broken for 4.4...

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 03/15] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
  2015-11-03 19:01     ` Ross Zwisler
@ 2015-11-03 22:50       ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03 22:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jens Axboe, Jens Axboe, Jan Kara, linux-nvdimm, david,
	linux-kernel, Jeff Moyer, Jan Kara, Christoph Hellwig

On Tue, Nov 3, 2015 at 11:01 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Sun, Nov 01, 2015 at 11:29:58PM -0500, Dan Williams wrote:
>> The DAX implementation needs to protect new calls to ->direct_access()
>> and usage of its return value against unbind of the underlying block
>> device.  Use blk_queue_enter()/blk_queue_exit() to either prevent
>> blk_cleanup_queue() from proceeding, or fail the dax_map_atomic() if the
>> request_queue is being torn down.
>>
>> Cc: Jan Kara <jack@suse.com>
>> Cc: Jens Axboe <axboe@kernel.dk>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
> <>
[ trim the comments that Jeff responded to]

>> @@ -305,11 +353,10 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>>               goto out;
>>       }
>>
>> -     error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
>> -     if (error < 0)
>> -             goto out;
>> -     if (error < PAGE_SIZE) {
>> -             error = -EIO;
>> +     addr = __dax_map_atomic(bdev, to_sector(bh, inode), bh->b_size,
>> +                     &pfn, NULL);
>> +     if (IS_ERR(addr)) {
>> +             error = PTR_ERR(addr);
>
> Just a note that we lost the check for bdev_direct_access() returning less
> than PAGE_SIZE.  Are we sure this can't happen and that it's safe to remove
> the check?

Yes, as Jeff recommends I'll do a follow on patch to make this an
explicit guarantee of bdev_direct_access() just like the page
alignment.

>
>> @@ -609,15 +655,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>>               result = VM_FAULT_NOPAGE;
>>               spin_unlock(ptl);
>>       } else {
>> -             sector = bh.b_blocknr << (blkbits - 9);
>> -             length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
>> -                                             bh.b_size);
>> -             if (length < 0) {
>> +             long length;
>> +             unsigned long pfn;
>> +             void __pmem *kaddr = __dax_map_atomic(bdev,
>> +                             to_sector(&bh, inode), HPAGE_SIZE, &pfn,
>> +                             &length);
>
> Let's use PMD_SIZE instead of HPAGE_SIZE to be consistent with the rest of the
> DAX code.
>

I changed to HPAGE_SIZE on advice from Dave Hansen.  I'll insert a
preceding cleanup patch in this series to do the conversion since we
should be consistent with the use of PAGE_SIZE in the other dax paths.

>> +
>> +             if (IS_ERR(kaddr)) {
>>                       result = VM_FAULT_SIGBUS;
>>                       goto out;
>>               }
>> -             if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
>> +             if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) {
>> +                     dax_unmap_atomic(bdev, kaddr);
>>                       goto fallback;
>> +             }
>>
>>               if (buffer_unwritten(&bh) || buffer_new(&bh)) {
>>                       clear_pmem(kaddr, HPAGE_SIZE);
>
> Ditto, let's use PMD_SIZE for consistency (I realize this was changed ealier
> in the series).

Ditto on the rebuttal.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 03/15] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
@ 2015-11-03 22:50       ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03 22:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jens Axboe, Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org,
	david, linux-kernel, Jeff Moyer, Jan Kara, Christoph Hellwig

On Tue, Nov 3, 2015 at 11:01 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Sun, Nov 01, 2015 at 11:29:58PM -0500, Dan Williams wrote:
>> The DAX implementation needs to protect new calls to ->direct_access()
>> and usage of its return value against unbind of the underlying block
>> device.  Use blk_queue_enter()/blk_queue_exit() to either prevent
>> blk_cleanup_queue() from proceeding, or fail the dax_map_atomic() if the
>> request_queue is being torn down.
>>
>> Cc: Jan Kara <jack@suse.com>
>> Cc: Jens Axboe <axboe@kernel.dk>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
> <>
[ trim the comments that Jeff responded to]

>> @@ -305,11 +353,10 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>>               goto out;
>>       }
>>
>> -     error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
>> -     if (error < 0)
>> -             goto out;
>> -     if (error < PAGE_SIZE) {
>> -             error = -EIO;
>> +     addr = __dax_map_atomic(bdev, to_sector(bh, inode), bh->b_size,
>> +                     &pfn, NULL);
>> +     if (IS_ERR(addr)) {
>> +             error = PTR_ERR(addr);
>
> Just a note that we lost the check for bdev_direct_access() returning less
> than PAGE_SIZE.  Are we sure this can't happen and that it's safe to remove
> the check?

Yes, as Jeff recommends I'll do a follow on patch to make this an
explicit guarantee of bdev_direct_access() just like the page
alignment.

>
>> @@ -609,15 +655,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>>               result = VM_FAULT_NOPAGE;
>>               spin_unlock(ptl);
>>       } else {
>> -             sector = bh.b_blocknr << (blkbits - 9);
>> -             length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
>> -                                             bh.b_size);
>> -             if (length < 0) {
>> +             long length;
>> +             unsigned long pfn;
>> +             void __pmem *kaddr = __dax_map_atomic(bdev,
>> +                             to_sector(&bh, inode), HPAGE_SIZE, &pfn,
>> +                             &length);
>
> Let's use PMD_SIZE instead of HPAGE_SIZE to be consistent with the rest of the
> DAX code.
>

I changed to HPAGE_SIZE on advice from Dave Hansen.  I'll insert a
preceding cleanup patch in this series to do the conversion since we
should be consistent with the use of PAGE_SIZE in the other dax paths.

>> +
>> +             if (IS_ERR(kaddr)) {
>>                       result = VM_FAULT_SIGBUS;
>>                       goto out;
>>               }
>> -             if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
>> +             if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) {
>> +                     dax_unmap_atomic(bdev, kaddr);
>>                       goto fallback;
>> +             }
>>
>>               if (buffer_unwritten(&bh) || buffer_new(&bh)) {
>>                       clear_pmem(kaddr, HPAGE_SIZE);
>
> Ditto, let's use PMD_SIZE for consistency (I realize this was changed ealier
> in the series).

Ditto on the rebuttal.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 13/15] block, dax: make dax mappings opt-in by default
  2015-11-03 20:20         ` Dave Chinner
@ 2015-11-03 23:04           ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03 23:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel, Ross Zwisler,
	Christoph Hellwig

On Tue, Nov 3, 2015 at 12:20 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Nov 02, 2015 at 11:35:04PM -0800, Dan Williams wrote:
>> On Mon, Nov 2, 2015 at 4:32 PM, Dave Chinner <david@fromorbit.com> wrote:
[..]
>> Only in the mmap path:
>
> which means blkdev_direct_IO() is now always going to go down the
> dax_do_io() path for any driver with a ->direct_access method rather
> than the direct IO path, regardless of whether DAX is enabled on the
> device or not.
>
> That really seems wrong to me - you've replace explicit "is DAX
> enabled" checks with "is DAX possible" checks, and so DAX paths are
> used regardless of whether DAX is enabled or not. And it's not
> obvious why this is done, nor is it now obvious how DAX interacts
> with the block device.
>
> This really seems like a step backwards to me.

I think the reason it is not obvious is the original justification for
the bypass as stated in commit bbab37ddc20b "block: Add support for
DAX reads/writes to block devices" was:

    "instead of allocating a DIO and a BIO"

It turns out it's faster and as far as I can tell semantically
equivalent to the __blockdev_direct_IO() path.  The DAX mmap path in
comparison has plenty of sharp edges and semantic differences that
would be avoided by turning off DAX.

I'm not opposed to also turning off dax_do_io() when S_DAX is clear,
but I don't currently see the point.  At the very least I need to add
the above comments to the code, but do you still think opt-in DAX is a
backwards step?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 13/15] block, dax: make dax mappings opt-in by default
@ 2015-11-03 23:04           ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-03 23:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org, linux-kernel,
	Ross Zwisler, Christoph Hellwig

On Tue, Nov 3, 2015 at 12:20 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Nov 02, 2015 at 11:35:04PM -0800, Dan Williams wrote:
>> On Mon, Nov 2, 2015 at 4:32 PM, Dave Chinner <david@fromorbit.com> wrote:
[..]
>> Only in the mmap path:
>
> which means blkdev_direct_IO() is now always going to go down the
> dax_do_io() path for any driver with a ->direct_access method rather
> than the direct IO path, regardless of whether DAX is enabled on the
> device or not.
>
> That really seems wrong to me - you've replace explicit "is DAX
> enabled" checks with "is DAX possible" checks, and so DAX paths are
> used regardless of whether DAX is enabled or not. And it's not
> obvious why this is done, nor is it now obvious how DAX interacts
> with the block device.
>
> This really seems like a step backwards to me.

I think the reason it is not obvious is the original justification for
the bypass as stated in commit bbab37ddc20b "block: Add support for
DAX reads/writes to block devices" was:

    "instead of allocating a DIO and a BIO"

It turns out it's faster and as far as I can tell semantically
equivalent to the __blockdev_direct_IO() path.  The DAX mmap path in
comparison has plenty of sharp edges and semantic differences that
would be avoided by turning off DAX.

I'm not opposed to also turning off dax_do_io() when S_DAX is clear,
but I don't currently see the point.  At the very least I need to add
the above comments to the code, but do you still think opt-in DAX is a
backwards step?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 13/15] block, dax: make dax mappings opt-in by default
  2015-11-03 23:04           ` Dan Williams
@ 2015-11-04 19:23             ` Dan Williams
  -1 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-04 19:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jan Kara, linux-nvdimm, linux-kernel, Ross Zwisler,
	Christoph Hellwig

On Tue, Nov 3, 2015 at 3:04 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Tue, Nov 3, 2015 at 12:20 PM, Dave Chinner <david@fromorbit.com> wrote:
>> On Mon, Nov 02, 2015 at 11:35:04PM -0800, Dan Williams wrote:
>>> On Mon, Nov 2, 2015 at 4:32 PM, Dave Chinner <david@fromorbit.com> wrote:
> [..]
>>> Only in the mmap path:
>>
>> which means blkdev_direct_IO() is now always going to go down the
>> dax_do_io() path for any driver with a ->direct_access method rather
>> than the direct IO path, regardless of whether DAX is enabled on the
>> device or not.
>>
>> That really seems wrong to me - you've replace explicit "is DAX
>> enabled" checks with "is DAX possible" checks, and so DAX paths are
>> used regardless of whether DAX is enabled or not. And it's not
>> obvious why this is done, nor is it now obvious how DAX interacts
>> with the block device.
>>
>> This really seems like a step backwards to me.
>
> I think the reason it is not obvious is the original justification for
> the bypass as stated in commit bbab37ddc20b "block: Add support for
> DAX reads/writes to block devices" was:
>
>     "instead of allocating a DIO and a BIO"
>
> It turns out it's faster and as far as I can tell semantically
> equivalent to the __blockdev_direct_IO() path.  The DAX mmap path in
> comparison has plenty of sharp edges and semantic differences that
> would be avoided by turning off DAX.
>
> I'm not opposed to also turning off dax_do_io() when S_DAX is clear,
> but I don't currently see the point.  At the very least I need to add
> the above comments to the code, but do you still think opt-in DAX is a
> backwards step?

I thought of one way dax_do_io() breaks current semantics, it defeats
blktrace and i/o stat tracking.  I'll restore the existing behavior
that gates dax_do_io() on S_DAX.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 13/15] block, dax: make dax mappings opt-in by default
@ 2015-11-04 19:23             ` Dan Williams
  0 siblings, 0 replies; 95+ messages in thread
From: Dan Williams @ 2015-11-04 19:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org, linux-kernel,
	Ross Zwisler, Christoph Hellwig

On Tue, Nov 3, 2015 at 3:04 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Tue, Nov 3, 2015 at 12:20 PM, Dave Chinner <david@fromorbit.com> wrote:
>> On Mon, Nov 02, 2015 at 11:35:04PM -0800, Dan Williams wrote:
>>> On Mon, Nov 2, 2015 at 4:32 PM, Dave Chinner <david@fromorbit.com> wrote:
> [..]
>>> Only in the mmap path:
>>
>> which means blkdev_direct_IO() is now always going to go down the
>> dax_do_io() path for any driver with a ->direct_access method rather
>> than the direct IO path, regardless of whether DAX is enabled on the
>> device or not.
>>
>> That really seems wrong to me - you've replace explicit "is DAX
>> enabled" checks with "is DAX possible" checks, and so DAX paths are
>> used regardless of whether DAX is enabled or not. And it's not
>> obvious why this is done, nor is it now obvious how DAX interacts
>> with the block device.
>>
>> This really seems like a step backwards to me.
>
> I think the reason it is not obvious is the original justification for
> the bypass as stated in commit bbab37ddc20b "block: Add support for
> DAX reads/writes to block devices" was:
>
>     "instead of allocating a DIO and a BIO"
>
> It turns out it's faster and as far as I can tell semantically
> equivalent to the __blockdev_direct_IO() path.  The DAX mmap path in
> comparison has plenty of sharp edges and semantic differences that
> would be avoided by turning off DAX.
>
> I'm not opposed to also turning off dax_do_io() when S_DAX is clear,
> but I don't currently see the point.  At the very least I need to add
> the above comments to the code, but do you still think opt-in DAX is a
> backwards step?

I thought of one way dax_do_io() breaks current semantics, it defeats
blktrace and i/o stat tracking.  I'll restore the existing behavior
that gates dax_do_io() on S_DAX.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 03/15] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
  2015-11-02  4:29   ` Dan Williams
@ 2016-01-18 10:42     ` Geert Uytterhoeven
  -1 siblings, 0 replies; 95+ messages in thread
From: Geert Uytterhoeven @ 2016-01-18 10:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Jens Axboe, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-kernel, Jeff Moyer, Jan Kara, Ross Zwisler,
	Christoph Hellwig

Hi Dan,

On Mon, Nov 2, 2015 at 5:29 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> The DAX implementation needs to protect new calls to ->direct_access()
> and usage of its return value against unbind of the underlying block
> device.  Use blk_queue_enter()/blk_queue_exit() to either prevent
> blk_cleanup_queue() from proceeding, or fail the dax_map_atomic() if the
> request_queue is being torn down.

> index f8e543839e5c..a480729c00ec 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c

>  static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
>                       loff_t start, loff_t end, get_block_t get_block,
>                       struct buffer_head *bh)
>  {
> -       ssize_t retval = 0;
> -       loff_t pos = start;
> -       loff_t max = start;
> -       loff_t bh_max = start;
> -       void __pmem *addr;
> +       loff_t pos = start, max = start, bh_max = start;
> +       struct block_device *bdev = NULL;
> +       int rw = iov_iter_rw(iter), rc;

fs/dax.c:138: warning: ‘rc’ may be used uninitialized in this function

rc will be uninitialized if start == end, and the while loop below is never
executed. I don't know whether it's 100% guaranteed this isn't the case.

Note that the old retval was preinitialized to 0.

> +       long map_len = 0;
> +       unsigned long pfn;
> +       void __pmem *addr = NULL;
> +       void __pmem *kmap = (void __pmem *) ERR_PTR(-EIO);
>         bool hole = false;
>         bool need_wmb = false;
>
> -       if (iov_iter_rw(iter) != WRITE)
> +       if (rw == READ)
>                 end = min(end, i_size_read(inode));
>
>         while (pos < end) {

> @@ -175,8 +219,9 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
>
>         if (need_wmb)
>                 wmb_pmem();
> +       dax_unmap_atomic(bdev, kmap);
>
> -       return (pos == start) ? retval : pos - start;
> +       return (pos == start) ? rc : pos - start;
>  }


Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 03/15] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
@ 2016-01-18 10:42     ` Geert Uytterhoeven
  0 siblings, 0 replies; 95+ messages in thread
From: Geert Uytterhoeven @ 2016-01-18 10:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Jens Axboe, Jan Kara, linux-nvdimm@lists.01.org,
	Dave Chinner, linux-kernel, Jeff Moyer, Jan Kara, Ross Zwisler,
	Christoph Hellwig

Hi Dan,

On Mon, Nov 2, 2015 at 5:29 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> The DAX implementation needs to protect new calls to ->direct_access()
> and usage of its return value against unbind of the underlying block
> device.  Use blk_queue_enter()/blk_queue_exit() to either prevent
> blk_cleanup_queue() from proceeding, or fail the dax_map_atomic() if the
> request_queue is being torn down.

> index f8e543839e5c..a480729c00ec 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c

>  static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
>                       loff_t start, loff_t end, get_block_t get_block,
>                       struct buffer_head *bh)
>  {
> -       ssize_t retval = 0;
> -       loff_t pos = start;
> -       loff_t max = start;
> -       loff_t bh_max = start;
> -       void __pmem *addr;
> +       loff_t pos = start, max = start, bh_max = start;
> +       struct block_device *bdev = NULL;
> +       int rw = iov_iter_rw(iter), rc;

fs/dax.c:138: warning: ‘rc’ may be used uninitialized in this function

rc will be uninitialized if start == end, and the while loop below is never
executed. I don't know whether it's 100% guaranteed this isn't the case.

Note that the old retval was preinitialized to 0.

> +       long map_len = 0;
> +       unsigned long pfn;
> +       void __pmem *addr = NULL;
> +       void __pmem *kmap = (void __pmem *) ERR_PTR(-EIO);
>         bool hole = false;
>         bool need_wmb = false;
>
> -       if (iov_iter_rw(iter) != WRITE)
> +       if (rw == READ)
>                 end = min(end, i_size_read(inode));
>
>         while (pos < end) {

> @@ -175,8 +219,9 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
>
>         if (need_wmb)
>                 wmb_pmem();
> +       dax_unmap_atomic(bdev, kmap);
>
> -       return (pos == start) ? retval : pos - start;
> +       return (pos == start) ? rc : pos - start;
>  }


Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 95+ messages in thread

end of thread, other threads:[~2016-01-18 10:42 UTC | newest]

Thread overview: 95+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-02  4:29 [PATCH v3 00/15] block, dax updates for 4.4 Dan Williams
2015-11-02  4:29 ` Dan Williams
2015-11-02  4:29 ` [PATCH v3 01/15] pmem, dax: clean up clear_pmem() Dan Williams
2015-11-02  4:29   ` Dan Williams
2015-11-02  4:29 ` [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations Dan Williams
2015-11-02  4:29   ` Dan Williams
2015-11-03  0:51   ` Dave Chinner
2015-11-03  0:51     ` Dave Chinner
2015-11-03  3:27     ` Dan Williams
2015-11-03  3:27       ` Dan Williams
2015-11-03  4:48       ` Dave Chinner
2015-11-03  4:48         ` Dave Chinner
2015-11-03  5:31         ` Dan Williams
2015-11-03  5:31           ` Dan Williams
2015-11-03  5:52           ` Dave Chinner
2015-11-03  5:52             ` Dave Chinner
2015-11-03  7:24             ` Dan Williams
2015-11-03  7:24               ` Dan Williams
2015-11-03 16:21           ` Jan Kara
2015-11-03 16:21             ` Jan Kara
2015-11-03 17:57           ` Ross Zwisler
2015-11-03 17:57             ` Ross Zwisler
2015-11-03 20:59             ` Dave Chinner
2015-11-03 20:59               ` Dave Chinner
2015-11-02  4:29 ` [PATCH v3 03/15] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams
2015-11-02  4:29   ` Dan Williams
2015-11-03 19:01   ` Ross Zwisler
2015-11-03 19:01     ` Ross Zwisler
2015-11-03 19:09     ` Jeff Moyer
2015-11-03 22:50     ` Dan Williams
2015-11-03 22:50       ` Dan Williams
2016-01-18 10:42   ` Geert Uytterhoeven
2016-01-18 10:42     ` Geert Uytterhoeven
2015-11-02  4:30 ` [PATCH v3 04/15] libnvdimm, pmem: move request_queue allocation earlier in probe Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-03 19:15   ` Ross Zwisler
2015-11-03 19:15     ` Ross Zwisler
2015-11-02  4:30 ` [PATCH v3 05/15] libnvdimm, pmem: fix size trim in pmem_direct_access() Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-03 19:32   ` Ross Zwisler
2015-11-03 19:32     ` Ross Zwisler
2015-11-03 21:39     ` Dan Williams
2015-11-03 21:39       ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 06/15] um: kill pfn_t Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 07/15] kvm: rename pfn_t to kvm_pfn_t Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 08/15] mm, dax, pmem: introduce pfn_t Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02 16:30   ` Joe Perches
2015-11-02 16:30     ` Joe Perches
2015-11-02  4:30 ` [PATCH v3 09/15] block: notify queue death confirmation Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 10/15] dax, pmem: introduce zone_device_revoke() and devm_memunmap_pages() Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 11/15] block: introduce bdev_file_inode() Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 12/15] block: enable dax for raw block devices Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 13/15] block, dax: make dax mappings opt-in by default Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-03  0:32   ` Dave Chinner
2015-11-03  0:32     ` Dave Chinner
2015-11-03  7:35     ` Dan Williams
2015-11-03  7:35       ` Dan Williams
2015-11-03 20:20       ` Dave Chinner
2015-11-03 20:20         ` Dave Chinner
2015-11-03 23:04         ` Dan Williams
2015-11-03 23:04           ` Dan Williams
2015-11-04 19:23           ` Dan Williams
2015-11-04 19:23             ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 14/15] dax: dirty extent notification Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-03  1:16   ` Dave Chinner
2015-11-03  1:16     ` Dave Chinner
2015-11-03  4:56     ` Dan Williams
2015-11-03  4:56       ` Dan Williams
2015-11-03  5:40       ` Dave Chinner
2015-11-03  5:40         ` Dave Chinner
2015-11-03  7:20         ` Dan Williams
2015-11-03  7:20           ` Dan Williams
2015-11-03 20:51           ` Dave Chinner
2015-11-03 20:51             ` Dave Chinner
2015-11-03 21:19             ` Dan Williams
2015-11-03 21:19               ` Dan Williams
2015-11-03 21:37             ` Ross Zwisler
2015-11-03 21:37               ` Ross Zwisler
2015-11-03 21:43               ` Dan Williams
2015-11-03 21:43                 ` Dan Williams
2015-11-03 21:18       ` Ross Zwisler
2015-11-03 21:18         ` Ross Zwisler
2015-11-03 21:34         ` Dan Williams
2015-11-03 21:34           ` Dan Williams
2015-11-02  4:31 ` [PATCH v3 15/15] pmem: blkdev_issue_flush support Dan Williams
2015-11-02  4:31   ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.