linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/20] get_user_pages() for dax mappings
@ 2015-10-10  0:55 Dan Williams
  2015-10-10  0:55 ` [PATCH v2 01/20] block: generic request_queue reference counting Dan Williams
                   ` (20 more replies)
  0 siblings, 21 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:55 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-mips, Dave Hansen, Boaz Harrosh, David Airlie,
	Catalin Marinas, Dave Hansen, Dave Chinner, Keith Busch,
	linux-mm, Paul Mackerras, H. Peter Anvin, hch, Russell King,
	Richard Weinberger, Peter Zijlstra, Jeff Moyer, Ingo Molnar,
	Benjamin Herrenschmidt, Matthew Wilcox, ross.zwisler,
	Gleb Natapov, Marc Zyngier, Will Deacon, Jeff Dike,
	Alexander Viro, Thomas Gleixner, Jens Axboe, linux-kernel,
	Ralf Baechle, Alexander Graf, Paolo Bonzini, Andrew Morton,
	Christoffer Dall

Changes since v1 [1]:
1/ Rebased on the accepted cleanups to the memremap() api and the NUMA
   hints for devm allocations. (see libnvdimm-for-next [2]).

2/ Rebased on DAX fixes from Ross [3], currently in -mm, and Dave [4],
   applied locally for now.

3/ Renamed __pfn_t to pfn_t and converted KVM and UM accordingly (Dave
   Hansen)

4/ Make pfn-to-pfn_t conversions a nop (binary identical) for typical
   mapped pfns (Dave Hansen)

5/ Fixed up the devm_memremap_pages() api to require passing in a
   percpu_ref object.  Addresses a crash reported-by Logan.

6/ Moved the back pointer from a page to its hosting 'struct
   dev_pagemap' to share storage with the 'lru' field rather than
   'mapping'.  Enables us to revoke mappings at devm_memunmap_page()
   time and addresses a crash reported-by Logan.

7/ Rework dax_map_bh() into dax_map_atomic() to avoid proliferating
   buffer_head usage deeper into the dax implementation.  Also addresses
   a crash reported by Logan (Dave Chinner)

8/ Include an initial, only lightly tested, implementation of revoking
   usages of ZONE_DEVICE pages when the driver disables the pmem device.
   This coordinates with blk_cleanup_queue() for the pmem gendisk, see
   patch 19.

9/ Include a cleaned up version of the vmem_altmap infrastructure
   allowing the struct page memmap to optionally be allocated from pmem
   itself.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://git.kernel.org/cgit/linux/kernel/git/nvdimm/nvdimm.git/log/?h=libnvdimm-for-next
[3]: https://git.kernel.org/cgit/linux/kernel/git/nvdimm/nvdimm.git/commit/?h=dax-fixes&id=93fdde069dce
[4]: https://lists.01.org/pipermail/linux-nvdimm/2015-October/002286.html

---
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o.  It allows userspace to coordinate
DMA/RDMA from/to persitent memory.

The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 to flag pages that are owned and dynamically mapped by a device
driver.  The pmem driver, after mapping a persistent memory range into
the system memmap via devm_memremap_pages(), arranges for DAX to
distinguish pfn-only versus page-backed pmem-pfns via flags in the new
__pfn_t type.  The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn,
flags the resulting pte(s) inserted into the process page tables with a
new _PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it
keys off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.

This series is available via git here:

  git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm libnvdimm-pending

---

Dan Williams (20):
      block: generic request_queue reference counting
      dax: increase granularity of dax_clear_blocks() operations
      block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
      mm: introduce __get_dev_pagemap()
      x86, mm: introduce vmem_altmap to augment vmemmap_populate()
      libnvdimm, pfn, pmem: allocate memmap array in persistent memory
      avr32: convert to asm-generic/memory_model.h
      hugetlb: fix compile error on tile
      frv: fix compiler warning from definition of __pmd()
      um: kill pfn_t
      kvm: rename pfn_t to kvm_pfn_t
      mips: fix PAGE_MASK definition
      mm, dax, pmem: introduce pfn_t
      mm, dax, gpu: convert vm_insert_mixed to pfn_t, introduce _PAGE_DEVMAP
      mm, dax: convert vmf_insert_pfn_pmd() to pfn_t
      list: introduce list_poison() and LIST_POISON3
      mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup
      block: notify queue death confirmation
      mm, pmem: devm_memunmap_pages(), truncate and unmap ZONE_DEVICE pages
      mm, x86: get_user_pages() for dax mappings


 arch/alpha/include/asm/pgtable.h        |    1 
 arch/arm/include/asm/kvm_mmu.h          |    5 -
 arch/arm/kvm/mmu.c                      |   10 +
 arch/arm64/include/asm/kvm_mmu.h        |    3 
 arch/avr32/include/asm/page.h           |    8 -
 arch/frv/include/asm/page.h             |    2 
 arch/ia64/include/asm/pgtable.h         |    1 
 arch/m68k/include/asm/page_mm.h         |    1 
 arch/m68k/include/asm/page_no.h         |    1 
 arch/mips/include/asm/kvm_host.h        |    6 -
 arch/mips/include/asm/page.h            |    2 
 arch/mips/kvm/emulate.c                 |    2 
 arch/mips/kvm/tlb.c                     |   14 +
 arch/parisc/include/asm/pgtable.h       |    1 
 arch/powerpc/include/asm/kvm_book3s.h   |    4 
 arch/powerpc/include/asm/kvm_ppc.h      |    2 
 arch/powerpc/include/asm/pgtable.h      |    1 
 arch/powerpc/kvm/book3s.c               |    6 -
 arch/powerpc/kvm/book3s_32_mmu_host.c   |    2 
 arch/powerpc/kvm/book3s_64_mmu_host.c   |    2 
 arch/powerpc/kvm/e500.h                 |    2 
 arch/powerpc/kvm/e500_mmu_host.c        |    8 -
 arch/powerpc/kvm/trace_pr.h             |    2 
 arch/powerpc/sysdev/axonram.c           |    8 -
 arch/sparc/include/asm/pgtable_64.h     |    2 
 arch/tile/include/asm/pgtable.h         |    1 
 arch/um/include/asm/page.h              |    6 -
 arch/um/include/asm/pgtable-3level.h    |    5 -
 arch/um/include/asm/pgtable.h           |    2 
 arch/x86/include/asm/pgtable.h          |   24 ++
 arch/x86/include/asm/pgtable_types.h    |    7 +
 arch/x86/kvm/iommu.c                    |   11 +
 arch/x86/kvm/mmu.c                      |   37 ++--
 arch/x86/kvm/mmu_audit.c                |    2 
 arch/x86/kvm/paging_tmpl.h              |    6 -
 arch/x86/kvm/vmx.c                      |    2 
 arch/x86/kvm/x86.c                      |    2 
 arch/x86/mm/gup.c                       |   56 +++++-
 arch/x86/mm/init_64.c                   |   32 +++
 arch/x86/mm/pat.c                       |    4 
 block/blk-core.c                        |   79 +++++++-
 block/blk-mq-sysfs.c                    |    6 -
 block/blk-mq.c                          |   87 +++------
 block/blk-sysfs.c                       |    3 
 block/blk.h                             |   12 +
 drivers/block/brd.c                     |    4 
 drivers/gpu/drm/exynos/exynos_drm_gem.c |    3 
 drivers/gpu/drm/gma500/framebuffer.c    |    3 
 drivers/gpu/drm/msm/msm_gem.c           |    3 
 drivers/gpu/drm/omapdrm/omap_gem.c      |    6 -
 drivers/gpu/drm/ttm/ttm_bo_vm.c         |    3 
 drivers/nvdimm/pfn_devs.c               |    3 
 drivers/nvdimm/pmem.c                   |  128 +++++++++----
 drivers/s390/block/dcssblk.c            |   10 -
 fs/block_dev.c                          |    2 
 fs/dax.c                                |  199 +++++++++++++--------
 include/asm-generic/pgtable.h           |    6 -
 include/linux/blk-mq.h                  |    1 
 include/linux/blkdev.h                  |   12 +
 include/linux/huge_mm.h                 |    2 
 include/linux/hugetlb.h                 |    1 
 include/linux/io.h                      |   17 --
 include/linux/kvm_host.h                |   37 ++--
 include/linux/kvm_types.h               |    2 
 include/linux/list.h                    |   14 +
 include/linux/memory_hotplug.h          |    3 
 include/linux/mm.h                      |  300 +++++++++++++++++++++++++++++--
 include/linux/mm_types.h                |    5 +
 include/linux/pfn.h                     |    9 +
 include/linux/poison.h                  |    1 
 kernel/memremap.c                       |  187 +++++++++++++++++++
 lib/list_debug.c                        |    2 
 mm/gup.c                                |   11 +
 mm/huge_memory.c                        |   10 +
 mm/hugetlb.c                            |   18 ++
 mm/memory.c                             |   17 +-
 mm/memory_hotplug.c                     |   66 +++++--
 mm/page_alloc.c                         |   10 +
 mm/sparse-vmemmap.c                     |   37 ++++
 mm/sparse.c                             |    8 +
 mm/swap.c                               |   15 ++
 virt/kvm/kvm_main.c                     |   47 ++---
 82 files changed, 1264 insertions(+), 418 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 01/20] block: generic request_queue reference counting
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
@ 2015-10-10  0:55 ` Dan Williams
  2015-10-11 12:59   ` Christoph Hellwig
  2015-10-10  0:55 ` [PATCH v2 02/20] dax: increase granularity of dax_clear_blocks() operations Dan Williams
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:55 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jens Axboe, linux-kernel, Keith Busch, linux-mm, ross.zwisler, hch

Allow pmem, and other synchronous/bio-based block drivers, to fallback
on a per-cpu reference count managed by the core for tracking queue
live/dead state.

The existing per-cpu reference count for the blk_mq case is promoted to
be used in all block i/o scenarios.  This involves initializing it by
default, waiting for it to drop to zero at exit, and holding a live
reference over the invocation of q->make_request_fn() in
generic_make_request().  The blk_mq code continues to take its own
reference per blk_mq request and retains the ability to freeze the
queue, but the check that the queue is frozen is moved to
generic_make_request().

This fixes crash signatures like the following:

 BUG: unable to handle kernel paging request at ffff880140000000
 [..]
 Call Trace:
  [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70
  [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
  [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem]
  [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0
  [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110
  [<ffffffff8141f966>] submit_bio+0x76/0x170
  [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160
  [<ffffffff81286e62>] submit_bh+0x12/0x20
  [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170
  [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90
  [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270
  [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30
  [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250
  [<ffffffff81303494>] ext4_put_super+0x64/0x360
  [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/blk-core.c       |   71 +++++++++++++++++++++++++++++++++++++------
 block/blk-mq-sysfs.c   |    6 ----
 block/blk-mq.c         |   80 ++++++++++++++----------------------------------
 block/blk-sysfs.c      |    3 +-
 block/blk.h            |   14 ++++++++
 include/linux/blk-mq.h |    1 -
 include/linux/blkdev.h |    2 +
 7 files changed, 102 insertions(+), 75 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2eb722d48773..9b4d735cb5b8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -554,13 +554,10 @@ void blk_cleanup_queue(struct request_queue *q)
 	 * Drain all requests queued before DYING marking. Set DEAD flag to
 	 * prevent that q->request_fn() gets invoked after draining finished.
 	 */
-	if (q->mq_ops) {
-		blk_mq_freeze_queue(q);
-		spin_lock_irq(lock);
-	} else {
-		spin_lock_irq(lock);
+	blk_freeze_queue(q);
+	spin_lock_irq(lock);
+	if (!q->mq_ops)
 		__blk_drain_queue(q, true);
-	}
 	queue_flag_set(QUEUE_FLAG_DEAD, q);
 	spin_unlock_irq(lock);
 
@@ -570,6 +567,7 @@ void blk_cleanup_queue(struct request_queue *q)
 
 	if (q->mq_ops)
 		blk_mq_free_queue(q);
+	percpu_ref_exit(&q->q_usage_counter);
 
 	spin_lock_irq(lock);
 	if (q->queue_lock != &q->__queue_lock)
@@ -629,6 +627,40 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
+int blk_queue_enter(struct request_queue *q, gfp_t gfp)
+{
+	while (true) {
+		int ret;
+
+		if (percpu_ref_tryget_live(&q->q_usage_counter))
+			return 0;
+
+		if (!(gfp & __GFP_WAIT))
+			return -EBUSY;
+
+		ret = wait_event_interruptible(q->mq_freeze_wq,
+				!atomic_read(&q->mq_freeze_depth) ||
+				blk_queue_dying(q));
+		if (blk_queue_dying(q))
+			return -ENODEV;
+		if (ret)
+			return ret;
+	}
+}
+
+void blk_queue_exit(struct request_queue *q)
+{
+	percpu_ref_put(&q->q_usage_counter);
+}
+
+static void blk_queue_usage_counter_release(struct percpu_ref *ref)
+{
+	struct request_queue *q =
+		container_of(ref, struct request_queue, q_usage_counter);
+
+	wake_up_all(&q->mq_freeze_wq);
+}
+
 struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 {
 	struct request_queue *q;
@@ -690,11 +722,22 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 
 	init_waitqueue_head(&q->mq_freeze_wq);
 
-	if (blkcg_init_queue(q))
+	/*
+	 * Init percpu_ref in atomic mode so that it's faster to shutdown.
+	 * See blk_register_queue() for details.
+	 */
+	if (percpu_ref_init(&q->q_usage_counter,
+				blk_queue_usage_counter_release,
+				PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
 		goto fail_bdi;
 
+	if (blkcg_init_queue(q))
+		goto fail_ref;
+
 	return q;
 
+fail_ref:
+	percpu_ref_exit(&q->q_usage_counter);
 fail_bdi:
 	bdi_destroy(&q->backing_dev_info);
 fail_split:
@@ -1966,9 +2009,19 @@ void generic_make_request(struct bio *bio)
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-		q->make_request_fn(q, bio);
+		if (likely(blk_queue_enter(q, __GFP_WAIT) == 0)) {
+
+			q->make_request_fn(q, bio);
+
+			blk_queue_exit(q);
 
-		bio = bio_list_pop(current->bio_list);
+			bio = bio_list_pop(current->bio_list);
+		} else {
+			struct bio *bio_next = bio_list_pop(current->bio_list);
+
+			bio_io_error(bio);
+			bio = bio_next;
+		}
 	} while (bio);
 	current->bio_list = NULL; /* deactivate */
 }
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 788fffd9b409..6f57a110289c 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -413,12 +413,6 @@ static void blk_mq_sysfs_init(struct request_queue *q)
 		kobject_init(&ctx->kobj, &blk_mq_ctx_ktype);
 }
 
-/* see blk_register_queue() */
-void blk_mq_finish_init(struct request_queue *q)
-{
-	percpu_ref_switch_to_percpu(&q->mq_usage_counter);
-}
-
 int blk_mq_register_disk(struct gendisk *disk)
 {
 	struct device *dev = disk_to_dev(disk);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7785ae96267a..c371aeda2986 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -77,47 +77,13 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
 	clear_bit(CTX_TO_BIT(hctx, ctx), &bm->word);
 }
 
-static int blk_mq_queue_enter(struct request_queue *q, gfp_t gfp)
-{
-	while (true) {
-		int ret;
-
-		if (percpu_ref_tryget_live(&q->mq_usage_counter))
-			return 0;
-
-		if (!(gfp & __GFP_WAIT))
-			return -EBUSY;
-
-		ret = wait_event_interruptible(q->mq_freeze_wq,
-				!atomic_read(&q->mq_freeze_depth) ||
-				blk_queue_dying(q));
-		if (blk_queue_dying(q))
-			return -ENODEV;
-		if (ret)
-			return ret;
-	}
-}
-
-static void blk_mq_queue_exit(struct request_queue *q)
-{
-	percpu_ref_put(&q->mq_usage_counter);
-}
-
-static void blk_mq_usage_counter_release(struct percpu_ref *ref)
-{
-	struct request_queue *q =
-		container_of(ref, struct request_queue, mq_usage_counter);
-
-	wake_up_all(&q->mq_freeze_wq);
-}
-
 void blk_mq_freeze_queue_start(struct request_queue *q)
 {
 	int freeze_depth;
 
 	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
 	if (freeze_depth == 1) {
-		percpu_ref_kill(&q->mq_usage_counter);
+		percpu_ref_kill(&q->q_usage_counter);
 		blk_mq_run_hw_queues(q, false);
 	}
 }
@@ -125,18 +91,34 @@ EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start);
 
 static void blk_mq_freeze_queue_wait(struct request_queue *q)
 {
-	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->mq_usage_counter));
+	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
 }
 
 /*
  * Guarantee no request is in use, so we can change any data structure of
  * the queue afterward.
  */
-void blk_mq_freeze_queue(struct request_queue *q)
+void blk_freeze_queue(struct request_queue *q)
 {
+	/*
+	 * In the !blk_mq case we are only calling this to kill the
+	 * q_usage_counter, otherwise this increases the freeze depth
+	 * and waits for it to return to zero.  For this reason there is
+	 * no blk_unfreeze_queue(), and blk_freeze_queue() is not
+	 * exported to drivers as the only user for unfreeze is blk_mq.
+	 */
 	blk_mq_freeze_queue_start(q);
 	blk_mq_freeze_queue_wait(q);
 }
+
+void blk_mq_freeze_queue(struct request_queue *q)
+{
+	/*
+	 * ...just an alias to keep freeze and unfreeze actions balanced
+	 * in the blk_mq_* namespace
+	 */
+	blk_freeze_queue(q);
+}
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
 
 void blk_mq_unfreeze_queue(struct request_queue *q)
@@ -146,7 +128,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 	freeze_depth = atomic_dec_return(&q->mq_freeze_depth);
 	WARN_ON_ONCE(freeze_depth < 0);
 	if (!freeze_depth) {
-		percpu_ref_reinit(&q->mq_usage_counter);
+		percpu_ref_reinit(&q->q_usage_counter);
 		wake_up_all(&q->mq_freeze_wq);
 	}
 }
@@ -255,7 +237,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
 	struct blk_mq_alloc_data alloc_data;
 	int ret;
 
-	ret = blk_mq_queue_enter(q, gfp);
+	ret = blk_queue_enter(q, gfp);
 	if (ret)
 		return ERR_PTR(ret);
 
@@ -278,7 +260,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
 	}
 	blk_mq_put_ctx(ctx);
 	if (!rq) {
-		blk_mq_queue_exit(q);
+		blk_queue_exit(q);
 		return ERR_PTR(-EWOULDBLOCK);
 	}
 	return rq;
@@ -297,7 +279,7 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
 
 	clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
 	blk_mq_put_tag(hctx, tag, &ctx->last_tag);
-	blk_mq_queue_exit(q);
+	blk_queue_exit(q);
 }
 
 void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
@@ -1176,11 +1158,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 	int rw = bio_data_dir(bio);
 	struct blk_mq_alloc_data alloc_data;
 
-	if (unlikely(blk_mq_queue_enter(q, GFP_KERNEL))) {
-		bio_io_error(bio);
-		return NULL;
-	}
-
+	blk_queue_enter_live(q);
 	ctx = blk_mq_get_ctx(q);
 	hctx = q->mq_ops->map_queue(q, ctx->cpu);
 
@@ -1989,14 +1967,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 		hctxs[i]->queue_num = i;
 	}
 
-	/*
-	 * Init percpu_ref in atomic mode so that it's faster to shutdown.
-	 * See blk_register_queue() for details.
-	 */
-	if (percpu_ref_init(&q->mq_usage_counter, blk_mq_usage_counter_release,
-			    PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
-		goto err_hctxs;
-
 	setup_timer(&q->timeout, blk_mq_rq_timer, (unsigned long) q);
 	blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
 
@@ -2077,8 +2047,6 @@ void blk_mq_free_queue(struct request_queue *q)
 
 	blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
 	blk_mq_free_hw_queues(q, set);
-
-	percpu_ref_exit(&q->mq_usage_counter);
 }
 
 /* Basically redo blk_mq_init_queue with queue frozen */
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 3e44a9da2a13..61fc2633bbea 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -599,9 +599,8 @@ int blk_register_queue(struct gendisk *disk)
 	 */
 	if (!blk_queue_init_done(q)) {
 		queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q);
+		percpu_ref_switch_to_percpu(&q->q_usage_counter);
 		blk_queue_bypass_end(q);
-		if (q->mq_ops)
-			blk_mq_finish_init(q);
 	}
 
 	ret = blk_trace_init_sysfs(dev);
diff --git a/block/blk.h b/block/blk.h
index 98614ad37c81..5b2cd393afbe 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -72,6 +72,20 @@ void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 bool __blk_end_bidi_request(struct request *rq, int error,
 			    unsigned int nr_bytes, unsigned int bidi_bytes);
+int blk_queue_enter(struct request_queue *q, gfp_t gfp);
+void blk_queue_exit(struct request_queue *q);
+void blk_freeze_queue(struct request_queue *q);
+
+static inline void blk_queue_enter_live(struct request_queue *q)
+{
+	/*
+	 * Given that running in generic_make_request() context
+	 * guarantees that a live reference against q_usage_counter has
+	 * been established, further references under that same context
+	 * need not check that the queue has been frozen (marked dead).
+	 */
+	percpu_ref_get(&q->q_usage_counter);
+}
 
 void blk_rq_timed_out_timer(unsigned long data);
 unsigned long blk_rq_timeout(unsigned long timeout);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 5e7d43ab61c0..83cc9d4e5455 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -166,7 +166,6 @@ enum {
 struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *);
 struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 						  struct request_queue *q);
-void blk_mq_finish_init(struct request_queue *q);
 int blk_mq_register_disk(struct gendisk *);
 void blk_mq_unregister_disk(struct gendisk *);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 19c2e947d4d1..dd761b663e6a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -450,7 +450,7 @@ struct request_queue {
 #endif
 	struct rcu_head		rcu_head;
 	wait_queue_head_t	mq_freeze_wq;
-	struct percpu_ref	mq_usage_counter;
+	struct percpu_ref	q_usage_counter;
 	struct list_head	all_q_node;
 
 	struct blk_mq_tag_set	*tag_set;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 02/20] dax: increase granularity of dax_clear_blocks() operations
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
  2015-10-10  0:55 ` [PATCH v2 01/20] block: generic request_queue reference counting Dan Williams
@ 2015-10-10  0:55 ` Dan Williams
  2015-10-10  0:55 ` [PATCH v2 03/20] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:55 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-mm, ross.zwisler, linux-kernel, hch

dax_clear_blocks is currently performing a cond_resched() after every
PAGE_SIZE memset.  We need not check so frequently, for example md-raid
only calls cond_resched() at stripe granularity.  Also, in preparation
for introducing a dax_map_atomic() operation that temporarily pins a dax
mapping move the call to cond_resched() to the outer loop.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index cc9a6e3d7389..7031b0312596 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -28,6 +28,7 @@
 #include <linux/sched.h>
 #include <linux/uio.h>
 #include <linux/vmstat.h>
+#include <linux/sizes.h>
 
 /*
  * dax_clear_blocks() is called from within transaction context from XFS,
@@ -43,24 +44,20 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 	do {
 		void __pmem *addr;
 		unsigned long pfn;
-		long count;
+		long count, sz;
 
-		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
+		sz = min_t(long, size, SZ_1M);
+		count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
 		if (count < 0)
 			return count;
-		BUG_ON(size < count);
-		while (count > 0) {
-			unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
-			if (pgsz > count)
-				pgsz = count;
-			clear_pmem(addr, pgsz);
-			addr += pgsz;
-			size -= pgsz;
-			count -= pgsz;
-			BUG_ON(pgsz & 511);
-			sector += pgsz / 512;
-			cond_resched();
-		}
+		if (count < sz)
+			sz = count;
+		clear_pmem(addr, sz);
+		addr += sz;
+		size -= sz;
+		BUG_ON(sz & 511);
+		sector += sz / 512;
+		cond_resched();
 	} while (size);
 
 	wmb_pmem();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 03/20] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
  2015-10-10  0:55 ` [PATCH v2 01/20] block: generic request_queue reference counting Dan Williams
  2015-10-10  0:55 ` [PATCH v2 02/20] dax: increase granularity of dax_clear_blocks() operations Dan Williams
@ 2015-10-10  0:55 ` Dan Williams
  2015-10-10  0:55 ` [PATCH v2 04/20] mm: introduce __get_dev_pagemap() Dan Williams
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:55 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jens Axboe, Boaz Harrosh, Dave Chinner, linux-kernel, linux-mm,
	ross.zwisler, hch

The DAX implementation needs to protect new calls to ->direct_access()
and usage of its return value against unbind of the underlying block
device.  Use blk_queue_enter()/blk_queue_exit() to either prevent
blk_cleanup_queue() from proceeding, or fail the dax_map_atomic() if the
request_queue is being torn down.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/blk.h            |    2 -
 fs/dax.c               |  165 ++++++++++++++++++++++++++++++++----------------
 include/linux/blkdev.h |    2 +
 3 files changed, 112 insertions(+), 57 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index 5b2cd393afbe..0f8de0dda768 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -72,8 +72,6 @@ void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 bool __blk_end_bidi_request(struct request *rq, int error,
 			    unsigned int nr_bytes, unsigned int bidi_bytes);
-int blk_queue_enter(struct request_queue *q, gfp_t gfp);
-void blk_queue_exit(struct request_queue *q);
 void blk_freeze_queue(struct request_queue *q);
 
 static inline void blk_queue_enter_live(struct request_queue *q)
diff --git a/fs/dax.c b/fs/dax.c
index 7031b0312596..9549cd523649 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -30,6 +30,40 @@
 #include <linux/vmstat.h>
 #include <linux/sizes.h>
 
+static void __pmem *__dax_map_atomic(struct block_device *bdev, sector_t sector,
+		long size, unsigned long *pfn, long *len)
+{
+	long rc;
+	void __pmem *addr;
+	struct request_queue *q = bdev->bd_queue;
+
+	if (blk_queue_enter(q, GFP_NOWAIT) != 0)
+		return (void __pmem *) ERR_PTR(-EIO);
+	rc = bdev_direct_access(bdev, sector, &addr, pfn, size);
+	if (len)
+		*len = rc;
+	if (rc < 0) {
+		blk_queue_exit(q);
+		return (void __pmem *) ERR_PTR(rc);
+	}
+	return addr;
+}
+
+static void __pmem *dax_map_atomic(struct block_device *bdev, sector_t sector,
+		long size)
+{
+	unsigned long pfn;
+
+	return __dax_map_atomic(bdev, sector, size, &pfn, NULL);
+}
+
+static void dax_unmap_atomic(struct block_device *bdev, void __pmem *addr)
+{
+	if (IS_ERR(addr))
+		return;
+	blk_queue_exit(bdev->bd_queue);
+}
+
 /*
  * dax_clear_blocks() is called from within transaction context from XFS,
  * and hence this means the stack from this point must follow GFP_NOFS
@@ -47,9 +81,9 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 		long count, sz;
 
 		sz = min_t(long, size, SZ_1M);
-		count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
-		if (count < 0)
-			return count;
+		addr = __dax_map_atomic(bdev, sector, size, &pfn, &count);
+		if (IS_ERR(addr))
+			return PTR_ERR(addr);
 		if (count < sz)
 			sz = count;
 		clear_pmem(addr, sz);
@@ -57,6 +91,7 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 		size -= sz;
 		BUG_ON(sz & 511);
 		sector += sz / 512;
+		dax_unmap_atomic(bdev, addr);
 		cond_resched();
 	} while (size);
 
@@ -65,14 +100,6 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 }
 EXPORT_SYMBOL_GPL(dax_clear_blocks);
 
-static long dax_get_addr(struct buffer_head *bh, void __pmem **addr,
-		unsigned blkbits)
-{
-	unsigned long pfn;
-	sector_t sector = bh->b_blocknr << (blkbits - 9);
-	return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
-}
-
 /* the clear_pmem() calls are ordered by a wmb_pmem() in the caller */
 static void dax_new_buf(void __pmem *addr, unsigned size, unsigned first,
 		loff_t pos, loff_t end)
@@ -102,19 +129,30 @@ static bool buffer_size_valid(struct buffer_head *bh)
 	return bh->b_state != 0;
 }
 
+
+static sector_t to_sector(const struct buffer_head *bh,
+		const struct inode *inode)
+{
+	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
+
+	return sector;
+}
+
 static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 		      loff_t start, loff_t end, get_block_t get_block,
 		      struct buffer_head *bh)
 {
-	ssize_t retval = 0;
-	loff_t pos = start;
-	loff_t max = start;
-	loff_t bh_max = start;
-	void __pmem *addr;
+	loff_t pos = start, max = start, bh_max = start;
+	struct block_device *bdev = NULL;
+	int rw = iov_iter_rw(iter), rc;
+	long map_len = 0;
+	unsigned long pfn;
+	void __pmem *addr = NULL;
+	void __pmem *kmap = (void __pmem *) ERR_PTR(-EIO);
 	bool hole = false;
 	bool need_wmb = false;
 
-	if (iov_iter_rw(iter) != WRITE)
+	if (rw == READ)
 		end = min(end, i_size_read(inode));
 
 	while (pos < end) {
@@ -129,13 +167,13 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 			if (pos == bh_max) {
 				bh->b_size = PAGE_ALIGN(end - pos);
 				bh->b_state = 0;
-				retval = get_block(inode, block, bh,
-						   iov_iter_rw(iter) == WRITE);
-				if (retval)
+				rc = get_block(inode, block, bh, rw == WRITE);
+				if (rc)
 					break;
 				if (!buffer_size_valid(bh))
 					bh->b_size = 1 << blkbits;
 				bh_max = pos - first + bh->b_size;
+				bdev = bh->b_bdev;
 			} else {
 				unsigned done = bh->b_size -
 						(bh_max - (pos - first));
@@ -143,21 +181,27 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				bh->b_size -= done;
 			}
 
-			hole = iov_iter_rw(iter) != WRITE && !buffer_written(bh);
+			hole = rw == READ && !buffer_written(bh);
 			if (hole) {
 				addr = NULL;
 				size = bh->b_size - first;
 			} else {
-				retval = dax_get_addr(bh, &addr, blkbits);
-				if (retval < 0)
+				dax_unmap_atomic(bdev, kmap);
+				kmap = __dax_map_atomic(bdev,
+						to_sector(bh, inode),
+						bh->b_size, &pfn, &map_len);
+				if (IS_ERR(kmap)) {
+					rc = PTR_ERR(kmap);
 					break;
+				}
+				addr = kmap;
 				if (buffer_unwritten(bh) || buffer_new(bh)) {
-					dax_new_buf(addr, retval, first, pos,
-									end);
+					dax_new_buf(addr, map_len, first, pos,
+							end);
 					need_wmb = true;
 				}
 				addr += first;
-				size = retval - first;
+				size = map_len - first;
 			}
 			max = min(pos + size, end);
 		}
@@ -180,8 +224,9 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 
 	if (need_wmb)
 		wmb_pmem();
+	dax_unmap_atomic(bdev, kmap);
 
-	return (pos == start) ? retval : pos - start;
+	return (pos == start) ? rc : pos - start;
 }
 
 /**
@@ -270,28 +315,31 @@ static int dax_load_hole(struct address_space *mapping, struct page *page,
 	return VM_FAULT_LOCKED;
 }
 
-static int copy_user_bh(struct page *to, struct buffer_head *bh,
-			unsigned blkbits, unsigned long vaddr)
+static int copy_user_bh(struct page *to, struct inode *inode,
+		struct buffer_head *bh, unsigned long vaddr)
 {
+	struct block_device *bdev = bh->b_bdev;
 	void __pmem *vfrom;
 	void *vto;
 
-	if (dax_get_addr(bh, &vfrom, blkbits) < 0)
-		return -EIO;
+	vfrom = dax_map_atomic(bdev, to_sector(bh, inode), bh->b_size);
+	if (IS_ERR(vfrom))
+		return PTR_ERR(vfrom);
 	vto = kmap_atomic(to);
 	copy_user_page(vto, (void __force *)vfrom, vaddr, to);
 	kunmap_atomic(vto);
+	dax_unmap_atomic(bdev, vfrom);
 	return 0;
 }
 
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct address_space *mapping = inode->i_mapping;
-	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
-	void __pmem *addr;
+	struct address_space *mapping = inode->i_mapping;
+	struct block_device *bdev = bh->b_bdev;
 	unsigned long pfn;
+	void __pmem *addr;
 	pgoff_t size;
 	int error;
 
@@ -310,11 +358,10 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		goto out;
 	}
 
-	error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
-	if (error < 0)
-		goto out;
-	if (error < PAGE_SIZE) {
-		error = -EIO;
+	addr = __dax_map_atomic(bdev, to_sector(bh, inode), bh->b_size,
+			&pfn, NULL);
+	if (IS_ERR(addr)) {
+		error = PTR_ERR(addr);
 		goto out;
 	}
 
@@ -322,6 +369,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		clear_pmem(addr, PAGE_SIZE);
 		wmb_pmem();
 	}
+	dax_unmap_atomic(bdev, addr);
 
 	error = vm_insert_mixed(vma, vaddr, pfn);
 
@@ -417,7 +465,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (vmf->cow_page) {
 		struct page *new_page = vmf->cow_page;
 		if (buffer_written(&bh))
-			error = copy_user_bh(new_page, &bh, blkbits, vaddr);
+			error = copy_user_bh(new_page, inode, &bh, vaddr);
 		else
 			clear_user_highpage(new_page, vaddr);
 		if (error)
@@ -529,11 +577,9 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	unsigned blkbits = inode->i_blkbits;
 	unsigned long pmd_addr = address & PMD_MASK;
 	bool write = flags & FAULT_FLAG_WRITE;
-	long length;
-	void __pmem *kaddr;
+	struct block_device *bdev;
 	pgoff_t size, pgoff;
-	sector_t block, sector;
-	unsigned long pfn;
+	sector_t block;
 	int result = 0;
 
 	/* Fall back to PTEs if we're going to COW */
@@ -557,9 +603,9 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
 
 	bh.b_size = PMD_SIZE;
-	length = get_block(inode, block, &bh, write);
-	if (length)
+	if (get_block(inode, block, &bh, write) != 0)
 		return VM_FAULT_SIGBUS;
+	bdev = bh.b_bdev;
 	i_mmap_lock_read(mapping);
 
 	/*
@@ -614,15 +660,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
-		sector = bh.b_blocknr << (blkbits - 9);
-		length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
-						bh.b_size);
-		if (length < 0) {
+		long length;
+		unsigned long pfn;
+		void __pmem *kaddr = __dax_map_atomic(bdev,
+				to_sector(&bh, inode), HPAGE_SIZE, &pfn,
+				&length);
+
+		if (IS_ERR(kaddr)) {
 			result = VM_FAULT_SIGBUS;
 			goto out;
 		}
-		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
+		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) {
+			dax_unmap_atomic(bdev, kaddr);
 			goto fallback;
+		}
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
 			clear_pmem(kaddr, HPAGE_SIZE);
@@ -631,6 +682,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 			result |= VM_FAULT_MAJOR;
 		}
+		dax_unmap_atomic(bdev, kaddr);
 
 		result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
 	}
@@ -734,12 +786,15 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 	if (err < 0)
 		return err;
 	if (buffer_written(&bh)) {
-		void __pmem *addr;
-		err = dax_get_addr(&bh, &addr, inode->i_blkbits);
-		if (err < 0)
-			return err;
+		struct block_device *bdev = bh.b_bdev;
+		void __pmem *addr = dax_map_atomic(bdev, to_sector(&bh, inode),
+				PAGE_CACHE_SIZE);
+
+		if (IS_ERR(addr))
+			return PTR_ERR(addr);
 		clear_pmem(addr + offset, length);
 		wmb_pmem();
+		dax_unmap_atomic(bdev, addr);
 	}
 
 	return 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index dd761b663e6a..cd091cb2b96e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -788,6 +788,8 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
+extern int blk_queue_enter(struct request_queue *q, gfp_t gfp);
+extern void blk_queue_exit(struct request_queue *q);
 extern void blk_start_queue(struct request_queue *q);
 extern void blk_stop_queue(struct request_queue *q);
 extern void blk_sync_queue(struct request_queue *q);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 04/20] mm: introduce __get_dev_pagemap()
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
                   ` (2 preceding siblings ...)
  2015-10-10  0:55 ` [PATCH v2 03/20] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams
@ 2015-10-10  0:55 ` Dan Williams
  2015-10-10  0:55 ` [PATCH v2 05/20] x86, mm: introduce vmem_altmap to augment vmemmap_populate() Dan Williams
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:55 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Chinner, linux-kernel, linux-mm, ross.zwisler, Andrew Morton, hch

There are several scenarios where we need to retrieve and update
metadata associated with a given devm_memremap_pages() mapping, and the
only lookup key available is a pfn in the range:

1/ We want to augment vmemmap_populate() (called via arch_add_memory())
   to allocate memmap storage from pre-allocated pages reserved by the
   device driver.  At vmemmap_alloc_block_buf() time it grabs device pages
   rather than page allocator pages.  This is in support of
   devm_memremap_pages() mappings where the memmap is too large to fit in
   main memory (i.e. large persistent memory devices).

2/ Taking a reference against the mapping when inserting device pages
   into the address_space radix of a given inode.  This facilitates
   unmap_mapping_range() and truncate_inode_pages() operations when the
   driver is tearing down the mapping.

3/ get_user_pages() operations on ZONE_DEVICE memory require taking a
   reference against the mapping so that the driver teardown path can
   revoke and drain usage of device pages.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/mm.h |   18 ++++++++++++++++++
 kernel/memremap.c  |   40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80001de019ba..30c3c8764649 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -717,6 +717,24 @@ static inline enum zone_type page_zonenum(const struct page *page)
 	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
 }
 
+/**
+ * struct dev_pagemap - metadata for ZONE_DEVICE mappings
+ * @dev: host device of the mapping for debug
+ */
+struct dev_pagemap {
+	/* TODO: vmem_altmap and percpu_ref count */
+	struct device *dev;
+};
+
+#ifdef CONFIG_ZONE_DEVICE
+struct dev_pagemap *__get_dev_pagemap(resource_size_t phys);
+#else
+static inline struct dev_pagemap *get_dev_pagemap(resource_size_t phys)
+{
+	return NULL;
+}
+#endif
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 3218e8b1fc28..64bfd9fa93aa 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -10,6 +10,7 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/rculist.h>
 #include <linux/device.h>
 #include <linux/types.h>
 #include <linux/io.h>
@@ -138,18 +139,52 @@ void devm_memunmap(struct device *dev, void *addr)
 EXPORT_SYMBOL(devm_memunmap);
 
 #ifdef CONFIG_ZONE_DEVICE
+static LIST_HEAD(ranges);
+static DEFINE_SPINLOCK(range_lock);
+
 struct page_map {
 	struct resource res;
+	struct dev_pagemap pgmap;
+	struct list_head list;
 };
 
+static void add_page_map(struct page_map *page_map)
+{
+	spin_lock(&range_lock);
+	list_add_rcu(&page_map->list, &ranges);
+	spin_unlock(&range_lock);
+}
+
+static void del_page_map(struct page_map *page_map)
+{
+	spin_lock(&range_lock);
+	list_del_rcu(&page_map->list);
+	spin_unlock(&range_lock);
+}
+
 static void devm_memremap_pages_release(struct device *dev, void *res)
 {
 	struct page_map *page_map = res;
 
+	del_page_map(page_map);
+
 	/* pages are dead and unused, undo the arch mapping */
 	arch_remove_memory(page_map->res.start, resource_size(&page_map->res));
 }
 
+/* assumes rcu_read_lock() held at entry */
+struct dev_pagemap *__get_dev_pagemap(resource_size_t phys)
+{
+	struct page_map *page_map;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	list_for_each_entry_rcu(page_map, &ranges, list)
+		if (phys >= page_map->res.start && phys <= page_map->res.end)
+			return &page_map->pgmap;
+	return NULL;
+}
+
 void *devm_memremap_pages(struct device *dev, struct resource *res)
 {
 	int is_ram = region_intersects(res->start, resource_size(res),
@@ -173,12 +208,17 @@ void *devm_memremap_pages(struct device *dev, struct resource *res)
 
 	memcpy(&page_map->res, res, sizeof(*res));
 
+	page_map->pgmap.dev = dev;
+	INIT_LIST_HEAD(&page_map->list);
+	add_page_map(page_map);
+
 	nid = dev_to_node(dev);
 	if (nid < 0)
 		nid = numa_mem_id();
 
 	error = arch_add_memory(nid, res->start, resource_size(res), true);
 	if (error) {
+		del_page_map(page_map);
 		devres_free(page_map);
 		return ERR_PTR(error);
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 05/20] x86, mm: introduce vmem_altmap to augment vmemmap_populate()
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
                   ` (3 preceding siblings ...)
  2015-10-10  0:55 ` [PATCH v2 04/20] mm: introduce __get_dev_pagemap() Dan Williams
@ 2015-10-10  0:55 ` Dan Williams
  2015-10-19 22:53   ` Williams, Dan J
  2015-10-10  0:55 ` [PATCH v2 06/20] libnvdimm, pfn, pmem: allocate memmap array in persistent memory Dan Williams
                   ` (15 subsequent siblings)
  20 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:55 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Hansen, linux-kernel, linux-mm, Ingo Molnar, H. Peter Anvin,
	Andrew Morton, ross.zwisler, hch

In support of providing struct page for large persistent memory
capacities, use struct vmem_altmap to change the default policy for
allocating memory for the memmap array.  The default vmemmap_populate()
allocates page table storage area from the page allocator.  Given
persistent memory capacities relative to DRAM it may not be feasible to
store the memmap in 'System Memory'.  Instead vmem_altmap represents
pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
requests.

Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/m68k/include/asm/page_mm.h |    1 
 arch/m68k/include/asm/page_no.h |    1 
 arch/x86/mm/init_64.c           |   32 ++++++++++---
 drivers/nvdimm/pmem.c           |    6 ++
 include/linux/io.h              |   17 -------
 include/linux/memory_hotplug.h  |    3 +
 include/linux/mm.h              |   95 ++++++++++++++++++++++++++++++++++++++-
 kernel/memremap.c               |   69 +++++++++++++++++++++++++---
 mm/memory_hotplug.c             |   66 +++++++++++++++++++--------
 mm/page_alloc.c                 |   10 ++++
 mm/sparse-vmemmap.c             |   37 ++++++++++++++-
 mm/sparse.c                     |    8 ++-
 12 files changed, 282 insertions(+), 63 deletions(-)

diff --git a/arch/m68k/include/asm/page_mm.h b/arch/m68k/include/asm/page_mm.h
index 5029f73e6294..884f2f7e4caf 100644
--- a/arch/m68k/include/asm/page_mm.h
+++ b/arch/m68k/include/asm/page_mm.h
@@ -125,6 +125,7 @@ static inline void *__va(unsigned long x)
  */
 #define virt_to_pfn(kaddr)	(__pa(kaddr) >> PAGE_SHIFT)
 #define pfn_to_virt(pfn)	__va((pfn) << PAGE_SHIFT)
+#define	__pfn_to_phys(pfn)	PFN_PHYS(pfn)
 
 extern int m68k_virt_to_node_shift;
 
diff --git a/arch/m68k/include/asm/page_no.h b/arch/m68k/include/asm/page_no.h
index ef209169579a..7845eca0b36d 100644
--- a/arch/m68k/include/asm/page_no.h
+++ b/arch/m68k/include/asm/page_no.h
@@ -24,6 +24,7 @@ extern unsigned long memory_end;
 
 #define virt_to_pfn(kaddr)	(__pa(kaddr) >> PAGE_SHIFT)
 #define pfn_to_virt(pfn)	__va((pfn) << PAGE_SHIFT)
+#define	__pfn_to_phys(pfn)	PFN_PHYS(pfn)
 
 #define virt_to_page(addr)	(mem_map + (((unsigned long)(addr)-PAGE_OFFSET) >> PAGE_SHIFT))
 #define page_to_virt(page)	__va(((((page) - mem_map) << PAGE_SHIFT) + PAGE_OFFSET))
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index e5d42f1a2a71..cabf8ceb0a6b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -714,6 +714,12 @@ static void __meminit free_pagetable(struct page *page, int order)
 {
 	unsigned long magic;
 	unsigned int nr_pages = 1 << order;
+	struct vmem_altmap *altmap = to_vmem_altmap((unsigned long) page);
+
+	if (altmap) {
+		vmem_altmap_free(altmap, nr_pages);
+		return;
+	}
 
 	/* bootmem page has reserved flag */
 	if (PageReserved(page)) {
@@ -1018,13 +1024,19 @@ int __ref arch_remove_memory(u64 start, u64 size)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct page *page = pfn_to_page(start_pfn);
+	struct vmem_altmap *altmap;
 	struct zone *zone;
 	int ret;
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	kernel_physical_mapping_remove(start, start + size);
+	/* With altmap the first mapped page is offset from @start */
+	altmap = to_vmem_altmap((unsigned long) page);
+	if (altmap)
+		page += vmem_altmap_offset(altmap);
+	zone = page_zone(page);
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	WARN_ON_ONCE(ret);
+	kernel_physical_mapping_remove(start, start + size);
 
 	return ret;
 }
@@ -1234,7 +1246,7 @@ static void __meminitdata *p_start, *p_end;
 static int __meminitdata node_start;
 
 static int __meminit vmemmap_populate_hugepages(unsigned long start,
-						unsigned long end, int node)
+		unsigned long end, int node, struct vmem_altmap *altmap)
 {
 	unsigned long addr;
 	unsigned long next;
@@ -1257,7 +1269,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 		if (pmd_none(*pmd)) {
 			void *p;
 
-			p = vmemmap_alloc_block_buf(PMD_SIZE, node);
+			p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
 			if (p) {
 				pte_t entry;
 
@@ -1278,7 +1290,8 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 				addr_end = addr + PMD_SIZE;
 				p_end = p + PMD_SIZE;
 				continue;
-			}
+			} else if (altmap)
+				return -ENOMEM; /* no fallback */
 		} else if (pmd_large(*pmd)) {
 			vmemmap_verify((pte_t *)pmd, node, addr, next);
 			continue;
@@ -1292,11 +1305,16 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 {
+	struct vmem_altmap *altmap = to_vmem_altmap(start);
 	int err;
 
 	if (cpu_has_pse)
-		err = vmemmap_populate_hugepages(start, end, node);
-	else
+		err = vmemmap_populate_hugepages(start, end, node, altmap);
+	else if (altmap) {
+		pr_err_once("%s: no cpu support for altmap allocations\n",
+				__func__);
+		err = -ENOMEM;
+	} else
 		err = vmemmap_populate_basepages(start, end, node);
 	if (!err)
 		sync_global_pgds(start, end - 1, 0);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 349f03e7ed06..3c5b8f585441 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -151,7 +151,8 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 	}
 
 	if (pmem_should_map_pages(dev))
-		pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res);
+		pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res,
+				NULL);
 	else
 		pmem->virt_addr = (void __pmem *) devm_memremap(dev,
 				pmem->phys_addr, pmem->size,
@@ -362,7 +363,8 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
 	/* establish pfn range for lookup, and switch to direct map */
 	pmem = dev_get_drvdata(dev);
 	devm_memunmap(dev, (void __force *) pmem->virt_addr);
-	pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res);
+	pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res,
+			NULL);
 	if (IS_ERR(pmem->virt_addr)) {
 		rc = PTR_ERR(pmem->virt_addr);
 		goto err;
diff --git a/include/linux/io.h b/include/linux/io.h
index de64c1e53612..2f2f8859abd9 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -87,23 +87,6 @@ void *devm_memremap(struct device *dev, resource_size_t offset,
 		size_t size, unsigned long flags);
 void devm_memunmap(struct device *dev, void *addr);
 
-void *__devm_memremap_pages(struct device *dev, struct resource *res);
-
-#ifdef CONFIG_ZONE_DEVICE
-void *devm_memremap_pages(struct device *dev, struct resource *res);
-#else
-static inline void *devm_memremap_pages(struct device *dev, struct resource *res)
-{
-	/*
-	 * Fail attempts to call devm_memremap_pages() without
-	 * ZONE_DEVICE support enabled, this requires callers to fall
-	 * back to plain devm_memremap() based on config
-	 */
-	WARN_ON_ONCE(1);
-	return ERR_PTR(-ENXIO);
-}
-#endif
-
 /*
  * Some systems do not have legacy ISA devices.
  * /dev/port is not a valid interface on these systems.
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 8f60e899b33c..178e000a7983 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -273,7 +273,8 @@ extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern void remove_memory(int nid, u64 start, u64 size);
 extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn);
-extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms);
+extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
+		unsigned long map_offset);
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
 					  unsigned long pnum);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 30c3c8764649..b5628cfbf649 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -718,18 +718,106 @@ static inline enum zone_type page_zonenum(const struct page *page)
 }
 
 /**
+ * struct vmem_altmap - pre-allocated storage for vmemmap_populate
+ * @base_pfn: base of the entire dev_pagemap mapping
+ * @reserve: pages mapped, but reserved for driver use (relative to @base)
+ * @free: free pages set aside in the mapping for memmap storage
+ * @align: pages reserved to meet allocation alignments
+ * @alloc: track pages consumed, private to vmemmap_populate()
+ */
+struct vmem_altmap {
+	const unsigned long base_pfn;
+	const unsigned long reserve;
+	unsigned long free;
+	unsigned long align;
+	unsigned long alloc;
+};
+
+static inline unsigned long vmem_altmap_nr_free(struct vmem_altmap *altmap)
+{
+	unsigned long allocated = altmap->alloc + altmap->align;
+
+	if (altmap->free > allocated)
+		return altmap->free - allocated;
+	return 0;
+}
+
+static inline unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
+{
+	/* number of pfns from base where pfn_to_page() is valid */
+	return altmap->reserve + altmap->free;
+}
+
+static inline unsigned long vmem_altmap_next_pfn(struct vmem_altmap *altmap)
+{
+	return altmap->base_pfn + altmap->reserve + altmap->alloc
+		+ altmap->align;
+}
+
+/**
+ * vmem_altmap_alloc - allocate pages from the vmem_altmap reservation
+ * @altmap - reserved page pool for the allocation
+ * @nr_pfns - size (in pages) of the allocation
+ *
+ * Allocations are aligned to the size of the request
+ */
+static inline unsigned long vmem_altmap_alloc(struct vmem_altmap *altmap,
+		unsigned long nr_pfns)
+{
+	unsigned long pfn = vmem_altmap_next_pfn(altmap);
+	unsigned long nr_align;
+
+	nr_align = 1UL << find_first_bit(&nr_pfns, BITS_PER_LONG);
+	nr_align = ALIGN(pfn, nr_align) - pfn;
+
+	if (nr_pfns + nr_align > vmem_altmap_nr_free(altmap))
+		return ULONG_MAX;
+	altmap->alloc += nr_pfns;
+	altmap->align += nr_align;
+	return pfn + nr_align;
+}
+
+static inline void vmem_altmap_free(struct vmem_altmap *altmap,
+		unsigned long nr_pfns)
+{
+	altmap->alloc -= nr_pfns;
+}
+
+/**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
+ * @altmap: pre-allocated/reserved memory for vmemmap allocations
  * @dev: host device of the mapping for debug
  */
 struct dev_pagemap {
-	/* TODO: vmem_altmap and percpu_ref count */
+	struct vmem_altmap *altmap;
+	const struct resource *res;
 	struct device *dev;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
 struct dev_pagemap *__get_dev_pagemap(resource_size_t phys);
+void *devm_memremap_pages(struct device *dev, struct resource *res,
+		struct vmem_altmap *altmap);
+struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start);
 #else
-static inline struct dev_pagemap *get_dev_pagemap(resource_size_t phys)
+static inline struct dev_pagemap *__get_dev_pagemap(resource_size_t phys)
+{
+	return NULL;
+}
+
+static inline void *devm_memremap_pages(struct device *dev, struct resource *res,
+		struct vmem_altmap *altmap)
+{
+	/*
+	 * Fail attempts to call devm_memremap_pages() without
+	 * ZONE_DEVICE support enabled, this requires callers to fall
+	 * back to plain devm_memremap() based on config
+	 */
+	WARN_ON_ONCE(1);
+	return ERR_PTR(-ENXIO);
+}
+
+static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 {
 	return NULL;
 }
@@ -2245,7 +2333,8 @@ pud_t *vmemmap_pud_populate(pgd_t *pgd, unsigned long addr, int node);
 pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node);
 void *vmemmap_alloc_block(unsigned long size, int node);
-void *vmemmap_alloc_block_buf(unsigned long size, int node);
+void *vmemmap_alloc_block_buf(unsigned long size, int node,
+		struct vmem_altmap *altmap);
 void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(unsigned long start, unsigned long end,
 			       int node);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 64bfd9fa93aa..75161bb68af1 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -146,6 +146,7 @@ struct page_map {
 	struct resource res;
 	struct dev_pagemap pgmap;
 	struct list_head list;
+	struct vmem_altmap altmap;
 };
 
 static void add_page_map(struct page_map *page_map)
@@ -162,14 +163,17 @@ static void del_page_map(struct page_map *page_map)
 	spin_unlock(&range_lock);
 }
 
-static void devm_memremap_pages_release(struct device *dev, void *res)
+static void devm_memremap_pages_release(struct device *dev, void *data)
 {
-	struct page_map *page_map = res;
-
-	del_page_map(page_map);
+	struct page_map *page_map = data;
+	struct resource *res = &page_map->res;
+	struct dev_pagemap *pgmap = &page_map->pgmap;
 
 	/* pages are dead and unused, undo the arch mapping */
-	arch_remove_memory(page_map->res.start, resource_size(&page_map->res));
+	arch_remove_memory(res->start, resource_size(res));
+	dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
+			"%s: failed to free all reserved pages\n", __func__);
+	del_page_map(page_map);
 }
 
 /* assumes rcu_read_lock() held at entry */
@@ -185,10 +189,22 @@ struct dev_pagemap *__get_dev_pagemap(resource_size_t phys)
 	return NULL;
 }
 
-void *devm_memremap_pages(struct device *dev, struct resource *res)
+/**
+ * devm_memremap_pages - remap and provide memmap backing for the given resource
+ * @dev: hosting device for @res
+ * @res: "host memory" address range
+ * @altmap: optional descriptor for allocating the memmap from @res
+ *
+ * Note, the expectation is that @res is a host memory range that could
+ * feasibly be treated as a "System RAM" range, i.e. not a device mmio
+ * range, but this is not enforced.
+ */
+void *devm_memremap_pages(struct device *dev, struct resource *res,
+		struct vmem_altmap *altmap)
 {
 	int is_ram = region_intersects(res->start, resource_size(res),
 			"System RAM");
+	struct dev_pagemap *pgmap;
 	struct page_map *page_map;
 	int error, nid;
 
@@ -205,10 +221,15 @@ void *devm_memremap_pages(struct device *dev, struct resource *res)
 			sizeof(*page_map), GFP_KERNEL, dev_to_node(dev));
 	if (!page_map)
 		return ERR_PTR(-ENOMEM);
+	pgmap = &page_map->pgmap;
 
 	memcpy(&page_map->res, res, sizeof(*res));
-
-	page_map->pgmap.dev = dev;
+	if (altmap) {
+		memcpy(&page_map->altmap, altmap, sizeof(*altmap));
+		pgmap->altmap = &page_map->altmap;
+	}
+	pgmap->dev = dev;
+	pgmap->res = &page_map->res;
 	INIT_LIST_HEAD(&page_map->list);
 	add_page_map(page_map);
 
@@ -227,4 +248,36 @@ void *devm_memremap_pages(struct device *dev, struct resource *res)
 	return __va(res->start);
 }
 EXPORT_SYMBOL(devm_memremap_pages);
+
+/*
+ * Uncoditionally retrieve a dev_pagemap associated with the given physical
+ * address, this is only for use in the arch_{add|remove}_memory() for setting
+ * up and tearing down the memmap.
+ */
+static struct dev_pagemap *lookup_dev_pagemap(resource_size_t phys)
+{
+	struct dev_pagemap *pgmap;
+
+	rcu_read_lock();
+	pgmap = __get_dev_pagemap(phys);
+	rcu_read_unlock();
+	return pgmap;
+}
+
+struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
+{
+	/*
+	 * 'memmap_start' is the virtual address for the first "struct
+	 * page" in this range of the vmemmap array.  In the case of
+	 * CONFIG_SPARSE_VMEMMAP a page_to_pfn conversion is simple
+	 * pointer arithmetic, so we can perform this to_vmem_altmap()
+	 * conversion without concern for the initialization state of
+	 * the struct page fields.
+	 */
+	struct page *page = (struct page *) memmap_start;
+	struct dev_pagemap *pgmap;
+
+	pgmap = lookup_dev_pagemap(__pfn_to_phys(page_to_pfn(page)));
+	return pgmap ? pgmap->altmap : NULL;
+}
 #endif /* CONFIG_ZONE_DEVICE */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index aa992e2df58a..3521df153de3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -505,10 +505,25 @@ int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
 	unsigned long i;
 	int err = 0;
 	int start_sec, end_sec;
+	struct vmem_altmap *altmap;
+
 	/* during initialize mem_map, align hot-added range to section */
 	start_sec = pfn_to_section_nr(phys_start_pfn);
 	end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1);
 
+	altmap = to_vmem_altmap((unsigned long) pfn_to_page(phys_start_pfn));
+	if (altmap) {
+		/*
+		 * Validate altmap is within bounds of the total request
+		 */
+		if (altmap->base_pfn != phys_start_pfn
+				|| vmem_altmap_offset(altmap) > nr_pages) {
+			pr_warn_once("memory add fail, invalid altmap\n");
+			return -EINVAL;
+		}
+		altmap->alloc = 0;
+	}
+
 	for (i = start_sec; i <= end_sec; i++) {
 		err = __add_section(nid, zone, section_nr_to_pfn(i));
 
@@ -730,7 +745,8 @@ static void __remove_zone(struct zone *zone, unsigned long start_pfn)
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 }
 
-static int __remove_section(struct zone *zone, struct mem_section *ms)
+static int __remove_section(struct zone *zone, struct mem_section *ms,
+		unsigned long map_offset)
 {
 	unsigned long start_pfn;
 	int scn_nr;
@@ -747,7 +763,7 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
 	start_pfn = section_nr_to_pfn(scn_nr);
 	__remove_zone(zone, start_pfn);
 
-	sparse_remove_one_section(zone, ms);
+	sparse_remove_one_section(zone, ms, map_offset);
 	return 0;
 }
 
@@ -766,9 +782,32 @@ int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
 		 unsigned long nr_pages)
 {
 	unsigned long i;
-	int sections_to_remove;
-	resource_size_t start, size;
-	int ret = 0;
+	unsigned long map_offset = 0;
+	int sections_to_remove, ret = 0;
+
+	/* In the ZONE_DEVICE case device driver owns the memory region */
+	if (is_dev_zone(zone)) {
+		struct page *page = pfn_to_page(phys_start_pfn);
+		struct vmem_altmap *altmap;
+
+		altmap = to_vmem_altmap((unsigned long) page);
+		if (altmap)
+			map_offset = vmem_altmap_offset(altmap);
+	} else {
+		resource_size_t start, size;
+
+		start = phys_start_pfn << PAGE_SHIFT;
+		size = nr_pages * PAGE_SIZE;
+
+		ret = release_mem_region_adjustable(&iomem_resource, start,
+					size);
+		if (ret) {
+			resource_size_t endres = start + size - 1;
+
+			pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
+					&start, &endres, ret);
+		}
+	}
 
 	/*
 	 * We can only remove entire sections
@@ -776,23 +815,12 @@ int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
 	BUG_ON(phys_start_pfn & ~PAGE_SECTION_MASK);
 	BUG_ON(nr_pages % PAGES_PER_SECTION);
 
-	start = phys_start_pfn << PAGE_SHIFT;
-	size = nr_pages * PAGE_SIZE;
-
-	/* in the ZONE_DEVICE case device driver owns the memory region */
-	if (!is_dev_zone(zone))
-		ret = release_mem_region_adjustable(&iomem_resource, start, size);
-	if (ret) {
-		resource_size_t endres = start + size - 1;
-
-		pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
-				&start, &endres, ret);
-	}
-
 	sections_to_remove = nr_pages / PAGES_PER_SECTION;
 	for (i = 0; i < sections_to_remove; i++) {
 		unsigned long pfn = phys_start_pfn + i*PAGES_PER_SECTION;
-		ret = __remove_section(zone, __pfn_to_section(pfn));
+
+		ret = __remove_section(zone, __pfn_to_section(pfn), map_offset);
+		map_offset = 0;
 		if (ret)
 			break;
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48aaf7b9f253..9dfc431d6271 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4620,8 +4620,9 @@ static void setup_zone_migrate_reserve(struct zone *zone)
 void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		unsigned long start_pfn, enum memmap_context context)
 {
-	pg_data_t *pgdat = NODE_DATA(nid);
+	struct vmem_altmap *altmap = to_vmem_altmap(__pfn_to_phys(start_pfn));
 	unsigned long end_pfn = start_pfn + size;
+	pg_data_t *pgdat = NODE_DATA(nid);
 	unsigned long pfn;
 	struct zone *z;
 	unsigned long nr_initialised = 0;
@@ -4629,6 +4630,13 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 	if (highest_memmap_pfn < end_pfn - 1)
 		highest_memmap_pfn = end_pfn - 1;
 
+	/*
+	 * Honor reservation requested by the driver for this ZONE_DEVICE
+	 * memory
+	 */
+	if (altmap && start_pfn == altmap->base_pfn)
+		start_pfn += altmap->reserve;
+
 	z = &pgdat->node_zones[zone];
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 		/*
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 4cba9c2783a1..96c1dca4ce6a 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -70,7 +70,7 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
 }
 
 /* need to make sure size is all the same during early stage */
-void * __meminit vmemmap_alloc_block_buf(unsigned long size, int node)
+static void * __meminit __vmemmap_alloc_block_buf(unsigned long size, int node)
 {
 	void *ptr;
 
@@ -87,6 +87,39 @@ void * __meminit vmemmap_alloc_block_buf(unsigned long size, int node)
 	return ptr;
 }
 
+static void * __meminit altmap_alloc_block_buf(unsigned long size,
+		struct vmem_altmap *altmap)
+{
+	unsigned long pfn, nr_pfns;
+	void *ptr;
+
+	if (size & ~PAGE_MASK) {
+		pr_warn_once("%s: allocations must be multiple of PAGE_SIZE (%ld)\n",
+				__func__, size);
+		return NULL;
+	}
+
+	nr_pfns = size >> PAGE_SHIFT;
+	pfn = vmem_altmap_alloc(altmap, nr_pfns);
+	if (pfn < ULONG_MAX)
+		ptr = __va(__pfn_to_phys(pfn));
+	else
+		ptr = NULL;
+	pr_debug("%s: pfn: %#lx alloc: %ld align: %ld nr: %#lx\n",
+			__func__, pfn, altmap->alloc, altmap->align, nr_pfns);
+
+	return ptr;
+}
+
+/* need to make sure size is all the same during early stage */
+void * __meminit vmemmap_alloc_block_buf(unsigned long size, int node,
+		struct vmem_altmap *altmap)
+{
+	if (altmap)
+		return altmap_alloc_block_buf(size, altmap);
+	return __vmemmap_alloc_block_buf(size, node);
+}
+
 void __meminit vmemmap_verify(pte_t *pte, int node,
 				unsigned long start, unsigned long end)
 {
@@ -103,7 +136,7 @@ pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node)
 	pte_t *pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(*pte)) {
 		pte_t entry;
-		void *p = vmemmap_alloc_block_buf(PAGE_SIZE, node);
+		void *p = __vmemmap_alloc_block_buf(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
diff --git a/mm/sparse.c b/mm/sparse.c
index d1b48b691ac8..3717ceed4177 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -748,7 +748,7 @@ static void clear_hwpoisoned_pages(struct page *memmap, int nr_pages)
 	if (!memmap)
 		return;
 
-	for (i = 0; i < PAGES_PER_SECTION; i++) {
+	for (i = 0; i < nr_pages; i++) {
 		if (PageHWPoison(&memmap[i])) {
 			atomic_long_sub(1, &num_poisoned_pages);
 			ClearPageHWPoison(&memmap[i]);
@@ -788,7 +788,8 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 		free_map_bootmem(memmap);
 }
 
-void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
+void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
+		unsigned long map_offset)
 {
 	struct page *memmap = NULL;
 	unsigned long *usemap = NULL, flags;
@@ -804,7 +805,8 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 	}
 	pgdat_resize_unlock(pgdat, &flags);
 
-	clear_hwpoisoned_pages(memmap, PAGES_PER_SECTION);
+	clear_hwpoisoned_pages(memmap + map_offset,
+			PAGES_PER_SECTION - map_offset);
 	free_section_usemap(memmap, usemap);
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 06/20] libnvdimm, pfn, pmem: allocate memmap array in persistent memory
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
                   ` (4 preceding siblings ...)
  2015-10-10  0:55 ` [PATCH v2 05/20] x86, mm: introduce vmem_altmap to augment vmemmap_populate() Dan Williams
@ 2015-10-10  0:55 ` Dan Williams
  2015-10-10  0:56 ` [PATCH v2 07/20] avr32: convert to asm-generic/memory_model.h Dan Williams
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:55 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Chinner, linux-kernel, linux-mm, ross.zwisler, Andrew Morton, hch

Use the new vmem_altmap capability to enable the pmem driver to arrange
for a struct page memmap to be established in persistent memory.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pfn_devs.c |    3 +--
 drivers/nvdimm/pmem.c     |   19 +++++++++++++++++--
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 71805a1aa0f3..a642cfacee07 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -83,8 +83,7 @@ static ssize_t mode_store(struct device *dev,
 
 		if (strncmp(buf, "pmem\n", n) == 0
 				|| strncmp(buf, "pmem", n) == 0) {
-			/* TODO: allocate from PMEM support */
-			rc = -ENOTTY;
+			nd_pfn->mode = PFN_MODE_PMEM;
 		} else if (strncmp(buf, "ram\n", n) == 0
 				|| strncmp(buf, "ram", n) == 0)
 			nd_pfn->mode = PFN_MODE_RAM;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3c5b8f585441..bb66158c0505 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -322,12 +322,16 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
 	struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
 	struct nd_pfn *nd_pfn = to_nd_pfn(ndns->claim);
 	struct device *dev = &nd_pfn->dev;
-	struct vmem_altmap *altmap;
 	struct nd_region *nd_region;
+	struct vmem_altmap *altmap;
 	struct nd_pfn_sb *pfn_sb;
 	struct pmem_device *pmem;
 	phys_addr_t offset;
 	int rc;
+	struct vmem_altmap __altmap = {
+		.base_pfn = __phys_to_pfn(nsio->res.start),
+		.reserve = __phys_to_pfn(SZ_8K),
+	};
 
 	if (!nd_pfn->uuid || !nd_pfn->ndns)
 		return -ENODEV;
@@ -355,6 +359,17 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
 			return -EINVAL;
 		nd_pfn->npfns = le64_to_cpu(pfn_sb->npfns);
 		altmap = NULL;
+	} else if (nd_pfn->mode == PFN_MODE_PMEM) {
+		nd_pfn->npfns = (resource_size(&nsio->res) - offset)
+			/ PAGE_SIZE;
+		if (le64_to_cpu(nd_pfn->pfn_sb->npfns) > nd_pfn->npfns)
+			dev_info(&nd_pfn->dev,
+					"number of pfns truncated from %lld to %ld\n",
+					le64_to_cpu(nd_pfn->pfn_sb->npfns),
+					nd_pfn->npfns);
+		altmap = & __altmap;
+		altmap->free = __phys_to_pfn(offset - SZ_8K);
+		altmap->alloc = 0;
 	} else {
 		rc = -ENXIO;
 		goto err;
@@ -364,7 +379,7 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
 	pmem = dev_get_drvdata(dev);
 	devm_memunmap(dev, (void __force *) pmem->virt_addr);
 	pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res,
-			NULL);
+			altmap);
 	if (IS_ERR(pmem->virt_addr)) {
 		rc = PTR_ERR(pmem->virt_addr);
 		goto err;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 07/20] avr32: convert to asm-generic/memory_model.h
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
                   ` (5 preceding siblings ...)
  2015-10-10  0:55 ` [PATCH v2 06/20] libnvdimm, pfn, pmem: allocate memmap array in persistent memory Dan Williams
@ 2015-10-10  0:56 ` Dan Williams
  2015-10-10  0:56 ` [PATCH v2 08/20] hugetlb: fix compile error on tile Dan Williams
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:56 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-mm, ross.zwisler, linux-kernel, hch

Switch avr32/include/asm/page.h to use the common defintions for
pfn_to_page(), page_to_pfn(), and ARCH_PFN_OFFSET.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/avr32/include/asm/page.h |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/avr32/include/asm/page.h b/arch/avr32/include/asm/page.h
index f805d1cb11bc..c5d2a3e2c62f 100644
--- a/arch/avr32/include/asm/page.h
+++ b/arch/avr32/include/asm/page.h
@@ -83,11 +83,9 @@ static inline int get_order(unsigned long size)
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 
-#define PHYS_PFN_OFFSET		(CONFIG_PHYS_OFFSET >> PAGE_SHIFT)
+#define ARCH_PFN_OFFSET		(CONFIG_PHYS_OFFSET >> PAGE_SHIFT)
 
-#define pfn_to_page(pfn)	(mem_map + ((pfn) - PHYS_PFN_OFFSET))
-#define page_to_pfn(page)	((unsigned long)((page) - mem_map) + PHYS_PFN_OFFSET)
-#define pfn_valid(pfn)		((pfn) >= PHYS_PFN_OFFSET && (pfn) < (PHYS_PFN_OFFSET + max_mapnr))
+#define pfn_valid(pfn)		((pfn) >= ARCH_PFN_OFFSET && (pfn) < (ARCH_PFN_OFFSET + max_mapnr))
 #endif /* CONFIG_NEED_MULTIPLE_NODES */
 
 #define virt_to_page(kaddr)	pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
@@ -101,4 +99,6 @@ static inline int get_order(unsigned long size)
  */
 #define HIGHMEM_START		0x20000000UL
 
+#include <asm-generic/memory_model.h>
+
 #endif /* __ASM_AVR32_PAGE_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 08/20] hugetlb: fix compile error on tile
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
                   ` (6 preceding siblings ...)
  2015-10-10  0:56 ` [PATCH v2 07/20] avr32: convert to asm-generic/memory_model.h Dan Williams
@ 2015-10-10  0:56 ` Dan Williams
  2015-10-10  0:56 ` [PATCH v2 09/20] frv: fix compiler warning from definition of __pmd() Dan Williams
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:56 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-mm, ross.zwisler, linux-kernel, hch

Inlude asm/pgtable.h to get the definition for pud_t to fix:

include/linux/hugetlb.h:203:29: error: unknown type name 'pud_t'

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/hugetlb.h |    1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 5e35379f58a5..ad5539cf52bf 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -8,6 +8,7 @@
 #include <linux/cgroup.h>
 #include <linux/list.h>
 #include <linux/kref.h>
+#include <asm/pgtable.h>
 
 struct ctl_table;
 struct user_struct;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 09/20] frv: fix compiler warning from definition of __pmd()
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
                   ` (7 preceding siblings ...)
  2015-10-10  0:56 ` [PATCH v2 08/20] hugetlb: fix compile error on tile Dan Williams
@ 2015-10-10  0:56 ` Dan Williams
  2015-10-10  0:56 ` [PATCH v2 10/20] um: kill pfn_t Dan Williams
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:56 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-mm, ross.zwisler, linux-kernel, hch

Take into account that the pmd_t type is a array inside a struct, so it
needs two levels of brackets to initialize.  Otherwise, a usage of __pmd
generates a warning:

include/linux/mm.h:986:2: warning: missing braces around initializer [-Wmissing-braces]

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/frv/include/asm/page.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/frv/include/asm/page.h b/arch/frv/include/asm/page.h
index 8c97068ac8fc..688d8076a43a 100644
--- a/arch/frv/include/asm/page.h
+++ b/arch/frv/include/asm/page.h
@@ -34,7 +34,7 @@ typedef struct page *pgtable_t;
 #define pgprot_val(x)	((x).pgprot)
 
 #define __pte(x)	((pte_t) { (x) } )
-#define __pmd(x)	((pmd_t) { (x) } )
+#define __pmd(x)	((pmd_t) { { (x) } } )
 #define __pud(x)	((pud_t) { (x) } )
 #define __pgd(x)	((pgd_t) { (x) } )
 #define __pgprot(x)	((pgprot_t) { (x) } )

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 10/20] um: kill pfn_t
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
                   ` (8 preceding siblings ...)
  2015-10-10  0:56 ` [PATCH v2 09/20] frv: fix compiler warning from definition of __pmd() Dan Williams
@ 2015-10-10  0:56 ` Dan Williams
  2015-10-10  0:56 ` [PATCH v2 11/20] kvm: rename pfn_t to kvm_pfn_t Dan Williams
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:56 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Hansen, Richard Weinberger, Jeff Dike, linux-kernel,
	linux-mm, ross.zwisler, hch

The core has developed a need for a "pfn_t" type [1].  Convert the usage
of pfn_t by usermode-linux to an unsigned long, and update pfn_to_phys()
to drop its expectation of a typed pfn.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html

Cc: Dave Hansen <dave@sr71.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/um/include/asm/page.h           |    6 +++---
 arch/um/include/asm/pgtable-3level.h |    4 ++--
 arch/um/include/asm/pgtable.h        |    2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/um/include/asm/page.h b/arch/um/include/asm/page.h
index 71c5d132062a..fe26a5e06268 100644
--- a/arch/um/include/asm/page.h
+++ b/arch/um/include/asm/page.h
@@ -18,6 +18,7 @@
 
 struct page;
 
+#include <linux/pfn.h>
 #include <linux/types.h>
 #include <asm/vm-flags.h>
 
@@ -76,7 +77,6 @@ typedef struct { unsigned long pmd; } pmd_t;
 #define pte_is_zero(p) (!((p).pte & ~_PAGE_NEWPAGE))
 #define pte_set_val(p, phys, prot) (p).pte = (phys | pgprot_val(prot))
 
-typedef unsigned long pfn_t;
 typedef unsigned long phys_t;
 
 #endif
@@ -109,8 +109,8 @@ extern unsigned long uml_physmem;
 #define __pa(virt) to_phys((void *) (unsigned long) (virt))
 #define __va(phys) to_virt((unsigned long) (phys))
 
-#define phys_to_pfn(p) ((pfn_t) ((p) >> PAGE_SHIFT))
-#define pfn_to_phys(pfn) ((phys_t) ((pfn) << PAGE_SHIFT))
+#define phys_to_pfn(p) ((p) >> PAGE_SHIFT)
+#define pfn_to_phys(pfn) PFN_PHYS(pfn)
 
 #define pfn_valid(pfn) ((pfn) < max_mapnr)
 #define virt_addr_valid(v) pfn_valid(phys_to_pfn(__pa(v)))
diff --git a/arch/um/include/asm/pgtable-3level.h b/arch/um/include/asm/pgtable-3level.h
index 2b4274e7c095..bae8523a162f 100644
--- a/arch/um/include/asm/pgtable-3level.h
+++ b/arch/um/include/asm/pgtable-3level.h
@@ -98,7 +98,7 @@ static inline unsigned long pte_pfn(pte_t pte)
 	return phys_to_pfn(pte_val(pte));
 }
 
-static inline pte_t pfn_pte(pfn_t page_nr, pgprot_t pgprot)
+static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
 {
 	pte_t pte;
 	phys_t phys = pfn_to_phys(page_nr);
@@ -107,7 +107,7 @@ static inline pte_t pfn_pte(pfn_t page_nr, pgprot_t pgprot)
 	return pte;
 }
 
-static inline pmd_t pfn_pmd(pfn_t page_nr, pgprot_t pgprot)
+static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)
 {
 	return __pmd((page_nr << PAGE_SHIFT) | pgprot_val(pgprot));
 }
diff --git a/arch/um/include/asm/pgtable.h b/arch/um/include/asm/pgtable.h
index 18eb9924dda3..7485398d0737 100644
--- a/arch/um/include/asm/pgtable.h
+++ b/arch/um/include/asm/pgtable.h
@@ -271,7 +271,7 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b)
 
 #define phys_to_page(phys) pfn_to_page(phys_to_pfn(phys))
 #define __virt_to_page(virt) phys_to_page(__pa(virt))
-#define page_to_phys(page) pfn_to_phys((pfn_t) page_to_pfn(page))
+#define page_to_phys(page) pfn_to_phys(page_to_pfn(page))
 #define virt_to_page(addr) __virt_to_page((const unsigned long) addr)
 
 #define mk_pte(page, pgprot) \

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 11/20] kvm: rename pfn_t to kvm_pfn_t
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
                   ` (9 preceding siblings ...)
  2015-10-10  0:56 ` [PATCH v2 10/20] um: kill pfn_t Dan Williams
@ 2015-10-10  0:56 ` Dan Williams
  2015-10-10 15:35   ` Christoffer Dall
  2015-10-10 20:35   ` Paolo Bonzini
  2015-10-10  0:56 ` [PATCH v2 12/20] mips: fix PAGE_MASK definition Dan Williams
                   ` (9 subsequent siblings)
  20 siblings, 2 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:56 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Hansen, Russell King, linux-mm, Gleb Natapov,
	Catalin Marinas, Will Deacon, linux-kernel, Ralf Baechle,
	Marc Zyngier, Paul Mackerras, Christoffer Dall,
	Benjamin Herrenschmidt, Paolo Bonzini, ross.zwisler, hch,
	Alexander Graf

The core has developed a need for a "pfn_t" type [1].  Move the existing
pfn_t in KVM to kvm_pfn_t [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html

Cc: Dave Hansen <dave@sr71.net>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Alexander Graf <agraf@suse.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/arm/include/asm/kvm_mmu.h        |    5 ++--
 arch/arm/kvm/mmu.c                    |   10 ++++---
 arch/arm64/include/asm/kvm_mmu.h      |    3 +-
 arch/mips/include/asm/kvm_host.h      |    6 ++--
 arch/mips/kvm/emulate.c               |    2 +
 arch/mips/kvm/tlb.c                   |   14 +++++-----
 arch/powerpc/include/asm/kvm_book3s.h |    4 +--
 arch/powerpc/include/asm/kvm_ppc.h    |    2 +
 arch/powerpc/kvm/book3s.c             |    6 ++--
 arch/powerpc/kvm/book3s_32_mmu_host.c |    2 +
 arch/powerpc/kvm/book3s_64_mmu_host.c |    2 +
 arch/powerpc/kvm/e500.h               |    2 +
 arch/powerpc/kvm/e500_mmu_host.c      |    8 +++---
 arch/powerpc/kvm/trace_pr.h           |    2 +
 arch/x86/kvm/iommu.c                  |   11 ++++----
 arch/x86/kvm/mmu.c                    |   37 +++++++++++++-------------
 arch/x86/kvm/mmu_audit.c              |    2 +
 arch/x86/kvm/paging_tmpl.h            |    6 ++--
 arch/x86/kvm/vmx.c                    |    2 +
 arch/x86/kvm/x86.c                    |    2 +
 include/linux/kvm_host.h              |   37 +++++++++++++-------------
 include/linux/kvm_types.h             |    2 +
 virt/kvm/kvm_main.c                   |   47 +++++++++++++++++----------------
 23 files changed, 110 insertions(+), 104 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 405aa1883307..8ebd282dfc2b 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -182,7 +182,8 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
 	return (vcpu->arch.cp15[c1_SCTLR] & 0b101) == 0b101;
 }
 
-static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
+static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
+					       kvm_pfn_t pfn,
 					       unsigned long size,
 					       bool ipa_uncached)
 {
@@ -246,7 +247,7 @@ static inline void __kvm_flush_dcache_pte(pte_t pte)
 static inline void __kvm_flush_dcache_pmd(pmd_t pmd)
 {
 	unsigned long size = PMD_SIZE;
-	pfn_t pfn = pmd_pfn(pmd);
+	kvm_pfn_t pfn = pmd_pfn(pmd);
 
 	while (size) {
 		void *va = kmap_atomic_pfn(pfn);
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 6984342da13d..e2dcbfdc4a8c 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -988,9 +988,9 @@ out:
 	return ret;
 }
 
-static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap)
+static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
 {
-	pfn_t pfn = *pfnp;
+	kvm_pfn_t pfn = *pfnp;
 	gfn_t gfn = *ipap >> PAGE_SHIFT;
 
 	if (PageTransCompound(pfn_to_page(pfn))) {
@@ -1202,7 +1202,7 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
 }
 
-static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
+static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
 				      unsigned long size, bool uncached)
 {
 	__coherent_cache_guest_page(vcpu, pfn, size, uncached);
@@ -1219,7 +1219,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	struct kvm *kvm = vcpu->kvm;
 	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
 	struct vm_area_struct *vma;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	pgprot_t mem_type = PAGE_S2;
 	bool fault_ipa_uncached;
 	bool logging_active = memslot_is_logging(memslot);
@@ -1347,7 +1347,7 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
 {
 	pmd_t *pmd;
 	pte_t *pte;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	bool pfn_valid = false;
 
 	trace_kvm_access_fault(fault_ipa);
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 61505676d085..385fc8cef82d 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -230,7 +230,8 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
 	return (vcpu_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
 }
 
-static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
+static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
+					       kvm_pfn_t pfn,
 					       unsigned long size,
 					       bool ipa_uncached)
 {
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 5a1a882e0a75..9c67f05a0a1b 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -101,9 +101,9 @@
 #define CAUSEF_DC			(_ULCAST_(1) << 27)
 
 extern atomic_t kvm_mips_instance;
-extern pfn_t(*kvm_mips_gfn_to_pfn) (struct kvm *kvm, gfn_t gfn);
-extern void (*kvm_mips_release_pfn_clean) (pfn_t pfn);
-extern bool(*kvm_mips_is_error_pfn) (pfn_t pfn);
+extern kvm_pfn_t (*kvm_mips_gfn_to_pfn)(struct kvm *kvm, gfn_t gfn);
+extern void (*kvm_mips_release_pfn_clean)(kvm_pfn_t pfn);
+extern bool (*kvm_mips_is_error_pfn)(kvm_pfn_t pfn);
 
 struct kvm_vm_stat {
 	u32 remote_tlb_flush;
diff --git a/arch/mips/kvm/emulate.c b/arch/mips/kvm/emulate.c
index d5fa3eaf39a1..476296cf37d3 100644
--- a/arch/mips/kvm/emulate.c
+++ b/arch/mips/kvm/emulate.c
@@ -1525,7 +1525,7 @@ int kvm_mips_sync_icache(unsigned long va, struct kvm_vcpu *vcpu)
 	struct kvm *kvm = vcpu->kvm;
 	unsigned long pa;
 	gfn_t gfn;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	gfn = va >> PAGE_SHIFT;
 
diff --git a/arch/mips/kvm/tlb.c b/arch/mips/kvm/tlb.c
index aed0ac2a4972..570479c03bdc 100644
--- a/arch/mips/kvm/tlb.c
+++ b/arch/mips/kvm/tlb.c
@@ -38,13 +38,13 @@ atomic_t kvm_mips_instance;
 EXPORT_SYMBOL(kvm_mips_instance);
 
 /* These function pointers are initialized once the KVM module is loaded */
-pfn_t (*kvm_mips_gfn_to_pfn)(struct kvm *kvm, gfn_t gfn);
+kvm_pfn_t (*kvm_mips_gfn_to_pfn)(struct kvm *kvm, gfn_t gfn);
 EXPORT_SYMBOL(kvm_mips_gfn_to_pfn);
 
-void (*kvm_mips_release_pfn_clean)(pfn_t pfn);
+void (*kvm_mips_release_pfn_clean)(kvm_pfn_t pfn);
 EXPORT_SYMBOL(kvm_mips_release_pfn_clean);
 
-bool (*kvm_mips_is_error_pfn)(pfn_t pfn);
+bool (*kvm_mips_is_error_pfn)(kvm_pfn_t pfn);
 EXPORT_SYMBOL(kvm_mips_is_error_pfn);
 
 uint32_t kvm_mips_get_kernel_asid(struct kvm_vcpu *vcpu)
@@ -144,7 +144,7 @@ EXPORT_SYMBOL(kvm_mips_dump_guest_tlbs);
 static int kvm_mips_map_page(struct kvm *kvm, gfn_t gfn)
 {
 	int srcu_idx, err = 0;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	if (kvm->arch.guest_pmap[gfn] != KVM_INVALID_PAGE)
 		return 0;
@@ -262,7 +262,7 @@ int kvm_mips_handle_kseg0_tlb_fault(unsigned long badvaddr,
 				    struct kvm_vcpu *vcpu)
 {
 	gfn_t gfn;
-	pfn_t pfn0, pfn1;
+	kvm_pfn_t pfn0, pfn1;
 	unsigned long vaddr = 0;
 	unsigned long entryhi = 0, entrylo0 = 0, entrylo1 = 0;
 	int even;
@@ -313,7 +313,7 @@ EXPORT_SYMBOL(kvm_mips_handle_kseg0_tlb_fault);
 int kvm_mips_handle_commpage_tlb_fault(unsigned long badvaddr,
 	struct kvm_vcpu *vcpu)
 {
-	pfn_t pfn0, pfn1;
+	kvm_pfn_t pfn0, pfn1;
 	unsigned long flags, old_entryhi = 0, vaddr = 0;
 	unsigned long entrylo0 = 0, entrylo1 = 0;
 
@@ -360,7 +360,7 @@ int kvm_mips_handle_mapped_seg_tlb_fault(struct kvm_vcpu *vcpu,
 {
 	unsigned long entryhi = 0, entrylo0 = 0, entrylo1 = 0;
 	struct kvm *kvm = vcpu->kvm;
-	pfn_t pfn0, pfn1;
+	kvm_pfn_t pfn0, pfn1;
 
 	if ((tlb->tlb_hi & VPN2_MASK) == 0) {
 		pfn0 = 0;
diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index 9fac01cb89c1..8f39796c9da8 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -154,8 +154,8 @@ extern void kvmppc_set_bat(struct kvm_vcpu *vcpu, struct kvmppc_bat *bat,
 			   bool upper, u32 val);
 extern void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong msr);
 extern int kvmppc_emulate_paired_single(struct kvm_run *run, struct kvm_vcpu *vcpu);
-extern pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa, bool writing,
-			bool *writable);
+extern kvm_pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa,
+			bool writing, bool *writable);
 extern void kvmppc_add_revmap_chain(struct kvm *kvm, struct revmap_entry *rev,
 			unsigned long *rmap, long pte_index, int realmode);
 extern void kvmppc_update_rmap_change(unsigned long *rmap, unsigned long psize);
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index c6ef05bd0765..2241d5357129 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -515,7 +515,7 @@ void kvmppc_claim_lpid(long lpid);
 void kvmppc_free_lpid(long lpid);
 void kvmppc_init_lpid(unsigned long nr_lpids);
 
-static inline void kvmppc_mmu_flush_icache(pfn_t pfn)
+static inline void kvmppc_mmu_flush_icache(kvm_pfn_t pfn)
 {
 	struct page *page;
 	/*
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 099c79d8c160..638c6d9be9e0 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -366,7 +366,7 @@ int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvmppc_core_prepare_to_enter);
 
-pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa, bool writing,
+kvm_pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa, bool writing,
 			bool *writable)
 {
 	ulong mp_pa = vcpu->arch.magic_page_pa & KVM_PAM;
@@ -379,9 +379,9 @@ pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa, bool writing,
 	gpa &= ~0xFFFULL;
 	if (unlikely(mp_pa) && unlikely((gpa & KVM_PAM) == mp_pa)) {
 		ulong shared_page = ((ulong)vcpu->arch.shared) & PAGE_MASK;
-		pfn_t pfn;
+		kvm_pfn_t pfn;
 
-		pfn = (pfn_t)virt_to_phys((void*)shared_page) >> PAGE_SHIFT;
+		pfn = (kvm_pfn_t)virt_to_phys((void*)shared_page) >> PAGE_SHIFT;
 		get_page(pfn_to_page(pfn));
 		if (writable)
 			*writable = true;
diff --git a/arch/powerpc/kvm/book3s_32_mmu_host.c b/arch/powerpc/kvm/book3s_32_mmu_host.c
index d5c9bfeb0c9c..55c4d51ea3e2 100644
--- a/arch/powerpc/kvm/book3s_32_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_32_mmu_host.c
@@ -142,7 +142,7 @@ extern char etext[];
 int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte,
 			bool iswrite)
 {
-	pfn_t hpaddr;
+	kvm_pfn_t hpaddr;
 	u64 vpn;
 	u64 vsid;
 	struct kvmppc_sid_map *map;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c b/arch/powerpc/kvm/book3s_64_mmu_host.c
index 79ad35abd196..913cd2198fa6 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_host.c
@@ -83,7 +83,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte,
 			bool iswrite)
 {
 	unsigned long vpn;
-	pfn_t hpaddr;
+	kvm_pfn_t hpaddr;
 	ulong hash, hpteg;
 	u64 vsid;
 	int ret;
diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index 72920bed3ac6..94f04fcb373e 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -41,7 +41,7 @@ enum vcpu_ftr {
 #define E500_TLB_MAS2_ATTR	(0x7f)
 
 struct tlbe_ref {
-	pfn_t pfn;		/* valid only for TLB0, except briefly */
+	kvm_pfn_t pfn;		/* valid only for TLB0, except briefly */
 	unsigned int flags;	/* E500_TLB_* */
 };
 
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 4d33e199edcc..8a5bb6dfcc2d 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -163,9 +163,9 @@ void kvmppc_map_magic(struct kvm_vcpu *vcpu)
 	struct kvm_book3e_206_tlb_entry magic;
 	ulong shared_page = ((ulong)vcpu->arch.shared) & PAGE_MASK;
 	unsigned int stid;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
-	pfn = (pfn_t)virt_to_phys((void *)shared_page) >> PAGE_SHIFT;
+	pfn = (kvm_pfn_t)virt_to_phys((void *)shared_page) >> PAGE_SHIFT;
 	get_page(pfn_to_page(pfn));
 
 	preempt_disable();
@@ -246,7 +246,7 @@ static inline int tlbe_is_writable(struct kvm_book3e_206_tlb_entry *tlbe)
 
 static inline void kvmppc_e500_ref_setup(struct tlbe_ref *ref,
 					 struct kvm_book3e_206_tlb_entry *gtlbe,
-					 pfn_t pfn, unsigned int wimg)
+					 kvm_pfn_t pfn, unsigned int wimg)
 {
 	ref->pfn = pfn;
 	ref->flags = E500_TLB_VALID;
@@ -309,7 +309,7 @@ static void kvmppc_e500_setup_stlbe(
 	int tsize, struct tlbe_ref *ref, u64 gvaddr,
 	struct kvm_book3e_206_tlb_entry *stlbe)
 {
-	pfn_t pfn = ref->pfn;
+	kvm_pfn_t pfn = ref->pfn;
 	u32 pr = vcpu->arch.shared->msr & MSR_PR;
 
 	BUG_ON(!(ref->flags & E500_TLB_VALID));
diff --git a/arch/powerpc/kvm/trace_pr.h b/arch/powerpc/kvm/trace_pr.h
index 810507cb688a..d44f324184fb 100644
--- a/arch/powerpc/kvm/trace_pr.h
+++ b/arch/powerpc/kvm/trace_pr.h
@@ -30,7 +30,7 @@ TRACE_EVENT(kvm_book3s_reenter,
 #ifdef CONFIG_PPC_BOOK3S_64
 
 TRACE_EVENT(kvm_book3s_64_mmu_map,
-	TP_PROTO(int rflags, ulong hpteg, ulong va, pfn_t hpaddr,
+	TP_PROTO(int rflags, ulong hpteg, ulong va, kvm_pfn_t hpaddr,
 		 struct kvmppc_pte *orig_pte),
 	TP_ARGS(rflags, hpteg, va, hpaddr, orig_pte),
 
diff --git a/arch/x86/kvm/iommu.c b/arch/x86/kvm/iommu.c
index 5c520ebf6343..a22a488b4622 100644
--- a/arch/x86/kvm/iommu.c
+++ b/arch/x86/kvm/iommu.c
@@ -43,11 +43,11 @@ static int kvm_iommu_unmap_memslots(struct kvm *kvm);
 static void kvm_iommu_put_pages(struct kvm *kvm,
 				gfn_t base_gfn, unsigned long npages);
 
-static pfn_t kvm_pin_pages(struct kvm_memory_slot *slot, gfn_t gfn,
+static kvm_pfn_t kvm_pin_pages(struct kvm_memory_slot *slot, gfn_t gfn,
 			   unsigned long npages)
 {
 	gfn_t end_gfn;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	pfn     = gfn_to_pfn_memslot(slot, gfn);
 	end_gfn = gfn + npages;
@@ -62,7 +62,8 @@ static pfn_t kvm_pin_pages(struct kvm_memory_slot *slot, gfn_t gfn,
 	return pfn;
 }
 
-static void kvm_unpin_pages(struct kvm *kvm, pfn_t pfn, unsigned long npages)
+static void kvm_unpin_pages(struct kvm *kvm, kvm_pfn_t pfn,
+		unsigned long npages)
 {
 	unsigned long i;
 
@@ -73,7 +74,7 @@ static void kvm_unpin_pages(struct kvm *kvm, pfn_t pfn, unsigned long npages)
 int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
 	gfn_t gfn, end_gfn;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	int r = 0;
 	struct iommu_domain *domain = kvm->arch.iommu_domain;
 	int flags;
@@ -275,7 +276,7 @@ static void kvm_iommu_put_pages(struct kvm *kvm,
 {
 	struct iommu_domain *domain;
 	gfn_t end_gfn, gfn;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	u64 phys;
 
 	domain  = kvm->arch.iommu_domain;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ff606f507913..6ab963ae0427 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -259,7 +259,7 @@ static unsigned get_mmio_spte_access(u64 spte)
 }
 
 static bool set_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
-			  pfn_t pfn, unsigned access)
+			  kvm_pfn_t pfn, unsigned access)
 {
 	if (unlikely(is_noslot_pfn(pfn))) {
 		mark_mmio_spte(vcpu, sptep, gfn, access);
@@ -325,7 +325,7 @@ static int is_last_spte(u64 pte, int level)
 	return 0;
 }
 
-static pfn_t spte_to_pfn(u64 pte)
+static kvm_pfn_t spte_to_pfn(u64 pte)
 {
 	return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
 }
@@ -587,7 +587,7 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
  */
 static int mmu_spte_clear_track_bits(u64 *sptep)
 {
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	u64 old_spte = *sptep;
 
 	if (!spte_has_volatile_bits(old_spte))
@@ -1369,7 +1369,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
 	int need_flush = 0;
 	u64 new_spte;
 	pte_t *ptep = (pte_t *)data;
-	pfn_t new_pfn;
+	kvm_pfn_t new_pfn;
 
 	WARN_ON(pte_huge(*ptep));
 	new_pfn = pte_pfn(*ptep);
@@ -2456,7 +2456,7 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return 0;
 }
 
-static bool kvm_is_mmio_pfn(pfn_t pfn)
+static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
 	if (pfn_valid(pfn))
 		return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
@@ -2466,7 +2466,7 @@ static bool kvm_is_mmio_pfn(pfn_t pfn)
 
 static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		    unsigned pte_access, int level,
-		    gfn_t gfn, pfn_t pfn, bool speculative,
+		    gfn_t gfn, kvm_pfn_t pfn, bool speculative,
 		    bool can_unsync, bool host_writable)
 {
 	u64 spte;
@@ -2546,7 +2546,7 @@ done:
 
 static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 			 unsigned pte_access, int write_fault, int *emulate,
-			 int level, gfn_t gfn, pfn_t pfn, bool speculative,
+			 int level, gfn_t gfn, kvm_pfn_t pfn, bool speculative,
 			 bool host_writable)
 {
 	int was_rmapped = 0;
@@ -2606,7 +2606,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	kvm_release_pfn_clean(pfn);
 }
 
-static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
+static kvm_pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
 				     bool no_dirty_log)
 {
 	struct kvm_memory_slot *slot;
@@ -2689,7 +2689,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
 }
 
 static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
-			int map_writable, int level, gfn_t gfn, pfn_t pfn,
+			int map_writable, int level, gfn_t gfn, kvm_pfn_t pfn,
 			bool prefault)
 {
 	struct kvm_shadow_walk_iterator iterator;
@@ -2739,7 +2739,7 @@ static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct *
 	send_sig_info(SIGBUS, &info, tsk);
 }
 
-static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, pfn_t pfn)
+static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
 {
 	/*
 	 * Do not cache the mmio info caused by writing the readonly gfn
@@ -2759,9 +2759,10 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, pfn_t pfn)
 }
 
 static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
-					gfn_t *gfnp, pfn_t *pfnp, int *levelp)
+					gfn_t *gfnp, kvm_pfn_t *pfnp,
+					int *levelp)
 {
-	pfn_t pfn = *pfnp;
+	kvm_pfn_t pfn = *pfnp;
 	gfn_t gfn = *gfnp;
 	int level = *levelp;
 
@@ -2800,7 +2801,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 }
 
 static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
-				pfn_t pfn, unsigned access, int *ret_val)
+				kvm_pfn_t pfn, unsigned access, int *ret_val)
 {
 	bool ret = true;
 
@@ -2954,7 +2955,7 @@ exit:
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-			 gva_t gva, pfn_t *pfn, bool write, bool *writable);
+			 gva_t gva, kvm_pfn_t *pfn, bool write, bool *writable);
 static void make_mmu_pages_available(struct kvm_vcpu *vcpu);
 
 static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
@@ -2963,7 +2964,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
 	int r;
 	int level;
 	int force_pt_level;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	unsigned long mmu_seq;
 	bool map_writable, write = error_code & PFERR_WRITE_MASK;
 
@@ -3435,7 +3436,7 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-			 gva_t gva, pfn_t *pfn, bool write, bool *writable)
+			 gva_t gva, kvm_pfn_t *pfn, bool write, bool *writable)
 {
 	struct kvm_memory_slot *slot;
 	bool async;
@@ -3473,7 +3474,7 @@ check_hugepage_cache_consistency(struct kvm_vcpu *vcpu, gfn_t gfn, int level)
 static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 			  bool prefault)
 {
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	int r;
 	int level;
 	int force_pt_level;
@@ -4627,7 +4628,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 	u64 *sptep;
 	struct rmap_iterator iter;
 	int need_tlb_flush = 0;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	struct kvm_mmu_page *sp;
 
 restart:
diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
index 03d518e499a6..37a4d14115c0 100644
--- a/arch/x86/kvm/mmu_audit.c
+++ b/arch/x86/kvm/mmu_audit.c
@@ -97,7 +97,7 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
 {
 	struct kvm_mmu_page *sp;
 	gfn_t gfn;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	hpa_t hpa;
 
 	sp = page_header(__pa(sptep));
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 736e6ab8784d..9dd02cb74724 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -456,7 +456,7 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 {
 	unsigned pte_access;
 	gfn_t gfn;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
 		return false;
@@ -551,7 +551,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
 static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 			 struct guest_walker *gw,
 			 int write_fault, int hlevel,
-			 pfn_t pfn, bool map_writable, bool prefault)
+			 kvm_pfn_t pfn, bool map_writable, bool prefault)
 {
 	struct kvm_mmu_page *sp = NULL;
 	struct kvm_shadow_walk_iterator it;
@@ -696,7 +696,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	int user_fault = error_code & PFERR_USER_MASK;
 	struct guest_walker walker;
 	int r;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 	int level = PT_PAGE_TABLE_LEVEL;
 	int force_pt_level;
 	unsigned long mmu_seq;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 06ef4908ba61..d401ed6874bd 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4046,7 +4046,7 @@ out:
 static int init_rmode_identity_map(struct kvm *kvm)
 {
 	int i, idx, r = 0;
-	pfn_t identity_map_pfn;
+	kvm_pfn_t identity_map_pfn;
 	u32 tmp;
 
 	if (!enable_ept)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 92511d4b7236..8fc5ca584edf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4935,7 +4935,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
 				  int emulation_type)
 {
 	gpa_t gpa = cr2;
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	if (emulation_type & EMULTYPE_NO_REEXECUTE)
 		return false;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1bef9e21e725..2420b43f3acc 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -65,7 +65,7 @@
  * error pfns indicate that the gfn is in slot but faild to
  * translate it to pfn on host.
  */
-static inline bool is_error_pfn(pfn_t pfn)
+static inline bool is_error_pfn(kvm_pfn_t pfn)
 {
 	return !!(pfn & KVM_PFN_ERR_MASK);
 }
@@ -75,13 +75,13 @@ static inline bool is_error_pfn(pfn_t pfn)
  * translated to pfn - it is not in slot or failed to
  * translate it to pfn.
  */
-static inline bool is_error_noslot_pfn(pfn_t pfn)
+static inline bool is_error_noslot_pfn(kvm_pfn_t pfn)
 {
 	return !!(pfn & KVM_PFN_ERR_NOSLOT_MASK);
 }
 
 /* noslot pfn indicates that the gfn is not in slot. */
-static inline bool is_noslot_pfn(pfn_t pfn)
+static inline bool is_noslot_pfn(kvm_pfn_t pfn)
 {
 	return pfn == KVM_PFN_NOSLOT;
 }
@@ -569,19 +569,20 @@ void kvm_release_page_clean(struct page *page);
 void kvm_release_page_dirty(struct page *page);
 void kvm_set_page_accessed(struct page *page);
 
-pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn);
-pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
-pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
+kvm_pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn);
+kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
+kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable);
-pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
-pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn);
-pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn, bool atomic,
-			   bool *async, bool write_fault, bool *writable);
+kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
+kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn);
+kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
+			       bool atomic, bool *async, bool write_fault,
+			       bool *writable);
 
-void kvm_release_pfn_clean(pfn_t pfn);
-void kvm_set_pfn_dirty(pfn_t pfn);
-void kvm_set_pfn_accessed(pfn_t pfn);
-void kvm_get_pfn(pfn_t pfn);
+void kvm_release_pfn_clean(kvm_pfn_t pfn);
+void kvm_set_pfn_dirty(kvm_pfn_t pfn);
+void kvm_set_pfn_accessed(kvm_pfn_t pfn);
+void kvm_get_pfn(kvm_pfn_t pfn);
 
 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 			int len);
@@ -607,8 +608,8 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
 
 struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
 struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn);
-pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
-pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
+kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
+kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn);
 unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn);
 unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool *writable);
@@ -789,7 +790,7 @@ void kvm_arch_sync_events(struct kvm *kvm);
 int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
 void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
 
-bool kvm_is_reserved_pfn(pfn_t pfn);
+bool kvm_is_reserved_pfn(kvm_pfn_t pfn);
 
 struct kvm_irq_ack_notifier {
 	struct hlist_node link;
@@ -940,7 +941,7 @@ static inline gfn_t gpa_to_gfn(gpa_t gpa)
 	return (gfn_t)(gpa >> PAGE_SHIFT);
 }
 
-static inline hpa_t pfn_to_hpa(pfn_t pfn)
+static inline hpa_t pfn_to_hpa(kvm_pfn_t pfn)
 {
 	return (hpa_t)pfn << PAGE_SHIFT;
 }
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 1b47a185c2f0..8bf259dae9f6 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -53,7 +53,7 @@ typedef unsigned long  hva_t;
 typedef u64            hpa_t;
 typedef u64            hfn_t;
 
-typedef hfn_t pfn_t;
+typedef hfn_t kvm_pfn_t;
 
 struct gfn_to_hva_cache {
 	u64 generation;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8db1d9361993..02cd2eddd3ff 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -111,7 +111,7 @@ static void hardware_disable_all(void);
 
 static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
 
-static void kvm_release_pfn_dirty(pfn_t pfn);
+static void kvm_release_pfn_dirty(kvm_pfn_t pfn);
 static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
 
 __visible bool kvm_rebooting;
@@ -119,7 +119,7 @@ EXPORT_SYMBOL_GPL(kvm_rebooting);
 
 static bool largepages_enabled = true;
 
-bool kvm_is_reserved_pfn(pfn_t pfn)
+bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
 	if (pfn_valid(pfn))
 		return PageReserved(pfn_to_page(pfn));
@@ -1296,7 +1296,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
  * true indicates success, otherwise false is returned.
  */
 static bool hva_to_pfn_fast(unsigned long addr, bool atomic, bool *async,
-			    bool write_fault, bool *writable, pfn_t *pfn)
+			    bool write_fault, bool *writable, kvm_pfn_t *pfn)
 {
 	struct page *page[1];
 	int npages;
@@ -1329,7 +1329,7 @@ static bool hva_to_pfn_fast(unsigned long addr, bool atomic, bool *async,
  * 1 indicates success, -errno is returned if error is detected.
  */
 static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
-			   bool *writable, pfn_t *pfn)
+			   bool *writable, kvm_pfn_t *pfn)
 {
 	struct page *page[1];
 	int npages = 0;
@@ -1393,11 +1393,11 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
  * 2): @write_fault = false && @writable, @writable will tell the caller
  *     whether the mapping is writable.
  */
-static pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
+static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
 			bool write_fault, bool *writable)
 {
 	struct vm_area_struct *vma;
-	pfn_t pfn = 0;
+	kvm_pfn_t pfn = 0;
 	int npages;
 
 	/* we can do it either atomically or asynchronously, not both */
@@ -1438,8 +1438,9 @@ exit:
 	return pfn;
 }
 
-pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn, bool atomic,
-			   bool *async, bool write_fault, bool *writable)
+kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
+			       bool atomic, bool *async, bool write_fault,
+			       bool *writable)
 {
 	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
 
@@ -1460,7 +1461,7 @@ pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn, bool atomic,
 }
 EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
 
-pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
+kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
 	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, NULL,
@@ -1468,37 +1469,37 @@ pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
 
-pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
+kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
 
-pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
+kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
 
-pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn)
+kvm_pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn)
 {
 	return gfn_to_pfn_memslot_atomic(gfn_to_memslot(kvm, gfn), gfn);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_atomic);
 
-pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn)
+kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
 	return gfn_to_pfn_memslot_atomic(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn_atomic);
 
-pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
+kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
 {
 	return gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn);
 
-pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
 	return gfn_to_pfn_memslot(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
 }
@@ -1521,7 +1522,7 @@ int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
 }
 EXPORT_SYMBOL_GPL(gfn_to_page_many_atomic);
 
-static struct page *kvm_pfn_to_page(pfn_t pfn)
+static struct page *kvm_pfn_to_page(kvm_pfn_t pfn)
 {
 	if (is_error_noslot_pfn(pfn))
 		return KVM_ERR_PTR_BAD_PAGE;
@@ -1536,7 +1537,7 @@ static struct page *kvm_pfn_to_page(pfn_t pfn)
 
 struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
 {
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	pfn = gfn_to_pfn(kvm, gfn);
 
@@ -1546,7 +1547,7 @@ EXPORT_SYMBOL_GPL(gfn_to_page);
 
 struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
-	pfn_t pfn;
+	kvm_pfn_t pfn;
 
 	pfn = kvm_vcpu_gfn_to_pfn(vcpu, gfn);
 
@@ -1562,7 +1563,7 @@ void kvm_release_page_clean(struct page *page)
 }
 EXPORT_SYMBOL_GPL(kvm_release_page_clean);
 
-void kvm_release_pfn_clean(pfn_t pfn)
+void kvm_release_pfn_clean(kvm_pfn_t pfn)
 {
 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
 		put_page(pfn_to_page(pfn));
@@ -1577,13 +1578,13 @@ void kvm_release_page_dirty(struct page *page)
 }
 EXPORT_SYMBOL_GPL(kvm_release_page_dirty);
 
-static void kvm_release_pfn_dirty(pfn_t pfn)
+static void kvm_release_pfn_dirty(kvm_pfn_t pfn)
 {
 	kvm_set_pfn_dirty(pfn);
 	kvm_release_pfn_clean(pfn);
 }
 
-void kvm_set_pfn_dirty(pfn_t pfn)
+void kvm_set_pfn_dirty(kvm_pfn_t pfn)
 {
 	if (!kvm_is_reserved_pfn(pfn)) {
 		struct page *page = pfn_to_page(pfn);
@@ -1594,14 +1595,14 @@ void kvm_set_pfn_dirty(pfn_t pfn)
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty);
 
-void kvm_set_pfn_accessed(pfn_t pfn)
+void kvm_set_pfn_accessed(kvm_pfn_t pfn)
 {
 	if (!kvm_is_reserved_pfn(pfn))
 		mark_page_accessed(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
 
-void kvm_get_pfn(pfn_t pfn)
+void kvm_get_pfn(kvm_pfn_t pfn)
 {
 	if (!kvm_is_reserved_pfn(pfn))
 		get_page(pfn_to_page(pfn));

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 12/20] mips: fix PAGE_MASK definition
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
                   ` (10 preceding siblings ...)
  2015-10-10  0:56 ` [PATCH v2 11/20] kvm: rename pfn_t to kvm_pfn_t Dan Williams
@ 2015-10-10  0:56 ` Dan Williams
  2015-10-10  0:56 ` [PATCH v2 13/20] mm, dax, pmem: introduce pfn_t Dan Williams
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:56 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-mips, linux-kernel, Ralf Baechle, linux-mm, ross.zwisler, hch

Make PAGE_MASK an unsigned long, like it is on x86, to avoid:

In file included from arch/mips/kernel/asm-offsets.c:14:0:
include/linux/mm.h: In function '__pfn_to_pfn_t':
include/linux/mm.h:1050:2: warning: left shift count >= width of type
  pfn_t pfn_t = { .val = pfn | (flags & PFN_FLAGS_MASK), };

...where PFN_FLAGS_MASK is:

#define PFN_FLAGS_MASK (~PAGE_MASK << (BITS_PER_LONG - PAGE_SHIFT))

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/mips/include/asm/page.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/mips/include/asm/page.h b/arch/mips/include/asm/page.h
index 89dd7fed1a57..ad1fccdb8d13 100644
--- a/arch/mips/include/asm/page.h
+++ b/arch/mips/include/asm/page.h
@@ -33,7 +33,7 @@
 #define PAGE_SHIFT	16
 #endif
 #define PAGE_SIZE	(_AC(1,UL) << PAGE_SHIFT)
-#define PAGE_MASK	(~((1 << PAGE_SHIFT) - 1))
+#define PAGE_MASK	(~(PAGE_SIZE - 1))
 
 /*
  * This is used for calculating the real page sizes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 13/20] mm, dax, pmem: introduce pfn_t
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
                   ` (11 preceding siblings ...)
  2015-10-10  0:56 ` [PATCH v2 12/20] mips: fix PAGE_MASK definition Dan Williams
@ 2015-10-10  0:56 ` Dan Williams
  2015-10-10  0:56 ` [PATCH v2 14/20] mm, dax, gpu: convert vm_insert_mixed to pfn_t, introduce _PAGE_DEVMAP Dan Williams
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:56 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Hansen, linux-kernel, linux-mm, ross.zwisler, Andrew Morton, hch

In preparation for enabling get_user_pages() operations on dax mappings,
introduce a type that encapsulates a page-frame-number that can also be
used to encode other information.  This other information is the
historical "page_link" encoding in a scatterlist, but can also denote
"device memory".  Where "device memory" is a set of pfns that are not
part of the kernel's linear mapping by default, but are accessed via the
same memory controller as ram.  The motivation for this new type is
large capacity persistent memory that optionally has struct page entries
in the 'memmap'.

When a driver, like pmem, has established a devm_memremap_pages()
mapping it needs to communicate to upper layers that the pfn has a page
backing.  This property will be leveraged in a later patch to enable
dax-gup.  For now, update all the ->direct_access() implementations to
communicate whether the returned pfn range is mapped.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave@sr71.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/sysdev/axonram.c |    8 ++---
 drivers/block/brd.c           |    4 +--
 drivers/nvdimm/pmem.c         |   25 ++++++----------
 drivers/s390/block/dcssblk.c  |   10 +++---
 fs/block_dev.c                |    2 +
 fs/dax.c                      |   19 ++++++------
 include/linux/blkdev.h        |    4 +--
 include/linux/mm.h            |   65 +++++++++++++++++++++++++++++++++++++++++
 include/linux/pfn.h           |    9 ++++++
 9 files changed, 105 insertions(+), 41 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index d2b79bc336c1..59ca4c0ab529 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -141,15 +141,13 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
  */
 static long
 axon_ram_direct_access(struct block_device *device, sector_t sector,
-		       void __pmem **kaddr, unsigned long *pfn)
+		       void __pmem **kaddr, pfn_t *pfn)
 {
 	struct axon_ram_bank *bank = device->bd_disk->private_data;
 	loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
-	void *addr = (void *)(bank->ph_addr + offset);
-
-	*kaddr = (void __pmem *)addr;
-	*pfn = virt_to_phys(addr) >> PAGE_SHIFT;
 
+	*kaddr = (void __pmem __force *) bank->io_addr + offset;
+	*pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
 	return bank->size - offset;
 }
 
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index b9794aeeb878..0bbc60463779 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -374,7 +374,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 static long brd_direct_access(struct block_device *bdev, sector_t sector,
-			void __pmem **kaddr, unsigned long *pfn)
+			void __pmem **kaddr, pfn_t *pfn)
 {
 	struct brd_device *brd = bdev->bd_disk->private_data;
 	struct page *page;
@@ -385,7 +385,7 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
 	if (!page)
 		return -ENOSPC;
 	*kaddr = (void __pmem *)page_address(page);
-	*pfn = page_to_pfn(page);
+	*pfn = page_to_pfn_t(page);
 
 	return PAGE_SIZE;
 }
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index bb66158c0505..c950602bbf0b 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -39,6 +39,7 @@ struct pmem_device {
 	phys_addr_t		phys_addr;
 	/* when non-zero this device is hosting a 'pfn' instance */
 	phys_addr_t		data_offset;
+	unsigned long		pfn_flags;
 	void __pmem		*virt_addr;
 	size_t			size;
 };
@@ -100,26 +101,15 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 }
 
 static long pmem_direct_access(struct block_device *bdev, sector_t sector,
-		      void __pmem **kaddr, unsigned long *pfn)
+		      void __pmem **kaddr, pfn_t *pfn)
 {
 	struct pmem_device *pmem = bdev->bd_disk->private_data;
 	resource_size_t offset = sector * 512 + pmem->data_offset;
-	resource_size_t size;
-
-	if (pmem->data_offset) {
-		/*
-		 * Limit the direct_access() size to what is covered by
-		 * the memmap
-		 */
-		size = (pmem->size - offset) & ~ND_PFN_MASK;
-	} else
-		size = pmem->size - offset;
 
-	/* FIXME convert DAX to comprehend that this mapping has a lifetime */
 	*kaddr = pmem->virt_addr + offset;
-	*pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;
+	*pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
 
-	return size;
+	return pmem->size - offset;
 }
 
 static const struct block_device_operations pmem_fops = {
@@ -150,10 +140,12 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 		return ERR_PTR(-EBUSY);
 	}
 
-	if (pmem_should_map_pages(dev))
+	pmem->pfn_flags = PFN_DEV;
+	if (pmem_should_map_pages(dev)) {
 		pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res,
 				NULL);
-	else
+		pmem->pfn_flags |= PFN_MAP;
+	} else
 		pmem->virt_addr = (void __pmem *) devm_memremap(dev,
 				pmem->phys_addr, pmem->size,
 				ARCH_MEMREMAP_PMEM);
@@ -380,6 +372,7 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
 	devm_memunmap(dev, (void __force *) pmem->virt_addr);
 	pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res,
 			altmap);
+	pmem->pfn_flags |= PFN_MAP;
 	if (IS_ERR(pmem->virt_addr)) {
 		rc = PTR_ERR(pmem->virt_addr);
 		goto err;
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 5ed44fe21380..e2b2839e4de5 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -29,7 +29,7 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
 static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
-			 void __pmem **kaddr, unsigned long *pfn);
+			 void __pmem **kaddr, pfn_t *pfn);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -881,20 +881,18 @@ fail:
 
 static long
 dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
-			void __pmem **kaddr, unsigned long *pfn)
+			void __pmem **kaddr, pfn_t *pfn)
 {
 	struct dcssblk_dev_info *dev_info;
 	unsigned long offset, dev_sz;
-	void *addr;
 
 	dev_info = bdev->bd_disk->private_data;
 	if (!dev_info)
 		return -ENODEV;
 	dev_sz = dev_info->end - dev_info->start;
 	offset = secnum * 512;
-	addr = (void *) (dev_info->start + offset);
-	*pfn = virt_to_phys(addr) >> PAGE_SHIFT;
-	*kaddr = (void __pmem *) addr;
+	*kaddr = (void __pmem *) (dev_info->start + offset);
+	*pfn = phys_to_pfn_t(dev_info->start + offset, PFN_DEV);
 
 	return dev_sz - offset;
 }
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 073bb57adab1..c37a193695d4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -442,7 +442,7 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
  * accessible at this address.
  */
 long bdev_direct_access(struct block_device *bdev, sector_t sector,
-			void __pmem **addr, unsigned long *pfn, long size)
+			void __pmem **addr, pfn_t *pfn, long size)
 {
 	long avail;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
diff --git a/fs/dax.c b/fs/dax.c
index 9549cd523649..7496a776e1a6 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -31,7 +31,7 @@
 #include <linux/sizes.h>
 
 static void __pmem *__dax_map_atomic(struct block_device *bdev, sector_t sector,
-		long size, unsigned long *pfn, long *len)
+		long size, pfn_t *pfn, long *len)
 {
 	long rc;
 	void __pmem *addr;
@@ -52,7 +52,7 @@ static void __pmem *__dax_map_atomic(struct block_device *bdev, sector_t sector,
 static void __pmem *dax_map_atomic(struct block_device *bdev, sector_t sector,
 		long size)
 {
-	unsigned long pfn;
+	pfn_t pfn;
 
 	return __dax_map_atomic(bdev, sector, size, &pfn, NULL);
 }
@@ -77,8 +77,8 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 	might_sleep();
 	do {
 		void __pmem *addr;
-		unsigned long pfn;
 		long count, sz;
+		pfn_t pfn;
 
 		sz = min_t(long, size, SZ_1M);
 		addr = __dax_map_atomic(bdev, sector, size, &pfn, &count);
@@ -146,7 +146,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 	struct block_device *bdev = NULL;
 	int rw = iov_iter_rw(iter), rc;
 	long map_len = 0;
-	unsigned long pfn;
+	pfn_t pfn;
 	void __pmem *addr = NULL;
 	void __pmem *kmap = (void __pmem *) ERR_PTR(-EIO);
 	bool hole = false;
@@ -338,9 +338,9 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
 	struct address_space *mapping = inode->i_mapping;
 	struct block_device *bdev = bh->b_bdev;
-	unsigned long pfn;
 	void __pmem *addr;
 	pgoff_t size;
+	pfn_t pfn;
 	int error;
 
 	i_mmap_lock_read(mapping);
@@ -371,7 +371,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, addr);
 
-	error = vm_insert_mixed(vma, vaddr, pfn);
+	error = vm_insert_mixed(vma, vaddr, pfn_t_to_pfn(pfn));
 
  out:
 	i_mmap_unlock_read(mapping);
@@ -660,8 +660,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
+		pfn_t pfn;
 		long length;
-		unsigned long pfn;
 		void __pmem *kaddr = __dax_map_atomic(bdev,
 				to_sector(&bh, inode), HPAGE_SIZE, &pfn,
 				&length);
@@ -670,7 +670,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			result = VM_FAULT_SIGBUS;
 			goto out;
 		}
-		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) {
+		if ((length < PMD_SIZE) || (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)) {
 			dax_unmap_atomic(bdev, kaddr);
 			goto fallback;
 		}
@@ -684,7 +684,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 		dax_unmap_atomic(bdev, kaddr);
 
-		result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
+		result |= vmf_insert_pfn_pmd(vma, address, pmd,
+				pfn_t_to_pfn(pfn), write);
 	}
 
  out:
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cd091cb2b96e..fb3e6886c479 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1624,7 +1624,7 @@ struct block_device_operations {
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	long (*direct_access)(struct block_device *, sector_t, void __pmem **,
-			unsigned long *pfn);
+			pfn_t *);
 	unsigned int (*check_events) (struct gendisk *disk,
 				      unsigned int clearing);
 	/* ->media_changed() is DEPRECATED, use ->check_events() instead */
@@ -1643,7 +1643,7 @@ extern int bdev_read_page(struct block_device *, sector_t, struct page *);
 extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 						struct writeback_control *);
 extern long bdev_direct_access(struct block_device *, sector_t,
-		void __pmem **addr, unsigned long *pfn, long size);
+		void __pmem **addr, pfn_t *pfn, long size);
 #else /* CONFIG_BLOCK */
 
 struct block_device;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b5628cfbf649..7045099f1654 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1033,6 +1033,71 @@ static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg)
 #endif
 
 /*
+ * PFN_FLAGS_MASK - mask of all the possible valid pfn_t flags
+ * PFN_SG_CHAIN - pfn is a pointer to the next scatterlist entry
+ * PFN_SG_LAST - pfn references a page and is the last scatterlist entry
+ * PFN_DEV - pfn is not covered by system memmap by default
+ * PFN_MAP - pfn has a dynamic page mapping established by a device driver
+ */
+#define PFN_FLAGS_MASK (~PAGE_MASK << (BITS_PER_LONG - PAGE_SHIFT))
+#define PFN_SG_CHAIN (1UL << (BITS_PER_LONG - 1))
+#define PFN_SG_LAST (1UL << (BITS_PER_LONG - 2))
+#define PFN_DEV (1UL << (BITS_PER_LONG - 3))
+#define PFN_MAP (1UL << (BITS_PER_LONG - 4))
+
+static inline pfn_t __pfn_to_pfn_t(unsigned long pfn, unsigned long flags)
+{
+	pfn_t pfn_t = { .val = pfn | (flags & PFN_FLAGS_MASK), };
+
+	return pfn_t;
+}
+
+/* a default pfn to pfn_t conversion assumes that @pfn is pfn_valid() */
+static inline pfn_t pfn_to_pfn_t(unsigned long pfn)
+{
+	return __pfn_to_pfn_t(pfn, 0);
+}
+
+static inline pfn_t phys_to_pfn_t(dma_addr_t addr, unsigned long flags)
+{
+	return __pfn_to_pfn_t(addr >> PAGE_SHIFT, flags);
+}
+
+static inline bool pfn_t_has_page(pfn_t pfn)
+{
+	return (pfn.val & PFN_MAP) == PFN_MAP || (pfn.val & PFN_DEV) == 0;
+}
+
+static inline unsigned long pfn_t_to_pfn(pfn_t pfn)
+{
+	return pfn.val & ~PFN_FLAGS_MASK;
+}
+
+static inline struct page *pfn_t_to_page(pfn_t pfn)
+{
+	if (pfn_t_has_page(pfn))
+		return pfn_to_page(pfn_t_to_pfn(pfn));
+	return NULL;
+}
+
+static inline dma_addr_t pfn_t_to_phys(pfn_t pfn)
+{
+	return PFN_PHYS(pfn_t_to_pfn(pfn));
+}
+
+static inline void *pfn_t_to_virt(pfn_t pfn)
+{
+	if (pfn_t_has_page(pfn))
+		return __va(pfn_t_to_phys(pfn));
+	return NULL;
+}
+
+static inline pfn_t page_to_pfn_t(struct page *page)
+{
+	return pfn_to_pfn_t(page_to_pfn(page));
+}
+
+/*
  * Some inline functions in vmstat.h depend on page_zone()
  */
 #include <linux/vmstat.h>
diff --git a/include/linux/pfn.h b/include/linux/pfn.h
index 7646637221f3..96df85985f16 100644
--- a/include/linux/pfn.h
+++ b/include/linux/pfn.h
@@ -3,6 +3,15 @@
 
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
+
+/*
+ * pfn_t: encapsulates a page-frame number that is optionally backed
+ * by memmap (struct page).  Whether a pfn_t has a 'struct page'
+ * backing is indicated by flags in the high bits of the value.
+ */
+typedef struct {
+	unsigned long val;
+} pfn_t;
 #endif
 
 #define PFN_ALIGN(x)	(((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 14/20] mm, dax, gpu: convert vm_insert_mixed to pfn_t, introduce _PAGE_DEVMAP
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
                   ` (12 preceding siblings ...)
  2015-10-10  0:56 ` [PATCH v2 13/20] mm, dax, pmem: introduce pfn_t Dan Williams
@ 2015-10-10  0:56 ` Dan Williams
  2015-10-10  0:56 ` [PATCH v2 15/20] mm, dax: convert vmf_insert_pfn_pmd() to pfn_t Dan Williams
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:56 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Hansen, David Airlie, linux-kernel, linux-mm, ross.zwisler,
	Andrew Morton, hch

Convert the raw unsigned long 'pfn' argument to pfn_t for the purpose
of evaluating the PFN_MAP and PFN_DEV flags.  When both are set it
triggers _PAGE_DEVMAP to be set in the resulting pte.  This flag will
later be used in the get_user_pages() path to pin the page mapping,
dynamically allocated by devm_memremap_pages(), until all the resulting
pages are released.

There are no functional changes to the gpu drivers as a result of this
conversion.

This uncovered several architectures with no local definition for
pfn_pte(), in response pfn_t_pte() is only defined when an arch opts-in
by "#define pfn_pte pfn_pte".

Cc: Dave Hansen <dave@sr71.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Airlie <airlied@linux.ie>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/alpha/include/asm/pgtable.h        |    1 +
 arch/parisc/include/asm/pgtable.h       |    1 +
 arch/powerpc/include/asm/pgtable.h      |    1 +
 arch/tile/include/asm/pgtable.h         |    1 +
 arch/um/include/asm/pgtable-3level.h    |    1 +
 arch/x86/include/asm/pgtable.h          |   18 ++++++++++++++++++
 arch/x86/include/asm/pgtable_types.h    |    7 ++++++-
 drivers/gpu/drm/exynos/exynos_drm_gem.c |    3 ++-
 drivers/gpu/drm/gma500/framebuffer.c    |    3 ++-
 drivers/gpu/drm/msm/msm_gem.c           |    3 ++-
 drivers/gpu/drm/omapdrm/omap_gem.c      |    6 ++++--
 drivers/gpu/drm/ttm/ttm_bo_vm.c         |    3 ++-
 fs/dax.c                                |    2 +-
 include/linux/mm.h                      |   29 ++++++++++++++++++++++++++++-
 mm/memory.c                             |   15 +++++++++------
 15 files changed, 79 insertions(+), 15 deletions(-)

diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgtable.h
index a9a119592372..a54050fe867e 100644
--- a/arch/alpha/include/asm/pgtable.h
+++ b/arch/alpha/include/asm/pgtable.h
@@ -216,6 +216,7 @@ extern unsigned long __zero_page(void);
 })
 #endif
 
+#define pfn_pte pfn_pte
 extern inline pte_t pfn_pte(unsigned long physpfn, pgprot_t pgprot)
 { pte_t pte; pte_val(pte) = (PHYS_TWIDDLE(physpfn) << 32) | pgprot_val(pgprot); return pte; }
 
diff --git a/arch/parisc/include/asm/pgtable.h b/arch/parisc/include/asm/pgtable.h
index f93c4a4e6580..dde7dd7200bd 100644
--- a/arch/parisc/include/asm/pgtable.h
+++ b/arch/parisc/include/asm/pgtable.h
@@ -377,6 +377,7 @@ static inline pte_t pte_mkspecial(pte_t pte)	{ return pte; }
 
 #define mk_pte(page, pgprot)	pfn_pte(page_to_pfn(page), (pgprot))
 
+#define pfn_pte pfn_pte
 static inline pte_t pfn_pte(unsigned long pfn, pgprot_t pgprot)
 {
 	pte_t pte;
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 0717693c8428..8448ff1542e0 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -67,6 +67,7 @@ static inline int pte_present(pte_t pte)
  * Even if PTEs can be unsigned long long, a PFN is always an unsigned
  * long for now.
  */
+#define pfn_pte pfn_pte
 static inline pte_t pfn_pte(unsigned long pfn, pgprot_t pgprot) {
 	return __pte(((pte_basic_t)(pfn) << PTE_RPN_SHIFT) |
 		     pgprot_val(pgprot)); }
diff --git a/arch/tile/include/asm/pgtable.h b/arch/tile/include/asm/pgtable.h
index 2b05ccbebed9..37c9aa3a3f0c 100644
--- a/arch/tile/include/asm/pgtable.h
+++ b/arch/tile/include/asm/pgtable.h
@@ -275,6 +275,7 @@ static inline unsigned long pte_pfn(pte_t pte)
 extern pgprot_t set_remote_cache_cpu(pgprot_t prot, int cpu);
 extern int get_remote_cache_cpu(pgprot_t prot);
 
+#define pfn_pte pfn_pte
 static inline pte_t pfn_pte(unsigned long pfn, pgprot_t prot)
 {
 	return hv_pte_set_pa(prot, PFN_PHYS(pfn));
diff --git a/arch/um/include/asm/pgtable-3level.h b/arch/um/include/asm/pgtable-3level.h
index bae8523a162f..b7b51db14c2f 100644
--- a/arch/um/include/asm/pgtable-3level.h
+++ b/arch/um/include/asm/pgtable-3level.h
@@ -98,6 +98,7 @@ static inline unsigned long pte_pfn(pte_t pte)
 	return phys_to_pfn(pte_val(pte));
 }
 
+#define pfn_pte pfn_pte
 static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
 {
 	pte_t pte;
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 867da5bbb4a3..02a54e5b7930 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -248,6 +248,11 @@ static inline pte_t pte_mkspecial(pte_t pte)
 	return pte_set_flags(pte, _PAGE_SPECIAL);
 }
 
+static inline pte_t pte_mkdevmap(pte_t pte)
+{
+	return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
+}
+
 static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
 {
 	pmdval_t v = native_pmd_val(pmd);
@@ -334,6 +339,7 @@ static inline pgprotval_t massage_pgprot(pgprot_t pgprot)
 	return protval;
 }
 
+#define pfn_pte pfn_pte
 static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
 {
 	return __pte(((phys_addr_t)page_nr << PAGE_SHIFT) |
@@ -446,6 +452,12 @@ static inline int pte_present(pte_t a)
 	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
 }
 
+#define pte_devmap pte_devmap
+static inline int pte_devmap(pte_t a)
+{
+	return pte_flags(a) & _PAGE_DEVMAP;
+}
+
 #define pte_accessible pte_accessible
 static inline bool pte_accessible(struct mm_struct *mm, pte_t a)
 {
@@ -464,6 +476,12 @@ static inline int pte_hidden(pte_t pte)
 	return pte_flags(pte) & _PAGE_HIDDEN;
 }
 
+#define pmd_devmap pmd_devmap
+static inline int pmd_devmap(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_DEVMAP;
+}
+
 static inline int pmd_present(pmd_t pmd)
 {
 	/*
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 13f310bfc09a..42d34e795123 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -25,7 +25,9 @@
 #define _PAGE_BIT_SPLITTING	_PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
 #define _PAGE_BIT_HIDDEN	_PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
-#define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
+#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
+#define _PAGE_BIT_DEVMAP		_PAGE_BIT_SOFTW4
+#define _PAGE_BIT_NX		63	/* No execute: only valid after cpuid check */
 
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
@@ -85,8 +87,11 @@
 
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AT(pteval_t, 1) << _PAGE_BIT_NX)
+#define _PAGE_DEVMAP	(_AT(pteval_t, 1) << _PAGE_BIT_DEVMAP)
+#define __HAVE_ARCH_PTE_DEVMAP
 #else
 #define _PAGE_NX	(_AT(pteval_t, 0))
+#define _PAGE_DEVMAP	(_AT(pteval_t, 0))
 #endif
 
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
diff --git a/drivers/gpu/drm/exynos/exynos_drm_gem.c b/drivers/gpu/drm/exynos/exynos_drm_gem.c
index 407afedb6003..778764bebc00 100644
--- a/drivers/gpu/drm/exynos/exynos_drm_gem.c
+++ b/drivers/gpu/drm/exynos/exynos_drm_gem.c
@@ -479,7 +479,8 @@ int exynos_drm_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 
 	pfn = page_to_pfn(exynos_gem_obj->pages[page_offset]);
-	ret = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, pfn);
+	ret = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
+			pfn_to_pfn_t(pfn, PFN_DEV));
 
 out:
 	switch (ret) {
diff --git a/drivers/gpu/drm/gma500/framebuffer.c b/drivers/gpu/drm/gma500/framebuffer.c
index 2eaf1b31c7bd..073144f197c5 100644
--- a/drivers/gpu/drm/gma500/framebuffer.c
+++ b/drivers/gpu/drm/gma500/framebuffer.c
@@ -132,7 +132,8 @@ static int psbfb_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	for (i = 0; i < page_num; i++) {
 		pfn = (phys_addr >> PAGE_SHIFT);
 
-		ret = vm_insert_mixed(vma, address, pfn);
+		ret = vm_insert_mixed(vma, address,
+				__pfn_to_pfn_t(pfn, PFN_DEV));
 		if (unlikely((ret == -EBUSY) || (ret != 0 && i > 0)))
 			break;
 		else if (unlikely(ret != 0)) {
diff --git a/drivers/gpu/drm/msm/msm_gem.c b/drivers/gpu/drm/msm/msm_gem.c
index c76cc853b08a..0f4ed5bfda83 100644
--- a/drivers/gpu/drm/msm/msm_gem.c
+++ b/drivers/gpu/drm/msm/msm_gem.c
@@ -222,7 +222,8 @@ int msm_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	VERB("Inserting %p pfn %lx, pa %lx", vmf->virtual_address,
 			pfn, pfn << PAGE_SHIFT);
 
-	ret = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, pfn);
+	ret = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
+			pfn_to_pfn_t(pfn, PFN_DEV));
 
 out_unlock:
 	mutex_unlock(&dev->struct_mutex);
diff --git a/drivers/gpu/drm/omapdrm/omap_gem.c b/drivers/gpu/drm/omapdrm/omap_gem.c
index 7ed08fdc4c42..910cb276a7ea 100644
--- a/drivers/gpu/drm/omapdrm/omap_gem.c
+++ b/drivers/gpu/drm/omapdrm/omap_gem.c
@@ -385,7 +385,8 @@ static int fault_1d(struct drm_gem_object *obj,
 	VERB("Inserting %p pfn %lx, pa %lx", vmf->virtual_address,
 			pfn, pfn << PAGE_SHIFT);
 
-	return vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, pfn);
+	return vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
+			pfn_to_pfn_t(pfn, PFN_DEV));
 }
 
 /* Special handling for the case of faulting in 2d tiled buffers */
@@ -478,7 +479,8 @@ static int fault_2d(struct drm_gem_object *obj,
 			pfn, pfn << PAGE_SHIFT);
 
 	for (i = n; i > 0; i--) {
-		vm_insert_mixed(vma, (unsigned long)vaddr, pfn);
+		vm_insert_mixed(vma, (unsigned long)vaddr,
+				pfn_to_pfn_t(pfn, PFN_DEV));
 		pfn += usergart[fmt].stride_pfn;
 		vaddr += PAGE_SIZE * m;
 	}
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 8fb7213277cc..bab765a7c501 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -229,7 +229,8 @@ static int ttm_bo_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		}
 
 		if (vma->vm_flags & VM_MIXEDMAP)
-			ret = vm_insert_mixed(&cvma, address, pfn);
+			ret = vm_insert_mixed(&cvma, address,
+					__pfn_to_pfn_t(pfn, PFN_DEV));
 		else
 			ret = vm_insert_pfn(&cvma, address, pfn);
 
diff --git a/fs/dax.c b/fs/dax.c
index 7496a776e1a6..1588edf297a2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -371,7 +371,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, addr);
 
-	error = vm_insert_mixed(vma, vaddr, pfn_t_to_pfn(pfn));
+	error = vm_insert_mixed(vma, vaddr, pfn);
 
  out:
 	i_mmap_unlock_read(mapping);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7045099f1654..1d405ca21c9f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1097,6 +1097,33 @@ static inline pfn_t page_to_pfn_t(struct page *page)
 	return pfn_to_pfn_t(page_to_pfn(page));
 }
 
+static inline int pfn_t_valid(pfn_t pfn)
+{
+	return pfn_valid(pfn_t_to_pfn(pfn));
+}
+
+#ifdef pfn_pte
+static inline pte_t pfn_t_pte(pfn_t pfn, pgprot_t pgprot)
+{
+	return pfn_pte(pfn_t_to_pfn(pfn), pgprot);
+}
+#endif
+
+#ifdef __HAVE_ARCH_PTE_DEVICE
+static inline bool pfn_t_has_dev_pagemap(pfn_t pfn)
+{
+	const unsigned long flags = PFN_DEV|PFN_MAP;
+
+	return (pfn.val & flags) == flags;
+}
+#else
+static inline bool pfn_t_has_dev_pagemap(pfn_t pfn)
+{
+	return false;
+}
+pte_t pte_mkdevmap(pte_t pte);
+#endif
+
 /*
  * Some inline functions in vmstat.h depend on page_zone()
  */
@@ -2280,7 +2307,7 @@ int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
 int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn);
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
-			unsigned long pfn);
+			pfn_t pfn);
 int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
 
 
diff --git a/mm/memory.c b/mm/memory.c
index deb679c31f2a..5fc858570f58 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1517,7 +1517,7 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr,
 EXPORT_SYMBOL(vm_insert_page);
 
 static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
-			unsigned long pfn, pgprot_t prot)
+			pfn_t pfn, pgprot_t prot)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	int retval;
@@ -1533,7 +1533,10 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 		goto out_unlock;
 
 	/* Ok, finally just insert the thing.. */
-	entry = pte_mkspecial(pfn_pte(pfn, prot));
+	if (pfn_t_has_dev_pagemap(pfn))
+		entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
+	else
+		entry = pte_mkspecial(pfn_t_pte(pfn, prot));
 	set_pte_at(mm, addr, pte, entry);
 	update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */
 
@@ -1583,14 +1586,14 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 	if (track_pfn_insert(vma, &pgprot, pfn))
 		return -EINVAL;
 
-	ret = insert_pfn(vma, addr, pfn, pgprot);
+	ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot);
 
 	return ret;
 }
 EXPORT_SYMBOL(vm_insert_pfn);
 
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
-			unsigned long pfn)
+			pfn_t pfn)
 {
 	BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
 
@@ -1604,10 +1607,10 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 	 * than insert_pfn).  If a zero_pfn were inserted into a VM_MIXEDMAP
 	 * without pte special, it would there be refcounted as a normal page.
 	 */
-	if (!HAVE_PTE_SPECIAL && pfn_valid(pfn)) {
+	if (!HAVE_PTE_SPECIAL && pfn_t_valid(pfn)) {
 		struct page *page;
 
-		page = pfn_to_page(pfn);
+		page = pfn_t_to_page(pfn);
 		return insert_page(vma, addr, page, vma->vm_page_prot);
 	}
 	return insert_pfn(vma, addr, pfn, vma->vm_page_prot);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 15/20] mm, dax: convert vmf_insert_pfn_pmd() to pfn_t
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
                   ` (13 preceding siblings ...)
  2015-10-10  0:56 ` [PATCH v2 14/20] mm, dax, gpu: convert vm_insert_mixed to pfn_t, introduce _PAGE_DEVMAP Dan Williams
@ 2015-10-10  0:56 ` Dan Williams
  2015-10-10  0:56 ` [PATCH v2 16/20] list: introduce list_poison() and LIST_POISON3 Dan Williams
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:56 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Hansen, linux-kernel, linux-mm, Alexander Viro,
	ross.zwisler, Matthew Wilcox, Andrew Morton, hch

Similar to the conversion of vm_insert_mixed() use pfn_t in the
vmf_insert_pfn_pmd() to tag the resulting pte with _PAGE_DEVICE when the
pfn is backed by a devm_memremap_pages() mapping.

Cc: Dave Hansen <dave@sr71.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/sparc/include/asm/pgtable_64.h     |    2 ++
 arch/x86/include/asm/pgtable.h          |    6 ++++++
 arch/x86/mm/pat.c                       |    4 ++--
 drivers/gpu/drm/exynos/exynos_drm_gem.c |    2 +-
 drivers/gpu/drm/msm/msm_gem.c           |    2 +-
 drivers/gpu/drm/omapdrm/omap_gem.c      |    4 ++--
 fs/dax.c                                |    2 +-
 include/asm-generic/pgtable.h           |    6 ++++--
 include/linux/huge_mm.h                 |    2 +-
 include/linux/mm.h                      |   18 +++++++++++++++++-
 mm/huge_memory.c                        |   10 ++++++----
 mm/memory.c                             |    2 +-
 12 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 131d36fcd07a..496ef783c68c 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -234,6 +234,7 @@ extern struct page *mem_map_zero;
  * the first physical page in the machine is at some huge physical address,
  * such as 4GB.   This is common on a partitioned E10000, for example.
  */
+#define pfn_pte pfn_pte
 static inline pte_t pfn_pte(unsigned long pfn, pgprot_t prot)
 {
 	unsigned long paddr = pfn << PAGE_SHIFT;
@@ -244,6 +245,7 @@ static inline pte_t pfn_pte(unsigned long pfn, pgprot_t prot)
 #define mk_pte(page, pgprot)	pfn_pte(page_to_pfn(page), (pgprot))
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pfn_pmd pfn_pmd
 static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)
 {
 	pte_t pte = pfn_pte(page_nr, pgprot);
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 02a54e5b7930..84d1346e1cda 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -282,6 +282,11 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
 	return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
 }
 
+static inline pmd_t pmd_mkdevmap(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_DEVMAP);
+}
+
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
 	return pmd_set_flags(pmd, _PAGE_PSE);
@@ -346,6 +351,7 @@ static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
 		     massage_pgprot(pgprot));
 }
 
+#define pfn_pmd pfn_pmd
 static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)
 {
 	return __pmd(((phys_addr_t)page_nr << PAGE_SHIFT) |
diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index 188e3e07eeeb..98efd3c02374 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -949,7 +949,7 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 }
 
 int track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
-		     unsigned long pfn)
+		     pfn_t pfn)
 {
 	enum page_cache_mode pcm;
 
@@ -957,7 +957,7 @@ int track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
 		return 0;
 
 	/* Set prot based on lookup */
-	pcm = lookup_memtype((resource_size_t)pfn << PAGE_SHIFT);
+	pcm = lookup_memtype(pfn_t_to_phys(pfn));
 	*prot = __pgprot((pgprot_val(vma->vm_page_prot) & (~_PAGE_CACHE_MASK)) |
 			 cachemode2protval(pcm));
 
diff --git a/drivers/gpu/drm/exynos/exynos_drm_gem.c b/drivers/gpu/drm/exynos/exynos_drm_gem.c
index 778764bebc00..aa7709ed9ae2 100644
--- a/drivers/gpu/drm/exynos/exynos_drm_gem.c
+++ b/drivers/gpu/drm/exynos/exynos_drm_gem.c
@@ -480,7 +480,7 @@ int exynos_drm_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 	pfn = page_to_pfn(exynos_gem_obj->pages[page_offset]);
 	ret = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
-			pfn_to_pfn_t(pfn, PFN_DEV));
+			__pfn_to_pfn_t(pfn, PFN_DEV));
 
 out:
 	switch (ret) {
diff --git a/drivers/gpu/drm/msm/msm_gem.c b/drivers/gpu/drm/msm/msm_gem.c
index 0f4ed5bfda83..6509d9b23912 100644
--- a/drivers/gpu/drm/msm/msm_gem.c
+++ b/drivers/gpu/drm/msm/msm_gem.c
@@ -223,7 +223,7 @@ int msm_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 			pfn, pfn << PAGE_SHIFT);
 
 	ret = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
-			pfn_to_pfn_t(pfn, PFN_DEV));
+			__pfn_to_pfn_t(pfn, PFN_DEV));
 
 out_unlock:
 	mutex_unlock(&dev->struct_mutex);
diff --git a/drivers/gpu/drm/omapdrm/omap_gem.c b/drivers/gpu/drm/omapdrm/omap_gem.c
index 910cb276a7ea..94b6d23ec202 100644
--- a/drivers/gpu/drm/omapdrm/omap_gem.c
+++ b/drivers/gpu/drm/omapdrm/omap_gem.c
@@ -386,7 +386,7 @@ static int fault_1d(struct drm_gem_object *obj,
 			pfn, pfn << PAGE_SHIFT);
 
 	return vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
-			pfn_to_pfn_t(pfn, PFN_DEV));
+			__pfn_to_pfn_t(pfn, PFN_DEV));
 }
 
 /* Special handling for the case of faulting in 2d tiled buffers */
@@ -480,7 +480,7 @@ static int fault_2d(struct drm_gem_object *obj,
 
 	for (i = n; i > 0; i--) {
 		vm_insert_mixed(vma, (unsigned long)vaddr,
-				pfn_to_pfn_t(pfn, PFN_DEV));
+				__pfn_to_pfn_t(pfn, PFN_DEV));
 		pfn += usergart[fmt].stride_pfn;
 		vaddr += PAGE_SIZE * m;
 	}
diff --git a/fs/dax.c b/fs/dax.c
index 1588edf297a2..87a070d6e6dc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -685,7 +685,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		dax_unmap_atomic(bdev, kaddr);
 
 		result |= vmf_insert_pfn_pmd(vma, address, pmd,
-				pfn_t_to_pfn(pfn), write);
+				pfn, write);
 	}
 
  out:
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 29c57b2cb344..16d2244c686f 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1,6 +1,8 @@
 #ifndef _ASM_GENERIC_PGTABLE_H
 #define _ASM_GENERIC_PGTABLE_H
 
+#include <linux/pfn.h>
+
 #ifndef __ASSEMBLY__
 #ifdef CONFIG_MMU
 
@@ -521,7 +523,7 @@ static inline int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
  * by vm_insert_pfn().
  */
 static inline int track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
-				   unsigned long pfn)
+				   pfn_t pfn)
 {
 	return 0;
 }
@@ -549,7 +551,7 @@ extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 			   unsigned long pfn, unsigned long addr,
 			   unsigned long size);
 extern int track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
-			    unsigned long pfn);
+			    pfn_t pfn);
 extern int track_pfn_copy(struct vm_area_struct *vma);
 extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
 			unsigned long size);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ecb080d6ff42..d218abedfeb9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -34,7 +34,7 @@ extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot,
 			int prot_numa);
 int vmf_insert_pfn_pmd(struct vm_area_struct *, unsigned long addr, pmd_t *,
-			unsigned long pfn, bool write);
+			pfn_t pfn, bool write);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1d405ca21c9f..ce173327215d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1109,7 +1109,14 @@ static inline pte_t pfn_t_pte(pfn_t pfn, pgprot_t pgprot)
 }
 #endif
 
-#ifdef __HAVE_ARCH_PTE_DEVICE
+#ifdef pfn_pmd
+static inline pmd_t pfn_t_pmd(pfn_t pfn, pgprot_t pgprot)
+{
+	return pfn_pmd(pfn_t_to_pfn(pfn), pgprot);
+}
+#endif
+
+#ifdef __HAVE_ARCH_PTE_DEVMAP
 static inline bool pfn_t_has_dev_pagemap(pfn_t pfn)
 {
 	const unsigned long flags = PFN_DEV|PFN_MAP;
@@ -1122,6 +1129,7 @@ static inline bool pfn_t_has_dev_pagemap(pfn_t pfn)
 	return false;
 }
 pte_t pte_mkdevmap(pte_t pte);
+pmd_t pmd_mkdevmap(pmd_t pmd);
 #endif
 
 /*
@@ -1887,6 +1895,14 @@ static inline void pgtable_pmd_page_dtor(struct page *page) {}
 
 #endif
 
+#ifndef pmd_devmap
+#define pmd_devmap(x) (0)
+#endif
+
+#ifndef pte_devmap
+#define pte_devmap(x) (0)
+#endif
+
 static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
 {
 	spinlock_t *ptl = pmd_lockptr(mm, pmd);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4b06b8db9df2..952b65a55bc9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -870,7 +870,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
-		pmd_t *pmd, unsigned long pfn, pgprot_t prot, bool write)
+		pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pmd_t entry;
@@ -878,7 +878,9 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_none(*pmd)) {
-		entry = pmd_mkhuge(pfn_pmd(pfn, prot));
+		entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
+		if (pfn_t_has_dev_pagemap(pfn))
+			entry = pmd_mkdevmap(entry);
 		if (write) {
 			entry = pmd_mkyoung(pmd_mkdirty(entry));
 			entry = maybe_pmd_mkwrite(entry, vma);
@@ -890,7 +892,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 }
 
 int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
-			pmd_t *pmd, unsigned long pfn, bool write)
+			pmd_t *pmd, pfn_t pfn, bool write)
 {
 	pgprot_t pgprot = vma->vm_page_prot;
 	/*
@@ -902,7 +904,7 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_valid(pfn));
+	BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_t_valid(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
diff --git a/mm/memory.c b/mm/memory.c
index 5fc858570f58..06d78ac37343 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1583,7 +1583,7 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return -EFAULT;
-	if (track_pfn_insert(vma, &pgprot, pfn))
+	if (track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV)))
 		return -EINVAL;
 
 	ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 16/20] list: introduce list_poison() and LIST_POISON3
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
                   ` (14 preceding siblings ...)
  2015-10-10  0:56 ` [PATCH v2 15/20] mm, dax: convert vmf_insert_pfn_pmd() to pfn_t Dan Williams
@ 2015-10-10  0:56 ` Dan Williams
  2015-10-10  0:56 ` [PATCH v2 17/20] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup Dan Williams
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:56 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-mm, ross.zwisler, linux-kernel, hch

ZONE_DEVICE pages always have an elevated count and will never be on an
lru reclaim list.  That space in 'struct page' can be redirected for
other uses, but for safety introduce a poison value that will always
trip __list_add() to assert.  This allows half of the struct list_head
storage to be reclaimed with some assurance to back up the assumption
that the page count never goes to zero and a list_add() is never
attempted.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/list.h   |   14 ++++++++++++++
 include/linux/poison.h |    1 +
 lib/list_debug.c       |    2 ++
 3 files changed, 17 insertions(+)

diff --git a/include/linux/list.h b/include/linux/list.h
index 3e3e64a61002..af38cc80ae4c 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -114,6 +114,20 @@ extern void list_del(struct list_head *entry);
 #endif
 
 /**
+ * list_del_poison - poison an entry to always assert on list_add
+ * @entry: the element to delete and poison
+ *
+ * Note: the assertion on list_add() only occurs when CONFIG_DEBUG_LIST=y,
+ * otherwise this is identical to list_del()
+ */
+static inline void list_del_poison(struct list_head *entry)
+{
+	__list_del(entry->prev, entry->next);
+	entry->next = LIST_POISON3;
+	entry->prev = LIST_POISON3;
+}
+
+/**
  * list_replace - replace old entry by new one
  * @old : the element to be replaced
  * @new : the new element to insert
diff --git a/include/linux/poison.h b/include/linux/poison.h
index 317e16de09e5..31d048b3ba06 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -21,6 +21,7 @@
  */
 #define LIST_POISON1  ((void *) 0x100 + POISON_POINTER_DELTA)
 #define LIST_POISON2  ((void *) 0x200 + POISON_POINTER_DELTA)
+#define LIST_POISON3  ((void *) 0x300 + POISON_POINTER_DELTA)
 
 /********** include/linux/timer.h **********/
 /*
diff --git a/lib/list_debug.c b/lib/list_debug.c
index c24c2f7e296f..ec69e2b8e0fc 100644
--- a/lib/list_debug.c
+++ b/lib/list_debug.c
@@ -23,6 +23,8 @@ void __list_add(struct list_head *new,
 			      struct list_head *prev,
 			      struct list_head *next)
 {
+	WARN(new->next == LIST_POISON3 || new->prev == LIST_POISON3,
+		"list_add attempted on poisoned entry\n");
 	WARN(next->prev != prev,
 		"list_add corruption. next->prev should be "
 		"prev (%p), but was %p. (next=%p).\n",

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 17/20] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
                   ` (15 preceding siblings ...)
  2015-10-10  0:56 ` [PATCH v2 16/20] list: introduce list_poison() and LIST_POISON3 Dan Williams
@ 2015-10-10  0:56 ` Dan Williams
  2015-10-10  0:57 ` [PATCH v2 18/20] block: notify queue death confirmation Dan Williams
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:56 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Hansen, linux-kernel, hch, linux-mm, Alexander Viro,
	Matthew Wilcox, Andrew Morton, ross.zwisler

get_dev_page() enables paths like get_user_pages() to pin a dynamically
mapped pfn-range (devm_memremap_pages()) while the resulting struct page
objects are in use.  Unlike get_page() it may fail if the device is, or
is in the process of being, disabled.  While the initial lookup of the
range may be an expensive list walk, the result is cached to speed up
subsequent lookups which are likely to be in the same mapped range.

devm_memremap_pages() now requires a reference counter to be specified
at init time.  For pmem this means moving request_queue allocation into
pmem_alloc() so the existing queue usage counter can track "device
pages".

Cc: Dave Hansen <dave@sr71.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c    |   42 +++++++++++++++++++++++++-----------------
 include/linux/mm.h       |   40 ++++++++++++++++++++++++++++++++++++++--
 include/linux/mm_types.h |    5 +++++
 kernel/memremap.c        |   46 ++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 110 insertions(+), 23 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index c950602bbf0b..f7acce594fa0 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -123,6 +123,7 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 		struct resource *res, int id)
 {
 	struct pmem_device *pmem;
+	struct request_queue *q;
 
 	pmem = devm_kzalloc(dev, sizeof(*pmem), GFP_KERNEL);
 	if (!pmem)
@@ -140,19 +141,26 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 		return ERR_PTR(-EBUSY);
 	}
 
+	q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(dev));
+	if (!q)
+		return ERR_PTR(-ENOMEM);
+
 	pmem->pfn_flags = PFN_DEV;
 	if (pmem_should_map_pages(dev)) {
 		pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res,
-				NULL);
+				&q->q_usage_counter, NULL);
 		pmem->pfn_flags |= PFN_MAP;
 	} else
 		pmem->virt_addr = (void __pmem *) devm_memremap(dev,
 				pmem->phys_addr, pmem->size,
 				ARCH_MEMREMAP_PMEM);
 
-	if (IS_ERR(pmem->virt_addr))
+	if (IS_ERR(pmem->virt_addr)) {
+		blk_cleanup_queue(q);
 		return (void __force *) pmem->virt_addr;
+	}
 
+	pmem->pmem_queue = q;
 	return pmem;
 }
 
@@ -169,20 +177,15 @@ static void pmem_detach_disk(struct pmem_device *pmem)
 static int pmem_attach_disk(struct device *dev,
 		struct nd_namespace_common *ndns, struct pmem_device *pmem)
 {
-	int nid = dev_to_node(dev);
 	struct gendisk *disk;
 
-	pmem->pmem_queue = blk_alloc_queue_node(GFP_KERNEL, nid);
-	if (!pmem->pmem_queue)
-		return -ENOMEM;
-
 	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
 	blk_queue_physical_block_size(pmem->pmem_queue, PAGE_SIZE);
 	blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX);
 	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->pmem_queue);
 
-	disk = alloc_disk_node(0, nid);
+	disk = alloc_disk_node(0, dev_to_node(dev));
 	if (!disk) {
 		blk_cleanup_queue(pmem->pmem_queue);
 		return -ENOMEM;
@@ -318,6 +321,7 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
 	struct vmem_altmap *altmap;
 	struct nd_pfn_sb *pfn_sb;
 	struct pmem_device *pmem;
+	struct request_queue *q;
 	phys_addr_t offset;
 	int rc;
 	struct vmem_altmap __altmap = {
@@ -369,9 +373,10 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
 
 	/* establish pfn range for lookup, and switch to direct map */
 	pmem = dev_get_drvdata(dev);
+	q = pmem->pmem_queue;
 	devm_memunmap(dev, (void __force *) pmem->virt_addr);
 	pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res,
-			altmap);
+			&q->q_usage_counter, altmap);
 	pmem->pfn_flags |= PFN_MAP;
 	if (IS_ERR(pmem->virt_addr)) {
 		rc = PTR_ERR(pmem->virt_addr);
@@ -410,19 +415,22 @@ static int nd_pmem_probe(struct device *dev)
 	dev_set_drvdata(dev, pmem);
 	ndns->rw_bytes = pmem_rw_bytes;
 
-	if (is_nd_btt(dev))
+	if (is_nd_btt(dev)) {
+		/* btt allocates its own request_queue */
+		blk_cleanup_queue(pmem->pmem_queue);
+		pmem->pmem_queue = NULL;
 		return nvdimm_namespace_attach_btt(ndns);
+	}
 
 	if (is_nd_pfn(dev))
 		return nvdimm_namespace_attach_pfn(ndns);
 
-	if (nd_btt_probe(ndns, pmem) == 0) {
-		/* we'll come back as btt-pmem */
-		return -ENXIO;
-	}
-
-	if (nd_pfn_probe(ndns, pmem) == 0) {
-		/* we'll come back as pfn-pmem */
+	if (nd_btt_probe(ndns, pmem) == 0 || nd_pfn_probe(ndns, pmem) == 0) {
+		/*
+		 * We'll come back as either btt-pmem, or pfn-pmem, so
+		 * drop the queue allocation for now.
+		 */
+		blk_cleanup_queue(pmem->pmem_queue);
 		return -ENXIO;
 	}
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ce173327215d..8a84bfb6fa6a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -15,12 +15,14 @@
 #include <linux/debug_locks.h>
 #include <linux/mm_types.h>
 #include <linux/range.h>
+#include <linux/percpu-refcount.h>
 #include <linux/pfn.h>
 #include <linux/bit_spinlock.h>
 #include <linux/shrinker.h>
 #include <linux/resource.h>
 #include <linux/page_ext.h>
 #include <linux/err.h>
+#include <linux/ioport.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -786,18 +788,21 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
  * @altmap: pre-allocated/reserved memory for vmemmap allocations
+ * @res: physical address range covered by @ref
+ * @ref: reference count that pins the devm_memremap_pages() mapping
  * @dev: host device of the mapping for debug
  */
 struct dev_pagemap {
 	struct vmem_altmap *altmap;
 	const struct resource *res;
+	struct percpu_ref *ref;
 	struct device *dev;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
 struct dev_pagemap *__get_dev_pagemap(resource_size_t phys);
 void *devm_memremap_pages(struct device *dev, struct resource *res,
-		struct vmem_altmap *altmap);
+		struct percpu_ref *ref, struct vmem_altmap *altmap);
 struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start);
 #else
 static inline struct dev_pagemap *__get_dev_pagemap(resource_size_t phys)
@@ -806,7 +811,7 @@ static inline struct dev_pagemap *__get_dev_pagemap(resource_size_t phys)
 }
 
 static inline void *devm_memremap_pages(struct device *dev, struct resource *res,
-		struct vmem_altmap *altmap)
+		struct percpu_ref *ref, struct vmem_altmap *altmap)
 {
 	/*
 	 * Fail attempts to call devm_memremap_pages() without
@@ -823,6 +828,37 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 }
 #endif
 
+/* get a live reference on the dev_pagemap hosting the given pfn */
+static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
+		struct dev_pagemap *pgmap)
+{
+	resource_size_t phys = __pfn_to_phys(pfn);
+
+	/*
+	 * In the cached case we're already holding a reference so we can
+	 * simply do a blind increment
+	 */
+	if (pgmap && phys >= pgmap->res->start && phys <= pgmap->res->end) {
+		percpu_ref_get(pgmap->ref);
+		return pgmap;
+	}
+
+	/* fall back to slow path lookup */
+	rcu_read_lock();
+	pgmap = __get_dev_pagemap(phys);
+	if (pgmap && !percpu_ref_tryget_live(pgmap->ref))
+		pgmap = NULL;
+	rcu_read_unlock();
+
+	return pgmap;
+}
+
+static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
+{
+	if (pgmap)
+		percpu_ref_put(pgmap->ref);
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3d6baa7d4534..457c3e8a8f4f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -120,6 +120,11 @@ struct page {
 					 * Can be used as a generic list
 					 * by the page owner.
 					 */
+		struct dev_pagemap *pgmap; /* ZONE_DEVICE pages are never on an
+					    * lru or handled by a slab
+					    * allocator, this points to the
+					    * hosting device page map.
+					    */
 		struct {		/* slub per cpu partial pages */
 			struct page *next;	/* Next partial slab */
 #ifdef CONFIG_64BIT
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 75161bb68af1..246446ba6e2f 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -163,6 +163,28 @@ static void del_page_map(struct page_map *page_map)
 	spin_unlock(&range_lock);
 }
 
+static unsigned long pfn_first(struct dev_pagemap *pgmap)
+{
+	struct vmem_altmap *altmap = pgmap->altmap;
+	const struct resource *res = pgmap->res;
+	unsigned long pfn;
+
+	pfn = res->start >> PAGE_SHIFT;
+	if (altmap)
+		pfn += vmem_altmap_offset(altmap);
+	return pfn;
+}
+
+static unsigned long pfn_end(struct dev_pagemap *pgmap)
+{
+	const struct resource *res = pgmap->res;
+
+	return (res->start + resource_size(res)) >> PAGE_SHIFT;
+}
+
+#define for_each_device_pfn(pfn, pgmap) \
+	for (pfn = pfn_first(pgmap); pfn < pfn_end(pgmap); pfn++)
+
 static void devm_memremap_pages_release(struct device *dev, void *data)
 {
 	struct page_map *page_map = data;
@@ -193,19 +215,25 @@ struct dev_pagemap *__get_dev_pagemap(resource_size_t phys)
  * devm_memremap_pages - remap and provide memmap backing for the given resource
  * @dev: hosting device for @res
  * @res: "host memory" address range
+ * @ref: a live per-cpu reference count
  * @altmap: optional descriptor for allocating the memmap from @res
  *
- * Note, the expectation is that @res is a host memory range that could
- * feasibly be treated as a "System RAM" range, i.e. not a device mmio
- * range, but this is not enforced.
+ * Notes:
+ * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time
+ *    (or devm release event).
+ *
+ * 2/ @res is expected to be a host memory range that could feasibly be
+ *    treated as a "System RAM" range, i.e. not a device mmio range, but
+ *    this is not enforced.
  */
 void *devm_memremap_pages(struct device *dev, struct resource *res,
-		struct vmem_altmap *altmap)
+		struct percpu_ref *ref, struct vmem_altmap *altmap)
 {
 	int is_ram = region_intersects(res->start, resource_size(res),
 			"System RAM");
 	struct dev_pagemap *pgmap;
 	struct page_map *page_map;
+	unsigned long pfn;
 	int error, nid;
 
 	if (is_ram == REGION_MIXED) {
@@ -217,6 +245,9 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	if (is_ram == REGION_INTERSECTS)
 		return __va(res->start);
 
+	if (!ref)
+		return ERR_PTR(-EINVAL);
+
 	page_map = devres_alloc_node(devm_memremap_pages_release,
 			sizeof(*page_map), GFP_KERNEL, dev_to_node(dev));
 	if (!page_map)
@@ -229,6 +260,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 		pgmap->altmap = &page_map->altmap;
 	}
 	pgmap->dev = dev;
+	pgmap->ref = ref;
 	pgmap->res = &page_map->res;
 	INIT_LIST_HEAD(&page_map->list);
 	add_page_map(page_map);
@@ -244,6 +276,12 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 		return ERR_PTR(error);
 	}
 
+	for_each_device_pfn(pfn, pgmap) {
+		struct page *page = pfn_to_page(pfn);
+
+		list_del_poison(&page->lru);
+		page->pgmap = pgmap;
+	}
 	devres_add(dev, page_map);
 	return __va(res->start);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 18/20] block: notify queue death confirmation
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
                   ` (16 preceding siblings ...)
  2015-10-10  0:56 ` [PATCH v2 17/20] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup Dan Williams
@ 2015-10-10  0:57 ` Dan Williams
  2015-10-10  0:57 ` [PATCH v2 19/20] mm, pmem: devm_memunmap_pages(), truncate and unmap ZONE_DEVICE pages Dan Williams
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:57 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: Jens Axboe, linux-mm, ross.zwisler, hch, linux-kernel

The pmem driver arranges for references to be taken against the queue
while pages it allocated via devm_memremap_pages() are in use.  At
shutdown time, before those pages can be deallocated, they need to be
truncated, unmapped, and guaranteed to be idle.  Scanning the pages to
initiate truncation can only be done once we are certain no new page
references will be taken.  Once the blk queue percpu_ref is confirmed
dead __get_dev_pagemap() will cease allowing new references and we can
reclaim these "device" pages.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/blk-core.c       |   12 +++++++++---
 block/blk-mq.c         |   19 +++++++++++++++----
 include/linux/blkdev.h |    4 +++-
 3 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 9b4d735cb5b8..74aaa208a8e9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -516,6 +516,12 @@ void blk_set_queue_dying(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
+void blk_wait_queue_dead(struct request_queue *q)
+{
+	wait_event(q->q_freeze_wq, q->q_usage_dead);
+}
+EXPORT_SYMBOL(blk_wait_queue_dead);
+
 /**
  * blk_cleanup_queue - shutdown a request queue
  * @q: request queue to shutdown
@@ -638,7 +644,7 @@ int blk_queue_enter(struct request_queue *q, gfp_t gfp)
 		if (!(gfp & __GFP_WAIT))
 			return -EBUSY;
 
-		ret = wait_event_interruptible(q->mq_freeze_wq,
+		ret = wait_event_interruptible(q->q_freeze_wq,
 				!atomic_read(&q->mq_freeze_depth) ||
 				blk_queue_dying(q));
 		if (blk_queue_dying(q))
@@ -658,7 +664,7 @@ static void blk_queue_usage_counter_release(struct percpu_ref *ref)
 	struct request_queue *q =
 		container_of(ref, struct request_queue, q_usage_counter);
 
-	wake_up_all(&q->mq_freeze_wq);
+	wake_up_all(&q->q_freeze_wq);
 }
 
 struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
@@ -720,7 +726,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	q->bypass_depth = 1;
 	__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);
 
-	init_waitqueue_head(&q->mq_freeze_wq);
+	init_waitqueue_head(&q->q_freeze_wq);
 
 	/*
 	 * Init percpu_ref in atomic mode so that it's faster to shutdown.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c371aeda2986..d52f9d91f5c1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -77,13 +77,23 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
 	clear_bit(CTX_TO_BIT(hctx, ctx), &bm->word);
 }
 
+static void blk_confirm_queue_death(struct percpu_ref *ref)
+{
+	struct request_queue *q = container_of(ref, typeof(*q),
+			q_usage_counter);
+
+	q->q_usage_dead = 1;
+	wake_up_all(&q->q_freeze_wq);
+}
+
 void blk_mq_freeze_queue_start(struct request_queue *q)
 {
 	int freeze_depth;
 
 	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
 	if (freeze_depth == 1) {
-		percpu_ref_kill(&q->q_usage_counter);
+		percpu_ref_kill_and_confirm(&q->q_usage_counter,
+				blk_confirm_queue_death);
 		blk_mq_run_hw_queues(q, false);
 	}
 }
@@ -91,7 +101,7 @@ EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start);
 
 static void blk_mq_freeze_queue_wait(struct request_queue *q)
 {
-	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
+	wait_event(q->q_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
 }
 
 /*
@@ -129,7 +139,8 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 	WARN_ON_ONCE(freeze_depth < 0);
 	if (!freeze_depth) {
 		percpu_ref_reinit(&q->q_usage_counter);
-		wake_up_all(&q->mq_freeze_wq);
+		q->q_usage_dead = 0;
+		wake_up_all(&q->q_freeze_wq);
 	}
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
@@ -148,7 +159,7 @@ void blk_mq_wake_waiters(struct request_queue *q)
 	 * dying, we need to ensure that processes currently waiting on
 	 * the queue are notified as well.
 	 */
-	wake_up_all(&q->mq_freeze_wq);
+	wake_up_all(&q->q_freeze_wq);
 }
 
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index fb3e6886c479..a1340654e360 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -427,6 +427,7 @@ struct request_queue {
 	 */
 	unsigned int		flush_flags;
 	unsigned int		flush_not_queueable:1;
+	unsigned int		q_usage_dead:1;
 	struct blk_flush_queue	*fq;
 
 	struct list_head	requeue_list;
@@ -449,7 +450,7 @@ struct request_queue {
 	struct throtl_data *td;
 #endif
 	struct rcu_head		rcu_head;
-	wait_queue_head_t	mq_freeze_wq;
+	wait_queue_head_t	q_freeze_wq;
 	struct percpu_ref	q_usage_counter;
 	struct list_head	all_q_node;
 
@@ -949,6 +950,7 @@ extern struct request_queue *blk_init_queue_node(request_fn_proc *rfn,
 extern struct request_queue *blk_init_queue(request_fn_proc *, spinlock_t *);
 extern struct request_queue *blk_init_allocated_queue(struct request_queue *,
 						      request_fn_proc *, spinlock_t *);
+extern void blk_wait_queue_dead(struct request_queue *q);
 extern void blk_cleanup_queue(struct request_queue *);
 extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
 extern void blk_queue_bounce_limit(struct request_queue *, u64);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 19/20] mm, pmem: devm_memunmap_pages(), truncate and unmap ZONE_DEVICE pages
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
                   ` (17 preceding siblings ...)
  2015-10-10  0:57 ` [PATCH v2 18/20] block: notify queue death confirmation Dan Williams
@ 2015-10-10  0:57 ` Dan Williams
  2015-10-10  0:57 ` [PATCH v2 20/20] mm, x86: get_user_pages() for dax mappings Dan Williams
  2015-10-23 21:06 ` [PATCH v2 00/20] " Logan Gunthorpe
  20 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:57 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Hansen, Dave Chinner, linux-kernel, hch, linux-mm,
	Alexander Viro, Matthew Wilcox, ross.zwisler, Andrew Morton

Before we allow ZONE_DEVICE pages to be put into active use outside of
the pmem driver, we need to arrange for them to be reclaimed when the
driver is shutdown.  devm_memunmap_pages() must wait for all pages to
return to the initial mapcount of 1.  If a given page is mapped by a
process we will truncate it out of its inode mapping and unmap it out of
the process vma.

This truncation is done while the dev_pagemap reference count is "dead",
preventing new references from being taken while the truncate+unmap scan
is in progress.

Cc: Dave Hansen <dave@sr71.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c |   42 ++++++++++++++++++++++++++++++++++++------
 fs/dax.c              |    2 ++
 include/linux/mm.h    |    5 +++++
 kernel/memremap.c     |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 91 insertions(+), 6 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index f7acce594fa0..2c9aebbc3fea 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -24,12 +24,15 @@
 #include <linux/memory_hotplug.h>
 #include <linux/moduleparam.h>
 #include <linux/vmalloc.h>
+#include <linux/async.h>
 #include <linux/slab.h>
 #include <linux/pmem.h>
 #include <linux/nd.h>
 #include "pfn.h"
 #include "nd.h"
 
+static ASYNC_DOMAIN_EXCLUSIVE(async_pmem);
+
 struct pmem_device {
 	struct request_queue	*pmem_queue;
 	struct gendisk		*pmem_disk;
@@ -164,14 +167,43 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 	return pmem;
 }
 
-static void pmem_detach_disk(struct pmem_device *pmem)
+
+static void async_blk_cleanup_queue(void *data, async_cookie_t cookie)
+{
+	struct pmem_device *pmem = data;
+
+	blk_cleanup_queue(pmem->pmem_queue);
+}
+
+static void pmem_detach_disk(struct device *dev)
 {
+	struct pmem_device *pmem = dev_get_drvdata(dev);
+	struct request_queue *q = pmem->pmem_queue;
+
 	if (!pmem->pmem_disk)
 		return;
 
 	del_gendisk(pmem->pmem_disk);
 	put_disk(pmem->pmem_disk);
-	blk_cleanup_queue(pmem->pmem_queue);
+	async_schedule_domain(async_blk_cleanup_queue, pmem, &async_pmem);
+
+	if (pmem->pfn_flags & PFN_MAP) {
+		/*
+		 * Wait for queue to go dead so that we know no new
+		 * references will be taken against the pages allocated
+		 * by devm_memremap_pages().
+		 */
+		blk_wait_queue_dead(q);
+
+		/*
+		 * Manually release the page mapping so that
+		 * blk_cleanup_queue() can complete queue draining.
+		 */
+		devm_memunmap_pages(dev, (void __force *) pmem->virt_addr);
+	}
+
+	/* Wait for blk_cleanup_queue() to finish */
+	async_synchronize_full_domain(&async_pmem);
 }
 
 static int pmem_attach_disk(struct device *dev,
@@ -299,11 +331,9 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
 static int nvdimm_namespace_detach_pfn(struct nd_namespace_common *ndns)
 {
 	struct nd_pfn *nd_pfn = to_nd_pfn(ndns->claim);
-	struct pmem_device *pmem;
 
 	/* free pmem disk */
-	pmem = dev_get_drvdata(&nd_pfn->dev);
-	pmem_detach_disk(pmem);
+	pmem_detach_disk(&nd_pfn->dev);
 
 	/* release nd_pfn resources */
 	kfree(nd_pfn->pfn_sb);
@@ -446,7 +476,7 @@ static int nd_pmem_remove(struct device *dev)
 	else if (is_nd_pfn(dev))
 		nvdimm_namespace_detach_pfn(pmem->ndns);
 	else
-		pmem_detach_disk(pmem);
+		pmem_detach_disk(dev);
 
 	return 0;
 }
diff --git a/fs/dax.c b/fs/dax.c
index 87a070d6e6dc..208e064fafe5 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -46,6 +46,7 @@ static void __pmem *__dax_map_atomic(struct block_device *bdev, sector_t sector,
 		blk_queue_exit(q);
 		return (void __pmem *) ERR_PTR(rc);
 	}
+	rcu_read_lock();
 	return addr;
 }
 
@@ -62,6 +63,7 @@ static void dax_unmap_atomic(struct block_device *bdev, void __pmem *addr)
 	if (IS_ERR(addr))
 		return;
 	blk_queue_exit(bdev->bd_queue);
+	rcu_read_unlock();
 }
 
 /*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8a84bfb6fa6a..af7597410cb9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -801,6 +801,7 @@ struct dev_pagemap {
 
 #ifdef CONFIG_ZONE_DEVICE
 struct dev_pagemap *__get_dev_pagemap(resource_size_t phys);
+void devm_memunmap_pages(struct device *dev, void *addr);
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap);
 struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start);
@@ -810,6 +811,10 @@ static inline struct dev_pagemap *__get_dev_pagemap(resource_size_t phys)
 	return NULL;
 }
 
+static inline void devm_memunmap_pages(struct device *dev, void *addr)
+{
+}
+
 static inline void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap)
 {
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 246446ba6e2f..fa0cf1be2992 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -13,6 +13,7 @@
 #include <linux/rculist.h>
 #include <linux/device.h>
 #include <linux/types.h>
+#include <linux/fs.h>
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/memory_hotplug.h>
@@ -187,10 +188,39 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap)
 
 static void devm_memremap_pages_release(struct device *dev, void *data)
 {
+	unsigned long pfn;
 	struct page_map *page_map = data;
 	struct resource *res = &page_map->res;
+	struct address_space *mapping_prev = NULL;
 	struct dev_pagemap *pgmap = &page_map->pgmap;
 
+	if (percpu_ref_tryget_live(pgmap->ref)) {
+		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
+		percpu_ref_put(pgmap->ref);
+	}
+
+	/* flush in-flight dax_map_atomic() operations */
+	synchronize_rcu();
+
+	for_each_device_pfn(pfn, pgmap) {
+		struct page *page = pfn_to_page(pfn);
+		struct address_space *mapping = page->mapping;
+		struct inode *inode = mapping ? mapping->host : NULL;
+
+		dev_WARN_ONCE(dev, atomic_read(&page->_count) < 1,
+				"%s: ZONE_DEVICE page was freed!\n", __func__);
+
+		if (!mapping || !inode || mapping == mapping_prev) {
+			dev_WARN_ONCE(dev, atomic_read(&page->_count) > 1,
+					"%s: unexpected elevated page count pfn: %lx\n",
+					__func__, pfn);
+			continue;
+		}
+
+		truncate_pagecache(inode, 0);
+		mapping_prev = mapping;
+	}
+
 	/* pages are dead and unused, undo the arch mapping */
 	arch_remove_memory(res->start, resource_size(res));
 	dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
@@ -287,6 +317,24 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 }
 EXPORT_SYMBOL(devm_memremap_pages);
 
+static int page_map_match(struct device *dev, void *res, void *match_data)
+{
+	struct page_map *page_map = res;
+	resource_size_t phys = *(resource_size_t *) match_data;
+
+	return page_map->res.start == phys;
+}
+
+void devm_memunmap_pages(struct device *dev, void *addr)
+{
+	resource_size_t start = __pa(addr);
+
+	if (devres_release(dev, devm_memremap_pages_release, page_map_match,
+				&start) != 0)
+		dev_WARN(dev, "failed to find page map to release\n");
+}
+EXPORT_SYMBOL(devm_memunmap_pages);
+
 /*
  * Uncoditionally retrieve a dev_pagemap associated with the given physical
  * address, this is only for use in the arch_{add|remove}_memory() for setting

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 20/20] mm, x86: get_user_pages() for dax mappings
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
                   ` (18 preceding siblings ...)
  2015-10-10  0:57 ` [PATCH v2 19/20] mm, pmem: devm_memunmap_pages(), truncate and unmap ZONE_DEVICE pages Dan Williams
@ 2015-10-10  0:57 ` Dan Williams
  2015-10-23 21:06 ` [PATCH v2 00/20] " Logan Gunthorpe
  20 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-10  0:57 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Hansen, Andrew Morton, Peter Zijlstra, Dave Chinner,
	linux-kernel, linux-mm, Jeff Moyer, Ingo Molnar, Thomas Gleixner,
	Alexander Viro, H. Peter Anvin, Matthew Wilcox, ross.zwisler,
	hch

A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
has established a devm_memremap_pages() mapping, i.e. when the pfn_t
return from ->direct_access() has PFN_DEV and PFN_MAP set.  Later, when
encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
struct dev_pagemap instance to keep the result of pfn_to_page() valid
until put_page().

Cc: Dave Hansen <dave@sr71.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/ia64/include/asm/pgtable.h |    1 +
 arch/x86/include/asm/pgtable.h  |    2 +
 arch/x86/mm/gup.c               |   56 +++++++++++++++++++++++++++++++++++++--
 include/linux/mm.h              |   40 +++++++++++++++++++---------
 mm/gup.c                        |   11 +++++++-
 mm/hugetlb.c                    |   18 ++++++++++++-
 mm/swap.c                       |   15 ++++++++++
 7 files changed, 124 insertions(+), 19 deletions(-)

diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index 9f3ed9ee8f13..81d2af23958f 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -273,6 +273,7 @@ extern unsigned long VMALLOC_END;
 #define pmd_clear(pmdp)			(pmd_val(*(pmdp)) = 0UL)
 #define pmd_page_vaddr(pmd)		((unsigned long) __va(pmd_val(pmd) & _PFN_MASK))
 #define pmd_page(pmd)			virt_to_page((pmd_val(pmd) + PAGE_OFFSET))
+#define pmd_pfn(pmd)			(pmd_val(pmd) >> PAGE_SHIFT)
 
 #define pud_none(pud)			(!pud_val(pud))
 #define pud_bad(pud)			(!ia64_phys_addr_valid(pud_val(pud)))
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 84d1346e1cda..d29dc7b4924b 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -461,7 +461,7 @@ static inline int pte_present(pte_t a)
 #define pte_devmap pte_devmap
 static inline int pte_devmap(pte_t a)
 {
-	return pte_flags(a) & _PAGE_DEVMAP;
+	return (pte_flags(a) & _PAGE_DEVMAP) == _PAGE_DEVMAP;
 }
 
 #define pte_accessible pte_accessible
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 81bf3d2af3eb..7254ba4f791d 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -63,6 +63,16 @@ retry:
 #endif
 }
 
+static void undo_dev_pagemap(int *nr, int nr_start, struct page **pages)
+{
+	while ((*nr) - nr_start) {
+		struct page *page = pages[--(*nr)];
+
+		ClearPageReferenced(page);
+		put_page(page);
+	}
+}
+
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -71,7 +81,9 @@ retry:
 static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
+	struct dev_pagemap *pgmap = NULL;
 	unsigned long mask;
+	int nr_start = *nr;
 	pte_t *ptep;
 
 	mask = _PAGE_PRESENT|_PAGE_USER;
@@ -89,13 +101,21 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 			return 0;
 		}
 
-		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
+		page = pte_page(pte);
+		if (pte_devmap(pte)) {
+			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
+			if (unlikely(!pgmap)) {
+				undo_dev_pagemap(nr, nr_start, pages);
+				pte_unmap(ptep);
+				return 0;
+			}
+		} else if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
 			pte_unmap(ptep);
 			return 0;
 		}
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
-		page = pte_page(pte);
 		get_page(page);
+		put_dev_pagemap(pgmap);
 		SetPageReferenced(page);
 		pages[*nr] = page;
 		(*nr)++;
@@ -114,6 +134,32 @@ static inline void get_head_page_multiple(struct page *page, int nr)
 	SetPageReferenced(page);
 }
 
+static int __gup_device_huge_pmd(pmd_t pmd, unsigned long addr,
+		unsigned long end, struct page **pages, int *nr)
+{
+	int nr_start = *nr;
+	unsigned long pfn = pmd_pfn(pmd);
+	struct dev_pagemap *pgmap = NULL;
+
+	pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
+	do {
+		struct page *page = pfn_to_page(pfn);
+
+		pgmap = get_dev_pagemap(pfn, pgmap);
+		if (unlikely(!pgmap)) {
+			undo_dev_pagemap(nr, nr_start, pages);
+			return 0;
+		}
+		SetPageReferenced(page);
+		pages[*nr] = page;
+		get_page(page);
+		put_dev_pagemap(pgmap);
+		(*nr)++;
+		pfn++;
+	} while (addr += PAGE_SIZE, addr != end);
+	return 1;
+}
+
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
@@ -127,9 +173,13 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 		mask |= _PAGE_RW;
 	if ((pte_flags(pte) & mask) != mask)
 		return 0;
+
+	VM_BUG_ON(!pfn_valid(pmd_pfn(pmd)));
+	if (pmd_devmap(pmd))
+		return __gup_device_huge_pmd(pmd, addr, end, pages, nr);
+
 	/* hugepages are never "special" */
 	VM_BUG_ON(pte_flags(pte) & _PAGE_SPECIAL);
-	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 
 	refs = 0;
 	head = pte_page(pte);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index af7597410cb9..a8400652ccbf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -522,19 +522,6 @@ static inline void get_huge_page_tail(struct page *page)
 
 extern bool __get_page_tail(struct page *page);
 
-static inline void get_page(struct page *page)
-{
-	if (unlikely(PageTail(page)))
-		if (likely(__get_page_tail(page)))
-			return;
-	/*
-	 * Getting a normal page or the head of a compound page
-	 * requires to already have an elevated page->_count.
-	 */
-	VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
-	atomic_inc(&page->_count);
-}
-
 static inline struct page *virt_to_head_page(const void *x)
 {
 	struct page *page = virt_to_page(x);
@@ -800,12 +787,22 @@ struct dev_pagemap {
 };
 
 #ifdef CONFIG_ZONE_DEVICE
+static inline bool is_zone_device_page(const struct page *page)
+{
+	return page_zonenum(page) == ZONE_DEVICE;
+}
+
 struct dev_pagemap *__get_dev_pagemap(resource_size_t phys);
 void devm_memunmap_pages(struct device *dev, void *addr);
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap);
 struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start);
 #else
+static inline bool is_zone_device_page(const struct page *page)
+{
+	return false;
+}
+
 static inline struct dev_pagemap *__get_dev_pagemap(resource_size_t phys)
 {
 	return NULL;
@@ -864,6 +861,23 @@ static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
 		percpu_ref_put(pgmap->ref);
 }
 
+static inline void get_page(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		if (likely(__get_page_tail(page)))
+			return;
+
+	if (is_zone_device_page(page))
+		percpu_ref_get(page->pgmap->ref);
+
+	/*
+	 * Getting a normal page or the head of a compound page
+	 * requires to already have an elevated page->_count.
+	 */
+	VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+	atomic_inc(&page->_count);
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/mm/gup.c b/mm/gup.c
index a798293fc648..1064e9a489a4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -98,7 +98,16 @@ retry:
 	}
 
 	page = vm_normal_page(vma, address, pte);
-	if (unlikely(!page)) {
+	if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
+		/*
+		 * Only return device mapping pages in the FOLL_GET case since
+		 * they are only valid while holding the pgmap reference.
+		 */
+		if (get_dev_pagemap(pte_pfn(pte), NULL))
+			page = pte_page(pte);
+		else
+			goto no_page;
+	} else if (unlikely(!page)) {
 		if (flags & FOLL_DUMP) {
 			/* Avoid special (like zero) pages in core dumps */
 			page = ERR_PTR(-EFAULT);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9cc773483624..6bcc7cdee5a2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4229,7 +4229,23 @@ retry:
 	 */
 	if (!pmd_huge(*pmd))
 		goto out;
-	if (pmd_present(*pmd)) {
+	if (pmd_present(*pmd) && pmd_devmap(*pmd)) {
+		unsigned long pfn = pmd_pfn(*pmd);
+		struct dev_pagemap *pgmap;
+
+		/*
+		 * device mapped pages can only be returned if the
+		 * caller will manage the page reference count.
+		 */
+		if (!(flags & FOLL_GET))
+			goto out;
+		pgmap = get_dev_pagemap(pfn, NULL);
+		if (!pgmap)
+			goto out;
+		page = pfn_to_page(pfn);
+		get_page(page);
+		put_dev_pagemap(pgmap);
+	} else if (pmd_present(*pmd)) {
 		page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
 		if (flags & FOLL_GET)
 			get_page(page);
diff --git a/mm/swap.c b/mm/swap.c
index 983f692a47fd..05a8a51c648e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -230,6 +230,19 @@ out_put_single:
 	}
 }
 
+static bool put_device_page(struct page *page)
+{
+	/*
+	 * ZONE_DEVICE pages are never "onlined" so their reference
+	 * counts never reach zero.  They are always owned by a device
+	 * driver, not the mm core.  I.e. the page is 'idle' when the
+	 * count is 1.
+	 */
+	VM_BUG_ON_PAGE(atomic_read(&page->_count) == 1, page);
+	put_dev_pagemap(page->pgmap);
+	return atomic_dec_return(&page->_count) == 1;
+}
+
 static void put_compound_page(struct page *page)
 {
 	struct page *page_head;
@@ -273,6 +286,8 @@ void put_page(struct page *page)
 {
 	if (unlikely(PageCompound(page)))
 		put_compound_page(page);
+	else if (is_zone_device_page(page))
+		put_device_page(page);
 	else if (put_page_testzero(page))
 		__put_single_page(page);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 11/20] kvm: rename pfn_t to kvm_pfn_t
  2015-10-10  0:56 ` [PATCH v2 11/20] kvm: rename pfn_t to kvm_pfn_t Dan Williams
@ 2015-10-10 15:35   ` Christoffer Dall
  2015-10-10 20:35   ` Paolo Bonzini
  1 sibling, 0 replies; 37+ messages in thread
From: Christoffer Dall @ 2015-10-10 15:35 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Dave Hansen, Russell King, linux-mm, Gleb Natapov,
	Catalin Marinas, Will Deacon, linux-kernel, Ralf Baechle,
	Marc Zyngier, Paul Mackerras, Benjamin Herrenschmidt,
	Paolo Bonzini, ross.zwisler, hch, Alexander Graf

On Fri, Oct 09, 2015 at 08:56:22PM -0400, Dan Williams wrote:
> The core has developed a need for a "pfn_t" type [1].  Move the existing
> pfn_t in KVM to kvm_pfn_t [2].
> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
> [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
> 
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Gleb Natapov <gleb@kernel.org>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Christoffer Dall <christoffer.dall@linaro.org>
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Ralf Baechle <ralf@linux-mips.org>
> Cc: Alexander Graf <agraf@suse.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  arch/arm/include/asm/kvm_mmu.h        |    5 ++--
>  arch/arm/kvm/mmu.c                    |   10 ++++---
>  arch/arm64/include/asm/kvm_mmu.h      |    3 +-
>  arch/mips/include/asm/kvm_host.h      |    6 ++--
>  arch/mips/kvm/emulate.c               |    2 +
>  arch/mips/kvm/tlb.c                   |   14 +++++-----
>  arch/powerpc/include/asm/kvm_book3s.h |    4 +--
>  arch/powerpc/include/asm/kvm_ppc.h    |    2 +
>  arch/powerpc/kvm/book3s.c             |    6 ++--
>  arch/powerpc/kvm/book3s_32_mmu_host.c |    2 +
>  arch/powerpc/kvm/book3s_64_mmu_host.c |    2 +
>  arch/powerpc/kvm/e500.h               |    2 +
>  arch/powerpc/kvm/e500_mmu_host.c      |    8 +++---
>  arch/powerpc/kvm/trace_pr.h           |    2 +
>  arch/x86/kvm/iommu.c                  |   11 ++++----
>  arch/x86/kvm/mmu.c                    |   37 +++++++++++++-------------
>  arch/x86/kvm/mmu_audit.c              |    2 +
>  arch/x86/kvm/paging_tmpl.h            |    6 ++--
>  arch/x86/kvm/vmx.c                    |    2 +
>  arch/x86/kvm/x86.c                    |    2 +
>  include/linux/kvm_host.h              |   37 +++++++++++++-------------
>  include/linux/kvm_types.h             |    2 +
>  virt/kvm/kvm_main.c                   |   47 +++++++++++++++++----------------
>  23 files changed, 110 insertions(+), 104 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 405aa1883307..8ebd282dfc2b 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -182,7 +182,8 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>  	return (vcpu->arch.cp15[c1_SCTLR] & 0b101) == 0b101;
>  }
>  
> -static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
> +static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
> +					       kvm_pfn_t pfn,
>  					       unsigned long size,
>  					       bool ipa_uncached)
>  {
> @@ -246,7 +247,7 @@ static inline void __kvm_flush_dcache_pte(pte_t pte)
>  static inline void __kvm_flush_dcache_pmd(pmd_t pmd)
>  {
>  	unsigned long size = PMD_SIZE;
> -	pfn_t pfn = pmd_pfn(pmd);
> +	kvm_pfn_t pfn = pmd_pfn(pmd);
>  
>  	while (size) {
>  		void *va = kmap_atomic_pfn(pfn);
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 6984342da13d..e2dcbfdc4a8c 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -988,9 +988,9 @@ out:
>  	return ret;
>  }
>  
> -static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap)
> +static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
>  {
> -	pfn_t pfn = *pfnp;
> +	kvm_pfn_t pfn = *pfnp;
>  	gfn_t gfn = *ipap >> PAGE_SHIFT;
>  
>  	if (PageTransCompound(pfn_to_page(pfn))) {
> @@ -1202,7 +1202,7 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>  	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>  }
>  
> -static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
> +static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>  				      unsigned long size, bool uncached)
>  {
>  	__coherent_cache_guest_page(vcpu, pfn, size, uncached);
> @@ -1219,7 +1219,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	struct kvm *kvm = vcpu->kvm;
>  	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
>  	struct vm_area_struct *vma;
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  	pgprot_t mem_type = PAGE_S2;
>  	bool fault_ipa_uncached;
>  	bool logging_active = memslot_is_logging(memslot);
> @@ -1347,7 +1347,7 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
>  {
>  	pmd_t *pmd;
>  	pte_t *pte;
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  	bool pfn_valid = false;
>  
>  	trace_kvm_access_fault(fault_ipa);
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 61505676d085..385fc8cef82d 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -230,7 +230,8 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>  	return (vcpu_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>  }
>  
> -static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
> +static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
> +					       kvm_pfn_t pfn,
>  					       unsigned long size,
>  					       bool ipa_uncached)
>  {
[...]

For the arm/arm64 part:
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 11/20] kvm: rename pfn_t to kvm_pfn_t
  2015-10-10  0:56 ` [PATCH v2 11/20] kvm: rename pfn_t to kvm_pfn_t Dan Williams
  2015-10-10 15:35   ` Christoffer Dall
@ 2015-10-10 20:35   ` Paolo Bonzini
  2015-10-10 20:57     ` Dan Williams
  1 sibling, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2015-10-10 20:35 UTC (permalink / raw)
  To: Dan Williams, linux-nvdimm
  Cc: Dave Hansen, Russell King, linux-mm, Gleb Natapov,
	Catalin Marinas, Will Deacon, linux-kernel, Ralf Baechle,
	Marc Zyngier, Paul Mackerras, Christoffer Dall,
	Benjamin Herrenschmidt, ross.zwisler, hch, Alexander Graf

On 10/10/2015 02:56, Dan Williams wrote:
> The core has developed a need for a "pfn_t" type [1].  Move the existing
> pfn_t in KVM to kvm_pfn_t [2].
> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
> [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html

Can you please change also the other types in include/linux/kvm_types.h?

Thanks,

Paolo

> Cc: Dave Hansen <dave@sr71.net>
> Cc: Gleb Natapov <gleb@kernel.org>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Christoffer Dall <christoffer.dall@linaro.org>
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Ralf Baechle <ralf@linux-mips.org>
> Cc: Alexander Graf <agraf@suse.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  arch/arm/include/asm/kvm_mmu.h        |    5 ++--
>  arch/arm/kvm/mmu.c                    |   10 ++++---
>  arch/arm64/include/asm/kvm_mmu.h      |    3 +-
>  arch/mips/include/asm/kvm_host.h      |    6 ++--
>  arch/mips/kvm/emulate.c               |    2 +
>  arch/mips/kvm/tlb.c                   |   14 +++++-----
>  arch/powerpc/include/asm/kvm_book3s.h |    4 +--
>  arch/powerpc/include/asm/kvm_ppc.h    |    2 +
>  arch/powerpc/kvm/book3s.c             |    6 ++--
>  arch/powerpc/kvm/book3s_32_mmu_host.c |    2 +
>  arch/powerpc/kvm/book3s_64_mmu_host.c |    2 +
>  arch/powerpc/kvm/e500.h               |    2 +
>  arch/powerpc/kvm/e500_mmu_host.c      |    8 +++---
>  arch/powerpc/kvm/trace_pr.h           |    2 +
>  arch/x86/kvm/iommu.c                  |   11 ++++----
>  arch/x86/kvm/mmu.c                    |   37 +++++++++++++-------------
>  arch/x86/kvm/mmu_audit.c              |    2 +
>  arch/x86/kvm/paging_tmpl.h            |    6 ++--
>  arch/x86/kvm/vmx.c                    |    2 +
>  arch/x86/kvm/x86.c                    |    2 +
>  include/linux/kvm_host.h              |   37 +++++++++++++-------------
>  include/linux/kvm_types.h             |    2 +
>  virt/kvm/kvm_main.c                   |   47 +++++++++++++++++----------------
>  23 files changed, 110 insertions(+), 104 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 405aa1883307..8ebd282dfc2b 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -182,7 +182,8 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>  	return (vcpu->arch.cp15[c1_SCTLR] & 0b101) == 0b101;
>  }
>  
> -static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
> +static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
> +					       kvm_pfn_t pfn,
>  					       unsigned long size,
>  					       bool ipa_uncached)
>  {
> @@ -246,7 +247,7 @@ static inline void __kvm_flush_dcache_pte(pte_t pte)
>  static inline void __kvm_flush_dcache_pmd(pmd_t pmd)
>  {
>  	unsigned long size = PMD_SIZE;
> -	pfn_t pfn = pmd_pfn(pmd);
> +	kvm_pfn_t pfn = pmd_pfn(pmd);
>  
>  	while (size) {
>  		void *va = kmap_atomic_pfn(pfn);
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 6984342da13d..e2dcbfdc4a8c 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -988,9 +988,9 @@ out:
>  	return ret;
>  }
>  
> -static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap)
> +static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
>  {
> -	pfn_t pfn = *pfnp;
> +	kvm_pfn_t pfn = *pfnp;
>  	gfn_t gfn = *ipap >> PAGE_SHIFT;
>  
>  	if (PageTransCompound(pfn_to_page(pfn))) {
> @@ -1202,7 +1202,7 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>  	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>  }
>  
> -static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
> +static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>  				      unsigned long size, bool uncached)
>  {
>  	__coherent_cache_guest_page(vcpu, pfn, size, uncached);
> @@ -1219,7 +1219,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	struct kvm *kvm = vcpu->kvm;
>  	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
>  	struct vm_area_struct *vma;
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  	pgprot_t mem_type = PAGE_S2;
>  	bool fault_ipa_uncached;
>  	bool logging_active = memslot_is_logging(memslot);
> @@ -1347,7 +1347,7 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
>  {
>  	pmd_t *pmd;
>  	pte_t *pte;
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  	bool pfn_valid = false;
>  
>  	trace_kvm_access_fault(fault_ipa);
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 61505676d085..385fc8cef82d 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -230,7 +230,8 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>  	return (vcpu_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>  }
>  
> -static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
> +static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
> +					       kvm_pfn_t pfn,
>  					       unsigned long size,
>  					       bool ipa_uncached)
>  {
> diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> index 5a1a882e0a75..9c67f05a0a1b 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -101,9 +101,9 @@
>  #define CAUSEF_DC			(_ULCAST_(1) << 27)
>  
>  extern atomic_t kvm_mips_instance;
> -extern pfn_t(*kvm_mips_gfn_to_pfn) (struct kvm *kvm, gfn_t gfn);
> -extern void (*kvm_mips_release_pfn_clean) (pfn_t pfn);
> -extern bool(*kvm_mips_is_error_pfn) (pfn_t pfn);
> +extern kvm_pfn_t (*kvm_mips_gfn_to_pfn)(struct kvm *kvm, gfn_t gfn);
> +extern void (*kvm_mips_release_pfn_clean)(kvm_pfn_t pfn);
> +extern bool (*kvm_mips_is_error_pfn)(kvm_pfn_t pfn);
>  
>  struct kvm_vm_stat {
>  	u32 remote_tlb_flush;
> diff --git a/arch/mips/kvm/emulate.c b/arch/mips/kvm/emulate.c
> index d5fa3eaf39a1..476296cf37d3 100644
> --- a/arch/mips/kvm/emulate.c
> +++ b/arch/mips/kvm/emulate.c
> @@ -1525,7 +1525,7 @@ int kvm_mips_sync_icache(unsigned long va, struct kvm_vcpu *vcpu)
>  	struct kvm *kvm = vcpu->kvm;
>  	unsigned long pa;
>  	gfn_t gfn;
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  
>  	gfn = va >> PAGE_SHIFT;
>  
> diff --git a/arch/mips/kvm/tlb.c b/arch/mips/kvm/tlb.c
> index aed0ac2a4972..570479c03bdc 100644
> --- a/arch/mips/kvm/tlb.c
> +++ b/arch/mips/kvm/tlb.c
> @@ -38,13 +38,13 @@ atomic_t kvm_mips_instance;
>  EXPORT_SYMBOL(kvm_mips_instance);
>  
>  /* These function pointers are initialized once the KVM module is loaded */
> -pfn_t (*kvm_mips_gfn_to_pfn)(struct kvm *kvm, gfn_t gfn);
> +kvm_pfn_t (*kvm_mips_gfn_to_pfn)(struct kvm *kvm, gfn_t gfn);
>  EXPORT_SYMBOL(kvm_mips_gfn_to_pfn);
>  
> -void (*kvm_mips_release_pfn_clean)(pfn_t pfn);
> +void (*kvm_mips_release_pfn_clean)(kvm_pfn_t pfn);
>  EXPORT_SYMBOL(kvm_mips_release_pfn_clean);
>  
> -bool (*kvm_mips_is_error_pfn)(pfn_t pfn);
> +bool (*kvm_mips_is_error_pfn)(kvm_pfn_t pfn);
>  EXPORT_SYMBOL(kvm_mips_is_error_pfn);
>  
>  uint32_t kvm_mips_get_kernel_asid(struct kvm_vcpu *vcpu)
> @@ -144,7 +144,7 @@ EXPORT_SYMBOL(kvm_mips_dump_guest_tlbs);
>  static int kvm_mips_map_page(struct kvm *kvm, gfn_t gfn)
>  {
>  	int srcu_idx, err = 0;
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  
>  	if (kvm->arch.guest_pmap[gfn] != KVM_INVALID_PAGE)
>  		return 0;
> @@ -262,7 +262,7 @@ int kvm_mips_handle_kseg0_tlb_fault(unsigned long badvaddr,
>  				    struct kvm_vcpu *vcpu)
>  {
>  	gfn_t gfn;
> -	pfn_t pfn0, pfn1;
> +	kvm_pfn_t pfn0, pfn1;
>  	unsigned long vaddr = 0;
>  	unsigned long entryhi = 0, entrylo0 = 0, entrylo1 = 0;
>  	int even;
> @@ -313,7 +313,7 @@ EXPORT_SYMBOL(kvm_mips_handle_kseg0_tlb_fault);
>  int kvm_mips_handle_commpage_tlb_fault(unsigned long badvaddr,
>  	struct kvm_vcpu *vcpu)
>  {
> -	pfn_t pfn0, pfn1;
> +	kvm_pfn_t pfn0, pfn1;
>  	unsigned long flags, old_entryhi = 0, vaddr = 0;
>  	unsigned long entrylo0 = 0, entrylo1 = 0;
>  
> @@ -360,7 +360,7 @@ int kvm_mips_handle_mapped_seg_tlb_fault(struct kvm_vcpu *vcpu,
>  {
>  	unsigned long entryhi = 0, entrylo0 = 0, entrylo1 = 0;
>  	struct kvm *kvm = vcpu->kvm;
> -	pfn_t pfn0, pfn1;
> +	kvm_pfn_t pfn0, pfn1;
>  
>  	if ((tlb->tlb_hi & VPN2_MASK) == 0) {
>  		pfn0 = 0;
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
> index 9fac01cb89c1..8f39796c9da8 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -154,8 +154,8 @@ extern void kvmppc_set_bat(struct kvm_vcpu *vcpu, struct kvmppc_bat *bat,
>  			   bool upper, u32 val);
>  extern void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong msr);
>  extern int kvmppc_emulate_paired_single(struct kvm_run *run, struct kvm_vcpu *vcpu);
> -extern pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa, bool writing,
> -			bool *writable);
> +extern kvm_pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa,
> +			bool writing, bool *writable);
>  extern void kvmppc_add_revmap_chain(struct kvm *kvm, struct revmap_entry *rev,
>  			unsigned long *rmap, long pte_index, int realmode);
>  extern void kvmppc_update_rmap_change(unsigned long *rmap, unsigned long psize);
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index c6ef05bd0765..2241d5357129 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -515,7 +515,7 @@ void kvmppc_claim_lpid(long lpid);
>  void kvmppc_free_lpid(long lpid);
>  void kvmppc_init_lpid(unsigned long nr_lpids);
>  
> -static inline void kvmppc_mmu_flush_icache(pfn_t pfn)
> +static inline void kvmppc_mmu_flush_icache(kvm_pfn_t pfn)
>  {
>  	struct page *page;
>  	/*
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index 099c79d8c160..638c6d9be9e0 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -366,7 +366,7 @@ int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
>  }
>  EXPORT_SYMBOL_GPL(kvmppc_core_prepare_to_enter);
>  
> -pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa, bool writing,
> +kvm_pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa, bool writing,
>  			bool *writable)
>  {
>  	ulong mp_pa = vcpu->arch.magic_page_pa & KVM_PAM;
> @@ -379,9 +379,9 @@ pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa, bool writing,
>  	gpa &= ~0xFFFULL;
>  	if (unlikely(mp_pa) && unlikely((gpa & KVM_PAM) == mp_pa)) {
>  		ulong shared_page = ((ulong)vcpu->arch.shared) & PAGE_MASK;
> -		pfn_t pfn;
> +		kvm_pfn_t pfn;
>  
> -		pfn = (pfn_t)virt_to_phys((void*)shared_page) >> PAGE_SHIFT;
> +		pfn = (kvm_pfn_t)virt_to_phys((void*)shared_page) >> PAGE_SHIFT;
>  		get_page(pfn_to_page(pfn));
>  		if (writable)
>  			*writable = true;
> diff --git a/arch/powerpc/kvm/book3s_32_mmu_host.c b/arch/powerpc/kvm/book3s_32_mmu_host.c
> index d5c9bfeb0c9c..55c4d51ea3e2 100644
> --- a/arch/powerpc/kvm/book3s_32_mmu_host.c
> +++ b/arch/powerpc/kvm/book3s_32_mmu_host.c
> @@ -142,7 +142,7 @@ extern char etext[];
>  int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte,
>  			bool iswrite)
>  {
> -	pfn_t hpaddr;
> +	kvm_pfn_t hpaddr;
>  	u64 vpn;
>  	u64 vsid;
>  	struct kvmppc_sid_map *map;
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c b/arch/powerpc/kvm/book3s_64_mmu_host.c
> index 79ad35abd196..913cd2198fa6 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_host.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_host.c
> @@ -83,7 +83,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte,
>  			bool iswrite)
>  {
>  	unsigned long vpn;
> -	pfn_t hpaddr;
> +	kvm_pfn_t hpaddr;
>  	ulong hash, hpteg;
>  	u64 vsid;
>  	int ret;
> diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
> index 72920bed3ac6..94f04fcb373e 100644
> --- a/arch/powerpc/kvm/e500.h
> +++ b/arch/powerpc/kvm/e500.h
> @@ -41,7 +41,7 @@ enum vcpu_ftr {
>  #define E500_TLB_MAS2_ATTR	(0x7f)
>  
>  struct tlbe_ref {
> -	pfn_t pfn;		/* valid only for TLB0, except briefly */
> +	kvm_pfn_t pfn;		/* valid only for TLB0, except briefly */
>  	unsigned int flags;	/* E500_TLB_* */
>  };
>  
> diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
> index 4d33e199edcc..8a5bb6dfcc2d 100644
> --- a/arch/powerpc/kvm/e500_mmu_host.c
> +++ b/arch/powerpc/kvm/e500_mmu_host.c
> @@ -163,9 +163,9 @@ void kvmppc_map_magic(struct kvm_vcpu *vcpu)
>  	struct kvm_book3e_206_tlb_entry magic;
>  	ulong shared_page = ((ulong)vcpu->arch.shared) & PAGE_MASK;
>  	unsigned int stid;
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  
> -	pfn = (pfn_t)virt_to_phys((void *)shared_page) >> PAGE_SHIFT;
> +	pfn = (kvm_pfn_t)virt_to_phys((void *)shared_page) >> PAGE_SHIFT;
>  	get_page(pfn_to_page(pfn));
>  
>  	preempt_disable();
> @@ -246,7 +246,7 @@ static inline int tlbe_is_writable(struct kvm_book3e_206_tlb_entry *tlbe)
>  
>  static inline void kvmppc_e500_ref_setup(struct tlbe_ref *ref,
>  					 struct kvm_book3e_206_tlb_entry *gtlbe,
> -					 pfn_t pfn, unsigned int wimg)
> +					 kvm_pfn_t pfn, unsigned int wimg)
>  {
>  	ref->pfn = pfn;
>  	ref->flags = E500_TLB_VALID;
> @@ -309,7 +309,7 @@ static void kvmppc_e500_setup_stlbe(
>  	int tsize, struct tlbe_ref *ref, u64 gvaddr,
>  	struct kvm_book3e_206_tlb_entry *stlbe)
>  {
> -	pfn_t pfn = ref->pfn;
> +	kvm_pfn_t pfn = ref->pfn;
>  	u32 pr = vcpu->arch.shared->msr & MSR_PR;
>  
>  	BUG_ON(!(ref->flags & E500_TLB_VALID));
> diff --git a/arch/powerpc/kvm/trace_pr.h b/arch/powerpc/kvm/trace_pr.h
> index 810507cb688a..d44f324184fb 100644
> --- a/arch/powerpc/kvm/trace_pr.h
> +++ b/arch/powerpc/kvm/trace_pr.h
> @@ -30,7 +30,7 @@ TRACE_EVENT(kvm_book3s_reenter,
>  #ifdef CONFIG_PPC_BOOK3S_64
>  
>  TRACE_EVENT(kvm_book3s_64_mmu_map,
> -	TP_PROTO(int rflags, ulong hpteg, ulong va, pfn_t hpaddr,
> +	TP_PROTO(int rflags, ulong hpteg, ulong va, kvm_pfn_t hpaddr,
>  		 struct kvmppc_pte *orig_pte),
>  	TP_ARGS(rflags, hpteg, va, hpaddr, orig_pte),
>  
> diff --git a/arch/x86/kvm/iommu.c b/arch/x86/kvm/iommu.c
> index 5c520ebf6343..a22a488b4622 100644
> --- a/arch/x86/kvm/iommu.c
> +++ b/arch/x86/kvm/iommu.c
> @@ -43,11 +43,11 @@ static int kvm_iommu_unmap_memslots(struct kvm *kvm);
>  static void kvm_iommu_put_pages(struct kvm *kvm,
>  				gfn_t base_gfn, unsigned long npages);
>  
> -static pfn_t kvm_pin_pages(struct kvm_memory_slot *slot, gfn_t gfn,
> +static kvm_pfn_t kvm_pin_pages(struct kvm_memory_slot *slot, gfn_t gfn,
>  			   unsigned long npages)
>  {
>  	gfn_t end_gfn;
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  
>  	pfn     = gfn_to_pfn_memslot(slot, gfn);
>  	end_gfn = gfn + npages;
> @@ -62,7 +62,8 @@ static pfn_t kvm_pin_pages(struct kvm_memory_slot *slot, gfn_t gfn,
>  	return pfn;
>  }
>  
> -static void kvm_unpin_pages(struct kvm *kvm, pfn_t pfn, unsigned long npages)
> +static void kvm_unpin_pages(struct kvm *kvm, kvm_pfn_t pfn,
> +		unsigned long npages)
>  {
>  	unsigned long i;
>  
> @@ -73,7 +74,7 @@ static void kvm_unpin_pages(struct kvm *kvm, pfn_t pfn, unsigned long npages)
>  int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
>  {
>  	gfn_t gfn, end_gfn;
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  	int r = 0;
>  	struct iommu_domain *domain = kvm->arch.iommu_domain;
>  	int flags;
> @@ -275,7 +276,7 @@ static void kvm_iommu_put_pages(struct kvm *kvm,
>  {
>  	struct iommu_domain *domain;
>  	gfn_t end_gfn, gfn;
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  	u64 phys;
>  
>  	domain  = kvm->arch.iommu_domain;
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index ff606f507913..6ab963ae0427 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -259,7 +259,7 @@ static unsigned get_mmio_spte_access(u64 spte)
>  }
>  
>  static bool set_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
> -			  pfn_t pfn, unsigned access)
> +			  kvm_pfn_t pfn, unsigned access)
>  {
>  	if (unlikely(is_noslot_pfn(pfn))) {
>  		mark_mmio_spte(vcpu, sptep, gfn, access);
> @@ -325,7 +325,7 @@ static int is_last_spte(u64 pte, int level)
>  	return 0;
>  }
>  
> -static pfn_t spte_to_pfn(u64 pte)
> +static kvm_pfn_t spte_to_pfn(u64 pte)
>  {
>  	return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
>  }
> @@ -587,7 +587,7 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
>   */
>  static int mmu_spte_clear_track_bits(u64 *sptep)
>  {
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  	u64 old_spte = *sptep;
>  
>  	if (!spte_has_volatile_bits(old_spte))
> @@ -1369,7 +1369,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
>  	int need_flush = 0;
>  	u64 new_spte;
>  	pte_t *ptep = (pte_t *)data;
> -	pfn_t new_pfn;
> +	kvm_pfn_t new_pfn;
>  
>  	WARN_ON(pte_huge(*ptep));
>  	new_pfn = pte_pfn(*ptep);
> @@ -2456,7 +2456,7 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
>  	return 0;
>  }
>  
> -static bool kvm_is_mmio_pfn(pfn_t pfn)
> +static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
>  {
>  	if (pfn_valid(pfn))
>  		return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
> @@ -2466,7 +2466,7 @@ static bool kvm_is_mmio_pfn(pfn_t pfn)
>  
>  static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  		    unsigned pte_access, int level,
> -		    gfn_t gfn, pfn_t pfn, bool speculative,
> +		    gfn_t gfn, kvm_pfn_t pfn, bool speculative,
>  		    bool can_unsync, bool host_writable)
>  {
>  	u64 spte;
> @@ -2546,7 +2546,7 @@ done:
>  
>  static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  			 unsigned pte_access, int write_fault, int *emulate,
> -			 int level, gfn_t gfn, pfn_t pfn, bool speculative,
> +			 int level, gfn_t gfn, kvm_pfn_t pfn, bool speculative,
>  			 bool host_writable)
>  {
>  	int was_rmapped = 0;
> @@ -2606,7 +2606,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  	kvm_release_pfn_clean(pfn);
>  }
>  
> -static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
> +static kvm_pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
>  				     bool no_dirty_log)
>  {
>  	struct kvm_memory_slot *slot;
> @@ -2689,7 +2689,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
>  }
>  
>  static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
> -			int map_writable, int level, gfn_t gfn, pfn_t pfn,
> +			int map_writable, int level, gfn_t gfn, kvm_pfn_t pfn,
>  			bool prefault)
>  {
>  	struct kvm_shadow_walk_iterator iterator;
> @@ -2739,7 +2739,7 @@ static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct *
>  	send_sig_info(SIGBUS, &info, tsk);
>  }
>  
> -static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, pfn_t pfn)
> +static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
>  {
>  	/*
>  	 * Do not cache the mmio info caused by writing the readonly gfn
> @@ -2759,9 +2759,10 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, pfn_t pfn)
>  }
>  
>  static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> -					gfn_t *gfnp, pfn_t *pfnp, int *levelp)
> +					gfn_t *gfnp, kvm_pfn_t *pfnp,
> +					int *levelp)
>  {
> -	pfn_t pfn = *pfnp;
> +	kvm_pfn_t pfn = *pfnp;
>  	gfn_t gfn = *gfnp;
>  	int level = *levelp;
>  
> @@ -2800,7 +2801,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>  }
>  
>  static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
> -				pfn_t pfn, unsigned access, int *ret_val)
> +				kvm_pfn_t pfn, unsigned access, int *ret_val)
>  {
>  	bool ret = true;
>  
> @@ -2954,7 +2955,7 @@ exit:
>  }
>  
>  static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> -			 gva_t gva, pfn_t *pfn, bool write, bool *writable);
> +			 gva_t gva, kvm_pfn_t *pfn, bool write, bool *writable);
>  static void make_mmu_pages_available(struct kvm_vcpu *vcpu);
>  
>  static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
> @@ -2963,7 +2964,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
>  	int r;
>  	int level;
>  	int force_pt_level;
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  	unsigned long mmu_seq;
>  	bool map_writable, write = error_code & PFERR_WRITE_MASK;
>  
> @@ -3435,7 +3436,7 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
>  }
>  
>  static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> -			 gva_t gva, pfn_t *pfn, bool write, bool *writable)
> +			 gva_t gva, kvm_pfn_t *pfn, bool write, bool *writable)
>  {
>  	struct kvm_memory_slot *slot;
>  	bool async;
> @@ -3473,7 +3474,7 @@ check_hugepage_cache_consistency(struct kvm_vcpu *vcpu, gfn_t gfn, int level)
>  static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
>  			  bool prefault)
>  {
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  	int r;
>  	int level;
>  	int force_pt_level;
> @@ -4627,7 +4628,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>  	u64 *sptep;
>  	struct rmap_iterator iter;
>  	int need_tlb_flush = 0;
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  	struct kvm_mmu_page *sp;
>  
>  restart:
> diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
> index 03d518e499a6..37a4d14115c0 100644
> --- a/arch/x86/kvm/mmu_audit.c
> +++ b/arch/x86/kvm/mmu_audit.c
> @@ -97,7 +97,7 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
>  {
>  	struct kvm_mmu_page *sp;
>  	gfn_t gfn;
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  	hpa_t hpa;
>  
>  	sp = page_header(__pa(sptep));
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 736e6ab8784d..9dd02cb74724 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -456,7 +456,7 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  {
>  	unsigned pte_access;
>  	gfn_t gfn;
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  
>  	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
>  		return false;
> @@ -551,7 +551,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
>  static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>  			 struct guest_walker *gw,
>  			 int write_fault, int hlevel,
> -			 pfn_t pfn, bool map_writable, bool prefault)
> +			 kvm_pfn_t pfn, bool map_writable, bool prefault)
>  {
>  	struct kvm_mmu_page *sp = NULL;
>  	struct kvm_shadow_walk_iterator it;
> @@ -696,7 +696,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
>  	int user_fault = error_code & PFERR_USER_MASK;
>  	struct guest_walker walker;
>  	int r;
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  	int level = PT_PAGE_TABLE_LEVEL;
>  	int force_pt_level;
>  	unsigned long mmu_seq;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 06ef4908ba61..d401ed6874bd 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4046,7 +4046,7 @@ out:
>  static int init_rmode_identity_map(struct kvm *kvm)
>  {
>  	int i, idx, r = 0;
> -	pfn_t identity_map_pfn;
> +	kvm_pfn_t identity_map_pfn;
>  	u32 tmp;
>  
>  	if (!enable_ept)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 92511d4b7236..8fc5ca584edf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4935,7 +4935,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
>  				  int emulation_type)
>  {
>  	gpa_t gpa = cr2;
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  
>  	if (emulation_type & EMULTYPE_NO_REEXECUTE)
>  		return false;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 1bef9e21e725..2420b43f3acc 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -65,7 +65,7 @@
>   * error pfns indicate that the gfn is in slot but faild to
>   * translate it to pfn on host.
>   */
> -static inline bool is_error_pfn(pfn_t pfn)
> +static inline bool is_error_pfn(kvm_pfn_t pfn)
>  {
>  	return !!(pfn & KVM_PFN_ERR_MASK);
>  }
> @@ -75,13 +75,13 @@ static inline bool is_error_pfn(pfn_t pfn)
>   * translated to pfn - it is not in slot or failed to
>   * translate it to pfn.
>   */
> -static inline bool is_error_noslot_pfn(pfn_t pfn)
> +static inline bool is_error_noslot_pfn(kvm_pfn_t pfn)
>  {
>  	return !!(pfn & KVM_PFN_ERR_NOSLOT_MASK);
>  }
>  
>  /* noslot pfn indicates that the gfn is not in slot. */
> -static inline bool is_noslot_pfn(pfn_t pfn)
> +static inline bool is_noslot_pfn(kvm_pfn_t pfn)
>  {
>  	return pfn == KVM_PFN_NOSLOT;
>  }
> @@ -569,19 +569,20 @@ void kvm_release_page_clean(struct page *page);
>  void kvm_release_page_dirty(struct page *page);
>  void kvm_set_page_accessed(struct page *page);
>  
> -pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn);
> -pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
> -pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> +kvm_pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn);
> +kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
> +kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  		      bool *writable);
> -pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
> -pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn);
> -pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn, bool atomic,
> -			   bool *async, bool write_fault, bool *writable);
> +kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
> +kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn);
> +kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
> +			       bool atomic, bool *async, bool write_fault,
> +			       bool *writable);
>  
> -void kvm_release_pfn_clean(pfn_t pfn);
> -void kvm_set_pfn_dirty(pfn_t pfn);
> -void kvm_set_pfn_accessed(pfn_t pfn);
> -void kvm_get_pfn(pfn_t pfn);
> +void kvm_release_pfn_clean(kvm_pfn_t pfn);
> +void kvm_set_pfn_dirty(kvm_pfn_t pfn);
> +void kvm_set_pfn_accessed(kvm_pfn_t pfn);
> +void kvm_get_pfn(kvm_pfn_t pfn);
>  
>  int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
>  			int len);
> @@ -607,8 +608,8 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
>  
>  struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
>  struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn);
> -pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
> -pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
> +kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
> +kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
>  struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn);
>  unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn);
>  unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool *writable);
> @@ -789,7 +790,7 @@ void kvm_arch_sync_events(struct kvm *kvm);
>  int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
>  void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
>  
> -bool kvm_is_reserved_pfn(pfn_t pfn);
> +bool kvm_is_reserved_pfn(kvm_pfn_t pfn);
>  
>  struct kvm_irq_ack_notifier {
>  	struct hlist_node link;
> @@ -940,7 +941,7 @@ static inline gfn_t gpa_to_gfn(gpa_t gpa)
>  	return (gfn_t)(gpa >> PAGE_SHIFT);
>  }
>  
> -static inline hpa_t pfn_to_hpa(pfn_t pfn)
> +static inline hpa_t pfn_to_hpa(kvm_pfn_t pfn)
>  {
>  	return (hpa_t)pfn << PAGE_SHIFT;
>  }
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 1b47a185c2f0..8bf259dae9f6 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -53,7 +53,7 @@ typedef unsigned long  hva_t;
>  typedef u64            hpa_t;
>  typedef u64            hfn_t;
>  
> -typedef hfn_t pfn_t;
> +typedef hfn_t kvm_pfn_t;
>  
>  struct gfn_to_hva_cache {
>  	u64 generation;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 8db1d9361993..02cd2eddd3ff 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -111,7 +111,7 @@ static void hardware_disable_all(void);
>  
>  static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
>  
> -static void kvm_release_pfn_dirty(pfn_t pfn);
> +static void kvm_release_pfn_dirty(kvm_pfn_t pfn);
>  static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
>  
>  __visible bool kvm_rebooting;
> @@ -119,7 +119,7 @@ EXPORT_SYMBOL_GPL(kvm_rebooting);
>  
>  static bool largepages_enabled = true;
>  
> -bool kvm_is_reserved_pfn(pfn_t pfn)
> +bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>  {
>  	if (pfn_valid(pfn))
>  		return PageReserved(pfn_to_page(pfn));
> @@ -1296,7 +1296,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
>   * true indicates success, otherwise false is returned.
>   */
>  static bool hva_to_pfn_fast(unsigned long addr, bool atomic, bool *async,
> -			    bool write_fault, bool *writable, pfn_t *pfn)
> +			    bool write_fault, bool *writable, kvm_pfn_t *pfn)
>  {
>  	struct page *page[1];
>  	int npages;
> @@ -1329,7 +1329,7 @@ static bool hva_to_pfn_fast(unsigned long addr, bool atomic, bool *async,
>   * 1 indicates success, -errno is returned if error is detected.
>   */
>  static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> -			   bool *writable, pfn_t *pfn)
> +			   bool *writable, kvm_pfn_t *pfn)
>  {
>  	struct page *page[1];
>  	int npages = 0;
> @@ -1393,11 +1393,11 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
>   * 2): @write_fault = false && @writable, @writable will tell the caller
>   *     whether the mapping is writable.
>   */
> -static pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
> +static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
>  			bool write_fault, bool *writable)
>  {
>  	struct vm_area_struct *vma;
> -	pfn_t pfn = 0;
> +	kvm_pfn_t pfn = 0;
>  	int npages;
>  
>  	/* we can do it either atomically or asynchronously, not both */
> @@ -1438,8 +1438,9 @@ exit:
>  	return pfn;
>  }
>  
> -pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn, bool atomic,
> -			   bool *async, bool write_fault, bool *writable)
> +kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
> +			       bool atomic, bool *async, bool write_fault,
> +			       bool *writable)
>  {
>  	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
>  
> @@ -1460,7 +1461,7 @@ pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn, bool atomic,
>  }
>  EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
>  
> -pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> +kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  		      bool *writable)
>  {
>  	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, NULL,
> @@ -1468,37 +1469,37 @@ pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
>  
> -pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
> +kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
>  {
>  	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
>  
> -pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
> +kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
>  {
>  	return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
>  
> -pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn)
> +kvm_pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn)
>  {
>  	return gfn_to_pfn_memslot_atomic(gfn_to_memslot(kvm, gfn), gfn);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_atomic);
>  
> -pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn)
> +kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn)
>  {
>  	return gfn_to_pfn_memslot_atomic(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn_atomic);
>  
> -pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
> +kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
>  {
>  	return gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn);
>  
> -pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn)
>  {
>  	return gfn_to_pfn_memslot(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
>  }
> @@ -1521,7 +1522,7 @@ int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_page_many_atomic);
>  
> -static struct page *kvm_pfn_to_page(pfn_t pfn)
> +static struct page *kvm_pfn_to_page(kvm_pfn_t pfn)
>  {
>  	if (is_error_noslot_pfn(pfn))
>  		return KVM_ERR_PTR_BAD_PAGE;
> @@ -1536,7 +1537,7 @@ static struct page *kvm_pfn_to_page(pfn_t pfn)
>  
>  struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
>  {
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  
>  	pfn = gfn_to_pfn(kvm, gfn);
>  
> @@ -1546,7 +1547,7 @@ EXPORT_SYMBOL_GPL(gfn_to_page);
>  
>  struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn)
>  {
> -	pfn_t pfn;
> +	kvm_pfn_t pfn;
>  
>  	pfn = kvm_vcpu_gfn_to_pfn(vcpu, gfn);
>  
> @@ -1562,7 +1563,7 @@ void kvm_release_page_clean(struct page *page)
>  }
>  EXPORT_SYMBOL_GPL(kvm_release_page_clean);
>  
> -void kvm_release_pfn_clean(pfn_t pfn)
> +void kvm_release_pfn_clean(kvm_pfn_t pfn)
>  {
>  	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
>  		put_page(pfn_to_page(pfn));
> @@ -1577,13 +1578,13 @@ void kvm_release_page_dirty(struct page *page)
>  }
>  EXPORT_SYMBOL_GPL(kvm_release_page_dirty);
>  
> -static void kvm_release_pfn_dirty(pfn_t pfn)
> +static void kvm_release_pfn_dirty(kvm_pfn_t pfn)
>  {
>  	kvm_set_pfn_dirty(pfn);
>  	kvm_release_pfn_clean(pfn);
>  }
>  
> -void kvm_set_pfn_dirty(pfn_t pfn)
> +void kvm_set_pfn_dirty(kvm_pfn_t pfn)
>  {
>  	if (!kvm_is_reserved_pfn(pfn)) {
>  		struct page *page = pfn_to_page(pfn);
> @@ -1594,14 +1595,14 @@ void kvm_set_pfn_dirty(pfn_t pfn)
>  }
>  EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty);
>  
> -void kvm_set_pfn_accessed(pfn_t pfn)
> +void kvm_set_pfn_accessed(kvm_pfn_t pfn)
>  {
>  	if (!kvm_is_reserved_pfn(pfn))
>  		mark_page_accessed(pfn_to_page(pfn));
>  }
>  EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
>  
> -void kvm_get_pfn(pfn_t pfn)
> +void kvm_get_pfn(kvm_pfn_t pfn)
>  {
>  	if (!kvm_is_reserved_pfn(pfn))
>  		get_page(pfn_to_page(pfn));
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 11/20] kvm: rename pfn_t to kvm_pfn_t
  2015-10-10 20:35   ` Paolo Bonzini
@ 2015-10-10 20:57     ` Dan Williams
  2015-10-12 12:51       ` Paolo Bonzini
  0 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2015-10-10 20:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-nvdimm, Dave Hansen, Russell King, Linux MM, Gleb Natapov,
	Catalin Marinas, Will Deacon, linux-kernel, Ralf Baechle,
	Marc Zyngier, Paul Mackerras, Christoffer Dall,
	Benjamin Herrenschmidt, Ross Zwisler, Christoph Hellwig,
	Alexander Graf

On Sat, Oct 10, 2015 at 1:35 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 10/10/2015 02:56, Dan Williams wrote:
>> The core has developed a need for a "pfn_t" type [1].  Move the existing
>> pfn_t in KVM to kvm_pfn_t [2].
>>
>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
>
> Can you please change also the other types in include/linux/kvm_types.h?

Hmm, all those seem kvm specific already.  I'd only prefix them with
kvm_ if they collided with a "core" type.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 01/20] block: generic request_queue reference counting
  2015-10-10  0:55 ` [PATCH v2 01/20] block: generic request_queue reference counting Dan Williams
@ 2015-10-11 12:59   ` Christoph Hellwig
  2015-10-13  0:09     ` Dan Williams
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2015-10-11 12:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jens Axboe, linux-kernel, Keith Busch, linux-mm,
	ross.zwisler

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

We could still clean up draing or only release the reference on
bio_done, but let's do that separately and get this infrastructure in
ASAP.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 11/20] kvm: rename pfn_t to kvm_pfn_t
  2015-10-10 20:57     ` Dan Williams
@ 2015-10-12 12:51       ` Paolo Bonzini
  2015-10-12 16:16         ` Dan Williams
  0 siblings, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2015-10-12 12:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm@lists.01.org, Dave Hansen, Russell King, Linux MM,
	Gleb Natapov, Catalin Marinas, Will Deacon, linux-kernel,
	Ralf Baechle, Marc Zyngier, Paul Mackerras, Christoffer Dall,
	Benjamin Herrenschmidt, Ross Zwisler, Christoph Hellwig,
	Alexander Graf, KVM list



On 10/10/2015 22:57, Dan Williams wrote:
> On Sat, Oct 10, 2015 at 1:35 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> On 10/10/2015 02:56, Dan Williams wrote:
>>> The core has developed a need for a "pfn_t" type [1].  Move the existing
>>> pfn_t in KVM to kvm_pfn_t [2].
>>>
>>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
>>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
>>
>> Can you please change also the other types in include/linux/kvm_types.h?
> 
> Hmm, all those seem kvm specific already.  I'd only prefix them with
> kvm_ if they collided with a "core" type.

But they are all related and the code becomes uglier if you only prefix
one of them.  If you don't convert all of them, I will do it anyway as
soon as this patch get in.

Since it touches a lot of KVM files, we should synchronize in order to
avoid conflicts and gnashing of teeth.  What tree is this patch going
in?  You could provide me a commit SHA1 for this patch (well, its
definitive version) based on Linus's tree (so that I can merge it in my
tree as well), or I could commit it and provide the SHA1 to the
maintainer of said tree.

Paolo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 11/20] kvm: rename pfn_t to kvm_pfn_t
  2015-10-12 12:51       ` Paolo Bonzini
@ 2015-10-12 16:16         ` Dan Williams
  0 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-12 16:16 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-nvdimm@lists.01.org, Dave Hansen, Russell King, Linux MM,
	Gleb Natapov, Catalin Marinas, Will Deacon, linux-kernel,
	Ralf Baechle, Marc Zyngier, Paul Mackerras, Christoffer Dall,
	Benjamin Herrenschmidt, Ross Zwisler, Christoph Hellwig,
	Alexander Graf, KVM list

On Mon, Oct 12, 2015 at 5:51 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 10/10/2015 22:57, Dan Williams wrote:
>> On Sat, Oct 10, 2015 at 1:35 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> On 10/10/2015 02:56, Dan Williams wrote:
>>>> The core has developed a need for a "pfn_t" type [1].  Move the existing
>>>> pfn_t in KVM to kvm_pfn_t [2].
>>>>
>>>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
>>>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
>>>
>>> Can you please change also the other types in include/linux/kvm_types.h?
>>
>> Hmm, all those seem kvm specific already.  I'd only prefix them with
>> kvm_ if they collided with a "core" type.
>
> But they are all related and the code becomes uglier if you only prefix
> one of them.  If you don't convert all of them, I will do it anyway as
> soon as this patch get in.

Ok.

> Since it touches a lot of KVM files, we should synchronize in order to
> avoid conflicts and gnashing of teeth.  What tree is this patch going
> in?  You could provide me a commit SHA1 for this patch (well, its
> definitive version) based on Linus's tree (so that I can merge it in my
> tree as well), or I could commit it and provide the SHA1 to the
> maintainer of said tree.

The kvm_pfn_t conversion is only needed if the new pfn_t
infrastructure moves forward, and at this point it still needs some
review feedback.

How about this, care to send conversion patches for the rest? ...based on:

    https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git/log/?h=libnvdimm-pending

When/if the new pfn_t bits move forward I'll carry them in the same
pull request through the nvdimm.git tree.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 01/20] block: generic request_queue reference counting
  2015-10-11 12:59   ` Christoph Hellwig
@ 2015-10-13  0:09     ` Dan Williams
  0 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-10-13  0:09 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: linux-nvdimm, linux-kernel, Keith Busch, Linux MM, Ross Zwisler

On Sun, Oct 11, 2015 at 5:59 AM, Christoph Hellwig <hch@lst.de> wrote:
> Looks good,
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
>
> We could still clean up draing or only release the reference on
> bio_done, but let's do that separately and get this infrastructure in
> ASAP.

Thanks Christoph.

Jens, do you want to take this, or ok for me to take this through nvdimm.git?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 05/20] x86, mm: introduce vmem_altmap to augment vmemmap_populate()
  2015-10-10  0:55 ` [PATCH v2 05/20] x86, mm: introduce vmem_altmap to augment vmemmap_populate() Dan Williams
@ 2015-10-19 22:53   ` Williams, Dan J
  0 siblings, 0 replies; 37+ messages in thread
From: Williams, Dan J @ 2015-10-19 22:53 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-kernel, linux-mm, dave.hansen, hch, akpm, hpa, mingo, ross.zwisler

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 26514 bytes --]

On Fri, 2015-10-09 at 20:55 -0400, Dan Williams wrote:
> In support of providing struct page for large persistent memory
> capacities, use struct vmem_altmap to change the default policy for
> allocating memory for the memmap array.  The default vmemmap_populate()
> allocates page table storage area from the page allocator.  Given
> persistent memory capacities relative to DRAM it may not be feasible to
> store the memmap in 'System Memory'.  Instead vmem_altmap represents
> pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
> requests.
> 
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---

The kbuild test robot reported a crash with this patch when
CONFIG_ZONE_DEVICE=y && CONFIG_SPARSEMEM_VMEMMAP=n.  The ability to
specify an alternate location for the vmemmap needs to be gated on
CONFIG_SPARSEMEM_VMEMMAP=y.

Here's a refreshed patch with ifdef guards and a warning message if the
@altmap arg is passed to devm_memremap_pages() on a
CONFIG_SPARSEMEM_VMEMMAP=n kernel.


8<----
Subject: x86, mm: introduce vmem_altmap to augment vmemmap_populate()

From: Dan Williams <dan.j.williams@intel.com>

In support of providing struct page for large persistent memory
capacities, use struct vmem_altmap to change the default policy for
allocating memory for the memmap array.  The default vmemmap_populate()
allocates page table storage area from the page allocator.  Given
persistent memory capacities relative to DRAM it may not be feasible to
store the memmap in 'System Memory'.  Instead vmem_altmap represents
pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
requests.

Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/m68k/include/asm/page_mm.h |    1 
 arch/m68k/include/asm/page_no.h |    1 
 arch/mn10300/include/asm/page.h |    1 
 arch/x86/mm/init_64.c           |   32 ++++++++++---
 drivers/nvdimm/pmem.c           |    6 ++
 include/linux/io.h              |   17 -------
 include/linux/memory_hotplug.h  |    3 +
 include/linux/mm.h              |   98 ++++++++++++++++++++++++++++++++++++++-
 kernel/memremap.c               |   77 +++++++++++++++++++++++++++----
 mm/memory_hotplug.c             |   66 +++++++++++++++++++-------
 mm/page_alloc.c                 |   10 ++++
 mm/sparse-vmemmap.c             |   37 ++++++++++++++-
 mm/sparse.c                     |    8 ++-
 13 files changed, 294 insertions(+), 63 deletions(-)

diff --git a/arch/m68k/include/asm/page_mm.h b/arch/m68k/include/asm/page_mm.h
index 5029f73e6294..884f2f7e4caf 100644
--- a/arch/m68k/include/asm/page_mm.h
+++ b/arch/m68k/include/asm/page_mm.h
@@ -125,6 +125,7 @@ static inline void *__va(unsigned long x)
  */
 #define virt_to_pfn(kaddr)	(__pa(kaddr) >> PAGE_SHIFT)
 #define pfn_to_virt(pfn)	__va((pfn) << PAGE_SHIFT)
+#define	__pfn_to_phys(pfn)	PFN_PHYS(pfn)
 
 extern int m68k_virt_to_node_shift;
 
diff --git a/arch/m68k/include/asm/page_no.h b/arch/m68k/include/asm/page_no.h
index ef209169579a..7845eca0b36d 100644
--- a/arch/m68k/include/asm/page_no.h
+++ b/arch/m68k/include/asm/page_no.h
@@ -24,6 +24,7 @@ extern unsigned long memory_end;
 
 #define virt_to_pfn(kaddr)	(__pa(kaddr) >> PAGE_SHIFT)
 #define pfn_to_virt(pfn)	__va((pfn) << PAGE_SHIFT)
+#define	__pfn_to_phys(pfn)	PFN_PHYS(pfn)
 
 #define virt_to_page(addr)	(mem_map + (((unsigned long)(addr)-PAGE_OFFSET) >> PAGE_SHIFT))
 #define page_to_virt(page)	__va(((((page) - mem_map) << PAGE_SHIFT) + PAGE_OFFSET))
diff --git a/arch/mn10300/include/asm/page.h b/arch/mn10300/include/asm/page.h
index 8288e124165b..3810a6f740fd 100644
--- a/arch/mn10300/include/asm/page.h
+++ b/arch/mn10300/include/asm/page.h
@@ -107,6 +107,7 @@ static inline int get_order(unsigned long size)
 #define pfn_to_kaddr(pfn)	__va((pfn) << PAGE_SHIFT)
 #define pfn_to_page(pfn)	(mem_map + ((pfn) - __pfn_disp))
 #define page_to_pfn(page)	((unsigned long)((page) - mem_map) + __pfn_disp)
+#define __pfn_to_phys(pfn)	PFN_PHYS(pfn)
 
 #define pfn_valid(pfn)					\
 ({							\
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index e5d42f1a2a71..cabf8ceb0a6b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -714,6 +714,12 @@ static void __meminit free_pagetable(struct page *page, int order)
 {
 	unsigned long magic;
 	unsigned int nr_pages = 1 << order;
+	struct vmem_altmap *altmap = to_vmem_altmap((unsigned long) page);
+
+	if (altmap) {
+		vmem_altmap_free(altmap, nr_pages);
+		return;
+	}
 
 	/* bootmem page has reserved flag */
 	if (PageReserved(page)) {
@@ -1018,13 +1024,19 @@ int __ref arch_remove_memory(u64 start, u64 size)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct page *page = pfn_to_page(start_pfn);
+	struct vmem_altmap *altmap;
 	struct zone *zone;
 	int ret;
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	kernel_physical_mapping_remove(start, start + size);
+	/* With altmap the first mapped page is offset from @start */
+	altmap = to_vmem_altmap((unsigned long) page);
+	if (altmap)
+		page += vmem_altmap_offset(altmap);
+	zone = page_zone(page);
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	WARN_ON_ONCE(ret);
+	kernel_physical_mapping_remove(start, start + size);
 
 	return ret;
 }
@@ -1234,7 +1246,7 @@ static void __meminitdata *p_start, *p_end;
 static int __meminitdata node_start;
 
 static int __meminit vmemmap_populate_hugepages(unsigned long start,
-						unsigned long end, int node)
+		unsigned long end, int node, struct vmem_altmap *altmap)
 {
 	unsigned long addr;
 	unsigned long next;
@@ -1257,7 +1269,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 		if (pmd_none(*pmd)) {
 			void *p;
 
-			p = vmemmap_alloc_block_buf(PMD_SIZE, node);
+			p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
 			if (p) {
 				pte_t entry;
 
@@ -1278,7 +1290,8 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 				addr_end = addr + PMD_SIZE;
 				p_end = p + PMD_SIZE;
 				continue;
-			}
+			} else if (altmap)
+				return -ENOMEM; /* no fallback */
 		} else if (pmd_large(*pmd)) {
 			vmemmap_verify((pte_t *)pmd, node, addr, next);
 			continue;
@@ -1292,11 +1305,16 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 {
+	struct vmem_altmap *altmap = to_vmem_altmap(start);
 	int err;
 
 	if (cpu_has_pse)
-		err = vmemmap_populate_hugepages(start, end, node);
-	else
+		err = vmemmap_populate_hugepages(start, end, node, altmap);
+	else if (altmap) {
+		pr_err_once("%s: no cpu support for altmap allocations\n",
+				__func__);
+		err = -ENOMEM;
+	} else
 		err = vmemmap_populate_basepages(start, end, node);
 	if (!err)
 		sync_global_pgds(start, end - 1, 0);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 349f03e7ed06..3c5b8f585441 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -151,7 +151,8 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 	}
 
 	if (pmem_should_map_pages(dev))
-		pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res);
+		pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res,
+				NULL);
 	else
 		pmem->virt_addr = (void __pmem *) devm_memremap(dev,
 				pmem->phys_addr, pmem->size,
@@ -362,7 +363,8 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
 	/* establish pfn range for lookup, and switch to direct map */
 	pmem = dev_get_drvdata(dev);
 	devm_memunmap(dev, (void __force *) pmem->virt_addr);
-	pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res);
+	pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res,
+			NULL);
 	if (IS_ERR(pmem->virt_addr)) {
 		rc = PTR_ERR(pmem->virt_addr);
 		goto err;
diff --git a/include/linux/io.h b/include/linux/io.h
index de64c1e53612..2f2f8859abd9 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -87,23 +87,6 @@ void *devm_memremap(struct device *dev, resource_size_t offset,
 		size_t size, unsigned long flags);
 void devm_memunmap(struct device *dev, void *addr);
 
-void *__devm_memremap_pages(struct device *dev, struct resource *res);
-
-#ifdef CONFIG_ZONE_DEVICE
-void *devm_memremap_pages(struct device *dev, struct resource *res);
-#else
-static inline void *devm_memremap_pages(struct device *dev, struct resource *res)
-{
-	/*
-	 * Fail attempts to call devm_memremap_pages() without
-	 * ZONE_DEVICE support enabled, this requires callers to fall
-	 * back to plain devm_memremap() based on config
-	 */
-	WARN_ON_ONCE(1);
-	return ERR_PTR(-ENXIO);
-}
-#endif
-
 /*
  * Some systems do not have legacy ISA devices.
  * /dev/port is not a valid interface on these systems.
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 8f60e899b33c..178e000a7983 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -273,7 +273,8 @@ extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern void remove_memory(int nid, u64 start, u64 size);
 extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn);
-extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms);
+extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
+		unsigned long map_offset);
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
 					  unsigned long pnum);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 30c3c8764649..b8cba7d8ea28 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -718,18 +718,109 @@ static inline enum zone_type page_zonenum(const struct page *page)
 }
 
 /**
+ * struct vmem_altmap - pre-allocated storage for vmemmap_populate
+ * @base_pfn: base of the entire dev_pagemap mapping
+ * @reserve: pages mapped, but reserved for driver use (relative to @base)
+ * @free: free pages set aside in the mapping for memmap storage
+ * @align: pages reserved to meet allocation alignments
+ * @alloc: track pages consumed, private to vmemmap_populate()
+ */
+struct vmem_altmap {
+	const unsigned long base_pfn;
+	const unsigned long reserve;
+	unsigned long free;
+	unsigned long align;
+	unsigned long alloc;
+};
+
+static inline unsigned long vmem_altmap_nr_free(struct vmem_altmap *altmap)
+{
+	unsigned long allocated = altmap->alloc + altmap->align;
+
+	if (altmap->free > allocated)
+		return altmap->free - allocated;
+	return 0;
+}
+
+static inline unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
+{
+	/* number of pfns from base where pfn_to_page() is valid */
+	return altmap->reserve + altmap->free;
+}
+
+static inline unsigned long vmem_altmap_next_pfn(struct vmem_altmap *altmap)
+{
+	return altmap->base_pfn + altmap->reserve + altmap->alloc
+		+ altmap->align;
+}
+
+/**
+ * vmem_altmap_alloc - allocate pages from the vmem_altmap reservation
+ * @altmap - reserved page pool for the allocation
+ * @nr_pfns - size (in pages) of the allocation
+ *
+ * Allocations are aligned to the size of the request
+ */
+static inline unsigned long vmem_altmap_alloc(struct vmem_altmap *altmap,
+		unsigned long nr_pfns)
+{
+	unsigned long pfn = vmem_altmap_next_pfn(altmap);
+	unsigned long nr_align;
+
+	nr_align = 1UL << find_first_bit(&nr_pfns, BITS_PER_LONG);
+	nr_align = ALIGN(pfn, nr_align) - pfn;
+
+	if (nr_pfns + nr_align > vmem_altmap_nr_free(altmap))
+		return ULONG_MAX;
+	altmap->alloc += nr_pfns;
+	altmap->align += nr_align;
+	return pfn + nr_align;
+}
+
+static inline void vmem_altmap_free(struct vmem_altmap *altmap,
+		unsigned long nr_pfns)
+{
+	altmap->alloc -= nr_pfns;
+}
+
+/**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
+ * @altmap: pre-allocated/reserved memory for vmemmap allocations
  * @dev: host device of the mapping for debug
  */
 struct dev_pagemap {
-	/* TODO: vmem_altmap and percpu_ref count */
+	struct vmem_altmap *altmap;
+	const struct resource *res;
 	struct device *dev;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
 struct dev_pagemap *__get_dev_pagemap(resource_size_t phys);
+void *devm_memremap_pages(struct device *dev, struct resource *res,
+		struct vmem_altmap *altmap);
+#else
+static inline struct dev_pagemap *__get_dev_pagemap(resource_size_t phys)
+{
+	return NULL;
+}
+
+static inline void *devm_memremap_pages(struct device *dev, struct resource *res,
+		struct vmem_altmap *altmap)
+{
+	/*
+	 * Fail attempts to call devm_memremap_pages() without
+	 * ZONE_DEVICE support enabled, this requires callers to fall
+	 * back to plain devm_memremap() based on config
+	 */
+	WARN_ON_ONCE(1);
+	return ERR_PTR(-ENXIO);
+}
+#endif
+
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start);
 #else
-static inline struct dev_pagemap *get_dev_pagemap(resource_size_t phys)
+static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 {
 	return NULL;
 }
@@ -2245,7 +2336,8 @@ pud_t *vmemmap_pud_populate(pgd_t *pgd, unsigned long addr, int node);
 pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node);
 void *vmemmap_alloc_block(unsigned long size, int node);
-void *vmemmap_alloc_block_buf(unsigned long size, int node);
+void *vmemmap_alloc_block_buf(unsigned long size, int node,
+		struct vmem_altmap *altmap);
 void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(unsigned long start, unsigned long end,
 			       int node);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 64bfd9fa93aa..79bbbea2de6a 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -146,6 +146,7 @@ struct page_map {
 	struct resource res;
 	struct dev_pagemap pgmap;
 	struct list_head list;
+	struct vmem_altmap altmap;
 };
 
 static void add_page_map(struct page_map *page_map)
@@ -162,14 +163,17 @@ static void del_page_map(struct page_map *page_map)
 	spin_unlock(&range_lock);
 }
 
-static void devm_memremap_pages_release(struct device *dev, void *res)
+static void devm_memremap_pages_release(struct device *dev, void *data)
 {
-	struct page_map *page_map = res;
-
-	del_page_map(page_map);
+	struct page_map *page_map = data;
+	struct resource *res = &page_map->res;
+	struct dev_pagemap *pgmap = &page_map->pgmap;
 
 	/* pages are dead and unused, undo the arch mapping */
-	arch_remove_memory(page_map->res.start, resource_size(&page_map->res));
+	arch_remove_memory(res->start, resource_size(res));
+	dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
+			"%s: failed to free all reserved pages\n", __func__);
+	del_page_map(page_map);
 }
 
 /* assumes rcu_read_lock() held at entry */
@@ -185,10 +189,22 @@ struct dev_pagemap *__get_dev_pagemap(resource_size_t phys)
 	return NULL;
 }
 
-void *devm_memremap_pages(struct device *dev, struct resource *res)
+/**
+ * devm_memremap_pages - remap and provide memmap backing for the given resource
+ * @dev: hosting device for @res
+ * @res: "host memory" address range
+ * @altmap: optional descriptor for allocating the memmap from @res
+ *
+ * Note, the expectation is that @res is a host memory range that could
+ * feasibly be treated as a "System RAM" range, i.e. not a device mmio
+ * range, but this is not enforced.
+ */
+void *devm_memremap_pages(struct device *dev, struct resource *res,
+		struct vmem_altmap *altmap)
 {
 	int is_ram = region_intersects(res->start, resource_size(res),
 			"System RAM");
+	struct dev_pagemap *pgmap;
 	struct page_map *page_map;
 	int error, nid;
 
@@ -201,14 +217,25 @@ void *devm_memremap_pages(struct device *dev, struct resource *res)
 	if (is_ram == REGION_INTERSECTS)
 		return __va(res->start);
 
+	if (altmap && !IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) {
+		dev_err(dev, "%s: support for alternate vmemmap disabled\n",
+				__func__);
+		return ERR_PTR(-ENXIO);
+	}
+
 	page_map = devres_alloc_node(devm_memremap_pages_release,
 			sizeof(*page_map), GFP_KERNEL, dev_to_node(dev));
 	if (!page_map)
 		return ERR_PTR(-ENOMEM);
+	pgmap = &page_map->pgmap;
 
 	memcpy(&page_map->res, res, sizeof(*res));
-
-	page_map->pgmap.dev = dev;
+	if (altmap) {
+		memcpy(&page_map->altmap, altmap, sizeof(*altmap));
+		pgmap->altmap = &page_map->altmap;
+	}
+	pgmap->dev = dev;
+	pgmap->res = &page_map->res;
 	INIT_LIST_HEAD(&page_map->list);
 	add_page_map(page_map);
 
@@ -228,3 +255,37 @@ void *devm_memremap_pages(struct device *dev, struct resource *res)
 }
 EXPORT_SYMBOL(devm_memremap_pages);
 #endif /* CONFIG_ZONE_DEVICE */
+
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+/*
+ * Uncoditionally retrieve a dev_pagemap associated with the given physical
+ * address, this is only for use in the arch_{add|remove}_memory() for setting
+ * up and tearing down the memmap.
+ */
+static struct dev_pagemap *lookup_dev_pagemap(resource_size_t phys)
+{
+	struct dev_pagemap *pgmap;
+
+	rcu_read_lock();
+	pgmap = __get_dev_pagemap(phys);
+	rcu_read_unlock();
+	return pgmap;
+}
+
+struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
+{
+	/*
+	 * 'memmap_start' is the virtual address for the first "struct
+	 * page" in this range of the vmemmap array.  In the case of
+	 * CONFIG_SPARSE_VMEMMAP a page_to_pfn conversion is simple
+	 * pointer arithmetic, so we can perform this to_vmem_altmap()
+	 * conversion without concern for the initialization state of
+	 * the struct page fields.
+	 */
+	struct page *page = (struct page *) memmap_start;
+	struct dev_pagemap *pgmap;
+
+	pgmap = lookup_dev_pagemap(__pfn_to_phys(page_to_pfn(page)));
+	return pgmap ? pgmap->altmap : NULL;
+}
+#endif /* CONFIG_SPARSEMEM_VMEMMAP */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index aa992e2df58a..3521df153de3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -505,10 +505,25 @@ int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
 	unsigned long i;
 	int err = 0;
 	int start_sec, end_sec;
+	struct vmem_altmap *altmap;
+
 	/* during initialize mem_map, align hot-added range to section */
 	start_sec = pfn_to_section_nr(phys_start_pfn);
 	end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1);
 
+	altmap = to_vmem_altmap((unsigned long) pfn_to_page(phys_start_pfn));
+	if (altmap) {
+		/*
+		 * Validate altmap is within bounds of the total request
+		 */
+		if (altmap->base_pfn != phys_start_pfn
+				|| vmem_altmap_offset(altmap) > nr_pages) {
+			pr_warn_once("memory add fail, invalid altmap\n");
+			return -EINVAL;
+		}
+		altmap->alloc = 0;
+	}
+
 	for (i = start_sec; i <= end_sec; i++) {
 		err = __add_section(nid, zone, section_nr_to_pfn(i));
 
@@ -730,7 +745,8 @@ static void __remove_zone(struct zone *zone, unsigned long start_pfn)
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 }
 
-static int __remove_section(struct zone *zone, struct mem_section *ms)
+static int __remove_section(struct zone *zone, struct mem_section *ms,
+		unsigned long map_offset)
 {
 	unsigned long start_pfn;
 	int scn_nr;
@@ -747,7 +763,7 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
 	start_pfn = section_nr_to_pfn(scn_nr);
 	__remove_zone(zone, start_pfn);
 
-	sparse_remove_one_section(zone, ms);
+	sparse_remove_one_section(zone, ms, map_offset);
 	return 0;
 }
 
@@ -766,9 +782,32 @@ int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
 		 unsigned long nr_pages)
 {
 	unsigned long i;
-	int sections_to_remove;
-	resource_size_t start, size;
-	int ret = 0;
+	unsigned long map_offset = 0;
+	int sections_to_remove, ret = 0;
+
+	/* In the ZONE_DEVICE case device driver owns the memory region */
+	if (is_dev_zone(zone)) {
+		struct page *page = pfn_to_page(phys_start_pfn);
+		struct vmem_altmap *altmap;
+
+		altmap = to_vmem_altmap((unsigned long) page);
+		if (altmap)
+			map_offset = vmem_altmap_offset(altmap);
+	} else {
+		resource_size_t start, size;
+
+		start = phys_start_pfn << PAGE_SHIFT;
+		size = nr_pages * PAGE_SIZE;
+
+		ret = release_mem_region_adjustable(&iomem_resource, start,
+					size);
+		if (ret) {
+			resource_size_t endres = start + size - 1;
+
+			pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
+					&start, &endres, ret);
+		}
+	}
 
 	/*
 	 * We can only remove entire sections
@@ -776,23 +815,12 @@ int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
 	BUG_ON(phys_start_pfn & ~PAGE_SECTION_MASK);
 	BUG_ON(nr_pages % PAGES_PER_SECTION);
 
-	start = phys_start_pfn << PAGE_SHIFT;
-	size = nr_pages * PAGE_SIZE;
-
-	/* in the ZONE_DEVICE case device driver owns the memory region */
-	if (!is_dev_zone(zone))
-		ret = release_mem_region_adjustable(&iomem_resource, start, size);
-	if (ret) {
-		resource_size_t endres = start + size - 1;
-
-		pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
-				&start, &endres, ret);
-	}
-
 	sections_to_remove = nr_pages / PAGES_PER_SECTION;
 	for (i = 0; i < sections_to_remove; i++) {
 		unsigned long pfn = phys_start_pfn + i*PAGES_PER_SECTION;
-		ret = __remove_section(zone, __pfn_to_section(pfn));
+
+		ret = __remove_section(zone, __pfn_to_section(pfn), map_offset);
+		map_offset = 0;
 		if (ret)
 			break;
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48aaf7b9f253..9dfc431d6271 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4620,8 +4620,9 @@ static void setup_zone_migrate_reserve(struct zone *zone)
 void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		unsigned long start_pfn, enum memmap_context context)
 {
-	pg_data_t *pgdat = NODE_DATA(nid);
+	struct vmem_altmap *altmap = to_vmem_altmap(__pfn_to_phys(start_pfn));
 	unsigned long end_pfn = start_pfn + size;
+	pg_data_t *pgdat = NODE_DATA(nid);
 	unsigned long pfn;
 	struct zone *z;
 	unsigned long nr_initialised = 0;
@@ -4629,6 +4630,13 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 	if (highest_memmap_pfn < end_pfn - 1)
 		highest_memmap_pfn = end_pfn - 1;
 
+	/*
+	 * Honor reservation requested by the driver for this ZONE_DEVICE
+	 * memory
+	 */
+	if (altmap && start_pfn == altmap->base_pfn)
+		start_pfn += altmap->reserve;
+
 	z = &pgdat->node_zones[zone];
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 		/*
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 4cba9c2783a1..96c1dca4ce6a 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -70,7 +70,7 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
 }
 
 /* need to make sure size is all the same during early stage */
-void * __meminit vmemmap_alloc_block_buf(unsigned long size, int node)
+static void * __meminit __vmemmap_alloc_block_buf(unsigned long size, int node)
 {
 	void *ptr;
 
@@ -87,6 +87,39 @@ void * __meminit vmemmap_alloc_block_buf(unsigned long size, int node)
 	return ptr;
 }
 
+static void * __meminit altmap_alloc_block_buf(unsigned long size,
+		struct vmem_altmap *altmap)
+{
+	unsigned long pfn, nr_pfns;
+	void *ptr;
+
+	if (size & ~PAGE_MASK) {
+		pr_warn_once("%s: allocations must be multiple of PAGE_SIZE (%ld)\n",
+				__func__, size);
+		return NULL;
+	}
+
+	nr_pfns = size >> PAGE_SHIFT;
+	pfn = vmem_altmap_alloc(altmap, nr_pfns);
+	if (pfn < ULONG_MAX)
+		ptr = __va(__pfn_to_phys(pfn));
+	else
+		ptr = NULL;
+	pr_debug("%s: pfn: %#lx alloc: %ld align: %ld nr: %#lx\n",
+			__func__, pfn, altmap->alloc, altmap->align, nr_pfns);
+
+	return ptr;
+}
+
+/* need to make sure size is all the same during early stage */
+void * __meminit vmemmap_alloc_block_buf(unsigned long size, int node,
+		struct vmem_altmap *altmap)
+{
+	if (altmap)
+		return altmap_alloc_block_buf(size, altmap);
+	return __vmemmap_alloc_block_buf(size, node);
+}
+
 void __meminit vmemmap_verify(pte_t *pte, int node,
 				unsigned long start, unsigned long end)
 {
@@ -103,7 +136,7 @@ pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node)
 	pte_t *pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(*pte)) {
 		pte_t entry;
-		void *p = vmemmap_alloc_block_buf(PAGE_SIZE, node);
+		void *p = __vmemmap_alloc_block_buf(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
diff --git a/mm/sparse.c b/mm/sparse.c
index d1b48b691ac8..3717ceed4177 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -748,7 +748,7 @@ static void clear_hwpoisoned_pages(struct page *memmap, int nr_pages)
 	if (!memmap)
 		return;
 
-	for (i = 0; i < PAGES_PER_SECTION; i++) {
+	for (i = 0; i < nr_pages; i++) {
 		if (PageHWPoison(&memmap[i])) {
 			atomic_long_sub(1, &num_poisoned_pages);
 			ClearPageHWPoison(&memmap[i]);
@@ -788,7 +788,8 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 		free_map_bootmem(memmap);
 }
 
-void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
+void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
+		unsigned long map_offset)
 {
 	struct page *memmap = NULL;
 	unsigned long *usemap = NULL, flags;
@@ -804,7 +805,8 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 	}
 	pgdat_resize_unlock(pgdat, &flags);
 
-	clear_hwpoisoned_pages(memmap, PAGES_PER_SECTION);
+	clear_hwpoisoned_pages(memmap + map_offset,
+			PAGES_PER_SECTION - map_offset);
 	free_section_usemap(memmap, usemap);
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */

N‹§²æìr¸›zǧu©ž²Æ {\b­†éì¹»\x1c®&Þ–)îÆi¢žØ^n‡r¶‰šŽŠÝ¢j$½§$¢¸\x05¢¹¨­è§~Š'.)îÄÃ,yèm¶ŸÿÃ\f%Š{±šj+ƒðèž×¦j)Z†·Ÿ

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 00/20] get_user_pages() for dax mappings
  2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
                   ` (19 preceding siblings ...)
  2015-10-10  0:57 ` [PATCH v2 20/20] mm, x86: get_user_pages() for dax mappings Dan Williams
@ 2015-10-23 21:06 ` Logan Gunthorpe
  2015-11-30 22:15   ` Dan Williams
  20 siblings, 1 reply; 37+ messages in thread
From: Logan Gunthorpe @ 2015-10-23 21:06 UTC (permalink / raw)
  To: Dan Williams, linux-nvdimm
  Cc: linux-mips, Dave Hansen, Boaz Harrosh, David Airlie,
	Catalin Marinas, Dave Hansen, Dave Chinner, Keith Busch,
	linux-mm, Paul Mackerras, H. Peter Anvin, hch, Russell King,
	Richard Weinberger, Peter Zijlstra, Jeff Moyer, Ingo Molnar,
	Benjamin Herrenschmidt, Matthew Wilcox, ross.zwisler,
	Gleb Natapov, Marc Zyngier, Will Deacon, Jeff Dike,
	Alexander Viro, Stephen Bates

Hi Dan,

We've tested this patch series (as pulled from your git repo) with our 
P2P work and everything is working great. The issues we found in v1 have 
been fixed and we have not found any new ones.

Tested-By: Logan Gunthorpe <logang@deltatee.com>

Thanks,

Logan



On 09/10/15 06:55 PM, Dan Williams wrote:
> Changes since v1 [1]:
> 1/ Rebased on the accepted cleanups to the memremap() api and the NUMA
>     hints for devm allocations. (see libnvdimm-for-next [2]).
>
> 2/ Rebased on DAX fixes from Ross [3], currently in -mm, and Dave [4],
>     applied locally for now.
>
> 3/ Renamed __pfn_t to pfn_t and converted KVM and UM accordingly (Dave
>     Hansen)
>
> 4/ Make pfn-to-pfn_t conversions a nop (binary identical) for typical
>     mapped pfns (Dave Hansen)
>
> 5/ Fixed up the devm_memremap_pages() api to require passing in a
>     percpu_ref object.  Addresses a crash reported-by Logan.
>
> 6/ Moved the back pointer from a page to its hosting 'struct
>     dev_pagemap' to share storage with the 'lru' field rather than
>     'mapping'.  Enables us to revoke mappings at devm_memunmap_page()
>     time and addresses a crash reported-by Logan.
>
> 7/ Rework dax_map_bh() into dax_map_atomic() to avoid proliferating
>     buffer_head usage deeper into the dax implementation.  Also addresses
>     a crash reported by Logan (Dave Chinner)
>
> 8/ Include an initial, only lightly tested, implementation of revoking
>     usages of ZONE_DEVICE pages when the driver disables the pmem device.
>     This coordinates with blk_cleanup_queue() for the pmem gendisk, see
>     patch 19.
>
> 9/ Include a cleaned up version of the vmem_altmap infrastructure
>     allowing the struct page memmap to optionally be allocated from pmem
>     itself.
>
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
> [2]: https://git.kernel.org/cgit/linux/kernel/git/nvdimm/nvdimm.git/log/?h=libnvdimm-for-next
> [3]: https://git.kernel.org/cgit/linux/kernel/git/nvdimm/nvdimm.git/commit/?h=dax-fixes&id=93fdde069dce
> [4]: https://lists.01.org/pipermail/linux-nvdimm/2015-October/002286.html
>
> ---
> To date, we have implemented two I/O usage models for persistent memory,
> PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
> userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
> to be the target of direct-i/o.  It allows userspace to coordinate
> DMA/RDMA from/to persitent memory.
>
> The implementation leverages the ZONE_DEVICE mm-zone that went into
> 4.3-rc1 to flag pages that are owned and dynamically mapped by a device
> driver.  The pmem driver, after mapping a persistent memory range into
> the system memmap via devm_memremap_pages(), arranges for DAX to
> distinguish pfn-only versus page-backed pmem-pfns via flags in the new
> __pfn_t type.  The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn,
> flags the resulting pte(s) inserted into the process page tables with a
> new _PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it
> keys off _PAGE_DEVMAP to pin the device hosting the page range active.
> Finally, get_page() and put_page() are modified to take references
> against the device driver established page mapping.
>
> This series is available via git here:
>
>    git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm libnvdimm-pending
>
> ---
>
> Dan Williams (20):
>        block: generic request_queue reference counting
>        dax: increase granularity of dax_clear_blocks() operations
>        block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
>        mm: introduce __get_dev_pagemap()
>        x86, mm: introduce vmem_altmap to augment vmemmap_populate()
>        libnvdimm, pfn, pmem: allocate memmap array in persistent memory
>        avr32: convert to asm-generic/memory_model.h
>        hugetlb: fix compile error on tile
>        frv: fix compiler warning from definition of __pmd()
>        um: kill pfn_t
>        kvm: rename pfn_t to kvm_pfn_t
>        mips: fix PAGE_MASK definition
>        mm, dax, pmem: introduce pfn_t
>        mm, dax, gpu: convert vm_insert_mixed to pfn_t, introduce _PAGE_DEVMAP
>        mm, dax: convert vmf_insert_pfn_pmd() to pfn_t
>        list: introduce list_poison() and LIST_POISON3
>        mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup
>        block: notify queue death confirmation
>        mm, pmem: devm_memunmap_pages(), truncate and unmap ZONE_DEVICE pages
>        mm, x86: get_user_pages() for dax mappings
>
>
>   arch/alpha/include/asm/pgtable.h        |    1
>   arch/arm/include/asm/kvm_mmu.h          |    5 -
>   arch/arm/kvm/mmu.c                      |   10 +
>   arch/arm64/include/asm/kvm_mmu.h        |    3
>   arch/avr32/include/asm/page.h           |    8 -
>   arch/frv/include/asm/page.h             |    2
>   arch/ia64/include/asm/pgtable.h         |    1
>   arch/m68k/include/asm/page_mm.h         |    1
>   arch/m68k/include/asm/page_no.h         |    1
>   arch/mips/include/asm/kvm_host.h        |    6 -
>   arch/mips/include/asm/page.h            |    2
>   arch/mips/kvm/emulate.c                 |    2
>   arch/mips/kvm/tlb.c                     |   14 +
>   arch/parisc/include/asm/pgtable.h       |    1
>   arch/powerpc/include/asm/kvm_book3s.h   |    4
>   arch/powerpc/include/asm/kvm_ppc.h      |    2
>   arch/powerpc/include/asm/pgtable.h      |    1
>   arch/powerpc/kvm/book3s.c               |    6 -
>   arch/powerpc/kvm/book3s_32_mmu_host.c   |    2
>   arch/powerpc/kvm/book3s_64_mmu_host.c   |    2
>   arch/powerpc/kvm/e500.h                 |    2
>   arch/powerpc/kvm/e500_mmu_host.c        |    8 -
>   arch/powerpc/kvm/trace_pr.h             |    2
>   arch/powerpc/sysdev/axonram.c           |    8 -
>   arch/sparc/include/asm/pgtable_64.h     |    2
>   arch/tile/include/asm/pgtable.h         |    1
>   arch/um/include/asm/page.h              |    6 -
>   arch/um/include/asm/pgtable-3level.h    |    5 -
>   arch/um/include/asm/pgtable.h           |    2
>   arch/x86/include/asm/pgtable.h          |   24 ++
>   arch/x86/include/asm/pgtable_types.h    |    7 +
>   arch/x86/kvm/iommu.c                    |   11 +
>   arch/x86/kvm/mmu.c                      |   37 ++--
>   arch/x86/kvm/mmu_audit.c                |    2
>   arch/x86/kvm/paging_tmpl.h              |    6 -
>   arch/x86/kvm/vmx.c                      |    2
>   arch/x86/kvm/x86.c                      |    2
>   arch/x86/mm/gup.c                       |   56 +++++-
>   arch/x86/mm/init_64.c                   |   32 +++
>   arch/x86/mm/pat.c                       |    4
>   block/blk-core.c                        |   79 +++++++-
>   block/blk-mq-sysfs.c                    |    6 -
>   block/blk-mq.c                          |   87 +++------
>   block/blk-sysfs.c                       |    3
>   block/blk.h                             |   12 +
>   drivers/block/brd.c                     |    4
>   drivers/gpu/drm/exynos/exynos_drm_gem.c |    3
>   drivers/gpu/drm/gma500/framebuffer.c    |    3
>   drivers/gpu/drm/msm/msm_gem.c           |    3
>   drivers/gpu/drm/omapdrm/omap_gem.c      |    6 -
>   drivers/gpu/drm/ttm/ttm_bo_vm.c         |    3
>   drivers/nvdimm/pfn_devs.c               |    3
>   drivers/nvdimm/pmem.c                   |  128 +++++++++----
>   drivers/s390/block/dcssblk.c            |   10 -
>   fs/block_dev.c                          |    2
>   fs/dax.c                                |  199 +++++++++++++--------
>   include/asm-generic/pgtable.h           |    6 -
>   include/linux/blk-mq.h                  |    1
>   include/linux/blkdev.h                  |   12 +
>   include/linux/huge_mm.h                 |    2
>   include/linux/hugetlb.h                 |    1
>   include/linux/io.h                      |   17 --
>   include/linux/kvm_host.h                |   37 ++--
>   include/linux/kvm_types.h               |    2
>   include/linux/list.h                    |   14 +
>   include/linux/memory_hotplug.h          |    3
>   include/linux/mm.h                      |  300 +++++++++++++++++++++++++++++--
>   include/linux/mm_types.h                |    5 +
>   include/linux/pfn.h                     |    9 +
>   include/linux/poison.h                  |    1
>   kernel/memremap.c                       |  187 +++++++++++++++++++
>   lib/list_debug.c                        |    2
>   mm/gup.c                                |   11 +
>   mm/huge_memory.c                        |   10 +
>   mm/hugetlb.c                            |   18 ++
>   mm/memory.c                             |   17 +-
>   mm/memory_hotplug.c                     |   66 +++++--
>   mm/page_alloc.c                         |   10 +
>   mm/sparse-vmemmap.c                     |   37 ++++
>   mm/sparse.c                             |    8 +
>   mm/swap.c                               |   15 ++
>   virt/kvm/kvm_main.c                     |   47 ++---
>   82 files changed, 1264 insertions(+), 418 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
>
>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 00/20] get_user_pages() for dax mappings
  2015-10-23 21:06 ` [PATCH v2 00/20] " Logan Gunthorpe
@ 2015-11-30 22:15   ` Dan Williams
  2015-12-02 22:02     ` Logan Gunthorpe
  0 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2015-11-30 22:15 UTC (permalink / raw)
  To: Logan Gunthorpe; +Cc: linux-nvdimm, Linux MM

On Fri, Oct 23, 2015 at 2:06 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Hi Dan,
>
> We've tested this patch series (as pulled from your git repo) with our P2P
> work and everything is working great. The issues we found in v1 have been
> fixed and we have not found any new ones.
>
> Tested-By: Logan Gunthorpe <logang@deltatee.com>
>
>

Hi Logan,

I appreciate the test report.  I appreciate it so much I wonder if
you'd be willing to re-test the current state of:

git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm libnvdimm-pending

...with the revised approach that I'm proposing for-4.5 inclusion.

The main changes are fixes for supporting huge-page mappings.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 00/20] get_user_pages() for dax mappings
  2015-11-30 22:15   ` Dan Williams
@ 2015-12-02 22:02     ` Logan Gunthorpe
  2015-12-02 22:04       ` Dan Williams
  2015-12-04  2:16       ` Dan Williams
  0 siblings, 2 replies; 37+ messages in thread
From: Logan Gunthorpe @ 2015-12-02 22:02 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm, Linux MM, Stephen Bates

On 30/11/15 03:15 PM, Dan Williams wrote:
> I appreciate the test report.  I appreciate it so much I wonder if
> you'd be willing to re-test the current state of:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm libnvdimm-pending


Hi Dan,

I've had some mixed success with the above branch. Many of my tests are 
working but I have the following two issues which I didn't see previously:

* When trying to do RDMA transfers to a mmaped DAX file I get a kernel 
panic while de-registering the memory region. (The panic message is at 
the end of this email.) addr2line puts it around dax.c:723 for the first 
line in the call trace, the address where the failure occurs doesn't 
seem to map to a line of code.

* Less important: my tests no longer work inside qemu because I'm using 
a region in the PCI bar space which is not on a section boundary. The 
latest code enforces that restriction which makes it harder to use with 
PCI memory. (I'm talking memremap.c:311). Presently, if I comment out 
the check, my VM tests work fine. This hasn't been a problem on real 
hardware as we are using a 64bit address space and thus the BAR 
addresses are better aligned.


I don't have much time at the moment to dig into the kernel panic myself 
so hopefully what I've provided will help you find the issue. If you 
need any more information let me know.

Thanks,

Logan





> [ 1542.406591] BUG: unable to handle kernel paging request at 00000000300000d1
> [ 1542.406627] IP: [<ffffffffa033be40>] ext4_end_io_unwritten+0x10/0x60 [ext4]
> [ 1542.406661] PGD 260d27067 PUD 2602aa067 PMD 0
> [ 1542.406701] Oops: 0000 [#1] SMP
> [ 1542.406729] Modules linked in: mem_map(O) mtramonb(O) xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables x_tables br_netfilter nf_nat nf_conntrack bridge stp llc dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc ext2 x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass sha256_generic iTCO_wdt iTCO_vendor_support hmac drbg ansi_cprng aesni_intel aes_x86_64 ablk_helper cryptd lrw sb_edac nvme gf128mul glue_helper ipmi_si edac_core psmouse joydev evdev lpc_ich i2c_i801 pcspkr serio_raw mfd_core wmi ipmi_msghandler acpi_cpufreq tpm_tis tpm ioatdma processor shpchp button iw_cxgb4 cxgb4 rdma_ucm ib_uverbs rdma_cm
> [ 1542.407293] iw_cm ib_ipoib ib_cm ib_umad mlx4_ib ib_sa ib_mad ib_core ib_addr msr loop fuse autofs4 ext4 mbcache jbd2 btrfs xor raid6_pq dm_mod md_mod ohci_hcd uhci_hcd xhci_hcd sg sd_mod hid_generic usbhid hid isci ahci libsas libahci igb i2c_algo_bit dca scsi_transport_sas ehci_pci libata ehci_hcd ptp crc32c_intel pps_core scsi_mod usbcore i2c_core usb_common mlx4_core
> [ 1542.407612] CPU: 5 PID: 4740 Comm: client Tainted: G O 4.4.0-rc3+donard2.4+ #78
> [ 1542.407682] Hardware name: Supermicro SYS-7047GR-TRF/X9DRG-QF, BIOS 3.0a 12/05/2013
> [ 1542.407749] task: ffff8802767445c0 ti: ffff8802601fc000 task.ti: ffff8802601fc000
> [ 1542.407816] RIP: 0010:[<ffffffffa033be40>] [<ffffffffa033be40>] ext4_end_io_unwritten+0x10/0x60 [ext4]
> [ 1542.407895] RSP: 0000:ffff8802601ffcf8 EFLAGS: 00010246
> [ 1542.407935] RAX: 00000000300000d1 RBX: 0000000000000800 RCX: 0000000000000000
> [ 1542.407981] RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffff8802601ffd68
> [ 1542.408025] RBP: 0000000000000000 R08: ffffffffa0342a9c R09: ffffffffa033be30
> [ 1542.408070] R10: 0000000000000001 R11: 0000000000000246 R12: ffff880464dd4230
> [ 1542.408114] R13: ffff88026030b858 R14: ffff880464dd40c8 R15: 0000000000000800
> [ 1542.408160] FS: 00007f5662bde740(0000) GS:ffff88047fc80000(0000) knlGS:0000000000000000
> [ 1542.408228] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1542.408269] CR2: 00000000300000d1 CR3: 0000000260dbb000 CR4: 00000000000406e0
> [ 1542.408314] Stack:
> [ 1542.408347] 0000000000000800 ffff8804716af018 ffff880464dd4230 ffff88026030b858
> [ 1542.408431] ffffffff81169652 000000008113ac16 00007f56617ee000 ffff8802601ffde0
> [ 1542.408516] ffffffff8113df6c ffffffffa033be30 00000024767445c0 ffff880200000041
> [ 1542.408601] Call Trace:
> [ 1542.408641] [<ffffffff81169652>] ? __dax_pmd_fault+0x3c0/0x3eb
> [ 1542.408686] [<ffffffff8113df6c>] ? path_openat+0xb33/0xc16
> [ 1542.408731] [<ffffffffa033be30>] ? ext4_dax_mkwrite+0x13/0x13 [ext4]
> [ 1542.408776] [<ffffffffa041dcbf>] ? ib_uverbs_dereg_mr+0xad/0xbb [ib_uverbs]
> [ 1542.408823] [<ffffffff8110795f>] ? vma_gap_callbacks_propagate+0x16/0x2c
> [ 1542.408868] [<ffffffff81108475>] ? vma_link+0x71/0x7e
> [ 1542.408910] [<ffffffff81109010>] ? vma_set_page_prot+0x33/0x50
> [ 1542.408955] [<ffffffffa033c081>] ? ext4_dax_pmd_fault+0xa7/0xee [ext4]
> [ 1542.409000] [<ffffffff8110560b>] ? handle_mm_fault+0x236/0xe95
> [ 1542.409043] [<ffffffff810f6fdc>] ? vm_mmap_pgoff+0x80/0xab
> [ 1542.409086] [<ffffffff81039c12>] ? __do_page_fault+0x239/0x3f2
> [ 1542.409131] [<ffffffff813e0722>] ? page_fault+0x22/0x30
> [ 1542.409171] Code: 5c 41 5d 41 5e 41 5f c3 48 c7 c1 30 be 33 a0 48 c7 c2 9c 2a 34 a0 e9 86 de e2 e0 41 55 41 54 85 f6 55 53 48 8b 47 58 48 8b 6f 40 <4c> 8b 20 45 8b ac 24 90 00 00 00 74 3c 48 8b 07 48 89 fb f6 c4
> [ 1542.409609] RIP [<ffffffffa033be40>] ext4_end_io_unwritten+0x10/0x60 [ext4]
> [ 1542.409661] RSP <ffff8802601ffcf8>
> [ 1542.409696] CR2: 00000000300000d1
> [ 1542.410353] ---[ end trace c43bed51af8ba585 ]---

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 00/20] get_user_pages() for dax mappings
  2015-12-02 22:02     ` Logan Gunthorpe
@ 2015-12-02 22:04       ` Dan Williams
  2015-12-04  2:16       ` Dan Williams
  1 sibling, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-12-02 22:04 UTC (permalink / raw)
  To: Logan Gunthorpe; +Cc: linux-nvdimm, Linux MM, Stephen Bates

On Wed, Dec 2, 2015 at 2:02 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> On 30/11/15 03:15 PM, Dan Williams wrote:
>>
>> I appreciate the test report.  I appreciate it so much I wonder if
>> you'd be willing to re-test the current state of:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm
>> libnvdimm-pending
>
>
>
> Hi Dan,
>
> I've had some mixed success with the above branch. Many of my tests are
> working but I have the following two issues which I didn't see previously:
>
> * When trying to do RDMA transfers to a mmaped DAX file I get a kernel panic
> while de-registering the memory region. (The panic message is at the end of
> this email.) addr2line puts it around dax.c:723 for the first line in the
> call trace, the address where the failure occurs doesn't seem to map to a
> line of code.
>
> * Less important: my tests no longer work inside qemu because I'm using a
> region in the PCI bar space which is not on a section boundary. The latest
> code enforces that restriction which makes it harder to use with PCI memory.
> (I'm talking memremap.c:311). Presently, if I comment out the check, my VM
> tests work fine. This hasn't been a problem on real hardware as we are using
> a 64bit address space and thus the BAR addresses are better aligned.
>
>
> I don't have much time at the moment to dig into the kernel panic myself so
> hopefully what I've provided will help you find the issue. If you need any
> more information let me know.

This is great. Thank you!  I'll take a look.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 00/20] get_user_pages() for dax mappings
  2015-12-02 22:02     ` Logan Gunthorpe
  2015-12-02 22:04       ` Dan Williams
@ 2015-12-04  2:16       ` Dan Williams
  2015-12-05  1:58         ` Logan Gunthorpe
  1 sibling, 1 reply; 37+ messages in thread
From: Dan Williams @ 2015-12-04  2:16 UTC (permalink / raw)
  To: Logan Gunthorpe; +Cc: linux-nvdimm, Linux MM, Stephen Bates

On Wed, Dec 2, 2015 at 2:02 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> On 30/11/15 03:15 PM, Dan Williams wrote:
>>
>> I appreciate the test report.  I appreciate it so much I wonder if
>> you'd be willing to re-test the current state of:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm
>> libnvdimm-pending
>
>
>
> Hi Dan,
>
> I've had some mixed success with the above branch. Many of my tests are
> working but I have the following two issues which I didn't see previously:
>
> * When trying to do RDMA transfers to a mmaped DAX file I get a kernel panic
> while de-registering the memory region. (The panic message is at the end of
> this email.) addr2line puts it around dax.c:723 for the first line in the
> call trace, the address where the failure occurs doesn't seem to map to a
> line of code.
>
> * Less important: my tests no longer work inside qemu because I'm using a
> region in the PCI bar space which is not on a section boundary. The latest
> code enforces that restriction which makes it harder to use with PCI memory.
> (I'm talking memremap.c:311). Presently, if I comment out the check, my VM
> tests work fine. This hasn't been a problem on real hardware as we are using
> a 64bit address space and thus the BAR addresses are better aligned.
>

I could loosen the restriction a bit to allow one unaligned mapping
per section.  However, if another mapping request came along that
tried to map a free part of the section it would fail because the code
depends on a  "1 dev_pagemap per section" relationship.  Seems an ok
compromise to me...

> I don't have much time at the moment to dig into the kernel panic myself so
> hopefully what I've provided will help you find the issue. If you need any
> more information let me know.

Could you share the test setup for this one so I can try to reproduce?
 As far as I can see this looks like an ext4 internals issue.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 00/20] get_user_pages() for dax mappings
  2015-12-04  2:16       ` Dan Williams
@ 2015-12-05  1:58         ` Logan Gunthorpe
  2015-12-08  0:00           ` Logan Gunthorpe
  0 siblings, 1 reply; 37+ messages in thread
From: Logan Gunthorpe @ 2015-12-05  1:58 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm, Linux MM, Stephen Bates

Hey,

On 03/12/15 07:16 PM, Dan Williams wrote:
> I could loosen the restriction a bit to allow one unaligned mapping
> per section.  However, if another mapping request came along that
> tried to map a free part of the section it would fail because the code
> depends on a  "1 dev_pagemap per section" relationship.  Seems an ok
> compromise to me...

Sure, that would work fine for us. I think it would be very unusual ;to 
need to map two adjacent BARs in this way.

> Could you share the test setup for this one so I can try to reproduce?
>   As far as I can see this looks like an ext4 internals issue.

Ok, well it's somewhat specialized and I can't run the failing test in a 
VM because it requires infiniband hardware. We have a PCI card that has 
a large memory backed BAR space. To use that with zone_device we have a 
kernel patch that allows doing the zone device mapping with io memory 
that has write combining enabled. Then we have an out of tree kernel 
module that creates a block device from the PCI bar (similar to the pmem 
code).

I could send you all of that, assuming you have a suitable PCI device. 
However, I'm hoping none of the above has anything to do with the failure.

The test that is failing is a very simple RDMA test with an mmaped DAX 
file. So hopefully it has nothing to do with the fact that a PCI device 
backs it. So if you have some IB hardware available you could try our 
simple test code from here:

https://github.com/sbates130272/io_peer_mem/tree/master/test

The server must be run with no arguments. Then the client can be run 
with the address of the server as the first argument and a file that's 
in a DAX fs (with a size greater than 4MB). The client and server should 
be able to run on the same node, if necessary.

Let me know if this helps or if there's anything else I can provide. I 
can probably dig into it some more on Monday on our setup.

Logan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 00/20] get_user_pages() for dax mappings
  2015-12-05  1:58         ` Logan Gunthorpe
@ 2015-12-08  0:00           ` Logan Gunthorpe
  2015-12-08  0:48             ` Dan Williams
  0 siblings, 1 reply; 37+ messages in thread
From: Logan Gunthorpe @ 2015-12-08  0:00 UTC (permalink / raw)
  To: Dan Williams; +Cc: Stephen Bates, Linux MM, linux-nvdimm

Hi Dan,

I've done a bit of digging and here's some more information:

* The crash occurs in ext4_end_io_unwritten when it tries to dereference 
bh->b_assoc_map which is not necessarily NULL.

* That function is called by __dax_pmd_fault, as the argument 
complete_unwritten.

* Looking in __dax_pmd_fault, the bug occurs if we hit either of the 
first two 'goto fallback' lines. (In my case, it's hitting the first one.)

* After the fallback code, it goes back to 'out', then checks '&bh'
for the unwritten flag. But bh hasn't been initialized yet and, on my 
setup, the unwritten flag happens to be set. So, it then calls 
complete_unwritten with a garbage bh and crashes.

If I move the memset(&bh) up in the code, before the goto fallbacks can 
occur, I can fix the crash.  I don't know if this is really the best way 
to fix the problem though.

--

However, unfortunately, fixing the above just uncovered another issue. 
Now the MR de-registration seems to have completed but the task hangs 
when it's trying to munmap the memory. (Stack trace at the end of this 
email.)

It looks like the i_mmap_lock_write is hanging in unlink_file_vma. I'm 
not really sure how to go about debugging this lock issue. If you have 
any steps I can try to get you more information let me know. I'm also 
happy to re-test if you have any other changes you'd like me to try.

Thanks,

Logan


> [ 240.520522] INFO: task client:1997 blocked for more than 120 seconds.
> [ 240.520638] Tainted: G O 4.4.0-rc3+donard2.5+ #87
> [ 240.520741] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 240.520847] client D ffff88047fd14800 0 1997 1912 0x00000004
> [ 240.520856] ffff88026bc7b240 0000000000000000 ffff88026bd38000 ffff88026bd37d30
> [ 240.520861] fffffffeffffffff ffff88026bc7b240 00007f4297513000 ffff880473aba240
> [ 240.520866] ffffffff81422896 ffff880470b34e40 ffffffff814242f1 ffff880476deddc0
> [ 240.520871] Call Trace:
> [ 240.520886] [<ffffffff81422896>] ? schedule+0x6c/0x79
> [ 240.520893] [<ffffffff814242f1>] ? rwsem_down_write_failed+0x285/0x2cb
> [ 240.520903] [<ffffffff8124d833>] ? call_rwsem_down_write_failed+0x13/0x20
> [ 240.520907] [<ffffffff8124d833>] ? call_rwsem_down_write_failed+0x13/0x20
> [ 240.520913] [<ffffffff81423b22>] ? down_write+0x24/0x33
> [ 240.520923] [<ffffffff8110836e>] ? unlink_file_vma+0x28/0x4b
> [ 240.520928] [<ffffffff811033e4>] ? free_pgtables+0x3c/0xba
> [ 240.520933] [<ffffffff81107c15>] ? unmap_region+0xa4/0xc1
> [ 240.520941] [<ffffffff8106c60c>] ? pick_next_task_fair+0x11b/0x347
> [ 240.520947] [<ffffffff8110795f>] ? vma_gap_callbacks_propagate+0x16/0x2c
> [ 240.520951] [<ffffffff81108101>] ? vma_rb_erase+0x161/0x18f
> [ 240.520957] [<ffffffff81109524>] ? do_munmap+0x271/0x2e6
> [ 240.520962] [<ffffffff811095d0>] ? vm_munmap+0x37/0x4f
> [ 240.520967] [<ffffffff81109602>] ? SyS_munmap+0x1a/0x1f
> [ 240.520971] [<ffffffff81424d57>] ? entry_SYSCALL_64_fastpath+0x12/0x6a

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 00/20] get_user_pages() for dax mappings
  2015-12-08  0:00           ` Logan Gunthorpe
@ 2015-12-08  0:48             ` Dan Williams
  0 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2015-12-08  0:48 UTC (permalink / raw)
  To: Logan Gunthorpe; +Cc: Stephen Bates, Linux MM, linux-nvdimm

On Mon, Dec 7, 2015 at 4:00 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Hi Dan,
>
> I've done a bit of digging and here's some more information:
>
> * The crash occurs in ext4_end_io_unwritten when it tries to dereference
> bh->b_assoc_map which is not necessarily NULL.
>
> * That function is called by __dax_pmd_fault, as the argument
> complete_unwritten.
>
> * Looking in __dax_pmd_fault, the bug occurs if we hit either of the first
> two 'goto fallback' lines. (In my case, it's hitting the first one.)
>
> * After the fallback code, it goes back to 'out', then checks '&bh'
> for the unwritten flag. But bh hasn't been initialized yet and, on my setup,
> the unwritten flag happens to be set. So, it then calls complete_unwritten
> with a garbage bh and crashes.
>
> If I move the memset(&bh) up in the code, before the goto fallbacks can
> occur, I can fix the crash.  I don't know if this is really the best way to
> fix the problem though.

I believe you are hitting the same issue that Matthew hit here:

https://patchwork.kernel.org/patch/7763851/

I have it fixed up in the latest that I pushed out last night to
libnvdimm-pending.  Note the libnvdimm-pending branch is now based on
linux-next as I needed to resolve collisions with transparent huge
page work pending in the -mm tree.

> However, unfortunately, fixing the above just uncovered another issue. Now
> the MR de-registration seems to have completed but the task hangs when it's
> trying to munmap the memory. (Stack trace at the end of this email.)
>
> It looks like the i_mmap_lock_write is hanging in unlink_file_vma. I'm not
> really sure how to go about debugging this lock issue. If you have any steps
> I can try to get you more information let me know. I'm also happy to re-test
> if you have any other changes you'd like me to try.

I worked through a crop of hangs and crashes triggered by Toshi's mmap
test.  Give the latest a try if you get a chance and I'll fix it up if
it still occurs.  I'll be pushing an updated branch again tonight with
fixes for issues uncovered while running the nvml test suite.

Thanks Logan!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2015-12-08  0:48 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-10  0:55 [PATCH v2 00/20] get_user_pages() for dax mappings Dan Williams
2015-10-10  0:55 ` [PATCH v2 01/20] block: generic request_queue reference counting Dan Williams
2015-10-11 12:59   ` Christoph Hellwig
2015-10-13  0:09     ` Dan Williams
2015-10-10  0:55 ` [PATCH v2 02/20] dax: increase granularity of dax_clear_blocks() operations Dan Williams
2015-10-10  0:55 ` [PATCH v2 03/20] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams
2015-10-10  0:55 ` [PATCH v2 04/20] mm: introduce __get_dev_pagemap() Dan Williams
2015-10-10  0:55 ` [PATCH v2 05/20] x86, mm: introduce vmem_altmap to augment vmemmap_populate() Dan Williams
2015-10-19 22:53   ` Williams, Dan J
2015-10-10  0:55 ` [PATCH v2 06/20] libnvdimm, pfn, pmem: allocate memmap array in persistent memory Dan Williams
2015-10-10  0:56 ` [PATCH v2 07/20] avr32: convert to asm-generic/memory_model.h Dan Williams
2015-10-10  0:56 ` [PATCH v2 08/20] hugetlb: fix compile error on tile Dan Williams
2015-10-10  0:56 ` [PATCH v2 09/20] frv: fix compiler warning from definition of __pmd() Dan Williams
2015-10-10  0:56 ` [PATCH v2 10/20] um: kill pfn_t Dan Williams
2015-10-10  0:56 ` [PATCH v2 11/20] kvm: rename pfn_t to kvm_pfn_t Dan Williams
2015-10-10 15:35   ` Christoffer Dall
2015-10-10 20:35   ` Paolo Bonzini
2015-10-10 20:57     ` Dan Williams
2015-10-12 12:51       ` Paolo Bonzini
2015-10-12 16:16         ` Dan Williams
2015-10-10  0:56 ` [PATCH v2 12/20] mips: fix PAGE_MASK definition Dan Williams
2015-10-10  0:56 ` [PATCH v2 13/20] mm, dax, pmem: introduce pfn_t Dan Williams
2015-10-10  0:56 ` [PATCH v2 14/20] mm, dax, gpu: convert vm_insert_mixed to pfn_t, introduce _PAGE_DEVMAP Dan Williams
2015-10-10  0:56 ` [PATCH v2 15/20] mm, dax: convert vmf_insert_pfn_pmd() to pfn_t Dan Williams
2015-10-10  0:56 ` [PATCH v2 16/20] list: introduce list_poison() and LIST_POISON3 Dan Williams
2015-10-10  0:56 ` [PATCH v2 17/20] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup Dan Williams
2015-10-10  0:57 ` [PATCH v2 18/20] block: notify queue death confirmation Dan Williams
2015-10-10  0:57 ` [PATCH v2 19/20] mm, pmem: devm_memunmap_pages(), truncate and unmap ZONE_DEVICE pages Dan Williams
2015-10-10  0:57 ` [PATCH v2 20/20] mm, x86: get_user_pages() for dax mappings Dan Williams
2015-10-23 21:06 ` [PATCH v2 00/20] " Logan Gunthorpe
2015-11-30 22:15   ` Dan Williams
2015-12-02 22:02     ` Logan Gunthorpe
2015-12-02 22:04       ` Dan Williams
2015-12-04  2:16       ` Dan Williams
2015-12-05  1:58         ` Logan Gunthorpe
2015-12-08  0:00           ` Logan Gunthorpe
2015-12-08  0:48             ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).