linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/15] get_user_pages() for dax mappings
@ 2015-09-23  4:41 Dan Williams
  2015-09-23  4:41 ` [PATCH 01/15] avr32: convert to asm-generic/memory_model.h Dan Williams
                   ` (14 more replies)
  0 siblings, 15 replies; 39+ messages in thread
From: Dan Williams @ 2015-09-23  4:41 UTC (permalink / raw)
  To: akpm
  Cc: Jens Axboe, Boaz Harrosh, Dave Hansen, linux-nvdimm,
	Peter Zijlstra, Dave Chinner, linux-kernel, Christoph Hellwig,
	linux-mm, David Airlie, Jeff Moyer, Ingo Molnar, Alexander Viro,
	H. Peter Anvin, linux-fsdevel, Matthew Wilcox, Thomas Gleixner,
	Ross Zwisler

To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o.  It allows userspace to coordinate
DMA/RDMA from/to persitent memory.

The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 to flag pages that are owned and dynamically mapped by a device
driver.  The pmem driver, after mapping a persistent memory range into
the system memmap via devm_memremap_pages(), arranges for DAX to
distinguish pfn-only versus page-backed pmem-pfns via flags in the new
__pfn_t type.  The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn,
flags the resulting pte(s) inserted into the process page tables with a
new _PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it
keys off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.

Next step, more testing specifically DAX-get_user_pages() vs truncate.

Patches 1 - 3 are general compilation fixups from 0day-kbuild reports
while developing this series.

Patches 4 - 7 are minor cleanups and reworks of the devm_memremap_* api.

Patches 8 - 10 add a reference counter for pinning the pmem driver
active while it is in use.  It turns out, prior to these changes, you
can reliably crash the kernel on shutdown if the pmem device is unbound
while hosting a mounted filesystem.

Patches 11 - 15 use __pfn_t and the _PAGE_DEVMAP flag to implement the
dax-gup path.

This series is built on 4.3-rc2 plus the __dax_pmd_fault fix from Ross:
https://patchwork.kernel.org/patch/7244961/

---

Dan Williams (15):
      avr32: convert to asm-generic/memory_model.h
      hugetlb: fix compile error on tile
      frv: fix compiler warning from definition of __pmd()
      x86, mm: quiet arch_add_memory()
      pmem: kill memremap_pmem()
      devm_memunmap: use devres_release()
      devm_memremap: convert to return ERR_PTR
      block, dax, pmem: reference counting infrastructure
      block, pmem: fix null pointer de-reference on shutdown, check for queue death
      block, dax: fix lifetime of in-kernel dax mappings
      mm, dax, pmem: introduce __pfn_t
      mm, dax, gpu: convert vm_insert_mixed to __pfn_t, introduce _PAGE_DEVMAP
      mm, dax: convert vmf_insert_pfn_pmd() to __pfn_t
      mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup
      mm, x86: get_user_pages() for dax mappings


 arch/alpha/include/asm/pgtable.h        |    1 
 arch/avr32/include/asm/page.h           |    8 +
 arch/frv/include/asm/page.h             |    2 
 arch/ia64/include/asm/pgtable.h         |    1 
 arch/m68k/include/asm/page_no.h         |    1 
 arch/parisc/include/asm/pgtable.h       |    1 
 arch/powerpc/include/asm/pgtable.h      |    1 
 arch/powerpc/sysdev/axonram.c           |   10 +
 arch/sparc/include/asm/pgtable_64.h     |    2 
 arch/tile/include/asm/pgtable.h         |    1 
 arch/um/include/asm/pgtable-3level.h    |    1 
 arch/x86/include/asm/pgtable.h          |   24 ++++
 arch/x86/include/asm/pgtable_types.h    |    7 +
 arch/x86/mm/gup.c                       |   56 ++++++++
 arch/x86/mm/init.c                      |    4 -
 arch/x86/mm/init_64.c                   |    4 -
 arch/x86/mm/pat.c                       |    4 -
 block/blk-core.c                        |   86 ++++++++++++-
 block/blk-mq-sysfs.c                    |    2 
 block/blk-mq.c                          |   48 ++-----
 block/blk-sysfs.c                       |    9 +
 block/blk.h                             |    3 
 drivers/block/brd.c                     |    6 -
 drivers/gpu/drm/exynos/exynos_drm_gem.c |    3 
 drivers/gpu/drm/gma500/framebuffer.c    |    2 
 drivers/gpu/drm/msm/msm_gem.c           |    3 
 drivers/gpu/drm/omapdrm/omap_gem.c      |    6 +
 drivers/gpu/drm/ttm/ttm_bo_vm.c         |    3 
 drivers/nvdimm/pmem.c                   |   57 +++++---
 drivers/s390/block/dcssblk.c            |   12 +-
 fs/block_dev.c                          |    2 
 fs/dax.c                                |  140 +++++++++++++-------
 include/asm-generic/pgtable.h           |    6 +
 include/linux/blkdev.h                  |   24 +++-
 include/linux/huge_mm.h                 |    2 
 include/linux/hugetlb.h                 |    1 
 include/linux/io.h                      |   17 --
 include/linux/mm.h                      |  212 +++++++++++++++++++++++++++++--
 include/linux/mm_types.h                |    6 +
 include/linux/pfn.h                     |    9 +
 include/linux/pmem.h                    |   26 ----
 kernel/memremap.c                       |   78 +++++++++++
 mm/gup.c                                |   11 +-
 mm/huge_memory.c                        |   10 +
 mm/hugetlb.c                            |   18 ++-
 mm/memory.c                             |   17 +-
 mm/swap.c                               |   15 ++
 47 files changed, 729 insertions(+), 233 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 01/15] avr32: convert to asm-generic/memory_model.h
  2015-09-23  4:41 [PATCH 00/15] get_user_pages() for dax mappings Dan Williams
@ 2015-09-23  4:41 ` Dan Williams
  2015-09-24 15:10   ` Christoph Hellwig
  2015-09-23  4:41 ` [PATCH 02/15] hugetlb: fix compile error on tile Dan Williams
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 39+ messages in thread
From: Dan Williams @ 2015-09-23  4:41 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel, linux-mm, linux-kernel, linux-nvdimm

Switch avr32/include/asm/page.h to use the common defintions for
pfn_to_page(), page_to_pfn(), and ARCH_PFN_OFFSET.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/avr32/include/asm/page.h |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/avr32/include/asm/page.h b/arch/avr32/include/asm/page.h
index f805d1cb11bc..c5d2a3e2c62f 100644
--- a/arch/avr32/include/asm/page.h
+++ b/arch/avr32/include/asm/page.h
@@ -83,11 +83,9 @@ static inline int get_order(unsigned long size)
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 
-#define PHYS_PFN_OFFSET		(CONFIG_PHYS_OFFSET >> PAGE_SHIFT)
+#define ARCH_PFN_OFFSET		(CONFIG_PHYS_OFFSET >> PAGE_SHIFT)
 
-#define pfn_to_page(pfn)	(mem_map + ((pfn) - PHYS_PFN_OFFSET))
-#define page_to_pfn(page)	((unsigned long)((page) - mem_map) + PHYS_PFN_OFFSET)
-#define pfn_valid(pfn)		((pfn) >= PHYS_PFN_OFFSET && (pfn) < (PHYS_PFN_OFFSET + max_mapnr))
+#define pfn_valid(pfn)		((pfn) >= ARCH_PFN_OFFSET && (pfn) < (ARCH_PFN_OFFSET + max_mapnr))
 #endif /* CONFIG_NEED_MULTIPLE_NODES */
 
 #define virt_to_page(kaddr)	pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
@@ -101,4 +99,6 @@ static inline int get_order(unsigned long size)
  */
 #define HIGHMEM_START		0x20000000UL
 
+#include <asm-generic/memory_model.h>
+
 #endif /* __ASM_AVR32_PAGE_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 02/15] hugetlb: fix compile error on tile
  2015-09-23  4:41 [PATCH 00/15] get_user_pages() for dax mappings Dan Williams
  2015-09-23  4:41 ` [PATCH 01/15] avr32: convert to asm-generic/memory_model.h Dan Williams
@ 2015-09-23  4:41 ` Dan Williams
  2015-09-23  4:41 ` [PATCH 03/15] frv: fix compiler warning from definition of __pmd() Dan Williams
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2015-09-23  4:41 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel, linux-mm, linux-kernel, linux-nvdimm

Inlude asm/pgtable.h to get the definition for pud_t to fix:

include/linux/hugetlb.h:203:29: error: unknown type name 'pud_t'

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/hugetlb.h |    1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 5e35379f58a5..ad5539cf52bf 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -8,6 +8,7 @@
 #include <linux/cgroup.h>
 #include <linux/list.h>
 #include <linux/kref.h>
+#include <asm/pgtable.h>
 
 struct ctl_table;
 struct user_struct;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 03/15] frv: fix compiler warning from definition of __pmd()
  2015-09-23  4:41 [PATCH 00/15] get_user_pages() for dax mappings Dan Williams
  2015-09-23  4:41 ` [PATCH 01/15] avr32: convert to asm-generic/memory_model.h Dan Williams
  2015-09-23  4:41 ` [PATCH 02/15] hugetlb: fix compile error on tile Dan Williams
@ 2015-09-23  4:41 ` Dan Williams
  2015-09-23  4:41 ` [PATCH 04/15] x86, mm: quiet arch_add_memory() Dan Williams
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2015-09-23  4:41 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel, linux-mm, linux-kernel, linux-nvdimm

Take into account that the pmd_t type is a array inside a struct, so it
needs two levels of brackets to initialize.  Otherwise, a usage of __pmd
generates a warning:

include/linux/mm.h:986:2: warning: missing braces around initializer [-Wmissing-braces]

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/frv/include/asm/page.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/frv/include/asm/page.h b/arch/frv/include/asm/page.h
index 8c97068ac8fc..688d8076a43a 100644
--- a/arch/frv/include/asm/page.h
+++ b/arch/frv/include/asm/page.h
@@ -34,7 +34,7 @@ typedef struct page *pgtable_t;
 #define pgprot_val(x)	((x).pgprot)
 
 #define __pte(x)	((pte_t) { (x) } )
-#define __pmd(x)	((pmd_t) { (x) } )
+#define __pmd(x)	((pmd_t) { { (x) } } )
 #define __pud(x)	((pud_t) { (x) } )
 #define __pgd(x)	((pgd_t) { (x) } )
 #define __pgprot(x)	((pgprot_t) { (x) } )

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 04/15] x86, mm: quiet arch_add_memory()
  2015-09-23  4:41 [PATCH 00/15] get_user_pages() for dax mappings Dan Williams
                   ` (2 preceding siblings ...)
  2015-09-23  4:41 ` [PATCH 03/15] frv: fix compiler warning from definition of __pmd() Dan Williams
@ 2015-09-23  4:41 ` Dan Williams
  2015-09-24 15:10   ` Christoph Hellwig
  2015-09-23  4:41 ` [PATCH 05/15] pmem: kill memremap_pmem() Dan Williams
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 39+ messages in thread
From: Dan Williams @ 2015-09-23  4:41 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel, linux-mm, linux-kernel, linux-nvdimm

Switch to pr_debug() so that dynamic-debug can disable these messages by
default.  This gets noisy in the presence of devm_memremap_pages().

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/mm/init.c    |    4 ++--
 arch/x86/mm/init_64.c |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 1d8a83df153a..4b9ea3f27de4 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -354,7 +354,7 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
 	}
 
 	for (i = 0; i < nr_range; i++)
-		printk(KERN_DEBUG " [mem %#010lx-%#010lx] page %s\n",
+		pr_debug(" [mem %#010lx-%#010lx] page %s\n",
 				mr[i].start, mr[i].end - 1,
 				page_size_string(&mr[i]));
 
@@ -401,7 +401,7 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
 	unsigned long ret = 0;
 	int nr_range, i;
 
-	pr_info("init_memory_mapping: [mem %#010lx-%#010lx]\n",
+	pr_debug("init_memory_mapping: [mem %#010lx-%#010lx]\n",
 	       start, end - 1);
 
 	memset(mr, 0, sizeof(mr));
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 30564e2752d3..bf827f231470 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1268,7 +1268,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 				/* check to see if we have contiguous blocks */
 				if (p_end != p || node_start != node) {
 					if (p_start)
-						printk(KERN_DEBUG " [%lx-%lx] PMD -> [%p-%p] on node %d\n",
+						pr_debug(" [%lx-%lx] PMD -> [%p-%p] on node %d\n",
 						       addr_start, addr_end-1, p_start, p_end-1, node_start);
 					addr_start = addr;
 					node_start = node;
@@ -1366,7 +1366,7 @@ void register_page_bootmem_memmap(unsigned long section_nr,
 void __meminit vmemmap_populate_print_last(void)
 {
 	if (p_start) {
-		printk(KERN_DEBUG " [%lx-%lx] PMD -> [%p-%p] on node %d\n",
+		pr_debug(" [%lx-%lx] PMD -> [%p-%p] on node %d\n",
 			addr_start, addr_end-1, p_start, p_end-1, node_start);
 		p_start = NULL;
 		p_end = NULL;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 05/15] pmem: kill memremap_pmem()
  2015-09-23  4:41 [PATCH 00/15] get_user_pages() for dax mappings Dan Williams
                   ` (3 preceding siblings ...)
  2015-09-23  4:41 ` [PATCH 04/15] x86, mm: quiet arch_add_memory() Dan Williams
@ 2015-09-23  4:41 ` Dan Williams
  2015-09-24 15:11   ` Christoph Hellwig
  2015-09-23  4:41 ` [PATCH 06/15] devm_memunmap: use devres_release() Dan Williams
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 39+ messages in thread
From: Dan Williams @ 2015-09-23  4:41 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel, linux-mm, linux-kernel, linux-nvdimm

Now that the pmem-api is defined as "a set of apis that enables access
to WB mapped pmem",  the mapping type is implied.  Remove the wrapper
and push the functionality down into the pmem driver in preparation for
adding support for direct-mapped pmem.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c |    9 +++++----
 include/linux/pmem.h  |   26 +-------------------------
 2 files changed, 6 insertions(+), 29 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 0ba6a978f227..0680affae04a 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -157,8 +157,9 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 			return addr;
 		pmem->virt_addr = (void __pmem *) addr;
 	} else {
-		pmem->virt_addr = memremap_pmem(dev, pmem->phys_addr,
-				pmem->size);
+		pmem->virt_addr = (void __pmem *) devm_memremap(dev,
+				pmem->phys_addr, pmem->size,
+				ARCH_MEMREMAP_PMEM);
 		if (!pmem->virt_addr)
 			return ERR_PTR(-ENXIO);
 	}
@@ -363,8 +364,8 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
 
 	/* establish pfn range for lookup, and switch to direct map */
 	pmem = dev_get_drvdata(dev);
-	memunmap_pmem(dev, pmem->virt_addr);
-	pmem->virt_addr = (void __pmem *)devm_memremap_pages(dev, &nsio->res);
+	devm_memunmap(dev, (void __force *) pmem->virt_addr);
+	pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res);
 	if (IS_ERR(pmem->virt_addr)) {
 		rc = PTR_ERR(pmem->virt_addr);
 		goto err;
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 85f810b33917..acfea8ce4a07 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -65,11 +65,6 @@ static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t si
 	memcpy(dst, (void __force const *) src, size);
 }
 
-static inline void memunmap_pmem(struct device *dev, void __pmem *addr)
-{
-	devm_memunmap(dev, (void __force *) addr);
-}
-
 static inline bool arch_has_pmem_api(void)
 {
 	return IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API);
@@ -93,7 +88,7 @@ static inline bool arch_has_wmb_pmem(void)
  * These defaults seek to offer decent performance and minimize the
  * window between i/o completion and writes being durable on media.
  * However, it is undefined / architecture specific whether
- * default_memremap_pmem + default_memcpy_to_pmem is sufficient for
+ * ARCH_MEMREMAP_PMEM + default_memcpy_to_pmem is sufficient for
  * making data durable relative to i/o completion.
  */
 static inline void default_memcpy_to_pmem(void __pmem *dst, const void *src,
@@ -117,25 +112,6 @@ static inline void default_clear_pmem(void __pmem *addr, size_t size)
 }
 
 /**
- * memremap_pmem - map physical persistent memory for pmem api
- * @offset: physical address of persistent memory
- * @size: size of the mapping
- *
- * Establish a mapping of the architecture specific memory type expected
- * by memcpy_to_pmem() and wmb_pmem().  For example, it may be
- * the case that an uncacheable or writethrough mapping is sufficient,
- * or a writeback mapping provided memcpy_to_pmem() and
- * wmb_pmem() arrange for the data to be written through the
- * cache to persistent media.
- */
-static inline void __pmem *memremap_pmem(struct device *dev,
-		resource_size_t offset, unsigned long size)
-{
-	return (void __pmem *) devm_memremap(dev, offset, size,
-			ARCH_MEMREMAP_PMEM);
-}
-
-/**
  * memcpy_to_pmem - copy data to persistent memory
  * @dst: destination buffer for the copy
  * @src: source buffer for the copy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 06/15] devm_memunmap: use devres_release()
  2015-09-23  4:41 [PATCH 00/15] get_user_pages() for dax mappings Dan Williams
                   ` (4 preceding siblings ...)
  2015-09-23  4:41 ` [PATCH 05/15] pmem: kill memremap_pmem() Dan Williams
@ 2015-09-23  4:41 ` Dan Williams
  2015-09-24 15:13   ` Christoph Hellwig
  2015-09-23  4:41 ` [PATCH 07/15] devm_memremap: convert to return ERR_PTR Dan Williams
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 39+ messages in thread
From: Dan Williams @ 2015-09-23  4:41 UTC (permalink / raw)
  To: akpm
  Cc: linux-nvdimm, linux-kernel, linux-mm, linux-fsdevel,
	Ross Zwisler, Christoph Hellwig

Remove open coded call to memunmap.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 kernel/memremap.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 72b0c66628b6..0756273437e0 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -131,9 +131,8 @@ EXPORT_SYMBOL(devm_memremap);
 
 void devm_memunmap(struct device *dev, void *addr)
 {
-	WARN_ON(devres_destroy(dev, devm_memremap_release, devm_memremap_match,
-			       addr));
-	memunmap(addr);
+	WARN_ON(devres_release(dev, devm_memremap_release,
+				devm_memremap_match, addr));
 }
 EXPORT_SYMBOL(devm_memunmap);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 07/15] devm_memremap: convert to return ERR_PTR
  2015-09-23  4:41 [PATCH 00/15] get_user_pages() for dax mappings Dan Williams
                   ` (5 preceding siblings ...)
  2015-09-23  4:41 ` [PATCH 06/15] devm_memunmap: use devres_release() Dan Williams
@ 2015-09-23  4:41 ` Dan Williams
  2015-09-24 15:13   ` Christoph Hellwig
  2015-09-23  4:41 ` [PATCH 08/15] block, dax, pmem: reference counting infrastructure Dan Williams
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 39+ messages in thread
From: Dan Williams @ 2015-09-23  4:41 UTC (permalink / raw)
  To: akpm
  Cc: linux-nvdimm, linux-kernel, linux-mm, linux-fsdevel,
	Ross Zwisler, Christoph Hellwig

Make devm_memremap consistent with the error return scheme of
devm_memremap_pages to remove special casing in the pmem driver.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c |   16 ++++++----------
 kernel/memremap.c     |    2 +-
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 0680affae04a..9805d311b1d1 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -150,19 +150,15 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 		return ERR_PTR(-EBUSY);
 	}
 
-	if (pmem_should_map_pages(dev)) {
-		void *addr = devm_memremap_pages(dev, res);
-
-		if (IS_ERR(addr))
-			return addr;
-		pmem->virt_addr = (void __pmem *) addr;
-	} else {
+	if (pmem_should_map_pages(dev))
+		pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res);
+	else
 		pmem->virt_addr = (void __pmem *) devm_memremap(dev,
 				pmem->phys_addr, pmem->size,
 				ARCH_MEMREMAP_PMEM);
-		if (!pmem->virt_addr)
-			return ERR_PTR(-ENXIO);
-	}
+
+	if (IS_ERR(pmem->virt_addr))
+		return (void __force *) pmem->virt_addr;
 
 	return pmem;
 }
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 0756273437e0..0d818ce04129 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -116,7 +116,7 @@ void *devm_memremap(struct device *dev, resource_size_t offset,
 
 	ptr = devres_alloc(devm_memremap_release, sizeof(*ptr), GFP_KERNEL);
 	if (!ptr)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 
 	addr = memremap(offset, size, flags);
 	if (addr) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 08/15] block, dax, pmem: reference counting infrastructure
  2015-09-23  4:41 [PATCH 00/15] get_user_pages() for dax mappings Dan Williams
                   ` (6 preceding siblings ...)
  2015-09-23  4:41 ` [PATCH 07/15] devm_memremap: convert to return ERR_PTR Dan Williams
@ 2015-09-23  4:41 ` Dan Williams
  2015-09-24 15:15   ` Christoph Hellwig
  2015-09-23  4:42 ` [PATCH 09/15] block, pmem: fix null pointer de-reference on shutdown, check for queue death Dan Williams
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 39+ messages in thread
From: Dan Williams @ 2015-09-23  4:41 UTC (permalink / raw)
  To: akpm
  Cc: Jens Axboe, linux-nvdimm, linux-kernel, linux-mm, linux-fsdevel,
	Ross Zwisler, Christoph Hellwig

Enable DAX to use a reference count for keeping the virtual address
returned by ->direct_access() valid for the duration of its usage in
fs/dax.c, or otherwise hold off blk_cleanup_queue() while
pmem_make_request is active.  The blk-mq code is already in a position
to need low overhead referece counting for races against request_queue
destruction (blk_cleanup_queue()).  Given DAX-enabled block drivers do
not enable blk-mq, share the storage in 'struct request_queue' between
the two implementations.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/sysdev/axonram.c |    2 -
 block/blk-core.c              |   84 +++++++++++++++++++++++++++++++++++++++++
 block/blk-mq-sysfs.c          |    2 -
 block/blk-mq.c                |   48 ++++++-----------------
 block/blk-sysfs.c             |    9 ++++
 block/blk.h                   |    3 +
 drivers/block/brd.c           |    2 -
 drivers/nvdimm/pmem.c         |    3 +
 drivers/s390/block/dcssblk.c  |    2 -
 include/linux/blkdev.h        |   20 ++++++++--
 10 files changed, 130 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index d2b79bc336c1..24ffab2572e8 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -228,7 +228,7 @@ static int axon_ram_probe(struct platform_device *device)
 	sprintf(bank->disk->disk_name, "%s%d",
 			AXON_RAM_DEVICE_NAME, axon_ram_bank_id);
 
-	bank->disk->queue = blk_alloc_queue(GFP_KERNEL);
+	bank->disk->queue = blk_dax_init_queue(NUMA_NO_NODE);
 	if (bank->disk->queue == NULL) {
 		dev_err(&device->dev, "Cannot register disk queue\n");
 		rc = -EFAULT;
diff --git a/block/blk-core.c b/block/blk-core.c
index 2eb722d48773..13764f8b22e0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/writeback.h>
+#include <linux/percpu-refcount.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/fault-inject.h>
 #include <linux/list_sort.h>
@@ -497,6 +498,84 @@ void blk_queue_bypass_end(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_queue_bypass_end);
 
+int blk_qref_enter(struct request_queue_ref *qref, gfp_t gfp)
+{
+	struct request_queue *q = container_of(qref, typeof(*q), mq_ref);
+
+	while (true) {
+		int ret;
+
+		if (percpu_ref_tryget_live(&qref->count))
+			return 0;
+
+		if (!(gfp & __GFP_WAIT))
+			return -EBUSY;
+
+		ret = wait_event_interruptible(qref->freeze_wq,
+				!atomic_read(&qref->freeze_depth) ||
+				blk_queue_dying(q));
+		if (blk_queue_dying(q))
+			return -ENODEV;
+		if (ret)
+			return ret;
+	}
+}
+
+void blk_qref_release(struct percpu_ref *ref)
+{
+	struct request_queue_ref *qref = container_of(ref, typeof(*qref), count);
+
+	wake_up_all(&qref->freeze_wq);
+}
+
+int blk_dax_get(struct request_queue *q)
+{
+	return blk_qref_enter(&q->dax_ref, GFP_NOWAIT);
+}
+
+void blk_dax_put(struct request_queue *q)
+{
+	percpu_ref_put(&q->dax_ref.count);
+}
+
+static void blk_dax_freeze(struct request_queue *q)
+{
+	if (!blk_queue_dax(q))
+		return;
+
+	if (atomic_inc_return(&q->dax_ref.freeze_depth) == 1)
+		percpu_ref_kill(&q->dax_ref.count);
+
+	wait_event(q->dax_ref.freeze_wq, percpu_ref_is_zero(&q->dax_ref.count));
+}
+
+struct request_queue *blk_dax_init_queue(int nid)
+{
+	struct request_queue *q;
+	int rc;
+
+	q = blk_alloc_queue_node(GFP_KERNEL, nid);
+	if (!q)
+		return ERR_PTR(-ENOMEM);
+	queue_flag_set_unlocked(QUEUE_FLAG_DAX, q);
+
+	rc = percpu_ref_init(&q->dax_ref.count, blk_qref_release, 0,
+			GFP_KERNEL);
+	if (rc) {
+		blk_cleanup_queue(q);
+		return ERR_PTR(rc);
+	}
+	return q;
+}
+EXPORT_SYMBOL(blk_dax_init_queue);
+
+static void blk_dax_exit(struct request_queue *q)
+{
+	if (!blk_queue_dax(q))
+		return;
+	percpu_ref_exit(&q->dax_ref.count);
+}
+
 void blk_set_queue_dying(struct request_queue *q)
 {
 	queue_flag_set_unlocked(QUEUE_FLAG_DYING, q);
@@ -558,6 +637,7 @@ void blk_cleanup_queue(struct request_queue *q)
 		blk_mq_freeze_queue(q);
 		spin_lock_irq(lock);
 	} else {
+		blk_dax_freeze(q);
 		spin_lock_irq(lock);
 		__blk_drain_queue(q, true);
 	}
@@ -570,6 +650,7 @@ void blk_cleanup_queue(struct request_queue *q)
 
 	if (q->mq_ops)
 		blk_mq_free_queue(q);
+	blk_dax_exit(q);
 
 	spin_lock_irq(lock);
 	if (q->queue_lock != &q->__queue_lock)
@@ -688,7 +769,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	q->bypass_depth = 1;
 	__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);
 
-	init_waitqueue_head(&q->mq_freeze_wq);
+	/* this also inits q->dax_ref.freeze_wq in the union */
+	init_waitqueue_head(&q->mq_ref.freeze_wq);
 
 	if (blkcg_init_queue(q))
 		goto fail_bdi;
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 279c5d674edf..b0fdffa0d4c6 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -415,7 +415,7 @@ static void blk_mq_sysfs_init(struct request_queue *q)
 /* see blk_register_queue() */
 void blk_mq_finish_init(struct request_queue *q)
 {
-	percpu_ref_switch_to_percpu(&q->mq_usage_counter);
+	percpu_ref_switch_to_percpu(&q->mq_ref.count);
 }
 
 int blk_mq_register_disk(struct gendisk *disk)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f2d67b4047a0..494c6e267c9d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -79,45 +79,21 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
 
 static int blk_mq_queue_enter(struct request_queue *q, gfp_t gfp)
 {
-	while (true) {
-		int ret;
-
-		if (percpu_ref_tryget_live(&q->mq_usage_counter))
-			return 0;
-
-		if (!(gfp & __GFP_WAIT))
-			return -EBUSY;
-
-		ret = wait_event_interruptible(q->mq_freeze_wq,
-				!atomic_read(&q->mq_freeze_depth) ||
-				blk_queue_dying(q));
-		if (blk_queue_dying(q))
-			return -ENODEV;
-		if (ret)
-			return ret;
-	}
+	return blk_qref_enter(&q->mq_ref, gfp);
 }
 
 static void blk_mq_queue_exit(struct request_queue *q)
 {
-	percpu_ref_put(&q->mq_usage_counter);
-}
-
-static void blk_mq_usage_counter_release(struct percpu_ref *ref)
-{
-	struct request_queue *q =
-		container_of(ref, struct request_queue, mq_usage_counter);
-
-	wake_up_all(&q->mq_freeze_wq);
+	percpu_ref_put(&q->mq_ref.count);
 }
 
 void blk_mq_freeze_queue_start(struct request_queue *q)
 {
 	int freeze_depth;
 
-	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
+	freeze_depth = atomic_inc_return(&q->mq_ref.freeze_depth);
 	if (freeze_depth == 1) {
-		percpu_ref_kill(&q->mq_usage_counter);
+		percpu_ref_kill(&q->mq_ref.count);
 		blk_mq_run_hw_queues(q, false);
 	}
 }
@@ -125,7 +101,7 @@ EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start);
 
 static void blk_mq_freeze_queue_wait(struct request_queue *q)
 {
-	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->mq_usage_counter));
+	wait_event(q->mq_ref.freeze_wq, percpu_ref_is_zero(&q->mq_ref.count));
 }
 
 /*
@@ -143,11 +119,11 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 {
 	int freeze_depth;
 
-	freeze_depth = atomic_dec_return(&q->mq_freeze_depth);
+	freeze_depth = atomic_dec_return(&q->mq_ref.freeze_depth);
 	WARN_ON_ONCE(freeze_depth < 0);
 	if (!freeze_depth) {
-		percpu_ref_reinit(&q->mq_usage_counter);
-		wake_up_all(&q->mq_freeze_wq);
+		percpu_ref_reinit(&q->mq_ref.count);
+		wake_up_all(&q->mq_ref.freeze_wq);
 	}
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
@@ -166,7 +142,7 @@ void blk_mq_wake_waiters(struct request_queue *q)
 	 * dying, we need to ensure that processes currently waiting on
 	 * the queue are notified as well.
 	 */
-	wake_up_all(&q->mq_freeze_wq);
+	wake_up_all(&q->mq_ref.freeze_wq);
 }
 
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
@@ -1983,7 +1959,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	 * Init percpu_ref in atomic mode so that it's faster to shutdown.
 	 * See blk_register_queue() for details.
 	 */
-	if (percpu_ref_init(&q->mq_usage_counter, blk_mq_usage_counter_release,
+	if (percpu_ref_init(&q->mq_ref.count, blk_qref_release,
 			    PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
 		goto err_hctxs;
 
@@ -2062,7 +2038,7 @@ void blk_mq_free_queue(struct request_queue *q)
 	blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
 	blk_mq_free_hw_queues(q, set);
 
-	percpu_ref_exit(&q->mq_usage_counter);
+	percpu_ref_exit(&q->mq_ref.count);
 
 	kfree(q->mq_map);
 
@@ -2076,7 +2052,7 @@ void blk_mq_free_queue(struct request_queue *q)
 /* Basically redo blk_mq_init_queue with queue frozen */
 static void blk_mq_queue_reinit(struct request_queue *q)
 {
-	WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth));
+	WARN_ON_ONCE(!atomic_read(&q->mq_ref.freeze_depth));
 
 	blk_mq_sysfs_unregister(q);
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 3e44a9da2a13..5126a97825de 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -616,6 +616,15 @@ int blk_register_queue(struct gendisk *disk)
 
 	kobject_uevent(&q->kobj, KOBJ_ADD);
 
+	if (q->mq_ops && blk_queue_dax(q)) {
+		/*
+		 * mq_ref and dax_ref share storage in request_queue, so
+		 * we can't have both enabled.
+		 */
+		WARN_ON_ONCE(1);
+		return -EINVAL;
+	}
+
 	if (q->mq_ops)
 		blk_mq_register_disk(disk);
 
diff --git a/block/blk.h b/block/blk.h
index 98614ad37c81..0b898d89e0dd 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -54,6 +54,9 @@ static inline void __blk_get_queue(struct request_queue *q)
 	kobject_get(&q->kobj);
 }
 
+int blk_qref_enter(struct request_queue_ref *qref, gfp_t gfp);
+void blk_qref_release(struct percpu_ref *percpu_ref);
+
 struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
 		int node, int cmd_size);
 void blk_free_flush_queue(struct blk_flush_queue *q);
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index b9794aeeb878..f645a71ae827 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -482,7 +482,7 @@ static struct brd_device *brd_alloc(int i)
 	spin_lock_init(&brd->brd_lock);
 	INIT_RADIX_TREE(&brd->brd_pages, GFP_ATOMIC);
 
-	brd->brd_queue = blk_alloc_queue(GFP_KERNEL);
+	brd->brd_queue = blk_dax_init_queue(NUMA_NO_NODE);
 	if (!brd->brd_queue)
 		goto out_free_dev;
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9805d311b1d1..a01611d8f351 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -176,9 +176,10 @@ static void pmem_detach_disk(struct pmem_device *pmem)
 static int pmem_attach_disk(struct device *dev,
 		struct nd_namespace_common *ndns, struct pmem_device *pmem)
 {
+	int nid = dev_to_node(dev);
 	struct gendisk *disk;
 
-	pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
+	pmem->pmem_queue = blk_dax_init_queue(nid);
 	if (!pmem->pmem_queue)
 		return -ENOMEM;
 
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 5ed44fe21380..c212ce925ee6 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -610,7 +610,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
 	}
 	dev_info->gd->major = dcssblk_major;
 	dev_info->gd->fops = &dcssblk_devops;
-	dev_info->dcssblk_queue = blk_alloc_queue(GFP_KERNEL);
+	dev_info->dcssblk_queue = blk_dax_init_queue(NUMA_NO_NODE);
 	dev_info->gd->queue = dev_info->dcssblk_queue;
 	dev_info->gd->private_data = dev_info;
 	dev_info->gd->driverfs_dev = &dev_info->dev;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 99da9ebc7377..363d7df8d65c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -277,6 +277,13 @@ struct queue_limits {
 	unsigned char		raid_partial_stripes_expensive;
 };
 
+
+struct request_queue_ref {
+	wait_queue_head_t	freeze_wq;
+	struct percpu_ref	count;
+	atomic_t		freeze_depth;
+};
+
 struct request_queue {
 	/*
 	 * Together with queue_head for cacheline sharing
@@ -436,7 +443,6 @@ struct request_queue {
 	struct mutex		sysfs_lock;
 
 	int			bypass_depth;
-	atomic_t		mq_freeze_depth;
 
 #if defined(CONFIG_BLK_DEV_BSG)
 	bsg_job_fn		*bsg_job_fn;
@@ -449,8 +455,10 @@ struct request_queue {
 	struct throtl_data *td;
 #endif
 	struct rcu_head		rcu_head;
-	wait_queue_head_t	mq_freeze_wq;
-	struct percpu_ref	mq_usage_counter;
+	union {
+		struct request_queue_ref mq_ref;
+		struct request_queue_ref dax_ref;
+	};
 	struct list_head	all_q_node;
 
 	struct blk_mq_tag_set	*tag_set;
@@ -480,6 +488,7 @@ struct request_queue {
 #define QUEUE_FLAG_DEAD        19	/* queue tear-down finished */
 #define QUEUE_FLAG_INIT_DONE   20	/* queue is initialized */
 #define QUEUE_FLAG_NO_SG_MERGE 21	/* don't attempt to merge SG segments*/
+#define QUEUE_FLAG_DAX         22	/* capacity may be direct-mapped */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\
@@ -568,6 +577,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 #define blk_queue_discard(q)	test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
 #define blk_queue_secdiscard(q)	(blk_queue_discard(q) && \
 	test_bit(QUEUE_FLAG_SECDISCARD, &(q)->queue_flags))
+#define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
@@ -1003,6 +1013,10 @@ struct request_queue *blk_alloc_queue_node(gfp_t, int);
 extern void blk_put_queue(struct request_queue *);
 extern void blk_set_queue_dying(struct request_queue *);
 
+struct request_queue *blk_dax_init_queue(int nid);
+int blk_dax_get(struct request_queue *q);
+void blk_dax_put(struct request_queue *q);
+
 /*
  * block layer runtime pm functions
  */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 09/15] block, pmem: fix null pointer de-reference on shutdown, check for queue death
  2015-09-23  4:41 [PATCH 00/15] get_user_pages() for dax mappings Dan Williams
                   ` (7 preceding siblings ...)
  2015-09-23  4:41 ` [PATCH 08/15] block, dax, pmem: reference counting infrastructure Dan Williams
@ 2015-09-23  4:42 ` Dan Williams
  2015-09-23  4:42 ` [PATCH 10/15] block, dax: fix lifetime of in-kernel dax mappings Dan Williams
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2015-09-23  4:42 UTC (permalink / raw)
  To: akpm
  Cc: Jens Axboe, linux-nvdimm, Dave Chinner, linux-kernel, linux-mm,
	linux-fsdevel, Ross Zwisler, Christoph Hellwig

After the driver has been unbound the queue is dead and the private data
pointer is invalid.  Check that the queue is still alive, or otherwise
pin it active before using queuedata.

Fixes crash signatures like the following.

 BUG: unable to handle kernel paging request at ffff880140000000
 [..]
 Call Trace:
  [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70
  [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
  [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem]
  [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0
  [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110
  [<ffffffff8141f966>] submit_bio+0x76/0x170
  [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160
  [<ffffffff81286e62>] submit_bh+0x12/0x20
  [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170
  [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90
  [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270
  [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30
  [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250
  [<ffffffff81303494>] ext4_put_super+0x64/0x360
  [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/blk-core.c      |    2 ++
 drivers/nvdimm/pmem.c |    8 ++++++++
 2 files changed, 10 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 13764f8b22e0..0ea7d285b886 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -532,11 +532,13 @@ int blk_dax_get(struct request_queue *q)
 {
 	return blk_qref_enter(&q->dax_ref, GFP_NOWAIT);
 }
+EXPORT_SYMBOL(blk_dax_get);
 
 void blk_dax_put(struct request_queue *q)
 {
 	percpu_ref_put(&q->dax_ref.count);
 }
+EXPORT_SYMBOL(blk_dax_put);
 
 static void blk_dax_freeze(struct request_queue *q)
 {
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index a01611d8f351..3ee02af73ad0 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -73,6 +73,12 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
 	struct block_device *bdev = bio->bi_bdev;
 	struct pmem_device *pmem = bdev->bd_disk->private_data;
 
+	if (blk_dax_get(q) != 0) {
+		bio->bi_error = -ENODEV;
+		bio_endio(bio);
+		return;
+	}
+
 	do_acct = nd_iostat_start(bio, &start);
 	bio_for_each_segment(bvec, bio, iter)
 		pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len, bvec.bv_offset,
@@ -84,6 +90,8 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
 		wmb_pmem();
 
 	bio_endio(bio);
+
+	blk_dax_put(q);
 }
 
 static int pmem_rw_page(struct block_device *bdev, sector_t sector,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 10/15] block, dax: fix lifetime of in-kernel dax mappings
  2015-09-23  4:41 [PATCH 00/15] get_user_pages() for dax mappings Dan Williams
                   ` (8 preceding siblings ...)
  2015-09-23  4:42 ` [PATCH 09/15] block, pmem: fix null pointer de-reference on shutdown, check for queue death Dan Williams
@ 2015-09-23  4:42 ` Dan Williams
  2015-10-07 22:56   ` Logan Gunthorpe
  2015-09-23  4:42 ` [PATCH 11/15] mm, dax, pmem: introduce __pfn_t Dan Williams
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 39+ messages in thread
From: Dan Williams @ 2015-09-23  4:42 UTC (permalink / raw)
  To: akpm
  Cc: Jens Axboe, Boaz Harrosh, linux-nvdimm, linux-kernel, linux-mm,
	linux-fsdevel, Ross Zwisler, Christoph Hellwig

The DAX implementation needs to protect new calls to ->direct_access()
and usage of its return value against unbind of the underlying block
device.  Use blk_dax_{get|put}() to either prevent blk_cleanup_queue()
from proceeding, or fail the dax_map_bh() if the request_queue is being
torn down.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |  131 ++++++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 84 insertions(+), 47 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index bcfb14bfc1e4..358eea39e982 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -63,12 +63,43 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 }
 EXPORT_SYMBOL_GPL(dax_clear_blocks);
 
-static long dax_get_addr(struct buffer_head *bh, void __pmem **addr,
-		unsigned blkbits)
+static void __pmem *__dax_map_bh(const struct buffer_head *bh, unsigned blkbits,
+		unsigned long *pfn, long *len)
 {
-	unsigned long pfn;
+	long rc;
+	void __pmem *addr;
+	struct block_device *bdev = bh->b_bdev;
+	struct request_queue *q = bdev->bd_queue;
 	sector_t sector = bh->b_blocknr << (blkbits - 9);
-	return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
+
+	rc = blk_dax_get(q);
+	if (rc < 0)
+		return (void __pmem *) ERR_PTR(rc);
+	rc = bdev_direct_access(bdev, sector, &addr, pfn, bh->b_size);
+	if (len)
+		*len = rc;
+	if (rc < 0) {
+		blk_dax_put(q);
+		return (void __pmem *) ERR_PTR(rc);
+	}
+	return addr;
+}
+
+static void __pmem *dax_map_bh(const struct buffer_head *bh, unsigned blkbits)
+{
+	unsigned long pfn;
+
+	return __dax_map_bh(bh, blkbits, &pfn, NULL);
+}
+
+static void dax_unmap_bh(const struct buffer_head *bh, void __pmem *addr)
+{
+	struct block_device *bdev = bh->b_bdev;
+	struct request_queue *q = bdev->bd_queue;
+
+	if (IS_ERR(addr))
+		return;
+	blk_dax_put(q);
 }
 
 /* the clear_pmem() calls are ordered by a wmb_pmem() in the caller */
@@ -104,15 +135,16 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 		      loff_t start, loff_t end, get_block_t get_block,
 		      struct buffer_head *bh)
 {
-	ssize_t retval = 0;
-	loff_t pos = start;
-	loff_t max = start;
-	loff_t bh_max = start;
-	void __pmem *addr;
+	loff_t pos = start, max = start, bh_max = start;
+	int rw = iov_iter_rw(iter), rc;
+	long map_len = 0;
+	unsigned long pfn;
+	void __pmem *addr = NULL;
+	void __pmem *kmap = (void __pmem *) ERR_PTR(-EIO);
 	bool hole = false;
 	bool need_wmb = false;
 
-	if (iov_iter_rw(iter) != WRITE)
+	if (rw == READ)
 		end = min(end, i_size_read(inode));
 
 	while (pos < end) {
@@ -127,9 +159,8 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 			if (pos == bh_max) {
 				bh->b_size = PAGE_ALIGN(end - pos);
 				bh->b_state = 0;
-				retval = get_block(inode, block, bh,
-						   iov_iter_rw(iter) == WRITE);
-				if (retval)
+				rc = get_block(inode, block, bh, rw == WRITE);
+				if (rc)
 					break;
 				if (!buffer_size_valid(bh))
 					bh->b_size = 1 << blkbits;
@@ -141,21 +172,25 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				bh->b_size -= done;
 			}
 
-			hole = iov_iter_rw(iter) != WRITE && !buffer_written(bh);
+			hole = rw == READ && !buffer_written(bh);
 			if (hole) {
 				addr = NULL;
 				size = bh->b_size - first;
 			} else {
-				retval = dax_get_addr(bh, &addr, blkbits);
-				if (retval < 0)
+				dax_unmap_bh(bh, kmap);
+				kmap = __dax_map_bh(bh, blkbits, &pfn, &map_len);
+				if (IS_ERR(kmap)) {
+					rc = PTR_ERR(kmap);
 					break;
+				}
+				addr = kmap;
 				if (buffer_unwritten(bh) || buffer_new(bh)) {
-					dax_new_buf(addr, retval, first, pos,
-									end);
+					dax_new_buf(addr, map_len, first, pos,
+							end);
 					need_wmb = true;
 				}
 				addr += first;
-				size = retval - first;
+				size = map_len - first;
 			}
 			max = min(pos + size, end);
 		}
@@ -178,8 +213,9 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 
 	if (need_wmb)
 		wmb_pmem();
+	dax_unmap_bh(bh, kmap);
 
-	return (pos == start) ? retval : pos - start;
+	return (pos == start) ? rc : pos - start;
 }
 
 /**
@@ -274,21 +310,22 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
 	void __pmem *vfrom;
 	void *vto;
 
-	if (dax_get_addr(bh, &vfrom, blkbits) < 0)
-		return -EIO;
+	vfrom = dax_map_bh(bh, blkbits);
+	if (IS_ERR(vfrom))
+		return PTR_ERR(vfrom);
 	vto = kmap_atomic(to);
 	copy_user_page(vto, (void __force *)vfrom, vaddr, to);
 	kunmap_atomic(vto);
+	dax_unmap_bh(bh, vfrom);
 	return 0;
 }
 
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
-	void __pmem *addr;
 	unsigned long pfn;
+	void __pmem *addr;
 	pgoff_t size;
 	int error;
 
@@ -305,11 +342,9 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		goto out;
 	}
 
-	error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
-	if (error < 0)
-		goto out;
-	if (error < PAGE_SIZE) {
-		error = -EIO;
+	addr = __dax_map_bh(bh, inode->i_blkbits, &pfn, NULL);
+	if (IS_ERR(addr)) {
+		error = PTR_ERR(addr);
 		goto out;
 	}
 
@@ -317,6 +352,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		clear_pmem(addr, PAGE_SIZE);
 		wmb_pmem();
 	}
+	dax_unmap_bh(bh, addr);
 
 	error = vm_insert_mixed(vma, vaddr, pfn);
 
@@ -528,11 +564,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	unsigned blkbits = inode->i_blkbits;
 	unsigned long pmd_addr = address & PMD_MASK;
 	bool write = flags & FAULT_FLAG_WRITE;
-	long length;
-	void __pmem *kaddr;
 	pgoff_t size, pgoff;
-	sector_t block, sector;
-	unsigned long pfn;
+	sector_t block;
 	int result = 0;
 
 	/* Fall back to PTEs if we're going to COW */
@@ -557,8 +590,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	bh.b_size = PMD_SIZE;
 	i_mmap_lock_write(mapping);
-	length = get_block(inode, block, &bh, write);
-	if (length)
+	if (get_block(inode, block, &bh, write) != 0)
 		return VM_FAULT_SIGBUS;
 
 	/*
@@ -569,17 +601,17 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE)
 		goto fallback;
 
-	sector = bh.b_blocknr << (blkbits - 9);
-
 	if (buffer_unwritten(&bh) || buffer_new(&bh)) {
 		int i;
+		long length;
+		unsigned long pfn;
+		void __pmem *kaddr = __dax_map_bh(&bh, blkbits, &pfn, &length);
 
-		length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
-						bh.b_size);
-		if (length < 0) {
+		if (IS_ERR(kaddr)) {
 			result = VM_FAULT_SIGBUS;
 			goto out;
 		}
+
 		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
 			goto fallback;
 
@@ -589,6 +621,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		count_vm_event(PGMAJFAULT);
 		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 		result |= VM_FAULT_MAJOR;
+		dax_unmap_bh(&bh, kaddr);
 	}
 
 	/*
@@ -635,12 +668,15 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
-		length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
-						bh.b_size);
-		if (length < 0) {
+		long length;
+		unsigned long pfn;
+		void __pmem *kaddr = __dax_map_bh(&bh, blkbits, &pfn, &length);
+
+		if (IS_ERR(kaddr)) {
 			result = VM_FAULT_SIGBUS;
 			goto out;
 		}
+		dax_unmap_bh(&bh, kaddr);
 		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
 			goto fallback;
 
@@ -746,12 +782,13 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 	if (err < 0)
 		return err;
 	if (buffer_written(&bh)) {
-		void __pmem *addr;
-		err = dax_get_addr(&bh, &addr, inode->i_blkbits);
-		if (err < 0)
-			return err;
+		void __pmem *addr = dax_map_bh(&bh, inode->i_blkbits);
+
+		if (IS_ERR(addr))
+			return PTR_ERR(addr);
 		clear_pmem(addr + offset, length);
 		wmb_pmem();
+		dax_unmap_bh(&bh, addr);
 	}
 
 	return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 11/15] mm, dax, pmem: introduce __pfn_t
  2015-09-23  4:41 [PATCH 00/15] get_user_pages() for dax mappings Dan Williams
                   ` (9 preceding siblings ...)
  2015-09-23  4:42 ` [PATCH 10/15] block, dax: fix lifetime of in-kernel dax mappings Dan Williams
@ 2015-09-23  4:42 ` Dan Williams
  2015-09-23 16:02   ` Dave Hansen
  2015-09-23  4:42 ` [PATCH 12/15] mm, dax, gpu: convert vm_insert_mixed to __pfn_t, introduce _PAGE_DEVMAP Dan Williams
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 39+ messages in thread
From: Dan Williams @ 2015-09-23  4:42 UTC (permalink / raw)
  To: akpm
  Cc: Dave Hansen, linux-nvdimm, linux-kernel, linux-mm, linux-fsdevel,
	Ross Zwisler, Christoph Hellwig

In preparation for enabling get_user_pages() operations on dax mappings,
introduce a type that encapsulates a page-frame-number that can also be
used to encode other information.  This other information is the
historical "page_link" encoding in a scatterlist, but can also denote
"device memory".  Where "device memory" is a set of pfns that are not
part of the kernel's linear mapping by default, but are accessed via the
same memory controller as ram.  The motivation for this new type is
large capacity persistent memory that optionally has struct page entries
in the 'memmap'.

When a driver, like pmem, has established a devm_memremap_pages()
mapping it needs to communicate to upper layers that the pfn has a page
backing.  This property will be leveraged in a later patch to enable
dax-gup.  For now, update all the ->direct_access() implementations to
communicate whether the returned pfn range is mapped.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave@sr71.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/sysdev/axonram.c |    8 ++---
 drivers/block/brd.c           |    4 +-
 drivers/nvdimm/pmem.c         |   27 ++++++++-------
 drivers/s390/block/dcssblk.c  |   10 ++----
 fs/block_dev.c                |    2 +
 fs/dax.c                      |   23 +++++++------
 include/linux/blkdev.h        |    4 +-
 include/linux/mm.h            |   72 +++++++++++++++++++++++++++++++++++++++++
 8 files changed, 110 insertions(+), 40 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 24ffab2572e8..35eff52c0a38 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -141,15 +141,13 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
  */
 static long
 axon_ram_direct_access(struct block_device *device, sector_t sector,
-		       void __pmem **kaddr, unsigned long *pfn)
+		       void __pmem **kaddr, __pfn_t *pfn)
 {
 	struct axon_ram_bank *bank = device->bd_disk->private_data;
 	loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
-	void *addr = (void *)(bank->ph_addr + offset);
-
-	*kaddr = (void __pmem *)addr;
-	*pfn = virt_to_phys(addr) >> PAGE_SHIFT;
 
+	*kaddr = (void __pmem __force *) bank->io_addr + offset;
+	*pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
 	return bank->size - offset;
 }
 
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index f645a71ae827..50e78b1ea26c 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -374,7 +374,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 static long brd_direct_access(struct block_device *bdev, sector_t sector,
-			void __pmem **kaddr, unsigned long *pfn)
+			void __pmem **kaddr, __pfn_t *pfn)
 {
 	struct brd_device *brd = bdev->bd_disk->private_data;
 	struct page *page;
@@ -385,7 +385,7 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
 	if (!page)
 		return -ENOSPC;
 	*kaddr = (void __pmem *)page_address(page);
-	*pfn = page_to_pfn(page);
+	*pfn = page_to_pfn_t(page);
 
 	return PAGE_SIZE;
 }
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3ee02af73ad0..1c670775129b 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -39,6 +39,7 @@ struct pmem_device {
 	phys_addr_t		phys_addr;
 	/* when non-zero this device is hosting a 'pfn' instance */
 	phys_addr_t		data_offset;
+	unsigned long		pfn_flags;
 	void __pmem		*virt_addr;
 	size_t			size;
 };
@@ -108,25 +109,22 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 }
 
 static long pmem_direct_access(struct block_device *bdev, sector_t sector,
-		      void __pmem **kaddr, unsigned long *pfn)
+		      void __pmem **kaddr, __pfn_t *pfn)
 {
 	struct pmem_device *pmem = bdev->bd_disk->private_data;
 	resource_size_t offset = sector * 512 + pmem->data_offset;
-	resource_size_t size;
+	resource_size_t size = pmem->size - offset;
 
-	if (pmem->data_offset) {
+	*kaddr = pmem->virt_addr + offset;
+	*pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
+
+	if (__pfn_t_has_page(*pfn)) {
 		/*
 		 * Limit the direct_access() size to what is covered by
 		 * the memmap
 		 */
-		size = (pmem->size - offset) & ~ND_PFN_MASK;
-	} else
-		size = pmem->size - offset;
-
-	/* FIXME convert DAX to comprehend that this mapping has a lifetime */
-	*kaddr = pmem->virt_addr + offset;
-	*pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;
-
+		size &= ~ND_PFN_MASK;
+	}
 	return size;
 }
 
@@ -158,9 +156,11 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 		return ERR_PTR(-EBUSY);
 	}
 
-	if (pmem_should_map_pages(dev))
+	pmem->pfn_flags = PFN_DEV;
+	if (pmem_should_map_pages(dev)) {
 		pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res);
-	else
+		pmem->pfn_flags |= PFN_MAP;
+	} else
 		pmem->virt_addr = (void __pmem *) devm_memremap(dev,
 				pmem->phys_addr, pmem->size,
 				ARCH_MEMREMAP_PMEM);
@@ -371,6 +371,7 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
 	pmem = dev_get_drvdata(dev);
 	devm_memunmap(dev, (void __force *) pmem->virt_addr);
 	pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res);
+	pmem->pfn_flags |= PFN_MAP;
 	if (IS_ERR(pmem->virt_addr)) {
 		rc = PTR_ERR(pmem->virt_addr);
 		goto err;
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index c212ce925ee6..4dfa8cfbdd9a 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -29,7 +29,7 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
 static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
-			 void __pmem **kaddr, unsigned long *pfn);
+			 void __pmem **kaddr, __pfn_t *pfn);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -881,20 +881,18 @@ fail:
 
 static long
 dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
-			void __pmem **kaddr, unsigned long *pfn)
+			void __pmem **kaddr, __pfn_t *pfn)
 {
 	struct dcssblk_dev_info *dev_info;
 	unsigned long offset, dev_sz;
-	void *addr;
 
 	dev_info = bdev->bd_disk->private_data;
 	if (!dev_info)
 		return -ENODEV;
 	dev_sz = dev_info->end - dev_info->start;
 	offset = secnum * 512;
-	addr = (void *) (dev_info->start + offset);
-	*pfn = virt_to_phys(addr) >> PAGE_SHIFT;
-	*kaddr = (void __pmem *) addr;
+	*kaddr = (void __pmem *) (dev_info->start + offset);
+	*pfn = phys_to_pfn_t(dev_info->start + offset, PFN_DEV);
 
 	return dev_sz - offset;
 }
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 073bb57adab1..74c507059f8d 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -442,7 +442,7 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
  * accessible at this address.
  */
 long bdev_direct_access(struct block_device *bdev, sector_t sector,
-			void __pmem **addr, unsigned long *pfn, long size)
+			void __pmem **addr, __pfn_t *pfn, long size)
 {
 	long avail;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
diff --git a/fs/dax.c b/fs/dax.c
index 358eea39e982..41d4f76e93ef 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -37,7 +37,7 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 	might_sleep();
 	do {
 		void __pmem *addr;
-		unsigned long pfn;
+		__pfn_t pfn;
 		long count;
 
 		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
@@ -64,7 +64,7 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 EXPORT_SYMBOL_GPL(dax_clear_blocks);
 
 static void __pmem *__dax_map_bh(const struct buffer_head *bh, unsigned blkbits,
-		unsigned long *pfn, long *len)
+		__pfn_t *pfn, long *len)
 {
 	long rc;
 	void __pmem *addr;
@@ -87,7 +87,7 @@ static void __pmem *__dax_map_bh(const struct buffer_head *bh, unsigned blkbits,
 
 static void __pmem *dax_map_bh(const struct buffer_head *bh, unsigned blkbits)
 {
-	unsigned long pfn;
+	__pfn_t pfn;
 
 	return __dax_map_bh(bh, blkbits, &pfn, NULL);
 }
@@ -138,7 +138,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 	loff_t pos = start, max = start, bh_max = start;
 	int rw = iov_iter_rw(iter), rc;
 	long map_len = 0;
-	unsigned long pfn;
+	__pfn_t pfn;
 	void __pmem *addr = NULL;
 	void __pmem *kmap = (void __pmem *) ERR_PTR(-EIO);
 	bool hole = false;
@@ -324,9 +324,9 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
-	unsigned long pfn;
 	void __pmem *addr;
 	pgoff_t size;
+	__pfn_t pfn;
 	int error;
 
 	/*
@@ -354,7 +354,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_bh(bh, addr);
 
-	error = vm_insert_mixed(vma, vaddr, pfn);
+	error = vm_insert_mixed(vma, vaddr, __pfn_t_to_pfn(pfn));
 
  out:
 	return error;
@@ -604,7 +604,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	if (buffer_unwritten(&bh) || buffer_new(&bh)) {
 		int i;
 		long length;
-		unsigned long pfn;
+		__pfn_t pfn;
 		void __pmem *kaddr = __dax_map_bh(&bh, blkbits, &pfn, &length);
 
 		if (IS_ERR(kaddr)) {
@@ -612,7 +612,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			goto out;
 		}
 
-		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
+		if ((length < PMD_SIZE) || (__pfn_t_to_pfn(pfn) & PG_PMD_COLOUR))
 			goto fallback;
 
 		for (i = 0; i < PTRS_PER_PMD; i++)
@@ -668,8 +668,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
+		__pfn_t pfn;
 		long length;
-		unsigned long pfn;
 		void __pmem *kaddr = __dax_map_bh(&bh, blkbits, &pfn, &length);
 
 		if (IS_ERR(kaddr)) {
@@ -677,10 +677,11 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			goto out;
 		}
 		dax_unmap_bh(&bh, kaddr);
-		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
+		if ((length < PMD_SIZE) || (__pfn_t_to_pfn(pfn) & PG_PMD_COLOUR))
 			goto fallback;
 
-		result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
+		result |= vmf_insert_pfn_pmd(vma, address, pmd,
+				__pfn_t_to_pfn(pfn), write);
 	}
 
  out:
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 363d7df8d65c..e893f30ad520 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1634,7 +1634,7 @@ struct block_device_operations {
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	long (*direct_access)(struct block_device *, sector_t, void __pmem **,
-			unsigned long *pfn);
+			__pfn_t *);
 	unsigned int (*check_events) (struct gendisk *disk,
 				      unsigned int clearing);
 	/* ->media_changed() is DEPRECATED, use ->check_events() instead */
@@ -1653,7 +1653,7 @@ extern int bdev_read_page(struct block_device *, sector_t, struct page *);
 extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 						struct writeback_control *);
 extern long bdev_direct_access(struct block_device *, sector_t,
-		void __pmem **addr, unsigned long *pfn, long size);
+		void __pmem **addr, __pfn_t *pfn, long size);
 #else /* CONFIG_BLOCK */
 
 struct block_device;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 91c08f6f0dc9..6ea922de6870 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -906,6 +906,78 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
 }
 
 /*
+ * __pfn_t: encapsulates a page-frame number that is optionally backed
+ * by memmap (struct page).  Whether a __pfn_t has a 'struct page'
+ * backing is indicated by flags in the low bits of the value;
+ */
+typedef struct {
+	unsigned long val;
+} __pfn_t;
+
+/*
+ * PFN_SG_CHAIN - pfn is a pointer to the next scatterlist entry
+ * PFN_SG_LAST - pfn references a page and is the last scatterlist entry
+ * PFN_DEV - pfn is not covered by system memmap by default
+ * PFN_MAP - pfn has a dynamic page mapping established by a device driver
+ */
+enum {
+	PFN_SHIFT = 4,
+	PFN_MASK = (1UL << PFN_SHIFT) - 1,
+	PFN_SG_CHAIN = (1UL << 0),
+	PFN_SG_LAST = (1UL << 1),
+	PFN_DEV = (1UL << 2),
+	PFN_MAP = (1UL << 3),
+};
+
+static inline __pfn_t pfn_to_pfn_t(unsigned long pfn, unsigned long flags)
+{
+	__pfn_t pfn_t = { .val = (pfn << PFN_SHIFT) | (flags & PFN_MASK), };
+
+	return pfn_t;
+}
+
+static inline __pfn_t phys_to_pfn_t(dma_addr_t addr, unsigned long flags)
+{
+	return pfn_to_pfn_t(addr >> PAGE_SHIFT, flags);
+}
+
+static inline bool __pfn_t_has_page(__pfn_t pfn)
+{
+	return (pfn.val & PFN_MAP) == PFN_MAP || (pfn.val & PFN_DEV) == 0;
+}
+
+static inline unsigned long __pfn_t_to_pfn(__pfn_t pfn)
+{
+	return pfn.val >> PFN_SHIFT;
+}
+
+static inline struct page *__pfn_t_to_page(__pfn_t pfn)
+{
+	if (__pfn_t_has_page(pfn))
+		return pfn_to_page(__pfn_t_to_pfn(pfn));
+	return NULL;
+}
+
+static inline dma_addr_t __pfn_t_to_phys(__pfn_t pfn)
+{
+	return PFN_PHYS(__pfn_t_to_pfn(pfn));
+}
+
+static inline void *__pfn_t_to_virt(__pfn_t pfn)
+{
+	if (__pfn_t_has_page(pfn))
+		return __va(__pfn_t_to_phys(pfn));
+	return NULL;
+}
+
+static inline __pfn_t page_to_pfn_t(struct page *page)
+{
+	__pfn_t pfn = { .val = page_to_pfn(page) << PFN_SHIFT, };
+
+	return pfn;
+}
+
+/*
  * Some inline functions in vmstat.h depend on page_zone()
  */
 #include <linux/vmstat.h>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 12/15] mm, dax, gpu: convert vm_insert_mixed to __pfn_t, introduce _PAGE_DEVMAP
  2015-09-23  4:41 [PATCH 00/15] get_user_pages() for dax mappings Dan Williams
                   ` (10 preceding siblings ...)
  2015-09-23  4:42 ` [PATCH 11/15] mm, dax, pmem: introduce __pfn_t Dan Williams
@ 2015-09-23  4:42 ` Dan Williams
  2015-09-23 13:47   ` Geert Uytterhoeven
  2015-09-23  4:42 ` [PATCH 13/15] mm, dax: convert vmf_insert_pfn_pmd() to __pfn_t Dan Williams
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 39+ messages in thread
From: Dan Williams @ 2015-09-23  4:42 UTC (permalink / raw)
  To: akpm
  Cc: Dave Hansen, linux-nvdimm, David Airlie, linux-kernel, linux-mm,
	linux-fsdevel

Convert the raw unsigned long 'pfn' argument to __pfn_t for the purpose of
evaluating the PFN_MAP and PFN_DEV flags.  When both are set the it
triggers _PAGE_DEVMAP to be set in the resulting pte.  This flag will
later be used in the get_user_pages() path to pin the page mapping,
dynamically allocated by devm_memremap_pages(), until all the resulting
pages are released.

There are no functional changes to the gpu drivers as a result of this
conversion.

This uncovered several architectures with no local definition for
pfn_pte(), in response __pfn_t_pte() is only defined when an arch
opts-in by "#define pfn_pte pfn_pte".

Cc: Dave Hansen <dave@sr71.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Airlie <airlied@linux.ie>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/alpha/include/asm/pgtable.h        |    1 +
 arch/m68k/include/asm/page_no.h         |    1 +
 arch/parisc/include/asm/pgtable.h       |    1 +
 arch/powerpc/include/asm/pgtable.h      |    1 +
 arch/tile/include/asm/pgtable.h         |    1 +
 arch/um/include/asm/pgtable-3level.h    |    1 +
 arch/x86/include/asm/pgtable.h          |   18 ++++++++++++++++++
 arch/x86/include/asm/pgtable_types.h    |    7 ++++++-
 drivers/gpu/drm/exynos/exynos_drm_gem.c |    3 ++-
 drivers/gpu/drm/gma500/framebuffer.c    |    2 +-
 drivers/gpu/drm/msm/msm_gem.c           |    3 ++-
 drivers/gpu/drm/omapdrm/omap_gem.c      |    6 ++++--
 drivers/gpu/drm/ttm/ttm_bo_vm.c         |    3 ++-
 fs/dax.c                                |    2 +-
 include/linux/mm.h                      |   29 ++++++++++++++++++++++++++++-
 mm/memory.c                             |   15 +++++++++------
 16 files changed, 79 insertions(+), 15 deletions(-)

diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgtable.h
index a9a119592372..a54050fe867e 100644
--- a/arch/alpha/include/asm/pgtable.h
+++ b/arch/alpha/include/asm/pgtable.h
@@ -216,6 +216,7 @@ extern unsigned long __zero_page(void);
 })
 #endif
 
+#define pfn_pte pfn_pte
 extern inline pte_t pfn_pte(unsigned long physpfn, pgprot_t pgprot)
 { pte_t pte; pte_val(pte) = (PHYS_TWIDDLE(physpfn) << 32) | pgprot_val(pgprot); return pte; }
 
diff --git a/arch/m68k/include/asm/page_no.h b/arch/m68k/include/asm/page_no.h
index ef209169579a..930a42f6db44 100644
--- a/arch/m68k/include/asm/page_no.h
+++ b/arch/m68k/include/asm/page_no.h
@@ -34,6 +34,7 @@ extern unsigned long memory_end;
 
 #define	virt_addr_valid(kaddr)	(((void *)(kaddr) >= (void *)PAGE_OFFSET) && \
 				((void *)(kaddr) < (void *)memory_end))
+#define __pfn_to_phys(pfn)	PFN_PHYS(pfn)
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/arch/parisc/include/asm/pgtable.h b/arch/parisc/include/asm/pgtable.h
index f93c4a4e6580..dde7dd7200bd 100644
--- a/arch/parisc/include/asm/pgtable.h
+++ b/arch/parisc/include/asm/pgtable.h
@@ -377,6 +377,7 @@ static inline pte_t pte_mkspecial(pte_t pte)	{ return pte; }
 
 #define mk_pte(page, pgprot)	pfn_pte(page_to_pfn(page), (pgprot))
 
+#define pfn_pte pfn_pte
 static inline pte_t pfn_pte(unsigned long pfn, pgprot_t pgprot)
 {
 	pte_t pte;
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 0717693c8428..8448ff1542e0 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -67,6 +67,7 @@ static inline int pte_present(pte_t pte)
  * Even if PTEs can be unsigned long long, a PFN is always an unsigned
  * long for now.
  */
+#define pfn_pte pfn_pte
 static inline pte_t pfn_pte(unsigned long pfn, pgprot_t pgprot) {
 	return __pte(((pte_basic_t)(pfn) << PTE_RPN_SHIFT) |
 		     pgprot_val(pgprot)); }
diff --git a/arch/tile/include/asm/pgtable.h b/arch/tile/include/asm/pgtable.h
index 2b05ccbebed9..37c9aa3a3f0c 100644
--- a/arch/tile/include/asm/pgtable.h
+++ b/arch/tile/include/asm/pgtable.h
@@ -275,6 +275,7 @@ static inline unsigned long pte_pfn(pte_t pte)
 extern pgprot_t set_remote_cache_cpu(pgprot_t prot, int cpu);
 extern int get_remote_cache_cpu(pgprot_t prot);
 
+#define pfn_pte pfn_pte
 static inline pte_t pfn_pte(unsigned long pfn, pgprot_t prot)
 {
 	return hv_pte_set_pa(prot, PFN_PHYS(pfn));
diff --git a/arch/um/include/asm/pgtable-3level.h b/arch/um/include/asm/pgtable-3level.h
index 2b4274e7c095..4de681d15911 100644
--- a/arch/um/include/asm/pgtable-3level.h
+++ b/arch/um/include/asm/pgtable-3level.h
@@ -98,6 +98,7 @@ static inline unsigned long pte_pfn(pte_t pte)
 	return phys_to_pfn(pte_val(pte));
 }
 
+#define pfn_pte pfn_pte
 static inline pte_t pfn_pte(pfn_t page_nr, pgprot_t pgprot)
 {
 	pte_t pte;
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 867da5bbb4a3..02a54e5b7930 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -248,6 +248,11 @@ static inline pte_t pte_mkspecial(pte_t pte)
 	return pte_set_flags(pte, _PAGE_SPECIAL);
 }
 
+static inline pte_t pte_mkdevmap(pte_t pte)
+{
+	return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
+}
+
 static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
 {
 	pmdval_t v = native_pmd_val(pmd);
@@ -334,6 +339,7 @@ static inline pgprotval_t massage_pgprot(pgprot_t pgprot)
 	return protval;
 }
 
+#define pfn_pte pfn_pte
 static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
 {
 	return __pte(((phys_addr_t)page_nr << PAGE_SHIFT) |
@@ -446,6 +452,12 @@ static inline int pte_present(pte_t a)
 	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
 }
 
+#define pte_devmap pte_devmap
+static inline int pte_devmap(pte_t a)
+{
+	return pte_flags(a) & _PAGE_DEVMAP;
+}
+
 #define pte_accessible pte_accessible
 static inline bool pte_accessible(struct mm_struct *mm, pte_t a)
 {
@@ -464,6 +476,12 @@ static inline int pte_hidden(pte_t pte)
 	return pte_flags(pte) & _PAGE_HIDDEN;
 }
 
+#define pmd_devmap pmd_devmap
+static inline int pmd_devmap(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_DEVMAP;
+}
+
 static inline int pmd_present(pmd_t pmd)
 {
 	/*
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 13f310bfc09a..42d34e795123 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -25,7 +25,9 @@
 #define _PAGE_BIT_SPLITTING	_PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
 #define _PAGE_BIT_HIDDEN	_PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
-#define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
+#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
+#define _PAGE_BIT_DEVMAP		_PAGE_BIT_SOFTW4
+#define _PAGE_BIT_NX		63	/* No execute: only valid after cpuid check */
 
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
@@ -85,8 +87,11 @@
 
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AT(pteval_t, 1) << _PAGE_BIT_NX)
+#define _PAGE_DEVMAP	(_AT(pteval_t, 1) << _PAGE_BIT_DEVMAP)
+#define __HAVE_ARCH_PTE_DEVMAP
 #else
 #define _PAGE_NX	(_AT(pteval_t, 0))
+#define _PAGE_DEVMAP	(_AT(pteval_t, 0))
 #endif
 
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
diff --git a/drivers/gpu/drm/exynos/exynos_drm_gem.c b/drivers/gpu/drm/exynos/exynos_drm_gem.c
index f12fbc36b120..3df44766a72f 100644
--- a/drivers/gpu/drm/exynos/exynos_drm_gem.c
+++ b/drivers/gpu/drm/exynos/exynos_drm_gem.c
@@ -492,7 +492,8 @@ int exynos_drm_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 
 	pfn = page_to_pfn(exynos_gem_obj->pages[page_offset]);
-	ret = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, pfn);
+	ret = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
+			pfn_to_pfn_t(pfn, PFN_DEV));
 
 out:
 	switch (ret) {
diff --git a/drivers/gpu/drm/gma500/framebuffer.c b/drivers/gpu/drm/gma500/framebuffer.c
index 2eaf1b31c7bd..a3e64aca9dce 100644
--- a/drivers/gpu/drm/gma500/framebuffer.c
+++ b/drivers/gpu/drm/gma500/framebuffer.c
@@ -132,7 +132,7 @@ static int psbfb_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	for (i = 0; i < page_num; i++) {
 		pfn = (phys_addr >> PAGE_SHIFT);
 
-		ret = vm_insert_mixed(vma, address, pfn);
+		ret = vm_insert_mixed(vma, address, pfn_to_pfn_t(pfn, PFN_DEV));
 		if (unlikely((ret == -EBUSY) || (ret != 0 && i > 0)))
 			break;
 		else if (unlikely(ret != 0)) {
diff --git a/drivers/gpu/drm/msm/msm_gem.c b/drivers/gpu/drm/msm/msm_gem.c
index c76cc853b08a..0f4ed5bfda83 100644
--- a/drivers/gpu/drm/msm/msm_gem.c
+++ b/drivers/gpu/drm/msm/msm_gem.c
@@ -222,7 +222,8 @@ int msm_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	VERB("Inserting %p pfn %lx, pa %lx", vmf->virtual_address,
 			pfn, pfn << PAGE_SHIFT);
 
-	ret = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, pfn);
+	ret = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
+			pfn_to_pfn_t(pfn, PFN_DEV));
 
 out_unlock:
 	mutex_unlock(&dev->struct_mutex);
diff --git a/drivers/gpu/drm/omapdrm/omap_gem.c b/drivers/gpu/drm/omapdrm/omap_gem.c
index 7ed08fdc4c42..910cb276a7ea 100644
--- a/drivers/gpu/drm/omapdrm/omap_gem.c
+++ b/drivers/gpu/drm/omapdrm/omap_gem.c
@@ -385,7 +385,8 @@ static int fault_1d(struct drm_gem_object *obj,
 	VERB("Inserting %p pfn %lx, pa %lx", vmf->virtual_address,
 			pfn, pfn << PAGE_SHIFT);
 
-	return vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, pfn);
+	return vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
+			pfn_to_pfn_t(pfn, PFN_DEV));
 }
 
 /* Special handling for the case of faulting in 2d tiled buffers */
@@ -478,7 +479,8 @@ static int fault_2d(struct drm_gem_object *obj,
 			pfn, pfn << PAGE_SHIFT);
 
 	for (i = n; i > 0; i--) {
-		vm_insert_mixed(vma, (unsigned long)vaddr, pfn);
+		vm_insert_mixed(vma, (unsigned long)vaddr,
+				pfn_to_pfn_t(pfn, PFN_DEV));
 		pfn += usergart[fmt].stride_pfn;
 		vaddr += PAGE_SIZE * m;
 	}
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 8fb7213277cc..f27fa1534687 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -229,7 +229,8 @@ static int ttm_bo_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		}
 
 		if (vma->vm_flags & VM_MIXEDMAP)
-			ret = vm_insert_mixed(&cvma, address, pfn);
+			ret = vm_insert_mixed(&cvma, address,
+					pfn_to_pfn_t(pfn, PFN_DEV));
 		else
 			ret = vm_insert_pfn(&cvma, address, pfn);
 
diff --git a/fs/dax.c b/fs/dax.c
index 41d4f76e93ef..b93dbf363dc2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -354,7 +354,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_bh(bh, addr);
 
-	error = vm_insert_mixed(vma, vaddr, __pfn_t_to_pfn(pfn));
+	error = vm_insert_mixed(vma, vaddr, pfn);
 
  out:
 	return error;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6ea922de6870..54fbeda5b896 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -977,6 +977,33 @@ static inline __pfn_t page_to_pfn_t(struct page *page)
 	return pfn;
 }
 
+static inline int __pfn_t_valid(__pfn_t pfn)
+{
+	return pfn_valid(__pfn_t_to_pfn(pfn));
+}
+
+#ifdef pfn_pte
+static inline pte_t __pfn_t_pte(__pfn_t pfn, pgprot_t pgprot)
+{
+	return pfn_pte(__pfn_t_to_pfn(pfn), pgprot);
+}
+#endif
+
+#ifdef __HAVE_ARCH_PTE_DEVICE
+static inline bool __pfn_t_has_dev_pagemap(__pfn_t pfn)
+{
+	const unsigned long flags = PFN_DEV|PFN_MAP;
+
+	return (pfn.val & flags) == flags;
+}
+#else
+static inline bool __pfn_t_has_dev_pagemap(__pfn_t pfn)
+{
+	return false;
+}
+pte_t pte_mkdevmap(pte_t pte);
+#endif
+
 /*
  * Some inline functions in vmstat.h depend on page_zone()
  */
@@ -2160,7 +2187,7 @@ int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
 int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn);
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
-			unsigned long pfn);
+			__pfn_t pfn);
 int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
 
 
diff --git a/mm/memory.c b/mm/memory.c
index 9cb27470fee9..6ec61c120289 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1517,7 +1517,7 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr,
 EXPORT_SYMBOL(vm_insert_page);
 
 static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
-			unsigned long pfn, pgprot_t prot)
+			__pfn_t pfn, pgprot_t prot)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	int retval;
@@ -1533,7 +1533,10 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 		goto out_unlock;
 
 	/* Ok, finally just insert the thing.. */
-	entry = pte_mkspecial(pfn_pte(pfn, prot));
+	if (__pfn_t_has_dev_pagemap(pfn))
+		entry = pte_mkdevmap(__pfn_t_pte(pfn, prot));
+	else
+		entry = pte_mkspecial(__pfn_t_pte(pfn, prot));
 	set_pte_at(mm, addr, pte, entry);
 	update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */
 
@@ -1583,14 +1586,14 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 	if (track_pfn_insert(vma, &pgprot, pfn))
 		return -EINVAL;
 
-	ret = insert_pfn(vma, addr, pfn, pgprot);
+	ret = insert_pfn(vma, addr, pfn_to_pfn_t(pfn, PFN_DEV), pgprot);
 
 	return ret;
 }
 EXPORT_SYMBOL(vm_insert_pfn);
 
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
-			unsigned long pfn)
+			__pfn_t pfn)
 {
 	BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
 
@@ -1604,10 +1607,10 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 	 * than insert_pfn).  If a zero_pfn were inserted into a VM_MIXEDMAP
 	 * without pte special, it would there be refcounted as a normal page.
 	 */
-	if (!HAVE_PTE_SPECIAL && pfn_valid(pfn)) {
+	if (!HAVE_PTE_SPECIAL && __pfn_t_valid(pfn)) {
 		struct page *page;
 
-		page = pfn_to_page(pfn);
+		page = __pfn_t_to_page(pfn);
 		return insert_page(vma, addr, page, vma->vm_page_prot);
 	}
 	return insert_pfn(vma, addr, pfn, vma->vm_page_prot);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 13/15] mm, dax: convert vmf_insert_pfn_pmd() to __pfn_t
  2015-09-23  4:41 [PATCH 00/15] get_user_pages() for dax mappings Dan Williams
                   ` (11 preceding siblings ...)
  2015-09-23  4:42 ` [PATCH 12/15] mm, dax, gpu: convert vm_insert_mixed to __pfn_t, introduce _PAGE_DEVMAP Dan Williams
@ 2015-09-23  4:42 ` Dan Williams
  2015-09-23  4:42 ` [PATCH 14/15] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup Dan Williams
  2015-09-23  4:42 ` [PATCH 15/15] mm, x86: get_user_pages() for dax mappings Dan Williams
  14 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2015-09-23  4:42 UTC (permalink / raw)
  To: akpm
  Cc: Dave Hansen, linux-nvdimm, linux-kernel, linux-mm,
	Alexander Viro, linux-fsdevel, Matthew Wilcox

Similar to the conversion of vm_insert_mixed() use __pfn_t in the
vmf_insert_pfn_pmd() to tag the resulting pte with _PAGE_DEVICE when the
pfn is backed by a devm_memremap_pages() mapping.

Cc: Dave Hansen <dave@sr71.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/sparc/include/asm/pgtable_64.h |    2 ++
 arch/x86/include/asm/pgtable.h      |    6 ++++++
 arch/x86/mm/pat.c                   |    4 ++--
 fs/dax.c                            |    2 +-
 include/asm-generic/pgtable.h       |    6 ++++--
 include/linux/huge_mm.h             |    2 +-
 include/linux/mm.h                  |   27 +++++++++++++++++----------
 include/linux/pfn.h                 |    9 +++++++++
 mm/huge_memory.c                    |   10 ++++++----
 mm/memory.c                         |    2 +-
 10 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 131d36fcd07a..496ef783c68c 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -234,6 +234,7 @@ extern struct page *mem_map_zero;
  * the first physical page in the machine is at some huge physical address,
  * such as 4GB.   This is common on a partitioned E10000, for example.
  */
+#define pfn_pte pfn_pte
 static inline pte_t pfn_pte(unsigned long pfn, pgprot_t prot)
 {
 	unsigned long paddr = pfn << PAGE_SHIFT;
@@ -244,6 +245,7 @@ static inline pte_t pfn_pte(unsigned long pfn, pgprot_t prot)
 #define mk_pte(page, pgprot)	pfn_pte(page_to_pfn(page), (pgprot))
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pfn_pmd pfn_pmd
 static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)
 {
 	pte_t pte = pfn_pte(page_nr, pgprot);
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 02a54e5b7930..84d1346e1cda 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -282,6 +282,11 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
 	return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
 }
 
+static inline pmd_t pmd_mkdevmap(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_DEVMAP);
+}
+
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
 	return pmd_set_flags(pmd, _PAGE_PSE);
@@ -346,6 +351,7 @@ static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
 		     massage_pgprot(pgprot));
 }
 
+#define pfn_pmd pfn_pmd
 static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)
 {
 	return __pmd(((phys_addr_t)page_nr << PAGE_SHIFT) |
diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index 188e3e07eeeb..2e02064dbe45 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -949,7 +949,7 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 }
 
 int track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
-		     unsigned long pfn)
+		     __pfn_t pfn)
 {
 	enum page_cache_mode pcm;
 
@@ -957,7 +957,7 @@ int track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
 		return 0;
 
 	/* Set prot based on lookup */
-	pcm = lookup_memtype((resource_size_t)pfn << PAGE_SHIFT);
+	pcm = lookup_memtype(__pfn_t_to_phys(pfn));
 	*prot = __pgprot((pgprot_val(vma->vm_page_prot) & (~_PAGE_CACHE_MASK)) |
 			 cachemode2protval(pcm));
 
diff --git a/fs/dax.c b/fs/dax.c
index b93dbf363dc2..321966335f33 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -681,7 +681,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			goto fallback;
 
 		result |= vmf_insert_pfn_pmd(vma, address, pmd,
-				__pfn_t_to_pfn(pfn), write);
+				pfn, write);
 	}
 
  out:
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 29c57b2cb344..a65f86061563 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1,6 +1,8 @@
 #ifndef _ASM_GENERIC_PGTABLE_H
 #define _ASM_GENERIC_PGTABLE_H
 
+#include <linux/pfn.h>
+
 #ifndef __ASSEMBLY__
 #ifdef CONFIG_MMU
 
@@ -521,7 +523,7 @@ static inline int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
  * by vm_insert_pfn().
  */
 static inline int track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
-				   unsigned long pfn)
+				   __pfn_t pfn)
 {
 	return 0;
 }
@@ -549,7 +551,7 @@ extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 			   unsigned long pfn, unsigned long addr,
 			   unsigned long size);
 extern int track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
-			    unsigned long pfn);
+			    __pfn_t pfn);
 extern int track_pfn_copy(struct vm_area_struct *vma);
 extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
 			unsigned long size);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ecb080d6ff42..824eb98c53fc 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -34,7 +34,7 @@ extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot,
 			int prot_numa);
 int vmf_insert_pfn_pmd(struct vm_area_struct *, unsigned long addr, pmd_t *,
-			unsigned long pfn, bool write);
+			__pfn_t pfn, bool write);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 54fbeda5b896..989c5459bee7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -906,15 +906,6 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
 }
 
 /*
- * __pfn_t: encapsulates a page-frame number that is optionally backed
- * by memmap (struct page).  Whether a __pfn_t has a 'struct page'
- * backing is indicated by flags in the low bits of the value;
- */
-typedef struct {
-	unsigned long val;
-} __pfn_t;
-
-/*
  * PFN_SG_CHAIN - pfn is a pointer to the next scatterlist entry
  * PFN_SG_LAST - pfn references a page and is the last scatterlist entry
  * PFN_DEV - pfn is not covered by system memmap by default
@@ -989,7 +980,14 @@ static inline pte_t __pfn_t_pte(__pfn_t pfn, pgprot_t pgprot)
 }
 #endif
 
-#ifdef __HAVE_ARCH_PTE_DEVICE
+#ifdef pfn_pmd
+static inline pmd_t __pfn_t_pmd(__pfn_t pfn, pgprot_t pgprot)
+{
+	return pfn_pmd(__pfn_t_to_pfn(pfn), pgprot);
+}
+#endif
+
+#ifdef __HAVE_ARCH_PTE_DEVMAP
 static inline bool __pfn_t_has_dev_pagemap(__pfn_t pfn)
 {
 	const unsigned long flags = PFN_DEV|PFN_MAP;
@@ -1002,6 +1000,7 @@ static inline bool __pfn_t_has_dev_pagemap(__pfn_t pfn)
 	return false;
 }
 pte_t pte_mkdevmap(pte_t pte);
+pmd_t pmd_mkdevmap(pmd_t pmd);
 #endif
 
 /*
@@ -1767,6 +1766,14 @@ static inline void pgtable_pmd_page_dtor(struct page *page) {}
 
 #endif
 
+#ifndef pmd_devmap
+#define pmd_devmap(x) (0)
+#endif
+
+#ifndef pte_devmap
+#define pte_devmap(x) (0)
+#endif
+
 static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
 {
 	spinlock_t *ptl = pmd_lockptr(mm, pmd);
diff --git a/include/linux/pfn.h b/include/linux/pfn.h
index 7646637221f3..ebe7b30ff912 100644
--- a/include/linux/pfn.h
+++ b/include/linux/pfn.h
@@ -3,6 +3,15 @@
 
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
+
+/*
+ * __pfn_t: encapsulates a page-frame number that is optionally backed
+ * by memmap (struct page).  Whether a __pfn_t has a 'struct page'
+ * backing is indicated by flags in the low bits of the value;
+ */
+typedef struct {
+	unsigned long val;
+} __pfn_t;
 #endif
 
 #define PFN_ALIGN(x)	(((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4b06b8db9df2..d8f01783ab88 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -870,7 +870,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
-		pmd_t *pmd, unsigned long pfn, pgprot_t prot, bool write)
+		pmd_t *pmd, __pfn_t pfn, pgprot_t prot, bool write)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pmd_t entry;
@@ -878,7 +878,9 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_none(*pmd)) {
-		entry = pmd_mkhuge(pfn_pmd(pfn, prot));
+		entry = pmd_mkhuge(__pfn_t_pmd(pfn, prot));
+		if (__pfn_t_has_dev_pagemap(pfn))
+			entry = pmd_mkdevmap(entry);
 		if (write) {
 			entry = pmd_mkyoung(pmd_mkdirty(entry));
 			entry = maybe_pmd_mkwrite(entry, vma);
@@ -890,7 +892,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 }
 
 int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
-			pmd_t *pmd, unsigned long pfn, bool write)
+			pmd_t *pmd, __pfn_t pfn, bool write)
 {
 	pgprot_t pgprot = vma->vm_page_prot;
 	/*
@@ -902,7 +904,7 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_valid(pfn));
+	BUG_ON((vma->vm_flags & VM_MIXEDMAP) && __pfn_t_valid(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
diff --git a/mm/memory.c b/mm/memory.c
index 6ec61c120289..2e178d8fdeba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1583,7 +1583,7 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return -EFAULT;
-	if (track_pfn_insert(vma, &pgprot, pfn))
+	if (track_pfn_insert(vma, &pgprot, pfn_to_pfn_t(pfn, PFN_DEV)))
 		return -EINVAL;
 
 	ret = insert_pfn(vma, addr, pfn_to_pfn_t(pfn, PFN_DEV), pgprot);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 14/15] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup
  2015-09-23  4:41 [PATCH 00/15] get_user_pages() for dax mappings Dan Williams
                   ` (12 preceding siblings ...)
  2015-09-23  4:42 ` [PATCH 13/15] mm, dax: convert vmf_insert_pfn_pmd() to __pfn_t Dan Williams
@ 2015-09-23  4:42 ` Dan Williams
  2015-10-02 21:21   ` Logan Gunthorpe
  2015-09-23  4:42 ` [PATCH 15/15] mm, x86: get_user_pages() for dax mappings Dan Williams
  14 siblings, 1 reply; 39+ messages in thread
From: Dan Williams @ 2015-09-23  4:42 UTC (permalink / raw)
  To: akpm
  Cc: Dave Hansen, linux-nvdimm, linux-kernel, linux-mm,
	Alexander Viro, linux-fsdevel, Matthew Wilcox, Ross Zwisler

get_dev_page() enables paths like get_user_pages() to pin a dynamically
mapped pfn-range (devm_memremap_pages()) while the resulting struct page
objects are in use.  Unlike get_page() it may fail if the device is, or
is in the process of being, disabled.  While the initial lookup of the
range may be an expensive list walk, the result is cached to speed up
subsequent lookups which are likely to be in the same mapped range.

Cc: Dave Hansen <dave@sr71.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c    |    2 +
 include/linux/io.h       |   17 -----------
 include/linux/mm.h       |   62 ++++++++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h |    6 +++-
 kernel/memremap.c        |   71 ++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 140 insertions(+), 18 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 1c670775129b..ac581a2e20e2 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -184,6 +184,7 @@ static void pmem_detach_disk(struct pmem_device *pmem)
 static int pmem_attach_disk(struct device *dev,
 		struct nd_namespace_common *ndns, struct pmem_device *pmem)
 {
+	struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
 	int nid = dev_to_node(dev);
 	struct gendisk *disk;
 
@@ -191,6 +192,7 @@ static int pmem_attach_disk(struct device *dev,
 	if (!pmem->pmem_queue)
 		return -ENOMEM;
 
+	devm_register_pagemap(dev, &nsio->res, &pmem->pmem_queue->dax_ref.count);
 	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
 	blk_queue_physical_block_size(pmem->pmem_queue, PAGE_SIZE);
 	blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX);
diff --git a/include/linux/io.h b/include/linux/io.h
index de64c1e53612..2f2f8859abd9 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -87,23 +87,6 @@ void *devm_memremap(struct device *dev, resource_size_t offset,
 		size_t size, unsigned long flags);
 void devm_memunmap(struct device *dev, void *addr);
 
-void *__devm_memremap_pages(struct device *dev, struct resource *res);
-
-#ifdef CONFIG_ZONE_DEVICE
-void *devm_memremap_pages(struct device *dev, struct resource *res);
-#else
-static inline void *devm_memremap_pages(struct device *dev, struct resource *res)
-{
-	/*
-	 * Fail attempts to call devm_memremap_pages() without
-	 * ZONE_DEVICE support enabled, this requires callers to fall
-	 * back to plain devm_memremap() based on config
-	 */
-	WARN_ON_ONCE(1);
-	return ERR_PTR(-ENXIO);
-}
-#endif
-
 /*
  * Some systems do not have legacy ISA devices.
  * /dev/port is not a valid interface on these systems.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 989c5459bee7..6183549a854c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -15,12 +15,14 @@
 #include <linux/debug_locks.h>
 #include <linux/mm_types.h>
 #include <linux/range.h>
+#include <linux/percpu-refcount.h>
 #include <linux/pfn.h>
 #include <linux/bit_spinlock.h>
 #include <linux/shrinker.h>
 #include <linux/resource.h>
 #include <linux/page_ext.h>
 #include <linux/err.h>
+#include <linux/ioport.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -558,6 +560,28 @@ static inline void init_page_count(struct page *page)
 void put_page(struct page *page);
 void put_pages_list(struct list_head *pages);
 
+#ifdef CONFIG_ZONE_DEVICE
+void *devm_memremap_pages(struct device *dev, struct resource *res);
+void devm_register_pagemap(struct device *dev, struct resource *res,
+		struct percpu_ref *ref);
+#else
+static inline void *devm_memremap_pages(struct device *dev, struct resource *res)
+{
+	/*
+	 * Fail attempts to call devm_memremap_pages() without
+	 * ZONE_DEVICE support enabled, this requires callers to fall
+	 * back to plain devm_memremap() based on config
+	 */
+	WARN_ON_ONCE(1);
+	return ERR_PTR(-ENXIO);
+}
+
+static inline void devm_register_pagemap(struct device *dev, struct resource *res,
+		struct percpu_ref *ref)
+{
+}
+#endif
+
 void split_page(struct page *page, unsigned int order);
 int split_free_page(struct page *page);
 
@@ -717,6 +741,44 @@ static inline enum zone_type page_zonenum(const struct page *page)
 	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
 }
 
+/**
+ * struct dev_pagemap - reference count for a devm_memremap_pages mapping
+ * @res: physical address range covered by @ref
+ * @ref: reference count that pins the devm_memremap_pages() mapping
+ * @dev: host device of the mapping for debug
+ */
+struct dev_pagemap {
+	const struct resource *res;
+	struct percpu_ref *ref;
+	struct device *dev;
+};
+
+struct dev_pagemap *__get_dev_pagemap(resource_size_t phys);
+
+static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
+		struct dev_pagemap *pgmap)
+{
+	resource_size_t phys = PFN_PHYS(pfn);
+
+	/*
+	 * In the cached case we're already holding a reference so we can
+	 * simply do a blind increment
+	 */
+	if (pgmap && phys >= pgmap->res->start && phys <= pgmap->res->end) {
+		percpu_ref_get(pgmap->ref);
+		return pgmap;
+	}
+
+	/* fall back to slow path lookup */
+	return __get_dev_pagemap(phys);
+}
+
+static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
+{
+	if (pgmap)
+		percpu_ref_put(pgmap->ref);
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3d6baa7d4534..20097e7b679a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -49,12 +49,16 @@ struct page {
 					 * updated asynchronously */
 	union {
 		struct address_space *mapping;	/* If low bit clear, points to
-						 * inode address_space, or NULL.
+						 * inode address_space, unless
+						 * the page is in ZONE_DEVICE
+						 * then it points to its parent
+						 * dev_pagemap, otherwise NULL.
 						 * If page mapped as anonymous
 						 * memory, low bit is set, and
 						 * it points to anon_vma object:
 						 * see PAGE_MAPPING_ANON below.
 						 */
+		struct dev_pagemap *pgmap;
 		void *s_mem;			/* slab first object */
 	};
 
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 0d818ce04129..74344dc8c31e 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -10,6 +10,7 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/rculist.h>
 #include <linux/device.h>
 #include <linux/types.h>
 #include <linux/io.h>
@@ -137,16 +138,86 @@ void devm_memunmap(struct device *dev, void *addr)
 EXPORT_SYMBOL(devm_memunmap);
 
 #ifdef CONFIG_ZONE_DEVICE
+static LIST_HEAD(ranges);
+static DEFINE_SPINLOCK(range_lock);
+
 struct page_map {
 	struct resource res;
+	struct dev_pagemap pgmap;
+	struct list_head list;
 };
 
 static void devm_memremap_pages_release(struct device *dev, void *res)
 {
 	struct page_map *page_map = res;
+	struct dev_pagemap *pgmap = &page_map->pgmap;
 
 	/* pages are dead and unused, undo the arch mapping */
 	arch_remove_memory(page_map->res.start, resource_size(&page_map->res));
+
+	if (pgmap->res) {
+		spin_lock(&range_lock);
+		list_del_rcu(&page_map->list);
+		spin_unlock(&range_lock);
+		dev_WARN_ONCE(dev, !percpu_ref_is_zero(pgmap->ref),
+				"page mapping not idle in %s\n", __func__);
+	}
+}
+
+static int page_map_match(struct device *dev, void *res, void *match_data)
+{
+	struct page_map *page_map = res;
+	resource_size_t phys = *(resource_size_t *) match_data;
+
+	return page_map->res.start == phys;
+}
+
+void devm_register_pagemap(struct device *dev, struct resource *res,
+		struct percpu_ref *ref)
+{
+	struct page_map *page_map;
+	struct dev_pagemap *pgmap;
+	unsigned long pfn;
+
+	page_map = devres_find(dev, devm_memremap_pages_release,
+			page_map_match, &res->start);
+	dev_WARN_ONCE(dev, !page_map, "%s: no mapping found for %pa\n",
+			__func__, &res->start);
+	if (!page_map)
+		return;
+
+	pgmap = &page_map->pgmap;
+	pgmap->dev = dev;
+	pgmap->res = &page_map->res;
+	pgmap->ref = ref;
+	INIT_LIST_HEAD(&page_map->list);
+	spin_lock(&range_lock);
+	list_add_rcu(&page_map->list, &ranges);
+	spin_unlock(&range_lock);
+
+	for (pfn = res->start >> PAGE_SHIFT;
+			pfn < res->end >> PAGE_SHIFT; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		page->pgmap = pgmap;
+	}
+}
+EXPORT_SYMBOL(devm_register_pagemap);
+
+struct dev_pagemap *__get_dev_pagemap(resource_size_t phys)
+{
+	struct page_map *page_map;
+	struct dev_pagemap *pgmap = NULL;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(page_map, &ranges, list)
+		if (phys >= page_map->res.start && phys <= page_map->res.end) {
+			if (percpu_ref_tryget_live(page_map->pgmap.ref))
+				pgmap = &page_map->pgmap;
+			break;
+		}
+	rcu_read_unlock();
+	return pgmap;
 }
 
 void *devm_memremap_pages(struct device *dev, struct resource *res)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 15/15] mm, x86: get_user_pages() for dax mappings
  2015-09-23  4:41 [PATCH 00/15] get_user_pages() for dax mappings Dan Williams
                   ` (13 preceding siblings ...)
  2015-09-23  4:42 ` [PATCH 14/15] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup Dan Williams
@ 2015-09-23  4:42 ` Dan Williams
  14 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2015-09-23  4:42 UTC (permalink / raw)
  To: akpm
  Cc: Dave Hansen, linux-nvdimm, Peter Zijlstra, Dave Chinner,
	linux-kernel, linux-mm, Jeff Moyer, Ingo Molnar, Thomas Gleixner,
	Alexander Viro, H. Peter Anvin, linux-fsdevel, Matthew Wilcox,
	Ross Zwisler, Christoph Hellwig

A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
has established a devm_memremap_pages() mapping, i.e. when the __pfn_t
return from ->direct_access() has PFN_DEV and PFN_MAP set.  Later, when
encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
struct dev_pagemap instance to keep the result of pfn_to_page() valid
until put_page().

Cc: Dave Hansen <dave@sr71.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/ia64/include/asm/pgtable.h |    1 +
 arch/x86/include/asm/pgtable.h  |    2 +
 arch/x86/mm/gup.c               |   56 +++++++++++++++++++++++++++++++++++++--
 include/linux/mm.h              |   42 ++++++++++++++++++++---------
 mm/gup.c                        |   11 +++++++-
 mm/hugetlb.c                    |   18 ++++++++++++-
 mm/swap.c                       |   15 ++++++++++
 7 files changed, 126 insertions(+), 19 deletions(-)

diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index 9f3ed9ee8f13..81d2af23958f 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -273,6 +273,7 @@ extern unsigned long VMALLOC_END;
 #define pmd_clear(pmdp)			(pmd_val(*(pmdp)) = 0UL)
 #define pmd_page_vaddr(pmd)		((unsigned long) __va(pmd_val(pmd) & _PFN_MASK))
 #define pmd_page(pmd)			virt_to_page((pmd_val(pmd) + PAGE_OFFSET))
+#define pmd_pfn(pmd)			(pmd_val(pmd) >> PAGE_SHIFT)
 
 #define pud_none(pud)			(!pud_val(pud))
 #define pud_bad(pud)			(!ia64_phys_addr_valid(pud_val(pud)))
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 84d1346e1cda..d29dc7b4924b 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -461,7 +461,7 @@ static inline int pte_present(pte_t a)
 #define pte_devmap pte_devmap
 static inline int pte_devmap(pte_t a)
 {
-	return pte_flags(a) & _PAGE_DEVMAP;
+	return (pte_flags(a) & _PAGE_DEVMAP) == _PAGE_DEVMAP;
 }
 
 #define pte_accessible pte_accessible
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 81bf3d2af3eb..7254ba4f791d 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -63,6 +63,16 @@ retry:
 #endif
 }
 
+static void undo_dev_pagemap(int *nr, int nr_start, struct page **pages)
+{
+	while ((*nr) - nr_start) {
+		struct page *page = pages[--(*nr)];
+
+		ClearPageReferenced(page);
+		put_page(page);
+	}
+}
+
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -71,7 +81,9 @@ retry:
 static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
+	struct dev_pagemap *pgmap = NULL;
 	unsigned long mask;
+	int nr_start = *nr;
 	pte_t *ptep;
 
 	mask = _PAGE_PRESENT|_PAGE_USER;
@@ -89,13 +101,21 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 			return 0;
 		}
 
-		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
+		page = pte_page(pte);
+		if (pte_devmap(pte)) {
+			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
+			if (unlikely(!pgmap)) {
+				undo_dev_pagemap(nr, nr_start, pages);
+				pte_unmap(ptep);
+				return 0;
+			}
+		} else if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
 			pte_unmap(ptep);
 			return 0;
 		}
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
-		page = pte_page(pte);
 		get_page(page);
+		put_dev_pagemap(pgmap);
 		SetPageReferenced(page);
 		pages[*nr] = page;
 		(*nr)++;
@@ -114,6 +134,32 @@ static inline void get_head_page_multiple(struct page *page, int nr)
 	SetPageReferenced(page);
 }
 
+static int __gup_device_huge_pmd(pmd_t pmd, unsigned long addr,
+		unsigned long end, struct page **pages, int *nr)
+{
+	int nr_start = *nr;
+	unsigned long pfn = pmd_pfn(pmd);
+	struct dev_pagemap *pgmap = NULL;
+
+	pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
+	do {
+		struct page *page = pfn_to_page(pfn);
+
+		pgmap = get_dev_pagemap(pfn, pgmap);
+		if (unlikely(!pgmap)) {
+			undo_dev_pagemap(nr, nr_start, pages);
+			return 0;
+		}
+		SetPageReferenced(page);
+		pages[*nr] = page;
+		get_page(page);
+		put_dev_pagemap(pgmap);
+		(*nr)++;
+		pfn++;
+	} while (addr += PAGE_SIZE, addr != end);
+	return 1;
+}
+
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
@@ -127,9 +173,13 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 		mask |= _PAGE_RW;
 	if ((pte_flags(pte) & mask) != mask)
 		return 0;
+
+	VM_BUG_ON(!pfn_valid(pmd_pfn(pmd)));
+	if (pmd_devmap(pmd))
+		return __gup_device_huge_pmd(pmd, addr, end, pages, nr);
+
 	/* hugepages are never "special" */
 	VM_BUG_ON(pte_flags(pte) & _PAGE_SPECIAL);
-	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 
 	refs = 0;
 	head = pte_page(pte);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6183549a854c..2aea87d8e702 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -522,19 +522,6 @@ static inline void get_huge_page_tail(struct page *page)
 
 extern bool __get_page_tail(struct page *page);
 
-static inline void get_page(struct page *page)
-{
-	if (unlikely(PageTail(page)))
-		if (likely(__get_page_tail(page)))
-			return;
-	/*
-	 * Getting a normal page or the head of a compound page
-	 * requires to already have an elevated page->_count.
-	 */
-	VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
-	atomic_inc(&page->_count);
-}
-
 static inline struct page *virt_to_head_page(const void *x)
 {
 	struct page *page = virt_to_page(x);
@@ -741,6 +728,18 @@ static inline enum zone_type page_zonenum(const struct page *page)
 	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
 }
 
+#ifdef CONFIG_ZONE_DEVICE
+static inline bool is_zone_device_page(const struct page *page)
+{
+	return page_zonenum(page) == ZONE_DEVICE;
+}
+#else
+static inline bool is_zone_device_page(const struct page *page)
+{
+	return false;
+}
+#endif
+
 /**
  * struct dev_pagemap - reference count for a devm_memremap_pages mapping
  * @res: physical address range covered by @ref
@@ -753,6 +752,23 @@ struct dev_pagemap {
 	struct device *dev;
 };
 
+static inline void get_page(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		if (likely(__get_page_tail(page)))
+			return;
+
+	if (is_zone_device_page(page))
+		percpu_ref_get(page->pgmap->ref);
+
+	/*
+	 * Getting a normal page or the head of a compound page
+	 * requires to already have an elevated page->_count.
+	 */
+	VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+	atomic_inc(&page->_count);
+}
+
 struct dev_pagemap *__get_dev_pagemap(resource_size_t phys);
 
 static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
diff --git a/mm/gup.c b/mm/gup.c
index a798293fc648..1064e9a489a4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -98,7 +98,16 @@ retry:
 	}
 
 	page = vm_normal_page(vma, address, pte);
-	if (unlikely(!page)) {
+	if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
+		/*
+		 * Only return device mapping pages in the FOLL_GET case since
+		 * they are only valid while holding the pgmap reference.
+		 */
+		if (get_dev_pagemap(pte_pfn(pte), NULL))
+			page = pte_page(pte);
+		else
+			goto no_page;
+	} else if (unlikely(!page)) {
 		if (flags & FOLL_DUMP) {
 			/* Avoid special (like zero) pages in core dumps */
 			page = ERR_PTR(-EFAULT);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 999fb0aef8f1..0abacb331ae2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4221,7 +4221,23 @@ retry:
 	 */
 	if (!pmd_huge(*pmd))
 		goto out;
-	if (pmd_present(*pmd)) {
+	if (pmd_present(*pmd) && pmd_devmap(*pmd)) {
+		unsigned long pfn = pmd_pfn(*pmd);
+		struct dev_pagemap *pgmap;
+
+		/*
+		 * device mapped pages can only be returned if the
+		 * caller will manage the page reference count.
+		 */
+		if (!(flags & FOLL_GET))
+			goto out;
+		pgmap = get_dev_pagemap(pfn, NULL);
+		if (!pgmap)
+			goto out;
+		page = pfn_to_page(pfn);
+		get_page(page);
+		put_dev_pagemap(pgmap);
+	} else if (pmd_present(*pmd)) {
 		page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
 		if (flags & FOLL_GET)
 			get_page(page);
diff --git a/mm/swap.c b/mm/swap.c
index 983f692a47fd..05a8a51c648e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -230,6 +230,19 @@ out_put_single:
 	}
 }
 
+static bool put_device_page(struct page *page)
+{
+	/*
+	 * ZONE_DEVICE pages are never "onlined" so their reference
+	 * counts never reach zero.  They are always owned by a device
+	 * driver, not the mm core.  I.e. the page is 'idle' when the
+	 * count is 1.
+	 */
+	VM_BUG_ON_PAGE(atomic_read(&page->_count) == 1, page);
+	put_dev_pagemap(page->pgmap);
+	return atomic_dec_return(&page->_count) == 1;
+}
+
 static void put_compound_page(struct page *page)
 {
 	struct page *page_head;
@@ -273,6 +286,8 @@ void put_page(struct page *page)
 {
 	if (unlikely(PageCompound(page)))
 		put_compound_page(page);
+	else if (is_zone_device_page(page))
+		put_device_page(page);
 	else if (put_page_testzero(page))
 		__put_single_page(page);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 12/15] mm, dax, gpu: convert vm_insert_mixed to __pfn_t, introduce _PAGE_DEVMAP
  2015-09-23  4:42 ` [PATCH 12/15] mm, dax, gpu: convert vm_insert_mixed to __pfn_t, introduce _PAGE_DEVMAP Dan Williams
@ 2015-09-23 13:47   ` Geert Uytterhoeven
  2015-09-23 16:59     ` Dan Williams
  0 siblings, 1 reply; 39+ messages in thread
From: Geert Uytterhoeven @ 2015-09-23 13:47 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Dave Hansen, linux-nvdimm, David Airlie,
	linux-kernel, Linux MM, Linux FS Devel

Hi Dan,

On Wed, Sep 23, 2015 at 6:42 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> Convert the raw unsigned long 'pfn' argument to __pfn_t for the purpose of
> evaluating the PFN_MAP and PFN_DEV flags.  When both are set the it

s/the it/it/

> triggers _PAGE_DEVMAP to be set in the resulting pte.  This flag will
> later be used in the get_user_pages() path to pin the page mapping,
> dynamically allocated by devm_memremap_pages(), until all the resulting
> pages are released.
>
> There are no functional changes to the gpu drivers as a result of this
> conversion.
>
> This uncovered several architectures with no local definition for
> pfn_pte(), in response __pfn_t_pte() is only defined when an arch
> opts-in by "#define pfn_pte pfn_pte".
>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Airlie <airlied@linux.ie>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

> diff --git a/arch/m68k/include/asm/page_no.h b/arch/m68k/include/asm/page_no.h
> index ef209169579a..930a42f6db44 100644
> --- a/arch/m68k/include/asm/page_no.h
> +++ b/arch/m68k/include/asm/page_no.h
> @@ -34,6 +34,7 @@ extern unsigned long memory_end;
>
>  #define        virt_addr_valid(kaddr)  (((void *)(kaddr) >= (void *)PAGE_OFFSET) && \
>                                 ((void *)(kaddr) < (void *)memory_end))
> +#define __pfn_to_phys(pfn)     PFN_PHYS(pfn)

The above change doesn't match the patch description?

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 11/15] mm, dax, pmem: introduce __pfn_t
  2015-09-23  4:42 ` [PATCH 11/15] mm, dax, pmem: introduce __pfn_t Dan Williams
@ 2015-09-23 16:02   ` Dave Hansen
  2015-09-23 23:36     ` Williams, Dan J
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Hansen @ 2015-09-23 16:02 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: linux-nvdimm, linux-kernel, linux-mm, linux-fsdevel,
	Ross Zwisler, Christoph Hellwig

On 09/22/2015 09:42 PM, Dan Williams wrote:
>  /*
> + * __pfn_t: encapsulates a page-frame number that is optionally backed
> + * by memmap (struct page).  Whether a __pfn_t has a 'struct page'
> + * backing is indicated by flags in the low bits of the value;
> + */
> +typedef struct {
> +	unsigned long val;
> +} __pfn_t;
> +
> +/*
> + * PFN_SG_CHAIN - pfn is a pointer to the next scatterlist entry
> + * PFN_SG_LAST - pfn references a page and is the last scatterlist entry
> + * PFN_DEV - pfn is not covered by system memmap by default
> + * PFN_MAP - pfn has a dynamic page mapping established by a device driver
> + */
> +enum {
> +	PFN_SHIFT = 4,
> +	PFN_MASK = (1UL << PFN_SHIFT) - 1,
> +	PFN_SG_CHAIN = (1UL << 0),
> +	PFN_SG_LAST = (1UL << 1),
> +	PFN_DEV = (1UL << 2),
> +	PFN_MAP = (1UL << 3),
> +};

Please forgive a little bikeshedding here...

Why __pfn_t?  Because the KVM code has a pfn_t?  If so, I think we
should rescue pfn_t from KVM and give them a kvm_pfn_t.

I think you should do one of two things:  Make PFN_SHIFT 12 so that a
physical addr can be stored in a __pfn_t with no work.  Or, use the
*high* 12 bits of __pfn_t.val.

If you use the high bits, *and* make it store a plain pfn when all the
bits are 0, then you get a zero-cost pfn<->__pfn_t conversion which will
hopefully generate the exact same code which is there today.

The one disadvantage here is that it makes it more likely that somebody
that's just setting __pfn_t.val=foo will get things subtly wrong
somehow, but that it will work most of the time.

Also, about naming...  PFN_SHIFT is pretty awful name for this.  It
probably needs to be __PFN_T_SOMETHING.  We don't want folks doing
craziness like:

	unsigned long phys_addr = pfn << PFN_SHIFT.

Which *looks* OK.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 12/15] mm, dax, gpu: convert vm_insert_mixed to __pfn_t, introduce _PAGE_DEVMAP
  2015-09-23 13:47   ` Geert Uytterhoeven
@ 2015-09-23 16:59     ` Dan Williams
  0 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2015-09-23 16:59 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Andrew Morton, Dave Hansen, linux-nvdimm, David Airlie,
	linux-kernel, Linux MM, Linux FS Devel

On Wed, Sep 23, 2015 at 6:47 AM, Geert Uytterhoeven
<geert@linux-m68k.org> wrote:
> Hi Dan,
>
> On Wed, Sep 23, 2015 at 6:42 AM, Dan Williams <dan.j.williams@intel.com> wrote:
>> Convert the raw unsigned long 'pfn' argument to __pfn_t for the purpose of
>> evaluating the PFN_MAP and PFN_DEV flags.  When both are set the it
>
> s/the it/it/

yes.

>> triggers _PAGE_DEVMAP to be set in the resulting pte.  This flag will
>> later be used in the get_user_pages() path to pin the page mapping,
>> dynamically allocated by devm_memremap_pages(), until all the resulting
>> pages are released.
>>
>> There are no functional changes to the gpu drivers as a result of this
>> conversion.
>>
>> This uncovered several architectures with no local definition for
>> pfn_pte(), in response __pfn_t_pte() is only defined when an arch
>> opts-in by "#define pfn_pte pfn_pte".
>>
>> Cc: Dave Hansen <dave@sr71.net>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: David Airlie <airlied@linux.ie>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>
>> diff --git a/arch/m68k/include/asm/page_no.h b/arch/m68k/include/asm/page_no.h
>> index ef209169579a..930a42f6db44 100644
>> --- a/arch/m68k/include/asm/page_no.h
>> +++ b/arch/m68k/include/asm/page_no.h
>> @@ -34,6 +34,7 @@ extern unsigned long memory_end;
>>
>>  #define        virt_addr_valid(kaddr)  (((void *)(kaddr) >= (void *)PAGE_OFFSET) && \
>>                                 ((void *)(kaddr) < (void *)memory_end))
>> +#define __pfn_to_phys(pfn)     PFN_PHYS(pfn)
>
> The above change doesn't match the patch description?
>

I should have noted that this is a new compile error introduced by
this patch since include/linux/mm.h now calls __pfn_to_phys() and m68k
does not always have it defined.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 11/15] mm, dax, pmem: introduce __pfn_t
  2015-09-23 16:02   ` Dave Hansen
@ 2015-09-23 23:36     ` Williams, Dan J
  0 siblings, 0 replies; 39+ messages in thread
From: Williams, Dan J @ 2015-09-23 23:36 UTC (permalink / raw)
  To: dave
  Cc: kvm, linux-kernel, linux-mm, hch, akpm, linux-nvdimm,
	linux-fsdevel, ross.zwisler

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 16992 bytes --]

[ adding kvm@vger.kernel.org ]

On Wed, 2015-09-23 at 09:02 -0700, Dave Hansen wrote:
> On 09/22/2015 09:42 PM, Dan Williams wrote:
> >  /*
> > + * __pfn_t: encapsulates a page-frame number that is optionally backed
> > + * by memmap (struct page).  Whether a __pfn_t has a 'struct page'
> > + * backing is indicated by flags in the low bits of the value;
> > + */
> > +typedef struct {
> > +	unsigned long val;
> > +} __pfn_t;
> > +
> > +/*
> > + * PFN_SG_CHAIN - pfn is a pointer to the next scatterlist entry
> > + * PFN_SG_LAST - pfn references a page and is the last scatterlist entry
> > + * PFN_DEV - pfn is not covered by system memmap by default
> > + * PFN_MAP - pfn has a dynamic page mapping established by a device driver
> > + */
> > +enum {
> > +	PFN_SHIFT = 4,
> > +	PFN_MASK = (1UL << PFN_SHIFT) - 1,
> > +	PFN_SG_CHAIN = (1UL << 0),
> > +	PFN_SG_LAST = (1UL << 1),
> > +	PFN_DEV = (1UL << 2),
> > +	PFN_MAP = (1UL << 3),
> > +};
> 
> Please forgive a little bikeshedding here...
> 
> Why __pfn_t?  Because the KVM code has a pfn_t?  If so, I think we
> should rescue pfn_t from KVM and give them a kvm_pfn_t.

Sounds good, once 0day has a chance to chew on the conversion I'll send
it out.

> I think you should do one of two things:  Make PFN_SHIFT 12 so that a
> physical addr can be stored in a __pfn_t with no work.  Or, use the
> *high* 12 bits of __pfn_t.val.
> 

pfn to pfn_t conversions are more likely, so the high bits sounds
better.

> If you use the high bits, *and* make it store a plain pfn when all the
> bits are 0, then you get a zero-cost pfn<->__pfn_t conversion which will
> hopefully generate the exact same code which is there today.
> 
> The one disadvantage here is that it makes it more likely that somebody
> that's just setting __pfn_t.val=foo will get things subtly wrong
> somehow, but that it will work most of the time.
> 
> Also, about naming...  PFN_SHIFT is pretty awful name for this.  It
> probably needs to be __PFN_T_SOMETHING.  We don't want folks doing
> craziness like:
> 
> 	unsigned long phys_addr = pfn << PFN_SHIFT.
> 
> Which *looks* OK.

Heh, true.  It's now PFN_FLAGS_MASK in the reworked patch...


8<---
Subject: mm, dax, pmem: introduce pfn_t

From: Dan Williams <dan.j.williams@intel.com>

In preparation for enabling get_user_pages() operations on dax mappings,
introduce a type that encapsulates a page-frame-number that can also be
used to encode other information.  This other information is the
historical "page_link" encoding in a scatterlist, but can also denote
"device memory".  Where "device memory" is a set of pfns that are not
part of the kernel's linear mapping by default, but are accessed via the
same memory controller as ram.  The motivation for this new type is
large capacity persistent memory that optionally has struct page entries
in the 'memmap'.

When a driver, like pmem, has established a devm_memremap_pages()
mapping it needs to communicate to upper layers that the pfn has a page
backing.  This property will be leveraged in a later patch to enable
dax-gup.  For now, update all the ->direct_access() implementations to
communicate whether the returned pfn range is mapped.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave@sr71.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/sysdev/axonram.c |    8 ++---
 drivers/block/brd.c           |    4 +--
 drivers/nvdimm/pmem.c         |   27 +++++++++--------
 drivers/s390/block/dcssblk.c  |   10 +++---
 fs/block_dev.c                |    2 +
 fs/dax.c                      |   23 ++++++++-------
 include/linux/blkdev.h        |    4 +--
 include/linux/mm.h            |   65 +++++++++++++++++++++++++++++++++++++++++
 include/linux/pfn.h           |    9 ++++++
 9 files changed, 112 insertions(+), 40 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 24ffab2572e8..ec3b072d20cf 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -141,15 +141,13 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
  */
 static long
 axon_ram_direct_access(struct block_device *device, sector_t sector,
-		       void __pmem **kaddr, unsigned long *pfn)
+		       void __pmem **kaddr, pfn_t *pfn)
 {
 	struct axon_ram_bank *bank = device->bd_disk->private_data;
 	loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
-	void *addr = (void *)(bank->ph_addr + offset);
-
-	*kaddr = (void __pmem *)addr;
-	*pfn = virt_to_phys(addr) >> PAGE_SHIFT;
 
+	*kaddr = (void __pmem __force *) bank->io_addr + offset;
+	*pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
 	return bank->size - offset;
 }
 
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index f645a71ae827..796112e9174b 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -374,7 +374,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 static long brd_direct_access(struct block_device *bdev, sector_t sector,
-			void __pmem **kaddr, unsigned long *pfn)
+			void __pmem **kaddr, pfn_t *pfn)
 {
 	struct brd_device *brd = bdev->bd_disk->private_data;
 	struct page *page;
@@ -385,7 +385,7 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
 	if (!page)
 		return -ENOSPC;
 	*kaddr = (void __pmem *)page_address(page);
-	*pfn = page_to_pfn(page);
+	*pfn = page_to_pfn_t(page);
 
 	return PAGE_SIZE;
 }
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3ee02af73ad0..af80c8ec2cc9 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -39,6 +39,7 @@ struct pmem_device {
 	phys_addr_t		phys_addr;
 	/* when non-zero this device is hosting a 'pfn' instance */
 	phys_addr_t		data_offset;
+	unsigned long		pfn_flags;
 	void __pmem		*virt_addr;
 	size_t			size;
 };
@@ -108,25 +109,22 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 }
 
 static long pmem_direct_access(struct block_device *bdev, sector_t sector,
-		      void __pmem **kaddr, unsigned long *pfn)
+		      void __pmem **kaddr, pfn_t *pfn)
 {
 	struct pmem_device *pmem = bdev->bd_disk->private_data;
 	resource_size_t offset = sector * 512 + pmem->data_offset;
-	resource_size_t size;
+	resource_size_t size = pmem->size - offset;
 
-	if (pmem->data_offset) {
+	*kaddr = pmem->virt_addr + offset;
+	*pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
+
+	if (pfn_t_has_page(*pfn)) {
 		/*
 		 * Limit the direct_access() size to what is covered by
 		 * the memmap
 		 */
-		size = (pmem->size - offset) & ~ND_PFN_MASK;
-	} else
-		size = pmem->size - offset;
-
-	/* FIXME convert DAX to comprehend that this mapping has a lifetime */
-	*kaddr = pmem->virt_addr + offset;
-	*pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;
-
+		size &= ~ND_PFN_MASK;
+	}
 	return size;
 }
 
@@ -158,9 +156,11 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 		return ERR_PTR(-EBUSY);
 	}
 
-	if (pmem_should_map_pages(dev))
+	pmem->pfn_flags = PFN_DEV;
+	if (pmem_should_map_pages(dev)) {
 		pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res);
-	else
+		pmem->pfn_flags |= PFN_MAP;
+	} else
 		pmem->virt_addr = (void __pmem *) devm_memremap(dev,
 				pmem->phys_addr, pmem->size,
 				ARCH_MEMREMAP_PMEM);
@@ -371,6 +371,7 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
 	pmem = dev_get_drvdata(dev);
 	devm_memunmap(dev, (void __force *) pmem->virt_addr);
 	pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res);
+	pmem->pfn_flags |= PFN_MAP;
 	if (IS_ERR(pmem->virt_addr)) {
 		rc = PTR_ERR(pmem->virt_addr);
 		goto err;
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index c212ce925ee6..f35d6c8f5fbb 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -29,7 +29,7 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
 static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
-			 void __pmem **kaddr, unsigned long *pfn);
+			 void __pmem **kaddr, pfn_t *pfn);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -881,20 +881,18 @@ fail:
 
 static long
 dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
-			void __pmem **kaddr, unsigned long *pfn)
+			void __pmem **kaddr, pfn_t *pfn)
 {
 	struct dcssblk_dev_info *dev_info;
 	unsigned long offset, dev_sz;
-	void *addr;
 
 	dev_info = bdev->bd_disk->private_data;
 	if (!dev_info)
 		return -ENODEV;
 	dev_sz = dev_info->end - dev_info->start;
 	offset = secnum * 512;
-	addr = (void *) (dev_info->start + offset);
-	*pfn = virt_to_phys(addr) >> PAGE_SHIFT;
-	*kaddr = (void __pmem *) addr;
+	*kaddr = (void __pmem *) (dev_info->start + offset);
+	*pfn = phys_to_pfn_t(dev_info->start + offset, PFN_DEV);
 
 	return dev_sz - offset;
 }
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 073bb57adab1..c37a193695d4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -442,7 +442,7 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
  * accessible at this address.
  */
 long bdev_direct_access(struct block_device *bdev, sector_t sector,
-			void __pmem **addr, unsigned long *pfn, long size)
+			void __pmem **addr, pfn_t *pfn, long size)
 {
 	long avail;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
diff --git a/fs/dax.c b/fs/dax.c
index 358eea39e982..7bd81b48a71d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -37,7 +37,7 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 	might_sleep();
 	do {
 		void __pmem *addr;
-		unsigned long pfn;
+		pfn_t pfn;
 		long count;
 
 		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
@@ -64,7 +64,7 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 EXPORT_SYMBOL_GPL(dax_clear_blocks);
 
 static void __pmem *__dax_map_bh(const struct buffer_head *bh, unsigned blkbits,
-		unsigned long *pfn, long *len)
+		pfn_t *pfn, long *len)
 {
 	long rc;
 	void __pmem *addr;
@@ -87,7 +87,7 @@ static void __pmem *__dax_map_bh(const struct buffer_head *bh, unsigned blkbits,
 
 static void __pmem *dax_map_bh(const struct buffer_head *bh, unsigned blkbits)
 {
-	unsigned long pfn;
+	pfn_t pfn;
 
 	return __dax_map_bh(bh, blkbits, &pfn, NULL);
 }
@@ -138,7 +138,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 	loff_t pos = start, max = start, bh_max = start;
 	int rw = iov_iter_rw(iter), rc;
 	long map_len = 0;
-	unsigned long pfn;
+	pfn_t pfn;
 	void __pmem *addr = NULL;
 	void __pmem *kmap = (void __pmem *) ERR_PTR(-EIO);
 	bool hole = false;
@@ -324,9 +324,9 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
-	unsigned long pfn;
 	void __pmem *addr;
 	pgoff_t size;
+	pfn_t pfn;
 	int error;
 
 	/*
@@ -354,7 +354,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_bh(bh, addr);
 
-	error = vm_insert_mixed(vma, vaddr, pfn);
+	error = vm_insert_mixed(vma, vaddr, pfn_t_to_pfn(pfn));
 
  out:
 	return error;
@@ -604,7 +604,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	if (buffer_unwritten(&bh) || buffer_new(&bh)) {
 		int i;
 		long length;
-		unsigned long pfn;
+		pfn_t pfn;
 		void __pmem *kaddr = __dax_map_bh(&bh, blkbits, &pfn, &length);
 
 		if (IS_ERR(kaddr)) {
@@ -612,7 +612,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			goto out;
 		}
 
-		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
+		if ((length < PMD_SIZE) || (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR))
 			goto fallback;
 
 		for (i = 0; i < PTRS_PER_PMD; i++)
@@ -668,8 +668,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
+		pfn_t pfn;
 		long length;
-		unsigned long pfn;
 		void __pmem *kaddr = __dax_map_bh(&bh, blkbits, &pfn, &length);
 
 		if (IS_ERR(kaddr)) {
@@ -677,10 +677,11 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			goto out;
 		}
 		dax_unmap_bh(&bh, kaddr);
-		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
+		if ((length < PMD_SIZE) || (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR))
 			goto fallback;
 
-		result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
+		result |= vmf_insert_pfn_pmd(vma, address, pmd,
+				pfn_t_to_pfn(pfn), write);
 	}
 
  out:
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 363d7df8d65c..0be8dbb982f3 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1634,7 +1634,7 @@ struct block_device_operations {
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	long (*direct_access)(struct block_device *, sector_t, void __pmem **,
-			unsigned long *pfn);
+			pfn_t *);
 	unsigned int (*check_events) (struct gendisk *disk,
 				      unsigned int clearing);
 	/* ->media_changed() is DEPRECATED, use ->check_events() instead */
@@ -1653,7 +1653,7 @@ extern int bdev_read_page(struct block_device *, sector_t, struct page *);
 extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 						struct writeback_control *);
 extern long bdev_direct_access(struct block_device *, sector_t,
-		void __pmem **addr, unsigned long *pfn, long size);
+		void __pmem **addr, pfn_t *pfn, long size);
 #else /* CONFIG_BLOCK */
 
 struct block_device;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 91c08f6f0dc9..abe60eadc8ae 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -906,6 +906,71 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
 }
 
 /*
+ * PFN_FLAGS_MASK - mask of all the possible valid pfn_t flags
+ * PFN_SG_CHAIN - pfn is a pointer to the next scatterlist entry
+ * PFN_SG_LAST - pfn references a page and is the last scatterlist entry
+ * PFN_DEV - pfn is not covered by system memmap by default
+ * PFN_MAP - pfn has a dynamic page mapping established by a device driver
+ */
+#define PFN_FLAGS_MASK (~PAGE_MASK << (BITS_PER_LONG - PAGE_SHIFT))
+#define PFN_SG_CHAIN (1UL << (BITS_PER_LONG - 1))
+#define PFN_SG_LAST (1UL << (BITS_PER_LONG - 2))
+#define PFN_DEV (1UL << (BITS_PER_LONG - 3))
+#define PFN_MAP (1UL << (BITS_PER_LONG - 4))
+
+static inline pfn_t __pfn_to_pfn_t(unsigned long pfn, unsigned long flags)
+{
+	pfn_t pfn_t = { .val = pfn | (flags & PFN_FLAGS_MASK), };
+
+	return pfn_t;
+}
+
+/* a default pfn to pfn_t conversion assumes that @pfn is pfn_valid() */
+static inline pfn_t pfn_to_pfn_t(unsigned long pfn)
+{
+	return __pfn_to_pfn_t(pfn, 0);
+}
+
+static inline pfn_t phys_to_pfn_t(dma_addr_t addr, unsigned long flags)
+{
+	return __pfn_to_pfn_t(addr >> PAGE_SHIFT, flags);
+}
+
+static inline bool pfn_t_has_page(pfn_t pfn)
+{
+	return (pfn.val & PFN_MAP) == PFN_MAP || (pfn.val & PFN_DEV) == 0;
+}
+
+static inline unsigned long pfn_t_to_pfn(pfn_t pfn)
+{
+	return pfn.val & ~PFN_FLAGS_MASK;
+}
+
+static inline struct page *pfn_t_to_page(pfn_t pfn)
+{
+	if (pfn_t_has_page(pfn))
+		return pfn_to_page(pfn_t_to_pfn(pfn));
+	return NULL;
+}
+
+static inline dma_addr_t pfn_t_to_phys(pfn_t pfn)
+{
+	return PFN_PHYS(pfn_t_to_pfn(pfn));
+}
+
+static inline void *pfn_t_to_virt(pfn_t pfn)
+{
+	if (pfn_t_has_page(pfn))
+		return __va(pfn_t_to_phys(pfn));
+	return NULL;
+}
+
+static inline pfn_t page_to_pfn_t(struct page *page)
+{
+	return pfn_to_pfn_t(page_to_pfn(page));
+}
+
+/*
  * Some inline functions in vmstat.h depend on page_zone()
  */
 #include <linux/vmstat.h>
diff --git a/include/linux/pfn.h b/include/linux/pfn.h
index 7646637221f3..96df85985f16 100644
--- a/include/linux/pfn.h
+++ b/include/linux/pfn.h
@@ -3,6 +3,15 @@
 
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
+
+/*
+ * pfn_t: encapsulates a page-frame number that is optionally backed
+ * by memmap (struct page).  Whether a pfn_t has a 'struct page'
+ * backing is indicated by flags in the high bits of the value.
+ */
+typedef struct {
+	unsigned long val;
+} pfn_t;
 #endif
 
 #define PFN_ALIGN(x)	(((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK)


N‹§²æìr¸›zǧu©ž²Æ {\b­†éì¹»\x1c®&Þ–)îÆi¢žØ^n‡r¶‰šŽŠÝ¢j$½§$¢¸\x05¢¹¨­è§~Š'.)îÄÃ,yèm¶ŸÿÃ\f%Š{±šj+ƒðèž×¦j)Z†·Ÿ

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 01/15] avr32: convert to asm-generic/memory_model.h
  2015-09-23  4:41 ` [PATCH 01/15] avr32: convert to asm-generic/memory_model.h Dan Williams
@ 2015-09-24 15:10   ` Christoph Hellwig
  2015-09-26  0:36     ` Dan Williams
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2015-09-24 15:10 UTC (permalink / raw)
  To: Dan Williams; +Cc: akpm, linux-fsdevel, linux-mm, linux-kernel, linux-nvdimm

On Wed, Sep 23, 2015 at 12:41:18AM -0400, Dan Williams wrote:
> Switch avr32/include/asm/page.h to use the common defintions for
> pfn_to_page(), page_to_pfn(), and ARCH_PFN_OFFSET.

This was the last architecture not using asm-generic/memory_model.h,
so it might be time to move it to linux/ or even fold it into an
existing header.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 04/15] x86, mm: quiet arch_add_memory()
  2015-09-23  4:41 ` [PATCH 04/15] x86, mm: quiet arch_add_memory() Dan Williams
@ 2015-09-24 15:10   ` Christoph Hellwig
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2015-09-24 15:10 UTC (permalink / raw)
  To: Dan Williams; +Cc: akpm, linux-fsdevel, linux-mm, linux-kernel, linux-nvdimm

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 05/15] pmem: kill memremap_pmem()
  2015-09-23  4:41 ` [PATCH 05/15] pmem: kill memremap_pmem() Dan Williams
@ 2015-09-24 15:11   ` Christoph Hellwig
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2015-09-24 15:11 UTC (permalink / raw)
  To: Dan Williams; +Cc: akpm, linux-fsdevel, linux-mm, linux-kernel, linux-nvdimm

Yes, please!

Reviewed-by: Christoph Hellwig <hch@lst.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 06/15] devm_memunmap: use devres_release()
  2015-09-23  4:41 ` [PATCH 06/15] devm_memunmap: use devres_release() Dan Williams
@ 2015-09-24 15:13   ` Christoph Hellwig
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2015-09-24 15:13 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, linux-nvdimm, linux-kernel, linux-mm, linux-fsdevel, Ross Zwisler

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 07/15] devm_memremap: convert to return ERR_PTR
  2015-09-23  4:41 ` [PATCH 07/15] devm_memremap: convert to return ERR_PTR Dan Williams
@ 2015-09-24 15:13   ` Christoph Hellwig
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2015-09-24 15:13 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, linux-nvdimm, linux-kernel, linux-mm, linux-fsdevel, Ross Zwisler

On Wed, Sep 23, 2015 at 12:41:50AM -0400, Dan Williams wrote:
> Make devm_memremap consistent with the error return scheme of
> devm_memremap_pages to remove special casing in the pmem driver.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 08/15] block, dax, pmem: reference counting infrastructure
  2015-09-23  4:41 ` [PATCH 08/15] block, dax, pmem: reference counting infrastructure Dan Williams
@ 2015-09-24 15:15   ` Christoph Hellwig
  2015-09-25  0:03     ` Dan Williams
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2015-09-24 15:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Jens Axboe, linux-nvdimm, linux-kernel, linux-mm,
	linux-fsdevel, Ross Zwisler

On Wed, Sep 23, 2015 at 12:41:55AM -0400, Dan Williams wrote:
> Enable DAX to use a reference count for keeping the virtual address
> returned by ->direct_access() valid for the duration of its usage in
> fs/dax.c, or otherwise hold off blk_cleanup_queue() while
> pmem_make_request is active.  The blk-mq code is already in a position
> to need low overhead referece counting for races against request_queue
> destruction (blk_cleanup_queue()).  Given DAX-enabled block drivers do
> not enable blk-mq, share the storage in 'struct request_queue' between
> the two implementations.

Can we just move the refcounting to common code with the same field
name, and even initialize it for non-mq, non-dax queues but just never
tage a reference there (for now)?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 08/15] block, dax, pmem: reference counting infrastructure
  2015-09-24 15:15   ` Christoph Hellwig
@ 2015-09-25  0:03     ` Dan Williams
  2015-09-25 11:32       ` Christoph Hellwig
  0 siblings, 1 reply; 39+ messages in thread
From: Dan Williams @ 2015-09-25  0:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jens Axboe, linux-nvdimm, linux-kernel, Linux MM,
	linux-fsdevel, Ross Zwisler

On Thu, Sep 24, 2015 at 8:15 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, Sep 23, 2015 at 12:41:55AM -0400, Dan Williams wrote:
>> Enable DAX to use a reference count for keeping the virtual address
>> returned by ->direct_access() valid for the duration of its usage in
>> fs/dax.c, or otherwise hold off blk_cleanup_queue() while
>> pmem_make_request is active.  The blk-mq code is already in a position
>> to need low overhead referece counting for races against request_queue
>> destruction (blk_cleanup_queue()).  Given DAX-enabled block drivers do
>> not enable blk-mq, share the storage in 'struct request_queue' between
>> the two implementations.
>
> Can we just move the refcounting to common code with the same field
> name, and even initialize it for non-mq, non-dax queues but just never
> tage a reference there (for now)?

That makes sense to me, especially because drivers/nvdimm/blk.c is
broken in the same way as drivers/nvdimm/pmem.c and it would be
awkward to have it use blk_dax_get() / blk_dax_put().  The
percpu_refcount should be valid for all queues and it will only ever
be > 1 in the blk_mq and libnvdimm cases (for now).  Will fix.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 08/15] block, dax, pmem: reference counting infrastructure
  2015-09-25  0:03     ` Dan Williams
@ 2015-09-25 11:32       ` Christoph Hellwig
  2015-09-25 21:08         ` Williams, Dan J
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2015-09-25 11:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Andrew Morton, Jens Axboe, linux-nvdimm,
	linux-kernel, Linux MM, linux-fsdevel, Ross Zwisler

On Thu, Sep 24, 2015 at 05:03:18PM -0700, Dan Williams wrote:
> That makes sense to me, especially because drivers/nvdimm/blk.c is
> broken in the same way as drivers/nvdimm/pmem.c and it would be
> awkward to have it use blk_dax_get() / blk_dax_put().  The
> percpu_refcount should be valid for all queues and it will only ever
> be > 1 in the blk_mq and libnvdimm cases (for now).  Will fix.

Looking at this a bit more it might actually make sense to grab the
referene in common code before calling into ->make_request.

Jens, any opinion on that?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 08/15] block, dax, pmem: reference counting infrastructure
  2015-09-25 11:32       ` Christoph Hellwig
@ 2015-09-25 21:08         ` Williams, Dan J
  0 siblings, 0 replies; 39+ messages in thread
From: Williams, Dan J @ 2015-09-25 21:08 UTC (permalink / raw)
  To: hch
  Cc: linux-kernel, linux-mm, ross.zwisler, linux-fsdevel, akpm, axboe,
	linux-nvdimm

On Fri, 2015-09-25 at 04:32 -0700, Christoph Hellwig wrote:
> On Thu, Sep 24, 2015 at 05:03:18PM -0700, Dan Williams wrote:
> > That makes sense to me, especially because drivers/nvdimm/blk.c is
> > broken in the same way as drivers/nvdimm/pmem.c and it would be
> > awkward to have it use blk_dax_get() / blk_dax_put().  The
> > percpu_refcount should be valid for all queues and it will only ever
> > be > 1 in the blk_mq and libnvdimm cases (for now).  Will fix.
> 
> Looking at this a bit more it might actually make sense to grab the
> referene in common code before calling into ->make_request.
> 
> Jens, any opinion on that?

...this works for me:

8<-----
Subject: [PATCH] block: generic request_queue reference counting

Allow pmem, and other synchronous/bio-based block drivers, to fallback
on a per-cpu reference count managed by the core for tracking queue
live/dead state.

The existing per-cpu reference count for the blk_mq case is promoted to
be used in all block i/o scenarios.  This involves initializing it by
default, waiting for it to drop to zero at exit, and holding a live
reference over the invocation of q->make_request_fn() in
generic_make_request().  The blk_mq code continues to take its own
reference per blk_mq request and retains the ability to freeze the
queue, but the check that the queue is frozen is moved to
generic_make_request().

This fixes crash signatures like the following:

 BUG: unable to handle kernel paging request at ffff880140000000
 [..]
 Call Trace:
  [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70
  [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
  [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem]
  [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0
  [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110
  [<ffffffff8141f966>] submit_bio+0x76/0x170
  [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160
  [<ffffffff81286e62>] submit_bh+0x12/0x20
  [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170
  [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90
  [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270
  [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30
  [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250
  [<ffffffff81303494>] ext4_put_super+0x64/0x360
  [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/blk-core.c       | 71 ++++++++++++++++++++++++++++++++++++++------
 block/blk-mq-sysfs.c   |  6 ----
 block/blk-mq.c         | 80 +++++++++++++++-----------------------------------
 block/blk-sysfs.c      |  3 +-
 block/blk.h            | 14 +++++++++
 include/linux/blk-mq.h |  1 -
 include/linux/blkdev.h |  2 +-
 7 files changed, 102 insertions(+), 75 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2eb722d48773..6062550baaef 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -554,19 +554,17 @@ void blk_cleanup_queue(struct request_queue *q)
 	 * Drain all requests queued before DYING marking. Set DEAD flag to
 	 * prevent that q->request_fn() gets invoked after draining finished.
 	 */
-	if (q->mq_ops) {
-		blk_mq_freeze_queue(q);
-		spin_lock_irq(lock);
-	} else {
-		spin_lock_irq(lock);
+	blk_freeze_queue(q);
+	spin_lock_irq(lock);
+	if (!q->mq_ops)
 		__blk_drain_queue(q, true);
-	}
 	queue_flag_set(QUEUE_FLAG_DEAD, q);
 	spin_unlock_irq(lock);
 
 	/* @q won't process any more request, flush async actions */
 	del_timer_sync(&q->backing_dev_info.laptop_mode_wb_timer);
 	blk_sync_queue(q);
+	percpu_ref_exit(&q->q_usage_counter);
 
 	if (q->mq_ops)
 		blk_mq_free_queue(q);
@@ -629,6 +627,40 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
+int blk_queue_enter(struct request_queue *q, gfp_t gfp)
+{
+	while (true) {
+		int ret;
+
+		if (percpu_ref_tryget_live(&q->q_usage_counter))
+			return 0;
+
+		if (!(gfp & __GFP_WAIT))
+			return -EBUSY;
+
+		ret = wait_event_interruptible(q->mq_freeze_wq,
+				!atomic_read(&q->mq_freeze_depth) ||
+				blk_queue_dying(q));
+		if (blk_queue_dying(q))
+			return -ENODEV;
+		if (ret)
+			return ret;
+	}
+}
+
+void blk_queue_exit(struct request_queue *q)
+{
+	percpu_ref_put(&q->q_usage_counter);
+}
+
+static void blk_queue_usage_counter_release(struct percpu_ref *ref)
+{
+	struct request_queue *q =
+		container_of(ref, struct request_queue, q_usage_counter);
+
+	wake_up_all(&q->mq_freeze_wq);
+}
+
 struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 {
 	struct request_queue *q;
@@ -690,11 +722,22 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 
 	init_waitqueue_head(&q->mq_freeze_wq);
 
-	if (blkcg_init_queue(q))
+	/*
+	 * Init percpu_ref in atomic mode so that it's faster to shutdown.
+	 * See blk_register_queue() for details.
+	 */
+	if (percpu_ref_init(&q->q_usage_counter,
+				blk_queue_usage_counter_release,
+				PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
 		goto fail_bdi;
 
+	if (blkcg_init_queue(q))
+		goto fail_ref;
+
 	return q;
 
+fail_ref:
+	percpu_ref_exit(&q->q_usage_counter);
 fail_bdi:
 	bdi_destroy(&q->backing_dev_info);
 fail_split:
@@ -1966,9 +2009,19 @@ void generic_make_request(struct bio *bio)
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-		q->make_request_fn(q, bio);
+		if (likely(blk_queue_enter(q, __GFP_WAIT) == 0)) {
+
+			q->make_request_fn(q, bio);
+
+			blk_queue_exit(q);
 
-		bio = bio_list_pop(current->bio_list);
+			bio = bio_list_pop(current->bio_list);
+		} else {
+			struct bio *bio_next = bio_list_pop(current->bio_list);
+
+			bio_io_error(bio);
+			bio = bio_next;
+		}
 	} while (bio);
 	current->bio_list = NULL; /* deactivate */
 }
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 279c5d674edf..731b6eecce82 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -412,12 +412,6 @@ static void blk_mq_sysfs_init(struct request_queue *q)
 		kobject_init(&ctx->kobj, &blk_mq_ctx_ktype);
 }
 
-/* see blk_register_queue() */
-void blk_mq_finish_init(struct request_queue *q)
-{
-	percpu_ref_switch_to_percpu(&q->mq_usage_counter);
-}
-
 int blk_mq_register_disk(struct gendisk *disk)
 {
 	struct device *dev = disk_to_dev(disk);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f2d67b4047a0..6d91894cf85e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -77,47 +77,13 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
 	clear_bit(CTX_TO_BIT(hctx, ctx), &bm->word);
 }
 
-static int blk_mq_queue_enter(struct request_queue *q, gfp_t gfp)
-{
-	while (true) {
-		int ret;
-
-		if (percpu_ref_tryget_live(&q->mq_usage_counter))
-			return 0;
-
-		if (!(gfp & __GFP_WAIT))
-			return -EBUSY;
-
-		ret = wait_event_interruptible(q->mq_freeze_wq,
-				!atomic_read(&q->mq_freeze_depth) ||
-				blk_queue_dying(q));
-		if (blk_queue_dying(q))
-			return -ENODEV;
-		if (ret)
-			return ret;
-	}
-}
-
-static void blk_mq_queue_exit(struct request_queue *q)
-{
-	percpu_ref_put(&q->mq_usage_counter);
-}
-
-static void blk_mq_usage_counter_release(struct percpu_ref *ref)
-{
-	struct request_queue *q =
-		container_of(ref, struct request_queue, mq_usage_counter);
-
-	wake_up_all(&q->mq_freeze_wq);
-}
-
 void blk_mq_freeze_queue_start(struct request_queue *q)
 {
 	int freeze_depth;
 
 	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
 	if (freeze_depth == 1) {
-		percpu_ref_kill(&q->mq_usage_counter);
+		percpu_ref_kill(&q->q_usage_counter);
 		blk_mq_run_hw_queues(q, false);
 	}
 }
@@ -125,18 +91,34 @@ EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start);
 
 static void blk_mq_freeze_queue_wait(struct request_queue *q)
 {
-	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->mq_usage_counter));
+	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
 }
 
 /*
  * Guarantee no request is in use, so we can change any data structure of
  * the queue afterward.
  */
-void blk_mq_freeze_queue(struct request_queue *q)
+void blk_freeze_queue(struct request_queue *q)
 {
+	/*
+	 * In the !blk_mq case we are only calling this to kill the
+	 * q_usage_counter, otherwise this increases the freeze depth
+	 * and waits for it to return to zero.  For this reason there is
+	 * no blk_unfreeze_queue(), and blk_freeze_queue() is not
+	 * exported to drivers as the only user for unfreeze is blk_mq.
+	 */
 	blk_mq_freeze_queue_start(q);
 	blk_mq_freeze_queue_wait(q);
 }
+
+void blk_mq_freeze_queue(struct request_queue *q)
+{
+	/*
+	 * ...just an alias to keep freeze and unfreeze actions balanced
+	 * in the blk_mq_* namespace
+	 */
+	blk_freeze_queue(q);
+}
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
 
 void blk_mq_unfreeze_queue(struct request_queue *q)
@@ -146,7 +128,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 	freeze_depth = atomic_dec_return(&q->mq_freeze_depth);
 	WARN_ON_ONCE(freeze_depth < 0);
 	if (!freeze_depth) {
-		percpu_ref_reinit(&q->mq_usage_counter);
+		percpu_ref_reinit(&q->q_usage_counter);
 		wake_up_all(&q->mq_freeze_wq);
 	}
 }
@@ -255,7 +237,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
 	struct blk_mq_alloc_data alloc_data;
 	int ret;
 
-	ret = blk_mq_queue_enter(q, gfp);
+	ret = blk_queue_enter(q, gfp);
 	if (ret)
 		return ERR_PTR(ret);
 
@@ -278,7 +260,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
 	}
 	blk_mq_put_ctx(ctx);
 	if (!rq) {
-		blk_mq_queue_exit(q);
+		blk_queue_exit(q);
 		return ERR_PTR(-EWOULDBLOCK);
 	}
 	return rq;
@@ -297,7 +279,7 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
 
 	clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
 	blk_mq_put_tag(hctx, tag, &ctx->last_tag);
-	blk_mq_queue_exit(q);
+	blk_queue_exit(q);
 }
 
 void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
@@ -1184,11 +1166,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 	int rw = bio_data_dir(bio);
 	struct blk_mq_alloc_data alloc_data;
 
-	if (unlikely(blk_mq_queue_enter(q, GFP_KERNEL))) {
-		bio_io_error(bio);
-		return NULL;
-	}
-
+	blk_queue_enter_live(q);
 	ctx = blk_mq_get_ctx(q);
 	hctx = q->mq_ops->map_queue(q, ctx->cpu);
 
@@ -1979,14 +1957,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 		hctxs[i]->queue_num = i;
 	}
 
-	/*
-	 * Init percpu_ref in atomic mode so that it's faster to shutdown.
-	 * See blk_register_queue() for details.
-	 */
-	if (percpu_ref_init(&q->mq_usage_counter, blk_mq_usage_counter_release,
-			    PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
-		goto err_hctxs;
-
 	setup_timer(&q->timeout, blk_mq_rq_timer, (unsigned long) q);
 	blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
 
@@ -2062,8 +2032,6 @@ void blk_mq_free_queue(struct request_queue *q)
 	blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
 	blk_mq_free_hw_queues(q, set);
 
-	percpu_ref_exit(&q->mq_usage_counter);
-
 	kfree(q->mq_map);
 
 	q->mq_map = NULL;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 3e44a9da2a13..61fc2633bbea 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -599,9 +599,8 @@ int blk_register_queue(struct gendisk *disk)
 	 */
 	if (!blk_queue_init_done(q)) {
 		queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q);
+		percpu_ref_switch_to_percpu(&q->q_usage_counter);
 		blk_queue_bypass_end(q);
-		if (q->mq_ops)
-			blk_mq_finish_init(q);
 	}
 
 	ret = blk_trace_init_sysfs(dev);
diff --git a/block/blk.h b/block/blk.h
index 98614ad37c81..5b2cd393afbe 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -72,6 +72,20 @@ void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 bool __blk_end_bidi_request(struct request *rq, int error,
 			    unsigned int nr_bytes, unsigned int bidi_bytes);
+int blk_queue_enter(struct request_queue *q, gfp_t gfp);
+void blk_queue_exit(struct request_queue *q);
+void blk_freeze_queue(struct request_queue *q);
+
+static inline void blk_queue_enter_live(struct request_queue *q)
+{
+	/*
+	 * Given that running in generic_make_request() context
+	 * guarantees that a live reference against q_usage_counter has
+	 * been established, further references under that same context
+	 * need not check that the queue has been frozen (marked dead).
+	 */
+	percpu_ref_get(&q->q_usage_counter);
+}
 
 void blk_rq_timed_out_timer(unsigned long data);
 unsigned long blk_rq_timeout(unsigned long timeout);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 37d1602c4f7a..25ce763fbb81 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -167,7 +167,6 @@ enum {
 struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *);
 struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 						  struct request_queue *q);
-void blk_mq_finish_init(struct request_queue *q);
 int blk_mq_register_disk(struct gendisk *);
 void blk_mq_unregister_disk(struct gendisk *);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 99da9ebc7377..0a0def66c61e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -450,7 +450,7 @@ struct request_queue {
 #endif
 	struct rcu_head		rcu_head;
 	wait_queue_head_t	mq_freeze_wq;
-	struct percpu_ref	mq_usage_counter;
+	struct percpu_ref	q_usage_counter;
 	struct list_head	all_q_node;
 
 	struct blk_mq_tag_set	*tag_set;
-- 
2.1.0




^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 01/15] avr32: convert to asm-generic/memory_model.h
  2015-09-24 15:10   ` Christoph Hellwig
@ 2015-09-26  0:36     ` Dan Williams
  2015-09-26 20:10       ` Christoph Hellwig
  0 siblings, 1 reply; 39+ messages in thread
From: Dan Williams @ 2015-09-26  0:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, linux-fsdevel, Linux MM, linux-kernel,
	linux-nvdimm, Tony Luck

On Thu, Sep 24, 2015 at 8:10 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, Sep 23, 2015 at 12:41:18AM -0400, Dan Williams wrote:
>> Switch avr32/include/asm/page.h to use the common defintions for
>> pfn_to_page(), page_to_pfn(), and ARCH_PFN_OFFSET.
>
> This was the last architecture not using asm-generic/memory_model.h,
> so it might be time to move it to linux/ or even fold it into an
> existing header.

I went to go attempt this, but ia64 is still a holdout, as its
DISCONTIGMEM setup can't use the generic memory_model definitions.

#ifdef CONFIG_DISCONTIGMEM
# define page_to_pfn(page)      ((unsigned long) (page - vmem_map))
# define pfn_to_page(pfn)       (vmem_map + (pfn))
#else
# include <asm-generic/memory_model.h>
#endif
#else
# include <asm-generic/memory_model.h>
#endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 01/15] avr32: convert to asm-generic/memory_model.h
  2015-09-26  0:36     ` Dan Williams
@ 2015-09-26 20:10       ` Christoph Hellwig
  2015-09-28 18:44         ` Luck, Tony
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2015-09-26 20:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Andrew Morton, linux-fsdevel, Linux MM,
	linux-kernel, linux-nvdimm, Tony Luck

On Fri, Sep 25, 2015 at 05:36:36PM -0700, Dan Williams wrote:
> I went to go attempt this, but ia64 is still a holdout, as its
> DISCONTIGMEM setup can't use the generic memory_model definitions.
> 
> #ifdef CONFIG_DISCONTIGMEM
> # define page_to_pfn(page)      ((unsigned long) (page - vmem_map))
> # define pfn_to_page(pfn)       (vmem_map + (pfn))
> #else
> # include <asm-generic/memory_model.h>
> #endif
> #else
> # include <asm-generic/memory_model.h>
> #endif

Seems like we should simply introduce a CONFIG_VMEM_MAP for ia64
to get this started.  Does my memory trick me or did we used to have
vmem_map on other architectures as well but managed to get rid of it
everywhere but on ia64?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 01/15] avr32: convert to asm-generic/memory_model.h
  2015-09-26 20:10       ` Christoph Hellwig
@ 2015-09-28 18:44         ` Luck, Tony
  0 siblings, 0 replies; 39+ messages in thread
From: Luck, Tony @ 2015-09-28 18:44 UTC (permalink / raw)
  To: Christoph Hellwig, Williams, Dan J
  Cc: Andrew Morton, linux-fsdevel, Linux MM, linux-kernel, linux-nvdimm

> Seems like we should simply introduce a CONFIG_VMEM_MAP for ia64
> to get this started.  Does my memory trick me or did we used to have
> vmem_map on other architectures as well but managed to get rid of it
> everywhere but on ia64?

I think ia64 hung onto this because of the SGI sn1 platforms. They were
even more discontiguous than even SPARSEMEM_EXTREME could handle

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 14/15] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup
  2015-09-23  4:42 ` [PATCH 14/15] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup Dan Williams
@ 2015-10-02 21:21   ` Logan Gunthorpe
  2015-10-02 21:53     ` Dan Williams
  0 siblings, 1 reply; 39+ messages in thread
From: Logan Gunthorpe @ 2015-10-02 21:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Dave Hansen, linux-nvdimm, linux-kernel, linux-mm,
	Alexander Viro, linux-fsdevel, Matthew Wilcox, Ross Zwisler,
	Stephen Bates

Hi Dan,

We've been doing some experimenting and testing with this patchset.
Specifically, we are trying to use you're ZONE_DEVICE work to enable
peer to peer PCIe transfers. This is actually working pretty well
(though we're still testing and working through some things).

However, we've found a couple of issues:

On Wed, Sep 23, 2015 at 12:42:27AM -0400, Dan Williams wrote:
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 3d6baa7d4534..20097e7b679a 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -49,12 +49,16 @@ struct page {
>  					 * updated asynchronously */
>  	union {
>  		struct address_space *mapping;	/* If low bit clear, points to
> -						 * inode address_space, or NULL.
> +						 * inode address_space, unless
> +						 * the page is in ZONE_DEVICE
> +						 * then it points to its parent
> +						 * dev_pagemap, otherwise NULL.
>  						 * If page mapped as anonymous
>  						 * memory, low bit is set, and
>  						 * it points to anon_vma object:
>  						 * see PAGE_MAPPING_ANON below.
>  						 */
> +		struct dev_pagemap *pgmap;
>  		void *s_mem;			/* slab first object */
>  	};


When you add to this union and overide the mapping value, we see bugs
in calls to set_page_dirty when it tries to dereference mapping. I believe
a change to page_mapping is required such as the patch that's at the end of
this email.


> diff --git a/mm/gup.c b/mm/gup.c
> index a798293fc648..1064e9a489a4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -98,7 +98,16 @@ retry:
>  	}
>
>  	page = vm_normal_page(vma, address, pte);
> -	if (unlikely(!page)) {
> +	if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
> +		/*
> +		 * Only return device mapping pages in the FOLL_GET case since
> +		 * they are only valid while holding the pgmap reference.
> +		 */
> +		if (get_dev_pagemap(pte_pfn(pte), NULL))
> +			page = pte_page(pte);
> +		else
> +			goto no_page;
> +	} else if (unlikely(!page)) {

I've found that if a driver creates a ZONE_DEVICE mapping but doesn't
create the pagemap (using devm_register_pagemap) then the get_user_pages code
will go into an infinite loop. I'm not really sure if this as an issue or
not but it seems a bit undesirable for a buggy driver to be able to cause this.

My thoughts are that either devm_register_pagemap needs to be done by
devm_memremap_pages so a driver cannot use one without the other,
or the GUP code needs to return EFAULT if no pagemap was registered so
it doesn't loop forever.

Thanks!

Logan



diff --git a/mm/util.c b/mm/util.c
index 68ff8a5..19af683 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -368,6 +368,9 @@ struct address_space *page_mapping(struct page *page)
 		return swap_address_space(entry);
 	}

+	if (unlikely(is_zone_device_page(page)))
+		return NULL;
+
 	mapping = (unsigned long)page->mapping;
 	if (mapping & PAGE_MAPPING_FLAGS)
 		return NULL;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 14/15] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup
  2015-10-02 21:21   ` Logan Gunthorpe
@ 2015-10-02 21:53     ` Dan Williams
  2015-10-02 22:14       ` Logan Gunthorpe
  2015-10-02 22:42       ` Logan Gunthorpe
  0 siblings, 2 replies; 39+ messages in thread
From: Dan Williams @ 2015-10-02 21:53 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Andrew Morton, Dave Hansen, linux-nvdimm, linux-kernel, Linux MM,
	Alexander Viro, linux-fsdevel, Matthew Wilcox, Ross Zwisler,
	Stephen Bates

On Fri, Oct 2, 2015 at 2:21 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Hi Dan,
>
> We've been doing some experimenting and testing with this patchset.
> Specifically, we are trying to use you're ZONE_DEVICE work to enable
> peer to peer PCIe transfers. This is actually working pretty well
> (though we're still testing and working through some things).

Hmm, I didn't have peer-to-peer PCI-E in mind for this mechanism, but
the test report is welcome nonetheless.  The definition of dma_addr_t
is the device view of host memory, not necessarily the device view of
a peer device's memory range, so I expect you'll run into issues with
IOMMUs and other parts of the kernel that assume this definition.

>
> However, we've found a couple of issues:
>
> On Wed, Sep 23, 2015 at 12:42:27AM -0400, Dan Williams wrote:
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 3d6baa7d4534..20097e7b679a 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -49,12 +49,16 @@ struct page {
>>                                        * updated asynchronously */
>>       union {
>>               struct address_space *mapping;  /* If low bit clear, points to
>> -                                              * inode address_space, or NULL.
>> +                                              * inode address_space, unless
>> +                                              * the page is in ZONE_DEVICE
>> +                                              * then it points to its parent
>> +                                              * dev_pagemap, otherwise NULL.
>>                                                * If page mapped as anonymous
>>                                                * memory, low bit is set, and
>>                                                * it points to anon_vma object:
>>                                                * see PAGE_MAPPING_ANON below.
>>                                                */
>> +             struct dev_pagemap *pgmap;
>>               void *s_mem;                    /* slab first object */
>>       };
>
>
> When you add to this union and overide the mapping value, we see bugs
> in calls to set_page_dirty when it tries to dereference mapping. I believe
> a change to page_mapping is required such as the patch that's at the end of
> this email.

Yes, this location for dev_pagemap will not work.  I've since moved it
to a union with the lru list_head since ZONE_DEVICE pages memory
should always have an elevated page count and never land on a slab
allocator lru.

>> diff --git a/mm/gup.c b/mm/gup.c
>> index a798293fc648..1064e9a489a4 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -98,7 +98,16 @@ retry:
>>       }
>>
>>       page = vm_normal_page(vma, address, pte);
>> -     if (unlikely(!page)) {
>> +     if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
>> +             /*
>> +              * Only return device mapping pages in the FOLL_GET case since
>> +              * they are only valid while holding the pgmap reference.
>> +              */
>> +             if (get_dev_pagemap(pte_pfn(pte), NULL))
>> +                     page = pte_page(pte);
>> +             else
>> +                     goto no_page;
>> +     } else if (unlikely(!page)) {
>
> I've found that if a driver creates a ZONE_DEVICE mapping but doesn't
> create the pagemap (using devm_register_pagemap) then the get_user_pages code
> will go into an infinite loop. I'm not really sure if this as an issue or
> not but it seems a bit undesirable for a buggy driver to be able to cause this.
>
> My thoughts are that either devm_register_pagemap needs to be done by
> devm_memremap_pages so a driver cannot use one without the other,
> or the GUP code needs to return EFAULT if no pagemap was registered so
> it doesn't loop forever.

Exactly, we should fail (-EFAULT)  get_user_pages() in that case since
we don't have a mechanism to pin down the mapping.  I'll track down
what's causing the loop.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 14/15] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup
  2015-10-02 21:53     ` Dan Williams
@ 2015-10-02 22:14       ` Logan Gunthorpe
  2015-10-02 22:42       ` Logan Gunthorpe
  1 sibling, 0 replies; 39+ messages in thread
From: Logan Gunthorpe @ 2015-10-02 22:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Dave Hansen, linux-nvdimm, linux-kernel, Linux MM,
	Alexander Viro, linux-fsdevel, Matthew Wilcox, Ross Zwisler,
	Stephen Bates

Hi Dan,

Good to know you've already addressed the struct page issue. We'll watch 
out for an updated patchset to try.


On 02/10/15 03:53 PM, Dan Williams wrote:
> Hmm, I didn't have peer-to-peer PCI-E in mind for this mechanism, but
> the test report is welcome nonetheless.  The definition of dma_addr_t
> is the device view of host memory, not necessarily the device view of
> a peer device's memory range, so I expect you'll run into issues with
> IOMMUs and other parts of the kernel that assume this definition.

Yeah, we've actually been doing this with a number of more "hacky" 
techniques for some time. ZONE_DEVICE just provides us with a much 
cleaner way to set this up that doesn't require patching around 
get_user_pages in various places in the kernel.

We've never had any issues with the IOMMU getting in the way (at least 
on Intel x86). My understanding always was that the IOMMU sits between a 
PCI card and main memory; it doesn't get in the way of peer-to-peer 
transfers. Though admittedly, I don't have a complete understanding of 
how the IOMMU works in the kernel. I'm just speaking from experimental 
experience. We've never actually tried this on other architectures.

Thanks,

Logan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 14/15] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup
  2015-10-02 21:53     ` Dan Williams
  2015-10-02 22:14       ` Logan Gunthorpe
@ 2015-10-02 22:42       ` Logan Gunthorpe
  2015-10-02 22:55         ` Dan Williams
  1 sibling, 1 reply; 39+ messages in thread
From: Logan Gunthorpe @ 2015-10-02 22:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Dave Hansen, linux-nvdimm, linux-kernel, Linux MM,
	Alexander Viro, linux-fsdevel, Matthew Wilcox, Ross Zwisler,
	Stephen Bates



On 02/10/15 03:53 PM, Dan Williams wrote:
> Yes, this location for dev_pagemap will not work.  I've since moved it
> to a union with the lru list_head since ZONE_DEVICE pages memory
> should always have an elevated page count and never land on a slab
> allocator lru.

Oh, also, I was actually hoping to make use of the lru list_head in the 
future with ZONE_DEVICE memory. One thought I had was once we have a 
PCIe device with a BAR space, we'd then need to have a way of allocating 
these buffers when user space needs them. The simple way I was thinking 
was to just use the lru list head to store lists of used and unused 
pages -- though there are probably other solutions to this that don't 
require using struct pages.

Thanks,

Logan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 14/15] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup
  2015-10-02 22:42       ` Logan Gunthorpe
@ 2015-10-02 22:55         ` Dan Williams
  0 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2015-10-02 22:55 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Andrew Morton, Dave Hansen, linux-nvdimm, linux-kernel, Linux MM,
	Alexander Viro, linux-fsdevel, Matthew Wilcox, Ross Zwisler,
	Stephen Bates

On Fri, Oct 2, 2015 at 3:42 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 02/10/15 03:53 PM, Dan Williams wrote:
>>
>> Yes, this location for dev_pagemap will not work.  I've since moved it
>> to a union with the lru list_head since ZONE_DEVICE pages memory
>> should always have an elevated page count and never land on a slab
>> allocator lru.
>
>
> Oh, also, I was actually hoping to make use of the lru list_head in the
> future with ZONE_DEVICE memory. One thought I had was once we have a PCIe
> device with a BAR space, we'd then need to have a way of allocating these
> buffers when user space needs them. The simple way I was thinking was to
> just use the lru list head to store lists of used and unused pages -- though
> there are probably other solutions to this that don't require using struct
> pages.
>

The current assumption is the ZONE_DEVICE ranges are being managed by
a physical address allocator.  In the case of persistent memory this
is the block allocator of the filesystem sitting on top of a pmem
block device.  The struct page is really only there to facilitate
in-flight I/O requests.  If it weren't for complexity we'd allocate
them on demand.  So you're "unused" case should be a raw pfn and then
for the time-limited duration it is in use as a struct page it should
hold a reference against the mapping.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 10/15] block, dax: fix lifetime of in-kernel dax mappings
  2015-09-23  4:42 ` [PATCH 10/15] block, dax: fix lifetime of in-kernel dax mappings Dan Williams
@ 2015-10-07 22:56   ` Logan Gunthorpe
  2015-10-09 21:12     ` Dan Williams
  0 siblings, 1 reply; 39+ messages in thread
From: Logan Gunthorpe @ 2015-10-07 22:56 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: Jens Axboe, Boaz Harrosh, linux-nvdimm, linux-kernel, linux-mm,
	linux-fsdevel, Ross Zwisler, Christoph Hellwig, Stephen Bates

Hi Dan,

We've uncovered another issue during testing with these patches. We get 
a kernel panic sometimes just while using a DAX filesystem. I've traced 
the issue back to this patch. (There's a stack trace at the end of this 
email.)

On 22/09/15 10:42 PM, Dan Williams wrote:
> +static void dax_unmap_bh(const struct buffer_head *bh, void __pmem *addr)
> +{
> +	struct block_device *bdev = bh->b_bdev;
> +	struct request_queue *q = bdev->bd_queue;
> +
> +	if (IS_ERR(addr))
> +		return;
> +	blk_dax_put(q);
>   }
>
> @@ -127,9 +159,8 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
>   			if (pos == bh_max) {
>   				bh->b_size = PAGE_ALIGN(end - pos);
>   				bh->b_state = 0;
> -				retval = get_block(inode, block, bh,
> -						   iov_iter_rw(iter) == WRITE);
> -				if (retval)
> +				rc = get_block(inode, block, bh, rw == WRITE);
> +				if (rc)
>   					break;
>   				if (!buffer_size_valid(bh))
>   					bh->b_size = 1 << blkbits;
> @@ -178,8 +213,9 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
>
>   	if (need_wmb)
>   		wmb_pmem();
> +	dax_unmap_bh(bh, kmap);
>
> -	return (pos == start) ? retval : pos - start;
> +	return (pos == start) ? rc : pos - start;
>   }

The problem is if get_block fails and returns an error code, it will 
still call dax_unmap_bh which tries to dereference bh->b_bdev. However, 
seeing get_block failed, that pointer is NULL. Maybe a null check in 
dax_unmap_bh would be sufficient?


Thanks,

Logan

--

> [   35.391790] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
> [   35.393306] IP: [<ffffffff811d1731>] dax_unmap_bh+0x41/0x70
> [   35.394253] PGD 7c7ed067 PUD 7c01f067 PMD 0
> [   35.395020] Oops: 0000 [#1] SMP
> [   35.395597] Modules linked in: mtramonb(O)
> [   35.396320] CPU: 0 PID: 1501 Comm: dd Tainted: G           O    4.3.0-rc2+ #51
> [   35.397500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
> [   35.399728] task: ffff88007bc6d600 ti: ffff880079250000 task.ti: ffff880079250000
> [   35.401006] RIP: 0010:[<ffffffff811d1731>]  [<ffffffff811d1731>] dax_unmap_bh+0x41/0x70
> [   35.402402] RSP: 0018:ffff880079253ab8  EFLAGS: 00010246
> [   35.403321] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000012
> [   35.404550] RDX: 0000000000000007 RSI: 0000000000000246 RDI: ffffffff8181f630
> [   35.405783] RBP: ffff880079253ac8 R08: 000000000000000a R09: 000000000000fffe
> [   35.407011] R10: ffff88007fc1ad90 R11: 0000000000006abc R12: fffffffffffffffb
> [   35.408237] R13: 000000000038e000 R14: 000000000038e000 R15: 000000000038e000
> [   35.409466] FS:  00007f0e81476700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
> [   35.410830] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   35.411796] CR2: 00000000000000a0 CR3: 000000007926f000 CR4: 00000000000006f0
> [   35.412990] Stack:
> [   35.413340]  00000000ffffffe4 ffff880079253cf0 ffff880079253be0 ffffffff811d2259
> [   35.414670]  0000000000000246 ffffffff81203120 ffff880079253e68 0000000000000000
> [   35.415981]  000000017c394800 000000000038e000 0000000100000000 fffffffffffffffb
> [   35.417298] Call Trace:
> [   35.417727]  [<ffffffff811d2259>] dax_do_io+0x199/0x700
> [   35.418615]  [<ffffffff81203120>] ? _ext4_get_block+0x200/0x200
> [   35.419819]  [<ffffffff81249700>] ? jbd2_journal_stop+0x60/0x390
> [   35.420886]  [<ffffffff8123e7fd>] ext4_ind_direct_IO+0x8d/0x410
> [   35.421908]  [<ffffffff8120205a>] ext4_direct_IO+0x2da/0x540
> [   35.422859]  [<ffffffff8120887c>] ? ext4_dirty_inode+0x5c/0x70
> [   35.423841]  [<ffffffff8113306a>] generic_file_direct_write+0xaa/0x170
> [   35.424931]  [<ffffffff811331f2>] __generic_file_write_iter+0xc2/0x1f0
> [   35.426027]  [<ffffffff811fcdcb>] ext4_file_write_iter+0x13b/0x420
> [   35.427066]  [<ffffffff81084ca2>] ? pick_next_entity+0xb2/0x190
> [   35.428061]  [<ffffffff811891f7>] __vfs_write+0xa7/0xf0
> [   35.428940]  [<ffffffff81189b59>] vfs_write+0xa9/0x190
> [   35.429810]  [<ffffffff81189a94>] ? vfs_read+0x114/0x130
> [   35.430706]  [<ffffffff8118a786>] SyS_write+0x46/0xa0
> [   35.431561]  [<ffffffff815f742e>] entry_SYSCALL_64_fastpath+0x12/0x71
> [   35.432637] Code: fe 48 c7 c7 4c 63 7e 81 e8 11 ec f5 ff 48 8b 5b 30 48 c7 c7 52 63 7e 81 31 c0 48 89 de e8 fc eb f5 ff 31 c0 48 c7 c7 30 f6 81 81 <48> 8b 9b a0 00 00 00 e8 e7 eb f5 ff 49 81 fc 00 f0 ff ff 77 08
> [   35.437029] RIP  [<ffffffff811d1731>] dax_unmap_bh+0x41/0x70
> [   35.438005]  RSP <ffff880079253ab8>
> [   35.438608] CR2: 00000000000000a0
> [   35.439194] ---[ end trace 323695b29b46dd96 ]---

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 10/15] block, dax: fix lifetime of in-kernel dax mappings
  2015-10-07 22:56   ` Logan Gunthorpe
@ 2015-10-09 21:12     ` Dan Williams
  0 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2015-10-09 21:12 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Andrew Morton, Jens Axboe, Boaz Harrosh, linux-nvdimm,
	linux-kernel, Linux MM, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig, Stephen Bates

On Wed, Oct 7, 2015 at 3:56 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Hi Dan,
>
> We've uncovered another issue during testing with these patches. We get a
> kernel panic sometimes just while using a DAX filesystem. I've traced the
> issue back to this patch. (There's a stack trace at the end of this email.)
>
> On 22/09/15 10:42 PM, Dan Williams wrote:
>>
>> +static void dax_unmap_bh(const struct buffer_head *bh, void __pmem *addr)
>> +{
>> +       struct block_device *bdev = bh->b_bdev;
>> +       struct request_queue *q = bdev->bd_queue;
>> +
>> +       if (IS_ERR(addr))
>> +               return;
>> +       blk_dax_put(q);
>>   }
>>
>> @@ -127,9 +159,8 @@ static ssize_t dax_io(struct inode *inode, struct
>> iov_iter *iter,
>>                         if (pos == bh_max) {
>>                                 bh->b_size = PAGE_ALIGN(end - pos);
>>                                 bh->b_state = 0;
>> -                               retval = get_block(inode, block, bh,
>> -                                                  iov_iter_rw(iter) ==
>> WRITE);
>> -                               if (retval)
>> +                               rc = get_block(inode, block, bh, rw ==
>> WRITE);
>> +                               if (rc)
>>                                         break;
>>                                 if (!buffer_size_valid(bh))
>>                                         bh->b_size = 1 << blkbits;
>> @@ -178,8 +213,9 @@ static ssize_t dax_io(struct inode *inode, struct
>> iov_iter *iter,
>>
>>         if (need_wmb)
>>                 wmb_pmem();
>> +       dax_unmap_bh(bh, kmap);
>>
>> -       return (pos == start) ? retval : pos - start;
>> +       return (pos == start) ? rc : pos - start;
>>   }
>
>
> The problem is if get_block fails and returns an error code, it will still
> call dax_unmap_bh which tries to dereference bh->b_bdev. However, seeing
> get_block failed, that pointer is NULL. Maybe a null check in dax_unmap_bh
> would be sufficient?

Thanks for the report, I have this fixed up in v2.  Will post shortly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2015-10-09 21:12 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-23  4:41 [PATCH 00/15] get_user_pages() for dax mappings Dan Williams
2015-09-23  4:41 ` [PATCH 01/15] avr32: convert to asm-generic/memory_model.h Dan Williams
2015-09-24 15:10   ` Christoph Hellwig
2015-09-26  0:36     ` Dan Williams
2015-09-26 20:10       ` Christoph Hellwig
2015-09-28 18:44         ` Luck, Tony
2015-09-23  4:41 ` [PATCH 02/15] hugetlb: fix compile error on tile Dan Williams
2015-09-23  4:41 ` [PATCH 03/15] frv: fix compiler warning from definition of __pmd() Dan Williams
2015-09-23  4:41 ` [PATCH 04/15] x86, mm: quiet arch_add_memory() Dan Williams
2015-09-24 15:10   ` Christoph Hellwig
2015-09-23  4:41 ` [PATCH 05/15] pmem: kill memremap_pmem() Dan Williams
2015-09-24 15:11   ` Christoph Hellwig
2015-09-23  4:41 ` [PATCH 06/15] devm_memunmap: use devres_release() Dan Williams
2015-09-24 15:13   ` Christoph Hellwig
2015-09-23  4:41 ` [PATCH 07/15] devm_memremap: convert to return ERR_PTR Dan Williams
2015-09-24 15:13   ` Christoph Hellwig
2015-09-23  4:41 ` [PATCH 08/15] block, dax, pmem: reference counting infrastructure Dan Williams
2015-09-24 15:15   ` Christoph Hellwig
2015-09-25  0:03     ` Dan Williams
2015-09-25 11:32       ` Christoph Hellwig
2015-09-25 21:08         ` Williams, Dan J
2015-09-23  4:42 ` [PATCH 09/15] block, pmem: fix null pointer de-reference on shutdown, check for queue death Dan Williams
2015-09-23  4:42 ` [PATCH 10/15] block, dax: fix lifetime of in-kernel dax mappings Dan Williams
2015-10-07 22:56   ` Logan Gunthorpe
2015-10-09 21:12     ` Dan Williams
2015-09-23  4:42 ` [PATCH 11/15] mm, dax, pmem: introduce __pfn_t Dan Williams
2015-09-23 16:02   ` Dave Hansen
2015-09-23 23:36     ` Williams, Dan J
2015-09-23  4:42 ` [PATCH 12/15] mm, dax, gpu: convert vm_insert_mixed to __pfn_t, introduce _PAGE_DEVMAP Dan Williams
2015-09-23 13:47   ` Geert Uytterhoeven
2015-09-23 16:59     ` Dan Williams
2015-09-23  4:42 ` [PATCH 13/15] mm, dax: convert vmf_insert_pfn_pmd() to __pfn_t Dan Williams
2015-09-23  4:42 ` [PATCH 14/15] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup Dan Williams
2015-10-02 21:21   ` Logan Gunthorpe
2015-10-02 21:53     ` Dan Williams
2015-10-02 22:14       ` Logan Gunthorpe
2015-10-02 22:42       ` Logan Gunthorpe
2015-10-02 22:55         ` Dan Williams
2015-09-23  4:42 ` [PATCH 15/15] mm, x86: get_user_pages() for dax mappings Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).