[PATCH v5 0/5] introduce __pfn_t for unmapped pfn I/O and DAX lifetime

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/5] introduce __pfn_t for unmapped pfn I/O and DAX lifetime
@ 2015-08-13  3:00 Dan Williams
  2015-08-13  3:01 ` [PATCH v5 1/5] mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h Dan Williams
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Dan Williams @ 2015-08-13  3:00 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, riel, linux-nvdimm, linux-mm, mgorman, torvalds, hch

Changes since v4 [1]:

1/ Allow up to PAGE_SHIFT bits in PFN_ flags.  Previously the __pfn_t
   value was a union with a 'struct page *', but now __pfn_t_to_page()
   internally does a pfn_to_page() instead of type-punning the value.
   (Linus, Matthew)

2/ Move the definition to include/linux/mm.h and squash the
   kmap_atomic_pfn_t() definition into the same patch. (Christoph)

3/ Kill dax_get_pfn().  Now replaced with dax_map_bh() (Matthew)

4/ The scatterlist cleanup patches are moved to their own series being
   carried by Christoph.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-June/001094.html

---

We want persistent memory to have 4 modes of access:

1/ Block device: persistent memory treated as a ram disk (done)

2/ DAX: userspace mmap (done)

3/ Kernel "page-less". (this series)

4/ Kernel and userspace references to page-mapped persistent memory
   (future series)

The "kernel 'page-less'" case leverages the fact that a 'struct page'
object is not necessarily required for describing a DMA transfer from a
device to a persistent memory address.  A pfn will do, but code needs to
be careful to not perform a pfn_to_page() operation on unmapped
persistent memory. The __pfn_t type enforces that safety and
kmap_atomic_pfn_t() covers cases where the I/O stack needs to touch the
buffer on its way to the low-level-device-driver (i.e. current usages of
kmap_atomic() in the block-layer).

A subsequent patch series will add struct page coverage for persistent,
"device", memory.

We also use kmap_atomic_pfn_t() to solve races of pmem driver unbind vs
usage in DAX. rcu_read_lock() protects the driver from unbinding while a
mapping is held.

---

Christoph Hellwig (1):
      mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h

Dan Williams (4):
      allow mapping page-less memremaped areas into KVA
      dax: drop size parameter to ->direct_access()
      dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t()
      scatterlist: convert to __pfn_t

 arch/arm/include/asm/memory.h       |    6 --
 arch/arm64/include/asm/memory.h     |    6 --
 arch/powerpc/platforms/Kconfig      |    1 
 arch/powerpc/sysdev/axonram.c       |   24 +++++--
 arch/unicore32/include/asm/memory.h |    6 --
 drivers/block/brd.c                 |    9 +--
 drivers/nvdimm/Kconfig              |    1 
 drivers/nvdimm/pmem.c               |   24 ++++---
 drivers/s390/block/Kconfig          |    1 
 drivers/s390/block/dcssblk.c        |   23 ++++++-
 fs/Kconfig                          |    1 
 fs/block_dev.c                      |    4 +
 fs/dax.c                            |   79 +++++++++++++++++-------
 include/asm-generic/memory_model.h  |    6 ++
 include/linux/blkdev.h              |    7 +-
 include/linux/kmap_pfn.h            |   31 +++++++++
 include/linux/mm.h                  |   78 +++++++++++++++++++++++
 include/linux/scatterlist.h         |  111 +++++++++++++++++++++++----------
 mm/Kconfig                          |    3 +
 mm/Makefile                         |    1 
 mm/kmap_pfn.c                       |  117 +++++++++++++++++++++++++++++++++++
 samples/kfifo/dma-example.c         |    8 +-
 22 files changed, 435 insertions(+), 112 deletions(-)
 create mode 100644 include/linux/kmap_pfn.h
 create mode 100644 mm/kmap_pfn.c

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v5 1/5] mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
  2015-08-13  3:00 [PATCH v5 0/5] introduce __pfn_t for unmapped pfn I/O and DAX lifetime Dan Williams
@ 2015-08-13  3:01 ` Dan Williams
  2015-08-13  3:01 ` [PATCH v5 2/5] allow mapping page-less memremaped areas into KVA Dan Williams
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2015-08-13  3:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, riel, linux-nvdimm, linux-mm, mgorman, torvalds, hch

From: Christoph Hellwig <hch@lst.de>

Three architectures already define these, and we'll need them genericly
soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/arm/include/asm/memory.h       |    6 ------
 arch/arm64/include/asm/memory.h     |    6 ------
 arch/unicore32/include/asm/memory.h |    6 ------
 include/asm-generic/memory_model.h  |    6 ++++++
 4 files changed, 6 insertions(+), 18 deletions(-)

diff --git a/arch/arm/include/asm/memory.h b/arch/arm/include/asm/memory.h
index b7f6fb462ea0..98d58bb04ac5 100644
--- a/arch/arm/include/asm/memory.h
+++ b/arch/arm/include/asm/memory.h
@@ -119,12 +119,6 @@
 #endif
 
 /*
- * Convert a physical address to a Page Frame Number and back
- */
-#define	__phys_to_pfn(paddr)	((unsigned long)((paddr) >> PAGE_SHIFT))
-#define	__pfn_to_phys(pfn)	((phys_addr_t)(pfn) << PAGE_SHIFT)
-
-/*
  * Convert a page to/from a physical address
  */
 #define page_to_phys(page)	(__pfn_to_phys(page_to_pfn(page)))
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index f800d45ea226..d808bb688751 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -81,12 +81,6 @@
 #define __phys_to_virt(x)	((unsigned long)((x) - PHYS_OFFSET + PAGE_OFFSET))
 
 /*
- * Convert a physical address to a Page Frame Number and back
- */
-#define	__phys_to_pfn(paddr)	((unsigned long)((paddr) >> PAGE_SHIFT))
-#define	__pfn_to_phys(pfn)	((phys_addr_t)(pfn) << PAGE_SHIFT)
-
-/*
  * Convert a page to/from a physical address
  */
 #define page_to_phys(page)	(__pfn_to_phys(page_to_pfn(page)))
diff --git a/arch/unicore32/include/asm/memory.h b/arch/unicore32/include/asm/memory.h
index debafc40200a..3bb0a29fd2d7 100644
--- a/arch/unicore32/include/asm/memory.h
+++ b/arch/unicore32/include/asm/memory.h
@@ -61,12 +61,6 @@
 #endif
 
 /*
- * Convert a physical address to a Page Frame Number and back
- */
-#define	__phys_to_pfn(paddr)	((paddr) >> PAGE_SHIFT)
-#define	__pfn_to_phys(pfn)	((pfn) << PAGE_SHIFT)
-
-/*
  * Convert a page to/from a physical address
  */
 #define page_to_phys(page)	(__pfn_to_phys(page_to_pfn(page)))
diff --git a/include/asm-generic/memory_model.h b/include/asm-generic/memory_model.h
index 14909b0b9cae..f20f407ce45d 100644
--- a/include/asm-generic/memory_model.h
+++ b/include/asm-generic/memory_model.h
@@ -69,6 +69,12 @@
 })
 #endif /* CONFIG_FLATMEM/DISCONTIGMEM/SPARSEMEM */
 
+/*
+ * Convert a physical address to a Page Frame Number and back
+ */
+#define	__phys_to_pfn(paddr)	((unsigned long)((paddr) >> PAGE_SHIFT))
+#define	__pfn_to_phys(pfn)	((pfn) << PAGE_SHIFT)
+
 #define page_to_pfn __page_to_pfn
 #define pfn_to_page __pfn_to_page
 


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 2/5] allow mapping page-less memremaped areas into KVA
  2015-08-13  3:00 [PATCH v5 0/5] introduce __pfn_t for unmapped pfn I/O and DAX lifetime Dan Williams
  2015-08-13  3:01 ` [PATCH v5 1/5] mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h Dan Williams
@ 2015-08-13  3:01 ` Dan Williams
  2015-08-13  5:58   ` Boaz Harrosh
  2015-08-13 17:35   ` Matthew Wilcox
  2015-08-13  3:01 ` [PATCH v5 3/5] dax: drop size parameter to ->direct_access() Dan Williams
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 21+ messages in thread
From: Dan Williams @ 2015-08-13  3:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, riel, linux-nvdimm, linux-mm, mgorman, torvalds, hch

Introduce a type that encapsulates a page-frame-number that can also be
used to encode other information.  This other information is the
traditional "page_link" encoding in a scatterlist, but can also denote
"device memory".  Where "device memory" is a set of pfns that are not
part of the kernel's linear mapping, but are accessed via the same
memory controller as ram.  The motivation for this conversion is large
capacity persistent memory that does not enjoy struct page coverage,
entries in 'memmap' by default.

This type will be used in replace usage of 'struct page *' in cases
where only a pfn is required, i.e. scatterlists for drivers, dma mapping
api, and potentially biovecs for the block layer.  The operations in
those i/o paths that formerly required a 'struct page *' are converted
to use __pfn_t aware equivalent helpers.

It turns out that while 'struct page' references are used broadly in the
kernel I/O stacks the usage of 'struct page' based capabilities is very
shallow for block-i/o.  It is only used for populating bio_vecs and
scatterlists for the retrieval of dma addresses, and for temporary
kernel mappings (kmap).  Aside from kmap, these usages can be trivially
converted to operate on a pfn.

Indeed, kmap_atomic() is more problematic as it uses mm infrastructure,
via struct page, to setup and track temporary kernel mappings.  It would
be unfortunate if the kmap infrastructure escaped its 32-bit/HIGHMEM
bonds and leaked into 64-bit code.  Thankfully, it seems all that is
needed here is to convert kmap_atomic() callers, that want to opt-in to
supporting persistent memory, to use a new kmap_atomic_pfn_t().  Where
kmap_atomic_pfn_t() is enabled to re-use the existing ioremap() mapping
established by the driver for persistent memory.

Note, that as far as conceptually understanding __pfn_t is concerned,
'persistent memory' is really any address range in host memory not
covered by memmap.  Contrast this with pure iomem that is on an mmio
mapped bus like PCI and cannot be converted to a dma_addr_t by "pfn <<
PAGE_SHIFT".

It would be unfortunate if the kmap infrastructure escaped its current
32-bit/HIGHMEM bonds and leaked into 64-bit code.  Instead, if the user
has enabled CONFIG_DEV_PFN we direct the kmap_atomic_pfn_t()
implementation to scan a list of pre-mapped persistent memory address
ranges inserted by the pmem driver.

The __pfn_t to resource lookup is indeed inefficient walking of a linked list,
but there are two mitigating factors:

1/ The number of persistent memory ranges is bounded by the number of
   DIMMs which is on the order of 10s of DIMMs, not hundreds.

2/ The lookup yields the entire range, if it becomes inefficient to do a
   kmap_atomic_pfn_t() a PAGE_SIZE at a time the caller can take
   advantage of the fact that the lookup can be amortized for all kmap
   operations it needs to perform in a given range.

[hch: various changes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/kmap_pfn.h |   31 ++++++++++++
 include/linux/mm.h       |   57 ++++++++++++++++++++++
 mm/Kconfig               |    3 +
 mm/Makefile              |    1 
 mm/kmap_pfn.c            |  117 ++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 209 insertions(+)
 create mode 100644 include/linux/kmap_pfn.h
 create mode 100644 mm/kmap_pfn.c

diff --git a/include/linux/kmap_pfn.h b/include/linux/kmap_pfn.h
new file mode 100644
index 000000000000..fa44971d8e95
--- /dev/null
+++ b/include/linux/kmap_pfn.h
@@ -0,0 +1,31 @@
+#ifndef _LINUX_KMAP_PFN_H
+#define _LINUX_KMAP_PFN_H 1
+
+#include <linux/highmem.h>
+
+struct device;
+struct resource;
+#ifdef CONFIG_KMAP_PFN
+extern void *kmap_atomic_pfn_t(__pfn_t pfn);
+extern void kunmap_atomic_pfn_t(void *addr);
+extern int devm_register_kmap_pfn_range(struct device *dev,
+		struct resource *res, void *base);
+#else
+static inline void *kmap_atomic_pfn_t(__pfn_t pfn)
+{
+	return kmap_atomic(__pfn_t_to_page(pfn));
+}
+
+static inline void kunmap_atomic_pfn_t(void *addr)
+{
+	__kunmap_atomic(addr);
+}
+
+static inline int devm_register_kmap_pfn_range(struct device *dev,
+		struct resource *res, void *base)
+{
+	return 0;
+}
+#endif /* CONFIG_KMAP_PFN */
+
+#endif /* _LINUX_KMAP_PFN_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 84b05ebedb2d..57ba5ca6be72 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -924,6 +924,63 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
 }
 
 /*
+ * __pfn_t: encapsulates a page-frame number that is optionally backed
+ * by memmap (struct page).  This type will be used in place of a
+ * 'struct page *' instance in contexts where unmapped memory (usually
+ * persistent memory) is being referenced (scatterlists for drivers,
+ * biovecs for the block layer, etc).  Whether a __pfn_t has a struct
+ * page backing is indicated by flags in the low bits of the value;
+ */
+typedef struct {
+	unsigned long val;
+} __pfn_t;
+
+/*
+ * PFN_SG_CHAIN - pfn is a pointer to the next scatterlist entry
+ * PFN_SG_LAST - pfn references a page and is the last scatterlist entry
+ * PFN_DEV - pfn is not covered by system memmap
+ */
+enum {
+	PFN_MASK = (1UL << PAGE_SHIFT) - 1,
+	PFN_SG_CHAIN = (1UL << 0),
+	PFN_SG_LAST = (1UL << 1),
+#ifdef CONFIG_KMAP_PFN
+	PFN_DEV = (1UL << 2),
+#else
+	PFN_DEV = 0,
+#endif
+};
+
+static inline bool __pfn_t_has_page(__pfn_t pfn)
+{
+	return (pfn.val & PFN_DEV) == 0;
+}
+
+static inline unsigned long __pfn_t_to_pfn(__pfn_t pfn)
+{
+	return pfn.val >> PAGE_SHIFT;
+}
+
+static inline struct page *__pfn_t_to_page(__pfn_t pfn)
+{
+	if (!__pfn_t_has_page(pfn))
+		return NULL;
+	return pfn_to_page(__pfn_t_to_pfn(pfn));
+}
+
+static inline dma_addr_t __pfn_t_to_phys(__pfn_t pfn)
+{
+	return __pfn_to_phys(__pfn_t_to_pfn(pfn));
+}
+
+static inline __pfn_t page_to_pfn_t(struct page *page)
+{
+	__pfn_t pfn = { .val = page_to_pfn(page) << PAGE_SHIFT, };
+
+	return pfn;
+}
+
+/*
  * Some inline functions in vmstat.h depend on page_zone()
  */
 #include <linux/vmstat.h>
diff --git a/mm/Kconfig b/mm/Kconfig
index e79de2bd12cd..ed1be8ff982e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -654,3 +654,6 @@ config DEFERRED_STRUCT_PAGE_INIT
 	  when kswapd starts. This has a potential performance impact on
 	  processes running early in the lifetime of the systemm until kswapd
 	  finishes the initialisation.
+
+config KMAP_PFN
+	bool
diff --git a/mm/Makefile b/mm/Makefile
index 98c4eaeabdcb..f7b27958ea69 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -78,3 +78,4 @@ obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
+obj-$(CONFIG_KMAP_PFN) += kmap_pfn.o
diff --git a/mm/kmap_pfn.c b/mm/kmap_pfn.c
new file mode 100644
index 000000000000..2d58e167dfbc
--- /dev/null
+++ b/mm/kmap_pfn.c
@@ -0,0 +1,117 @@
+/*
+ * Copyright(c) 2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/rcupdate.h>
+#include <linux/rculist.h>
+#include <linux/highmem.h>
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+
+static LIST_HEAD(ranges);
+static DEFINE_MUTEX(register_lock);
+
+struct kmap {
+	struct list_head list;
+	struct resource *res;
+	struct device *dev;
+	void *base;
+};
+
+static void teardown_kmap(void *data)
+{
+	struct kmap *kmap = data;
+
+	dev_dbg(kmap->dev, "kmap unregister %pr\n", kmap->res);
+	mutex_lock(&register_lock);
+	list_del_rcu(&kmap->list);
+	mutex_unlock(&register_lock);
+	synchronize_rcu();
+	kfree(kmap);
+}
+
+int devm_register_kmap_pfn_range(struct device *dev, struct resource *res,
+		void *base)
+{
+	struct kmap *kmap = kzalloc(sizeof(*kmap), GFP_KERNEL);
+	int rc;
+
+	if (!kmap)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&kmap->list);
+	kmap->res = res;
+	kmap->base = base;
+	kmap->dev = dev;
+	rc = devm_add_action(dev, teardown_kmap, kmap);
+	if (rc) {
+		kfree(kmap);
+		return rc;
+	}
+	dev_dbg(kmap->dev, "kmap register %pr\n", kmap->res);
+
+	mutex_lock(&register_lock);
+	list_add_rcu(&kmap->list, &ranges);
+	mutex_unlock(&register_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(devm_register_kmap_pfn_range);
+
+void *kmap_atomic_pfn_t(__pfn_t pfn)
+{
+	struct page *page = __pfn_t_to_page(pfn);
+	resource_size_t addr;
+	struct kmap *kmap;
+
+	rcu_read_lock();
+	if (page)
+		return kmap_atomic(page);
+	addr = __pfn_t_to_phys(pfn);
+	list_for_each_entry_rcu(kmap, &ranges, list)
+		if (addr >= kmap->res->start && addr <= kmap->res->end)
+			return kmap->base + addr - kmap->res->start;
+
+	/* only unlock in the error case */
+	rcu_read_unlock();
+	return NULL;
+}
+EXPORT_SYMBOL(kmap_atomic_pfn_t);
+
+void kunmap_atomic_pfn_t(void *addr)
+{
+	struct kmap *kmap;
+	bool dev_pfn = false;
+
+	if (!addr)
+		return;
+
+	/*
+	 * If the original __pfn_t had an entry in the memmap (i.e.
+	 * !PFN_DEV) then 'addr' will be outside of the registered
+	 * ranges and we'll need to kunmap_atomic() it.
+	 */
+	list_for_each_entry_rcu(kmap, &ranges, list)
+		if (addr < kmap->base + resource_size(kmap->res)
+				&& addr >= kmap->base) {
+			dev_pfn = true;
+			break;
+		}
+
+	if (!dev_pfn)
+		kunmap_atomic(addr);
+
+	/* signal that we are done with the range */
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL(kunmap_atomic_pfn_t);


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 3/5] dax: drop size parameter to ->direct_access()
  2015-08-13  3:00 [PATCH v5 0/5] introduce __pfn_t for unmapped pfn I/O and DAX lifetime Dan Williams
  2015-08-13  3:01 ` [PATCH v5 1/5] mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h Dan Williams
  2015-08-13  3:01 ` [PATCH v5 2/5] allow mapping page-less memremaped areas into KVA Dan Williams
@ 2015-08-13  3:01 ` Dan Williams
  2015-08-13  3:01 ` [PATCH v5 4/5] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t() Dan Williams
  2015-08-13  3:01 ` [PATCH v5 5/5] scatterlist: convert to __pfn_t Dan Williams
  4 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2015-08-13  3:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, riel, linux-nvdimm, linux-mm, mgorman, torvalds, hch

None of the implementations currently use it.  The common
bdev_direct_access() entry point handles all the size checks before
calling ->direct_access().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/sysdev/axonram.c |    2 +-
 drivers/block/brd.c           |    6 +-----
 drivers/nvdimm/pmem.c         |    2 +-
 drivers/s390/block/dcssblk.c  |    4 ++--
 fs/block_dev.c                |    2 +-
 include/linux/blkdev.h        |    2 +-
 6 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index ee90db17b097..e8657d3bc588 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -141,7 +141,7 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
  */
 static long
 axon_ram_direct_access(struct block_device *device, sector_t sector,
-		       void **kaddr, unsigned long *pfn, long size)
+		       void **kaddr, unsigned long *pfn)
 {
 	struct axon_ram_bank *bank = device->bd_disk->private_data;
 	loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 64ab4951e9d6..41528857c70d 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -371,7 +371,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 static long brd_direct_access(struct block_device *bdev, sector_t sector,
-			void **kaddr, unsigned long *pfn, long size)
+			void **kaddr, unsigned long *pfn)
 {
 	struct brd_device *brd = bdev->bd_disk->private_data;
 	struct page *page;
@@ -384,10 +384,6 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
 	*kaddr = page_address(page);
 	*pfn = page_to_pfn(page);
 
-	/*
-	 * TODO: If size > PAGE_SIZE, we could look to see if the next page in
-	 * the file happens to be mapped to the next page of physical RAM.
-	 */
 	return PAGE_SIZE;
 }
 #else
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index eb7552d939e1..5e019a6942ce 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -92,7 +92,7 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 }
 
 static long pmem_direct_access(struct block_device *bdev, sector_t sector,
-			      void **kaddr, unsigned long *pfn, long size)
+			      void **kaddr, unsigned long *pfn)
 {
 	struct pmem_device *pmem = bdev->bd_disk->private_data;
 	size_t offset = sector << 9;
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index da212813f2d5..2f1734ba0e22 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -29,7 +29,7 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
 static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
-				 void **kaddr, unsigned long *pfn, long size);
+				 void **kaddr, unsigned long *pfn);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -879,7 +879,7 @@ fail:
 
 static long
 dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
-			void **kaddr, unsigned long *pfn, long size)
+			void **kaddr, unsigned long *pfn)
 {
 	struct dcssblk_dev_info *dev_info;
 	unsigned long offset, dev_sz;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 198243717da5..3a8ac7edfbf4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -462,7 +462,7 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
 	sector += get_start_sect(bdev);
 	if (sector % (PAGE_SIZE / 512))
 		return -EINVAL;
-	avail = ops->direct_access(bdev, sector, addr, pfn, size);
+	avail = ops->direct_access(bdev, sector, addr, pfn);
 	if (!avail)
 		return -ERANGE;
 	return min(avail, size);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d4068c17d0df..ff47d5498133 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1556,7 +1556,7 @@ struct block_device_operations {
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	long (*direct_access)(struct block_device *, sector_t,
-					void **, unsigned long *pfn, long size);
+					void **, unsigned long *pfn);
 	unsigned int (*check_events) (struct gendisk *disk,
 				      unsigned int clearing);
 	/* ->media_changed() is DEPRECATED, use ->check_events() instead */


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 4/5] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t()
  2015-08-13  3:00 [PATCH v5 0/5] introduce __pfn_t for unmapped pfn I/O and DAX lifetime Dan Williams
                   ` (2 preceding siblings ...)
  2015-08-13  3:01 ` [PATCH v5 3/5] dax: drop size parameter to ->direct_access() Dan Williams
@ 2015-08-13  3:01 ` Dan Williams
  2015-08-13  6:26   ` Boaz Harrosh
  2015-08-13  3:01 ` [PATCH v5 5/5] scatterlist: convert to __pfn_t Dan Williams
  4 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2015-08-13  3:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, riel, linux-nvdimm, linux-mm, mgorman, torvalds, hch

The primary source for non-page-backed page-frames to enter the system
is via the pmem driver's ->direct_access() method.  The pfns returned by
the top-level bdev_direct_access() may be passed to any other subsystem
in the kernel and those sub-systems either need to assume that the pfn
is page backed (CONFIG_DEV_PFN=n) or be prepared to handle non-page
backed case (CONFIG_DEV_PFN=y).  Currently the pfns returned by
->direct_access() are only ever used by vm_insert_mixed() which does not
care if the pfn is mapped.  As we go to add more usages of these pfns
add the type-safety of __pfn_t.

This also simplifies the calling convention of ->direct_access() by not
returning the virtual address in the same call.  This annotates cases
where the kernel is directly accessing pmem outside the driver, and
makes the valid lifetime of the reference explicit.  This property may
be useful in the future for invalidating mappings to pmem, but for now
it provides some protection against the "pmem disable vs still-in-use"
race.

Note that axon_ram_direct_access and dcssblk_direct_access were
previously making potentially incorrect assumptions about the addresses
they passed to virt_to_phys().

[hch: various minor updates]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/platforms/Kconfig |    1 +
 arch/powerpc/sysdev/axonram.c  |   24 ++++++++----
 drivers/block/brd.c            |    5 +--
 drivers/nvdimm/Kconfig         |    1 +
 drivers/nvdimm/pmem.c          |   24 +++++++-----
 drivers/s390/block/Kconfig     |    1 +
 drivers/s390/block/dcssblk.c   |   23 ++++++++++--
 fs/Kconfig                     |    1 +
 fs/block_dev.c                 |    4 +-
 fs/dax.c                       |   79 +++++++++++++++++++++++++++++-----------
 include/linux/blkdev.h         |    7 ++--
 include/linux/mm.h             |   12 ++++++
 12 files changed, 129 insertions(+), 53 deletions(-)

diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index b7f9c408bf24..6b1c2f2e5fb4 100644
--- a/arch/powerpc/platforms/Kconfig
+++ b/arch/powerpc/platforms/Kconfig
@@ -307,6 +307,7 @@ config CPM2
 config AXON_RAM
 	tristate "Axon DDR2 memory device driver"
 	depends on PPC_IBM_CELL_BLADE && BLOCK
+	select KMAP_PFN
 	default m
 	help
 	  It registers one block device per Axon's DDR2 memory bank found
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index e8657d3bc588..7c5a1563c0fd 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -43,6 +43,7 @@
 #include <linux/types.h>
 #include <linux/of_device.h>
 #include <linux/of_platform.h>
+#include <linux/kmap_pfn.h>
 
 #include <asm/page.h>
 #include <asm/prom.h>
@@ -141,14 +142,12 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
  */
 static long
 axon_ram_direct_access(struct block_device *device, sector_t sector,
-		       void **kaddr, unsigned long *pfn)
+		       __pfn_t *pfn)
 {
 	struct axon_ram_bank *bank = device->bd_disk->private_data;
 	loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
 
-	*kaddr = (void *)(bank->ph_addr + offset);
-	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
-
+	*pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
 	return bank->size - offset;
 }
 
@@ -165,9 +164,13 @@ static int axon_ram_probe(struct platform_device *device)
 {
 	static int axon_ram_bank_id = -1;
 	struct axon_ram_bank *bank;
-	struct resource resource;
+	struct resource *resource;
 	int rc = 0;
 
+	resource = devm_kzalloc(&device->dev, sizeof(*resource), GFP_KERNEL);
+	if (!resource)
+		return -ENOMEM;
+
 	axon_ram_bank_id++;
 
 	dev_info(&device->dev, "Found memory controller on %s\n",
@@ -184,13 +187,13 @@ static int axon_ram_probe(struct platform_device *device)
 
 	bank->device = device;
 
-	if (of_address_to_resource(device->dev.of_node, 0, &resource) != 0) {
+	if (of_address_to_resource(device->dev.of_node, 0, resource) != 0) {
 		dev_err(&device->dev, "Cannot access device tree\n");
 		rc = -EFAULT;
 		goto failed;
 	}
 
-	bank->size = resource_size(&resource);
+	bank->size = resource_size(resource);
 
 	if (bank->size == 0) {
 		dev_err(&device->dev, "No DDR2 memory found for %s%d\n",
@@ -202,7 +205,7 @@ static int axon_ram_probe(struct platform_device *device)
 	dev_info(&device->dev, "Register DDR2 memory device %s%d with %luMB\n",
 			AXON_RAM_DEVICE_NAME, axon_ram_bank_id, bank->size >> 20);
 
-	bank->ph_addr = resource.start;
+	bank->ph_addr = resource->start;
 	bank->io_addr = (unsigned long) ioremap_prot(
 			bank->ph_addr, bank->size, _PAGE_NO_CACHE);
 	if (bank->io_addr == 0) {
@@ -211,6 +214,11 @@ static int axon_ram_probe(struct platform_device *device)
 		goto failed;
 	}
 
+	rc = devm_register_kmap_pfn_range(&device->dev, resource,
+			(void *) bank->io_addr);
+	if (rc)
+		goto failed;
+
 	bank->disk = alloc_disk(AXON_RAM_MINORS_PER_DISK);
 	if (bank->disk == NULL) {
 		dev_err(&device->dev, "Cannot register disk\n");
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 41528857c70d..6c4b21a4e915 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -371,7 +371,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 static long brd_direct_access(struct block_device *bdev, sector_t sector,
-			void **kaddr, unsigned long *pfn)
+		__pfn_t *pfn)
 {
 	struct brd_device *brd = bdev->bd_disk->private_data;
 	struct page *page;
@@ -381,8 +381,7 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
 	page = brd_insert_page(brd, sector);
 	if (!page)
 		return -ENOSPC;
-	*kaddr = page_address(page);
-	*pfn = page_to_pfn(page);
+	*pfn = page_to_pfn_t(page);
 
 	return PAGE_SIZE;
 }
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 72226acb5c0f..0d8c6bda7a41 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -20,6 +20,7 @@ config BLK_DEV_PMEM
 	tristate "PMEM: Persistent memory block device support"
 	default LIBNVDIMM
 	depends on HAS_IOMEM
+	select KMAP_PFN
 	select ND_BTT if BTT
 	help
 	  Memory ranges for PMEM are described by either an NFIT
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 5e019a6942ce..85d4101bb821 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -25,6 +25,8 @@
 #include <linux/slab.h>
 #include <linux/pmem.h>
 #include <linux/nd.h>
+#include <linux/mm.h>
+#include <linux/kmap_pfn.h>
 #include "nd.h"
 
 struct pmem_device {
@@ -92,18 +94,12 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 }
 
 static long pmem_direct_access(struct block_device *bdev, sector_t sector,
-			      void **kaddr, unsigned long *pfn)
+		__pfn_t *pfn)
 {
 	struct pmem_device *pmem = bdev->bd_disk->private_data;
 	size_t offset = sector << 9;
 
-	if (!pmem)
-		return -ENODEV;
-
-	/* FIXME convert DAX to comprehend that this mapping has a lifetime */
-	*kaddr = (void __force *) pmem->virt_addr + offset;
-	*pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;
-
+	*pfn = phys_to_pfn_t(pmem->phys_addr + offset, PFN_DEV);
 	return pmem->size - offset;
 }
 
@@ -149,10 +145,17 @@ static void pmem_detach_disk(struct pmem_device *pmem)
 	blk_cleanup_queue(pmem->pmem_queue);
 }
 
-static int pmem_attach_disk(struct nd_namespace_common *ndns,
+static int pmem_attach_disk(struct device *dev,
+		struct nd_namespace_common *ndns,
 		struct pmem_device *pmem)
 {
 	struct gendisk *disk;
+	struct resource *res = &(to_nd_namespace_io(&ndns->dev)->res);
+	int err;
+
+	err = devm_register_kmap_pfn_range(dev, res, pmem->virt_addr);
+	if (err)
+		return err;
 
 	pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
 	if (!pmem->pmem_queue)
@@ -232,7 +235,8 @@ static int nd_pmem_probe(struct device *dev)
 	if (nd_btt_probe(ndns, pmem) == 0)
 		/* we'll come back as btt-pmem */
 		return -ENXIO;
-	return pmem_attach_disk(ndns, pmem);
+
+	return pmem_attach_disk(dev, ndns, pmem);
 }
 
 static int nd_pmem_remove(struct device *dev)
diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
index 4a3b62326183..06c7a1c90d88 100644
--- a/drivers/s390/block/Kconfig
+++ b/drivers/s390/block/Kconfig
@@ -14,6 +14,7 @@ config BLK_DEV_XPRAM
 
 config DCSSBLK
 	def_tristate m
+	select KMAP_PFN
 	prompt "DCSSBLK support"
 	depends on S390 && BLOCK
 	help
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 2f1734ba0e22..42f1546d7b03 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -16,6 +16,7 @@
 #include <linux/blkdev.h>
 #include <linux/completion.h>
 #include <linux/interrupt.h>
+#include <linux/kmap_pfn.h>
 #include <linux/platform_device.h>
 #include <asm/extmem.h>
 #include <asm/io.h>
@@ -29,7 +30,7 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
 static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
-				 void **kaddr, unsigned long *pfn);
+		__pfn_t *pfn);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -520,12 +521,18 @@ static const struct attribute_group *dcssblk_dev_attr_groups[] = {
 static ssize_t
 dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count)
 {
+	struct resource *res = devm_kzalloc(dev, sizeof(*res), GFP_KERNEL);
 	int rc, i, j, num_of_segments;
 	struct dcssblk_dev_info *dev_info;
 	struct segment_info *seg_info, *temp;
 	char *local_buf;
 	unsigned long seg_byte_size;
 
+	if (!res) {
+		rc = -ENOMEM;
+		goto out_nobuf;
+	}
+
 	dev_info = NULL;
 	seg_info = NULL;
 	if (dev != dcssblk_root_dev) {
@@ -652,6 +659,13 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
 	if (rc)
 		goto put_dev;
 
+	res->start = dev_info->start;
+	res->end = dev_info->end - 1;
+	rc = devm_register_kmap_pfn_range(&dev_info->dev, res,
+			(void *) dev_info->start);
+	if (rc)
+		goto put_dev;
+
 	get_device(&dev_info->dev);
 	add_disk(dev_info->gd);
 
@@ -699,6 +713,8 @@ seg_list_del:
 out:
 	kfree(local_buf);
 out_nobuf:
+	if (res)
+		devm_kfree(dev, res);
 	return rc;
 }
 
@@ -879,7 +895,7 @@ fail:
 
 static long
 dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
-			void **kaddr, unsigned long *pfn)
+		__pfn_t *pfn)
 {
 	struct dcssblk_dev_info *dev_info;
 	unsigned long offset, dev_sz;
@@ -889,8 +905,7 @@ dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
 		return -ENODEV;
 	dev_sz = dev_info->end - dev_info->start;
 	offset = secnum * 512;
-	*kaddr = (void *) (dev_info->start + offset);
-	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
+	*pfn = phys_to_pfn_t(dev_info->start + offset, PFN_DEV);
 
 	return dev_sz - offset;
 }
diff --git a/fs/Kconfig b/fs/Kconfig
index 011f43365d7b..bd37234e71a8 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
 	depends on !(ARM || MIPS || SPARC)
+	depends on KMAP_PFN
 	help
 	  Direct Access (DAX) can be used on memory-backed block devices.
 	  If the block device supports DAX and the filesystem supports DAX,
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3a8ac7edfbf4..73fbc57b6e6d 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -441,7 +441,7 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
  * accessible at this address.
  */
 long bdev_direct_access(struct block_device *bdev, sector_t sector,
-			void **addr, unsigned long *pfn, long size)
+			__pfn_t *pfn, long size)
 {
 	long avail;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
@@ -462,7 +462,7 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
 	sector += get_start_sect(bdev);
 	if (sector % (PAGE_SIZE / 512))
 		return -EINVAL;
-	avail = ops->direct_access(bdev, sector, addr, pfn);
+	avail = ops->direct_access(bdev, sector, pfn);
 	if (!avail)
 		return -ERANGE;
 	return min(avail, size);
diff --git a/fs/dax.c b/fs/dax.c
index c3e21ccfc358..94611f480091 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -26,6 +26,7 @@
 #include <linux/sched.h>
 #include <linux/uio.h>
 #include <linux/vmstat.h>
+#include <linux/kmap_pfn.h>
 
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
@@ -35,13 +36,16 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 	might_sleep();
 	do {
 		void *addr;
-		unsigned long pfn;
+		__pfn_t pfn;
 		long count;
 
-		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
+		count = bdev_direct_access(bdev, sector, &pfn, size);
 		if (count < 0)
 			return count;
 		BUG_ON(size < count);
+		addr = kmap_atomic_pfn_t(pfn);
+		if (!addr)
+			return -EIO;
 		while (count > 0) {
 			unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
 			if (pgsz > count)
@@ -57,17 +61,39 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 			sector += pgsz / 512;
 			cond_resched();
 		}
+		kunmap_atomic_pfn_t(addr);
 	} while (size);
 
 	return 0;
 }
 EXPORT_SYMBOL_GPL(dax_clear_blocks);
 
-static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
+static void *__dax_map_bh(struct buffer_head *bh, unsigned blkbits, __pfn_t *pfn)
 {
-	unsigned long pfn;
 	sector_t sector = bh->b_blocknr << (blkbits - 9);
-	return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
+	void *addr;
+	long rc;
+
+	rc = bdev_direct_access(bh->b_bdev, sector, pfn, bh->b_size);
+	if (rc)
+		return ERR_PTR(rc);
+	addr = kmap_atomic_pfn_t(*pfn);
+	if (!addr)
+		return ERR_PTR(-EIO);
+	return addr;
+}
+
+static void *dax_map_bh(struct buffer_head *bh, unsigned blkbits)
+{
+	__pfn_t pfn;
+
+	return __dax_map_bh(bh, blkbits, &pfn);
+}
+
+static void dax_unmap_bh(void *addr)
+{
+	if (!IS_ERR(addr))
+		kunmap_atomic_pfn_t(addr);
 }
 
 static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos,
@@ -106,7 +132,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 	loff_t pos = start;
 	loff_t max = start;
 	loff_t bh_max = start;
-	void *addr;
+	void *addr = NULL, *kmap = ERR_PTR(-EIO);
 	bool hole = false;
 
 	if (iov_iter_rw(iter) != WRITE)
@@ -142,9 +168,13 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				addr = NULL;
 				size = bh->b_size - first;
 			} else {
-				retval = dax_get_addr(bh, &addr, blkbits);
-				if (retval < 0)
+				dax_unmap_bh(kmap);
+				kmap = dax_map_bh(bh, blkbits);
+				if (IS_ERR(kmap)) {
+					retval = PTR_ERR(kmap);
 					break;
+				}
+				addr = kmap;
 				if (buffer_unwritten(bh) || buffer_new(bh))
 					dax_new_buf(addr, retval, first, pos,
 									end);
@@ -168,6 +198,8 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 		addr += len;
 	}
 
+	dax_unmap_bh(kmap);
+
 	return (pos == start) ? retval : pos - start;
 }
 
@@ -261,11 +293,14 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
 			unsigned blkbits, unsigned long vaddr)
 {
 	void *vfrom, *vto;
-	if (dax_get_addr(bh, &vfrom, blkbits) < 0)
-		return -EIO;
+
+	vfrom = dax_map_bh(bh, blkbits);
+	if (IS_ERR(vfrom))
+		return PTR_ERR(vfrom);
 	vto = kmap_atomic(to);
 	copy_user_page(vto, vfrom, vaddr, to);
 	kunmap_atomic(vto);
+	dax_unmap_bh(vfrom);
 	return 0;
 }
 
@@ -273,11 +308,10 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct address_space *mapping = inode->i_mapping;
-	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
-	void *addr;
-	unsigned long pfn;
 	pgoff_t size;
+	__pfn_t pfn;
+	void *addr;
 	int error;
 
 	i_mmap_lock_read(mapping);
@@ -295,18 +329,17 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		goto out;
 	}
 
-	error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
-	if (error < 0)
-		goto out;
-	if (error < PAGE_SIZE) {
-		error = -EIO;
+	addr = __dax_map_bh(bh, inode->i_blkbits, &pfn);
+	if (IS_ERR(addr)) {
+		error = PTR_ERR(addr);
 		goto out;
 	}
 
 	if (buffer_unwritten(bh) || buffer_new(bh))
 		clear_page(addr);
+	dax_unmap_bh(addr);
 
-	error = vm_insert_mixed(vma, vaddr, pfn);
+	error = vm_insert_mixed(vma, vaddr, __pfn_t_to_pfn(pfn));
 
  out:
 	i_mmap_unlock_read(mapping);
@@ -539,10 +572,12 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 		return err;
 	if (buffer_written(&bh)) {
 		void *addr;
-		err = dax_get_addr(&bh, &addr, inode->i_blkbits);
-		if (err < 0)
-			return err;
+
+		addr = dax_map_bh(&bh, inode->i_blkbits);
+		if (IS_ERR(addr))
+			return PTR_ERR(addr);
 		memset(addr + offset, 0, length);
+		dax_unmap_bh(addr);
 	}
 
 	return 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ff47d5498133..ae59778d8076 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1555,8 +1555,7 @@ struct block_device_operations {
 	int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
-	long (*direct_access)(struct block_device *, sector_t,
-					void **, unsigned long *pfn);
+	long (*direct_access)(struct block_device *, sector_t, __pfn_t *pfn);
 	unsigned int (*check_events) (struct gendisk *disk,
 				      unsigned int clearing);
 	/* ->media_changed() is DEPRECATED, use ->check_events() instead */
@@ -1574,8 +1573,8 @@ extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
 extern int bdev_read_page(struct block_device *, sector_t, struct page *);
 extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 						struct writeback_control *);
-extern long bdev_direct_access(struct block_device *, sector_t, void **addr,
-						unsigned long *pfn, long size);
+extern long bdev_direct_access(struct block_device *, sector_t,
+		__pfn_t *pfn, long size);
 #else /* CONFIG_BLOCK */
 
 struct block_device;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 57ba5ca6be72..c4683ea2fcab 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -951,6 +951,18 @@ enum {
 #endif
 };
 
+static inline __pfn_t pfn_to_pfn_t(unsigned long pfn, unsigned long flags)
+{
+	__pfn_t pfn_t = { .val = (pfn << PAGE_SHIFT) | (flags & PFN_MASK), };
+
+	return pfn_t;
+}
+
+static inline __pfn_t phys_to_pfn_t(dma_addr_t addr, unsigned long flags)
+{
+	return pfn_to_pfn_t(addr >> PAGE_SHIFT, flags);
+}
+
 static inline bool __pfn_t_has_page(__pfn_t pfn)
 {
 	return (pfn.val & PFN_DEV) == 0;


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 5/5] scatterlist: convert to __pfn_t
  2015-08-13  3:00 [PATCH v5 0/5] introduce __pfn_t for unmapped pfn I/O and DAX lifetime Dan Williams
                   ` (3 preceding siblings ...)
  2015-08-13  3:01 ` [PATCH v5 4/5] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t() Dan Williams
@ 2015-08-13  3:01 ` Dan Williams
  4 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2015-08-13  3:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, riel, linux-nvdimm, linux-mm, mgorman, torvalds, hch

__pfn_t has flags for sg_chain and sg_last, use it to replace the
(struct page *) entry in struct scatterlist.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/mm.h          |    9 +++
 include/linux/scatterlist.h |  111 ++++++++++++++++++++++++++++++-------------
 samples/kfifo/dma-example.c |    8 ++-
 3 files changed, 91 insertions(+), 37 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c4683ea2fcab..348f69467f54 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -992,6 +992,15 @@ static inline __pfn_t page_to_pfn_t(struct page *page)
 	return pfn;
 }
 
+static inline __pfn_t nth_pfn(__pfn_t pfn, unsigned int n)
+{
+	__pfn_t ret;
+
+	ret.val = (__pfn_t_to_pfn(pfn) + n) << PAGE_SHIFT
+		| (pfn.val & PFN_MASK);
+	return ret;
+}
+
 /*
  * Some inline functions in vmstat.h depend on page_zone()
  */
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 698e906ca730..c612599bb155 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -11,7 +11,7 @@ struct scatterlist {
 #ifdef CONFIG_DEBUG_SG
 	unsigned long	sg_magic;
 #endif
-	unsigned long	page_link;
+	__pfn_t		pfn;
 	unsigned int	offset;
 	unsigned int	length;
 	dma_addr_t	dma_address;
@@ -44,14 +44,14 @@ struct sg_table {
 /*
  * Notes on SG table design.
  *
- * We use the unsigned long page_link field in the scatterlist struct to place
+ * We use the __pfn_t pfn field in the scatterlist struct to place
  * the page pointer AND encode information about the sg table as well. The two
  * lower bits are reserved for this information.
  *
- * If bit 0 is set, then the page_link contains a pointer to the next sg
+ * If PFN_SG_CHAIN is set, then the pfn contains a pointer to the next sg
  * table list. Otherwise the next entry is at sg + 1.
  *
- * If bit 1 is set, then this sg entry is the last element in a list.
+ * If PFN_SG_LAST is set, then this sg entry is the last element in a list.
  *
  * See sg_next().
  *
@@ -64,10 +64,31 @@ struct sg_table {
  * a valid sg entry, or whether it points to the start of a new scatterlist.
  * Those low bits are there for everyone! (thanks mason :-)
  */
-#define sg_is_chain(sg)		((sg)->page_link & 0x01)
-#define sg_is_last(sg)		((sg)->page_link & 0x02)
-#define sg_chain_ptr(sg)	\
-	((struct scatterlist *) ((sg)->page_link & ~0x03))
+static inline bool sg_is_chain(struct scatterlist *sg)
+{
+	return (sg->pfn.val & PFN_SG_CHAIN) == PFN_SG_CHAIN;
+}
+
+static inline bool sg_is_last(struct scatterlist *sg)
+{
+	return (sg->pfn.val & PFN_SG_LAST) == PFN_SG_LAST;
+}
+
+static inline struct scatterlist *sg_chain_ptr(struct scatterlist *sg)
+{
+	return (struct scatterlist *) (sg->pfn.val
+		& ~(PFN_SG_CHAIN | PFN_SG_LAST));
+}
+
+static inline void sg_assign_pfn(struct scatterlist *sg, __pfn_t pfn)
+{
+#ifdef CONFIG_DEBUG_SG
+	BUG_ON(sg->sg_magic != SG_MAGIC);
+	BUG_ON(sg_is_chain(sg));
+#endif
+	pfn.val &= ~PFN_SG_LAST;
+	sg->pfn.val = (sg->pfn.val & PFN_SG_LAST) | pfn.val;
+}
 
 /**
  * sg_assign_page - Assign a given page to an SG entry
@@ -81,18 +102,20 @@ struct sg_table {
  **/
 static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
 {
-	unsigned long page_link = sg->page_link & 0x3;
+	__pfn_t pfn = page_to_pfn_t(page);
 
-	/*
-	 * In order for the low bit stealing approach to work, pages
-	 * must be aligned at a 32-bit boundary as a minimum.
-	 */
-	BUG_ON((unsigned long) page & 0x03);
-#ifdef CONFIG_DEBUG_SG
-	BUG_ON(sg->sg_magic != SG_MAGIC);
-	BUG_ON(sg_is_chain(sg));
-#endif
-	sg->page_link = page_link | (unsigned long) page;
+	/* check that a __pfn_t has enough bits to encode a page */
+	BUG_ON(pfn.val & (PFN_SG_LAST | PFN_SG_CHAIN));
+
+	sg_assign_pfn(sg, pfn);
+}
+
+static inline void sg_set_pfn(struct scatterlist *sg, __pfn_t pfn,
+	unsigned int len, unsigned int offset)
+{
+	sg_assign_pfn(sg, pfn);
+	sg->offset = offset;
+	sg->length = len;
 }
 
 /**
@@ -112,18 +135,34 @@ static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
 static inline void sg_set_page(struct scatterlist *sg, struct page *page,
 			       unsigned int len, unsigned int offset)
 {
-	sg_assign_page(sg, page);
-	sg->offset = offset;
-	sg->length = len;
+	sg_set_pfn(sg, page_to_pfn_t(page), len, offset);
+}
+
+static inline bool sg_has_page(struct scatterlist *sg)
+{
+	return __pfn_t_has_page(sg->pfn);
 }
 
 static inline struct page *sg_page(struct scatterlist *sg)
 {
+	__pfn_t pfn = sg->pfn;
+	struct page *page;
+
+	WARN_ONCE(!sg_has_page(sg), "scatterlist references unmapped memory\n");
+
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 	BUG_ON(sg_is_chain(sg));
 #endif
-	return (struct page *)((sg)->page_link & ~0x3);
+
+	page = __pfn_t_to_page(pfn);
+
+	return page;
+}
+
+static inline unsigned long sg_pfn(struct scatterlist *sg)
+{
+	return __pfn_t_to_pfn(sg->pfn);
 }
 
 /**
@@ -171,7 +210,8 @@ static inline void sg_chain(struct scatterlist *prv, unsigned int prv_nents,
 	 * Set lowest bit to indicate a link pointer, and make sure to clear
 	 * the termination bit if it happens to be set.
 	 */
-	prv[prv_nents - 1].page_link = ((unsigned long) sgl | 0x01) & ~0x02;
+	prv[prv_nents - 1].pfn.val = ((unsigned long) sgl | PFN_SG_CHAIN)
+		& ~PFN_SG_LAST;
 }
 
 /**
@@ -191,8 +231,8 @@ static inline void sg_mark_end(struct scatterlist *sg)
 	/*
 	 * Set termination bit, clear potential chain bit
 	 */
-	sg->page_link |= 0x02;
-	sg->page_link &= ~0x01;
+	sg->pfn.val |= PFN_SG_LAST;
+	sg->pfn.val &= ~PFN_SG_CHAIN;
 }
 
 /**
@@ -208,7 +248,7 @@ static inline void sg_unmark_end(struct scatterlist *sg)
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 #endif
-	sg->page_link &= ~0x02;
+	sg->pfn.val &= ~PFN_SG_LAST;
 }
 
 /**
@@ -216,14 +256,13 @@ static inline void sg_unmark_end(struct scatterlist *sg)
  * @sg:	     SG entry
  *
  * Description:
- *   This calls page_to_phys() on the page in this sg entry, and adds the
- *   sg offset. The caller must know that it is legal to call page_to_phys()
- *   on the sg page.
+ *   This calls __pfn_t_to_phys() on the pfn in this sg entry, and adds the
+ *   sg offset.
  *
  **/
 static inline dma_addr_t sg_phys(struct scatterlist *sg)
 {
-	return page_to_phys(sg_page(sg)) + sg->offset;
+	return __pfn_t_to_phys(sg->pfn) + sg->offset;
 }
 
 /**
@@ -281,7 +320,7 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
 #define SG_MAX_SINGLE_ALLOC		(PAGE_SIZE / sizeof(struct scatterlist))
 
 /*
- * sg page iterator
+ * sg page / pfn iterator
  *
  * Iterates over sg entries page-by-page.  On each successful iteration,
  * you can call sg_page_iter_page(@piter) and sg_page_iter_dma_address(@piter)
@@ -304,13 +343,19 @@ bool __sg_page_iter_next(struct sg_page_iter *piter);
 void __sg_page_iter_start(struct sg_page_iter *piter,
 			  struct scatterlist *sglist, unsigned int nents,
 			  unsigned long pgoffset);
+
+static inline __pfn_t sg_page_iter_pfn(struct sg_page_iter *piter)
+{
+	return nth_pfn(piter->sg->pfn, piter->sg_pgoffset);
+}
+
 /**
  * sg_page_iter_page - get the current page held by the page iterator
  * @piter:	page iterator holding the page
  */
 static inline struct page *sg_page_iter_page(struct sg_page_iter *piter)
 {
-	return nth_page(sg_page(piter->sg), piter->sg_pgoffset);
+	return __pfn_t_to_page(sg_page_iter_pfn(piter));
 }
 
 /**
diff --git a/samples/kfifo/dma-example.c b/samples/kfifo/dma-example.c
index aa243db93f01..3eeff9a56e0e 100644
--- a/samples/kfifo/dma-example.c
+++ b/samples/kfifo/dma-example.c
@@ -75,8 +75,8 @@ static int __init example_init(void)
 	for (i = 0; i < nents; i++) {
 		printk(KERN_INFO
 		"sg[%d] -> "
-		"page_link 0x%.8lx offset 0x%.8x length 0x%.8x\n",
-			i, sg[i].page_link, sg[i].offset, sg[i].length);
+		"pfn_data 0x%.8lx offset 0x%.8x length 0x%.8x\n",
+			i, sg[i].pfn.data, sg[i].offset, sg[i].length);
 
 		if (sg_is_last(&sg[i]))
 			break;
@@ -104,8 +104,8 @@ static int __init example_init(void)
 	for (i = 0; i < nents; i++) {
 		printk(KERN_INFO
 		"sg[%d] -> "
-		"page_link 0x%.8lx offset 0x%.8x length 0x%.8x\n",
-			i, sg[i].page_link, sg[i].offset, sg[i].length);
+		"pfn_data 0x%.8lx offset 0x%.8x length 0x%.8x\n",
+			i, sg[i].pfn.data, sg[i].offset, sg[i].length);
 
 		if (sg_is_last(&sg[i]))
 			break;


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/5] allow mapping page-less memremaped areas into KVA
  2015-08-13  3:01 ` [PATCH v5 2/5] allow mapping page-less memremaped areas into KVA Dan Williams
@ 2015-08-13  5:58   ` Boaz Harrosh
  2015-08-13 12:57     ` Dan Williams
  2015-08-13 14:37     ` Christoph Hellwig
  2015-08-13 17:35   ` Matthew Wilcox
  1 sibling, 2 replies; 21+ messages in thread
From: Boaz Harrosh @ 2015-08-13  5:58 UTC (permalink / raw)
  To: Dan Williams, linux-kernel
  Cc: axboe, riel, linux-nvdimm, linux-mm, mgorman, torvalds, hch

On 08/13/2015 06:01 AM, Dan Williams wrote:
> Introduce a type that encapsulates a page-frame-number that can also be
> used to encode other information.  This other information is the
> traditional "page_link" encoding in a scatterlist, but can also denote
> "device memory".  Where "device memory" is a set of pfns that are not
> part of the kernel's linear mapping, but are accessed via the same
> memory controller as ram.  The motivation for this conversion is large
> capacity persistent memory that does not enjoy struct page coverage,
> entries in 'memmap' by default.
> 
> This type will be used in replace usage of 'struct page *' in cases
> where only a pfn is required, i.e. scatterlists for drivers, dma mapping
> api, and potentially biovecs for the block layer.  The operations in
> those i/o paths that formerly required a 'struct page *' are converted
> to use __pfn_t aware equivalent helpers.
> 
> It turns out that while 'struct page' references are used broadly in the
> kernel I/O stacks the usage of 'struct page' based capabilities is very
> shallow for block-i/o.  It is only used for populating bio_vecs and
> scatterlists for the retrieval of dma addresses, and for temporary
> kernel mappings (kmap).  Aside from kmap, these usages can be trivially
> converted to operate on a pfn.
> 
> Indeed, kmap_atomic() is more problematic as it uses mm infrastructure,
> via struct page, to setup and track temporary kernel mappings.  It would
> be unfortunate if the kmap infrastructure escaped its 32-bit/HIGHMEM
> bonds and leaked into 64-bit code.  Thankfully, it seems all that is
> needed here is to convert kmap_atomic() callers, that want to opt-in to
> supporting persistent memory, to use a new kmap_atomic_pfn_t().  Where
> kmap_atomic_pfn_t() is enabled to re-use the existing ioremap() mapping
> established by the driver for persistent memory.
> 
> Note, that as far as conceptually understanding __pfn_t is concerned,
> 'persistent memory' is really any address range in host memory not
> covered by memmap.  Contrast this with pure iomem that is on an mmio
> mapped bus like PCI and cannot be converted to a dma_addr_t by "pfn <<
> PAGE_SHIFT".
> 
> It would be unfortunate if the kmap infrastructure escaped its current
> 32-bit/HIGHMEM bonds and leaked into 64-bit code.  Instead, if the user
> has enabled CONFIG_DEV_PFN we direct the kmap_atomic_pfn_t()
> implementation to scan a list of pre-mapped persistent memory address
> ranges inserted by the pmem driver.
> 
> The __pfn_t to resource lookup is indeed inefficient walking of a linked list,
> but there are two mitigating factors:
> 
> 1/ The number of persistent memory ranges is bounded by the number of
>    DIMMs which is on the order of 10s of DIMMs, not hundreds.
> 
> 2/ The lookup yields the entire range, if it becomes inefficient to do a
>    kmap_atomic_pfn_t() a PAGE_SIZE at a time the caller can take
>    advantage of the fact that the lookup can be amortized for all kmap
>    operations it needs to perform in a given range.
> 
> [hch: various changes]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/kmap_pfn.h |   31 ++++++++++++
>  include/linux/mm.h       |   57 ++++++++++++++++++++++
>  mm/Kconfig               |    3 +
>  mm/Makefile              |    1 
>  mm/kmap_pfn.c            |  117 ++++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 209 insertions(+)
>  create mode 100644 include/linux/kmap_pfn.h
>  create mode 100644 mm/kmap_pfn.c
> 
> diff --git a/include/linux/kmap_pfn.h b/include/linux/kmap_pfn.h
> new file mode 100644
> index 000000000000..fa44971d8e95
> --- /dev/null
> +++ b/include/linux/kmap_pfn.h
> @@ -0,0 +1,31 @@
> +#ifndef _LINUX_KMAP_PFN_H
> +#define _LINUX_KMAP_PFN_H 1
> +
> +#include <linux/highmem.h>
> +
> +struct device;
> +struct resource;
> +#ifdef CONFIG_KMAP_PFN
> +extern void *kmap_atomic_pfn_t(__pfn_t pfn);
> +extern void kunmap_atomic_pfn_t(void *addr);
> +extern int devm_register_kmap_pfn_range(struct device *dev,
> +		struct resource *res, void *base);
> +#else
> +static inline void *kmap_atomic_pfn_t(__pfn_t pfn)
> +{
> +	return kmap_atomic(__pfn_t_to_page(pfn));
> +}
> +
> +static inline void kunmap_atomic_pfn_t(void *addr)
> +{
> +	__kunmap_atomic(addr);
> +}
> +
> +static inline int devm_register_kmap_pfn_range(struct device *dev,
> +		struct resource *res, void *base)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_KMAP_PFN */
> +
> +#endif /* _LINUX_KMAP_PFN_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 84b05ebedb2d..57ba5ca6be72 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -924,6 +924,63 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
>  }
>  
>  /*
> + * __pfn_t: encapsulates a page-frame number that is optionally backed
> + * by memmap (struct page).  This type will be used in place of a
> + * 'struct page *' instance in contexts where unmapped memory (usually
> + * persistent memory) is being referenced (scatterlists for drivers,
> + * biovecs for the block layer, etc).  Whether a __pfn_t has a struct
> + * page backing is indicated by flags in the low bits of the value;
> + */
> +typedef struct {
> +	unsigned long val;
> +} __pfn_t;
> +
> +/*
> + * PFN_SG_CHAIN - pfn is a pointer to the next scatterlist entry
> + * PFN_SG_LAST - pfn references a page and is the last scatterlist entry
> + * PFN_DEV - pfn is not covered by system memmap
> + */
> +enum {
> +	PFN_MASK = (1UL << PAGE_SHIFT) - 1,
> +	PFN_SG_CHAIN = (1UL << 0),
> +	PFN_SG_LAST = (1UL << 1),
> +#ifdef CONFIG_KMAP_PFN
> +	PFN_DEV = (1UL << 2),
> +#else
> +	PFN_DEV = 0,
> +#endif
> +};
> +
> +static inline bool __pfn_t_has_page(__pfn_t pfn)
> +{
> +	return (pfn.val & PFN_DEV) == 0;
> +}
> +
> +static inline unsigned long __pfn_t_to_pfn(__pfn_t pfn)
> +{
> +	return pfn.val >> PAGE_SHIFT;
> +}
> +
> +static inline struct page *__pfn_t_to_page(__pfn_t pfn)
> +{
> +	if (!__pfn_t_has_page(pfn))
> +		return NULL;
> +	return pfn_to_page(__pfn_t_to_pfn(pfn));
> +}
> +
> +static inline dma_addr_t __pfn_t_to_phys(__pfn_t pfn)
> +{
> +	return __pfn_to_phys(__pfn_t_to_pfn(pfn));
> +}
> +
> +static inline __pfn_t page_to_pfn_t(struct page *page)
> +{
> +	__pfn_t pfn = { .val = page_to_pfn(page) << PAGE_SHIFT, };
> +
> +	return pfn;
> +}
> +
> +/*
>   * Some inline functions in vmstat.h depend on page_zone()
>   */
>  #include <linux/vmstat.h>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e79de2bd12cd..ed1be8ff982e 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -654,3 +654,6 @@ config DEFERRED_STRUCT_PAGE_INIT
>  	  when kswapd starts. This has a potential performance impact on
>  	  processes running early in the lifetime of the systemm until kswapd
>  	  finishes the initialisation.
> +
> +config KMAP_PFN
> +	bool
> diff --git a/mm/Makefile b/mm/Makefile
> index 98c4eaeabdcb..f7b27958ea69 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -78,3 +78,4 @@ obj-$(CONFIG_CMA)	+= cma.o
>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
>  obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
> +obj-$(CONFIG_KMAP_PFN) += kmap_pfn.o
> diff --git a/mm/kmap_pfn.c b/mm/kmap_pfn.c
> new file mode 100644
> index 000000000000..2d58e167dfbc
> --- /dev/null
> +++ b/mm/kmap_pfn.c
> @@ -0,0 +1,117 @@
> +/*
> + * Copyright(c) 2015 Intel Corporation. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of version 2 of the GNU General Public License as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + */
> +#include <linux/rcupdate.h>
> +#include <linux/rculist.h>
> +#include <linux/highmem.h>
> +#include <linux/device.h>
> +#include <linux/mutex.h>
> +#include <linux/slab.h>
> +#include <linux/mm.h>
> +
> +static LIST_HEAD(ranges);
> +static DEFINE_MUTEX(register_lock);
> +
> +struct kmap {
> +	struct list_head list;
> +	struct resource *res;
> +	struct device *dev;
> +	void *base;
> +};
> +
> +static void teardown_kmap(void *data)
> +{
> +	struct kmap *kmap = data;
> +
> +	dev_dbg(kmap->dev, "kmap unregister %pr\n", kmap->res);
> +	mutex_lock(&register_lock);
> +	list_del_rcu(&kmap->list);
> +	mutex_unlock(&register_lock);
> +	synchronize_rcu();
> +	kfree(kmap);
> +}
> +
> +int devm_register_kmap_pfn_range(struct device *dev, struct resource *res,
> +		void *base)
> +{
> +	struct kmap *kmap = kzalloc(sizeof(*kmap), GFP_KERNEL);
> +	int rc;
> +
> +	if (!kmap)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&kmap->list);
> +	kmap->res = res;
> +	kmap->base = base;
> +	kmap->dev = dev;
> +	rc = devm_add_action(dev, teardown_kmap, kmap);
> +	if (rc) {
> +		kfree(kmap);
> +		return rc;
> +	}
> +	dev_dbg(kmap->dev, "kmap register %pr\n", kmap->res);
> +
> +	mutex_lock(&register_lock);
> +	list_add_rcu(&kmap->list, &ranges);
> +	mutex_unlock(&register_lock);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(devm_register_kmap_pfn_range);
> +
> +void *kmap_atomic_pfn_t(__pfn_t pfn)
> +{
> +	struct page *page = __pfn_t_to_page(pfn);
> +	resource_size_t addr;
> +	struct kmap *kmap;
> +
> +	rcu_read_lock();
> +	if (page)
> +		return kmap_atomic(page);

Right even with pages I pay rcu_read_lock(); for every access?

> +	addr = __pfn_t_to_phys(pfn);
> +	list_for_each_entry_rcu(kmap, &ranges, list)
> +		if (addr >= kmap->res->start && addr <= kmap->res->end)
> +			return kmap->base + addr - kmap->res->start;
> +

Good god! This loop is a real *joke*. You have just dropped memory access
performance by 10 fold.

The all point of pages and memory_model.h was to have a one to one
relation-ships between Kernel-virtual vs physical vs page *

There is already an object that holds a relationship of physical
to Kernel-virtual. It is called a memory-section. Why not just
widen its definition?

If you are willing to accept this loop. In current Linux 2015 Kernel
Then I have nothing farther to say.

Boaz - go mourning for the death of the Linux Kernel alone in the corner ;-(

> +	/* only unlock in the error case */
> +	rcu_read_unlock();
> +	return NULL;
> +}
> +EXPORT_SYMBOL(kmap_atomic_pfn_t);
> +
> +void kunmap_atomic_pfn_t(void *addr)
> +{
> +	struct kmap *kmap;
> +	bool dev_pfn = false;
> +
> +	if (!addr)
> +		return;
> +
> +	/*
> +	 * If the original __pfn_t had an entry in the memmap (i.e.
> +	 * !PFN_DEV) then 'addr' will be outside of the registered
> +	 * ranges and we'll need to kunmap_atomic() it.
> +	 */
> +	list_for_each_entry_rcu(kmap, &ranges, list)
> +		if (addr < kmap->base + resource_size(kmap->res)
> +				&& addr >= kmap->base) {
> +			dev_pfn = true;
> +			break;
> +		}
> +
> +	if (!dev_pfn)
> +		kunmap_atomic(addr);
> +
> +	/* signal that we are done with the range */
> +	rcu_read_unlock();
> +}
> +EXPORT_SYMBOL(kunmap_atomic_pfn_t);
> 
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 4/5] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t()
  2015-08-13  3:01 ` [PATCH v5 4/5] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t() Dan Williams
@ 2015-08-13  6:26   ` Boaz Harrosh
  2015-08-13 15:21     ` Dan Williams
  0 siblings, 1 reply; 21+ messages in thread
From: Boaz Harrosh @ 2015-08-13  6:26 UTC (permalink / raw)
  To: Dan Williams, linux-kernel
  Cc: axboe, riel, linux-nvdimm, linux-mm, mgorman, torvalds, hch

On 08/13/2015 06:01 AM, Dan Williams wrote:
> The primary source for non-page-backed page-frames to enter the system
> is via the pmem driver's ->direct_access() method.  The pfns returned by
> the top-level bdev_direct_access() may be passed to any other subsystem
> in the kernel and those sub-systems either need to assume that the pfn
> is page backed (CONFIG_DEV_PFN=n) or be prepared to handle non-page
> backed case (CONFIG_DEV_PFN=y).  Currently the pfns returned by
> ->direct_access() are only ever used by vm_insert_mixed() which does not
> care if the pfn is mapped.  As we go to add more usages of these pfns
> add the type-safety of __pfn_t.
> 
> This also simplifies the calling convention of ->direct_access() by not
> returning the virtual address in the same call.  This annotates cases
> where the kernel is directly accessing pmem outside the driver, and
> makes the valid lifetime of the reference explicit.  This property may
> be useful in the future for invalidating mappings to pmem, but for now
> it provides some protection against the "pmem disable vs still-in-use"
> race.
> 
> Note that axon_ram_direct_access and dcssblk_direct_access were
> previously making potentially incorrect assumptions about the addresses
> they passed to virt_to_phys().
> 
> [hch: various minor updates]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  arch/powerpc/platforms/Kconfig |    1 +
>  arch/powerpc/sysdev/axonram.c  |   24 ++++++++----
>  drivers/block/brd.c            |    5 +--
>  drivers/nvdimm/Kconfig         |    1 +
>  drivers/nvdimm/pmem.c          |   24 +++++++-----
>  drivers/s390/block/Kconfig     |    1 +
>  drivers/s390/block/dcssblk.c   |   23 ++++++++++--
>  fs/Kconfig                     |    1 +
>  fs/block_dev.c                 |    4 +-
>  fs/dax.c                       |   79 +++++++++++++++++++++++++++++-----------
>  include/linux/blkdev.h         |    7 ++--
>  include/linux/mm.h             |   12 ++++++
>  12 files changed, 129 insertions(+), 53 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
> index b7f9c408bf24..6b1c2f2e5fb4 100644
> --- a/arch/powerpc/platforms/Kconfig
> +++ b/arch/powerpc/platforms/Kconfig
> @@ -307,6 +307,7 @@ config CPM2
>  config AXON_RAM
>  	tristate "Axon DDR2 memory device driver"
>  	depends on PPC_IBM_CELL_BLADE && BLOCK
> +	select KMAP_PFN
>  	default m
>  	help
>  	  It registers one block device per Axon's DDR2 memory bank found
> diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
> index e8657d3bc588..7c5a1563c0fd 100644
> --- a/arch/powerpc/sysdev/axonram.c
> +++ b/arch/powerpc/sysdev/axonram.c
> @@ -43,6 +43,7 @@
>  #include <linux/types.h>
>  #include <linux/of_device.h>
>  #include <linux/of_platform.h>
> +#include <linux/kmap_pfn.h>
>  
>  #include <asm/page.h>
>  #include <asm/prom.h>
> @@ -141,14 +142,12 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
>   */
>  static long
>  axon_ram_direct_access(struct block_device *device, sector_t sector,
> -		       void **kaddr, unsigned long *pfn)
> +		       __pfn_t *pfn)
>  {
>  	struct axon_ram_bank *bank = device->bd_disk->private_data;
>  	loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
>  
> -	*kaddr = (void *)(bank->ph_addr + offset);
> -	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
> -
> +	*pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
>  	return bank->size - offset;
>  }
>  
> @@ -165,9 +164,13 @@ static int axon_ram_probe(struct platform_device *device)
>  {
>  	static int axon_ram_bank_id = -1;
>  	struct axon_ram_bank *bank;
> -	struct resource resource;
> +	struct resource *resource;
>  	int rc = 0;
>  
> +	resource = devm_kzalloc(&device->dev, sizeof(*resource), GFP_KERNEL);
> +	if (!resource)
> +		return -ENOMEM;
> +
>  	axon_ram_bank_id++;
>  
>  	dev_info(&device->dev, "Found memory controller on %s\n",
> @@ -184,13 +187,13 @@ static int axon_ram_probe(struct platform_device *device)
>  
>  	bank->device = device;
>  
> -	if (of_address_to_resource(device->dev.of_node, 0, &resource) != 0) {
> +	if (of_address_to_resource(device->dev.of_node, 0, resource) != 0) {
>  		dev_err(&device->dev, "Cannot access device tree\n");
>  		rc = -EFAULT;
>  		goto failed;
>  	}
>  
> -	bank->size = resource_size(&resource);
> +	bank->size = resource_size(resource);
>  
>  	if (bank->size == 0) {
>  		dev_err(&device->dev, "No DDR2 memory found for %s%d\n",
> @@ -202,7 +205,7 @@ static int axon_ram_probe(struct platform_device *device)
>  	dev_info(&device->dev, "Register DDR2 memory device %s%d with %luMB\n",
>  			AXON_RAM_DEVICE_NAME, axon_ram_bank_id, bank->size >> 20);
>  
> -	bank->ph_addr = resource.start;
> +	bank->ph_addr = resource->start;
>  	bank->io_addr = (unsigned long) ioremap_prot(
>  			bank->ph_addr, bank->size, _PAGE_NO_CACHE);
>  	if (bank->io_addr == 0) {
> @@ -211,6 +214,11 @@ static int axon_ram_probe(struct platform_device *device)
>  		goto failed;
>  	}
>  
> +	rc = devm_register_kmap_pfn_range(&device->dev, resource,
> +			(void *) bank->io_addr);
> +	if (rc)
> +		goto failed;
> +
>  	bank->disk = alloc_disk(AXON_RAM_MINORS_PER_DISK);
>  	if (bank->disk == NULL) {
>  		dev_err(&device->dev, "Cannot register disk\n");
> diff --git a/drivers/block/brd.c b/drivers/block/brd.c
> index 41528857c70d..6c4b21a4e915 100644
> --- a/drivers/block/brd.c
> +++ b/drivers/block/brd.c
> @@ -371,7 +371,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
>  
>  #ifdef CONFIG_BLK_DEV_RAM_DAX
>  static long brd_direct_access(struct block_device *bdev, sector_t sector,
> -			void **kaddr, unsigned long *pfn)
> +		__pfn_t *pfn)
>  {
>  	struct brd_device *brd = bdev->bd_disk->private_data;
>  	struct page *page;
> @@ -381,8 +381,7 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
>  	page = brd_insert_page(brd, sector);
>  	if (!page)
>  		return -ENOSPC;
> -	*kaddr = page_address(page);
> -	*pfn = page_to_pfn(page);
> +	*pfn = page_to_pfn_t(page);
>  
>  	return PAGE_SIZE;
>  }
> diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
> index 72226acb5c0f..0d8c6bda7a41 100644
> --- a/drivers/nvdimm/Kconfig
> +++ b/drivers/nvdimm/Kconfig
> @@ -20,6 +20,7 @@ config BLK_DEV_PMEM
>  	tristate "PMEM: Persistent memory block device support"
>  	default LIBNVDIMM
>  	depends on HAS_IOMEM
> +	select KMAP_PFN
>  	select ND_BTT if BTT
>  	help
>  	  Memory ranges for PMEM are described by either an NFIT
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 5e019a6942ce..85d4101bb821 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -25,6 +25,8 @@
>  #include <linux/slab.h>
>  #include <linux/pmem.h>
>  #include <linux/nd.h>
> +#include <linux/mm.h>
> +#include <linux/kmap_pfn.h>
>  #include "nd.h"
>  
>  struct pmem_device {
> @@ -92,18 +94,12 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
>  }
>  
>  static long pmem_direct_access(struct block_device *bdev, sector_t sector,
> -			      void **kaddr, unsigned long *pfn)
> +		__pfn_t *pfn)
>  {
>  	struct pmem_device *pmem = bdev->bd_disk->private_data;
>  	size_t offset = sector << 9;
>  
> -	if (!pmem)
> -		return -ENODEV;
> -
> -	/* FIXME convert DAX to comprehend that this mapping has a lifetime */
> -	*kaddr = (void __force *) pmem->virt_addr + offset;
> -	*pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;
> -
> +	*pfn = phys_to_pfn_t(pmem->phys_addr + offset, PFN_DEV);
>  	return pmem->size - offset;
>  }
>  
> @@ -149,10 +145,17 @@ static void pmem_detach_disk(struct pmem_device *pmem)
>  	blk_cleanup_queue(pmem->pmem_queue);
>  }
>  
> -static int pmem_attach_disk(struct nd_namespace_common *ndns,
> +static int pmem_attach_disk(struct device *dev,
> +		struct nd_namespace_common *ndns,
>  		struct pmem_device *pmem)
>  {
>  	struct gendisk *disk;
> +	struct resource *res = &(to_nd_namespace_io(&ndns->dev)->res);
> +	int err;
> +
> +	err = devm_register_kmap_pfn_range(dev, res, pmem->virt_addr);
> +	if (err)
> +		return err;
>  
>  	pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
>  	if (!pmem->pmem_queue)
> @@ -232,7 +235,8 @@ static int nd_pmem_probe(struct device *dev)
>  	if (nd_btt_probe(ndns, pmem) == 0)
>  		/* we'll come back as btt-pmem */
>  		return -ENXIO;
> -	return pmem_attach_disk(ndns, pmem);
> +
> +	return pmem_attach_disk(dev, ndns, pmem);
>  }
>  
>  static int nd_pmem_remove(struct device *dev)
> diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
> index 4a3b62326183..06c7a1c90d88 100644
> --- a/drivers/s390/block/Kconfig
> +++ b/drivers/s390/block/Kconfig
> @@ -14,6 +14,7 @@ config BLK_DEV_XPRAM
>  
>  config DCSSBLK
>  	def_tristate m
> +	select KMAP_PFN
>  	prompt "DCSSBLK support"
>  	depends on S390 && BLOCK
>  	help
> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> index 2f1734ba0e22..42f1546d7b03 100644
> --- a/drivers/s390/block/dcssblk.c
> +++ b/drivers/s390/block/dcssblk.c
> @@ -16,6 +16,7 @@
>  #include <linux/blkdev.h>
>  #include <linux/completion.h>
>  #include <linux/interrupt.h>
> +#include <linux/kmap_pfn.h>
>  #include <linux/platform_device.h>
>  #include <asm/extmem.h>
>  #include <asm/io.h>
> @@ -29,7 +30,7 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
>  static void dcssblk_release(struct gendisk *disk, fmode_t mode);
>  static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
>  static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
> -				 void **kaddr, unsigned long *pfn);
> +		__pfn_t *pfn);
>  
>  static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
>  
> @@ -520,12 +521,18 @@ static const struct attribute_group *dcssblk_dev_attr_groups[] = {
>  static ssize_t
>  dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count)
>  {
> +	struct resource *res = devm_kzalloc(dev, sizeof(*res), GFP_KERNEL);
>  	int rc, i, j, num_of_segments;
>  	struct dcssblk_dev_info *dev_info;
>  	struct segment_info *seg_info, *temp;
>  	char *local_buf;
>  	unsigned long seg_byte_size;
>  
> +	if (!res) {
> +		rc = -ENOMEM;
> +		goto out_nobuf;
> +	}
> +
>  	dev_info = NULL;
>  	seg_info = NULL;
>  	if (dev != dcssblk_root_dev) {
> @@ -652,6 +659,13 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
>  	if (rc)
>  		goto put_dev;
>  
> +	res->start = dev_info->start;
> +	res->end = dev_info->end - 1;
> +	rc = devm_register_kmap_pfn_range(&dev_info->dev, res,
> +			(void *) dev_info->start);
> +	if (rc)
> +		goto put_dev;
> +
>  	get_device(&dev_info->dev);
>  	add_disk(dev_info->gd);
>  
> @@ -699,6 +713,8 @@ seg_list_del:
>  out:
>  	kfree(local_buf);
>  out_nobuf:
> +	if (res)
> +		devm_kfree(dev, res);
>  	return rc;
>  }
>  
> @@ -879,7 +895,7 @@ fail:
>  
>  static long
>  dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
> -			void **kaddr, unsigned long *pfn)
> +		__pfn_t *pfn)
>  {
>  	struct dcssblk_dev_info *dev_info;
>  	unsigned long offset, dev_sz;
> @@ -889,8 +905,7 @@ dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
>  		return -ENODEV;
>  	dev_sz = dev_info->end - dev_info->start;
>  	offset = secnum * 512;
> -	*kaddr = (void *) (dev_info->start + offset);
> -	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
> +	*pfn = phys_to_pfn_t(dev_info->start + offset, PFN_DEV);
>  
>  	return dev_sz - offset;
>  }
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 011f43365d7b..bd37234e71a8 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -38,6 +38,7 @@ config FS_DAX
>  	bool "Direct Access (DAX) support"
>  	depends on MMU
>  	depends on !(ARM || MIPS || SPARC)
> +	depends on KMAP_PFN
>  	help
>  	  Direct Access (DAX) can be used on memory-backed block devices.
>  	  If the block device supports DAX and the filesystem supports DAX,
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 3a8ac7edfbf4..73fbc57b6e6d 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -441,7 +441,7 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
>   * accessible at this address.
>   */
>  long bdev_direct_access(struct block_device *bdev, sector_t sector,
> -			void **addr, unsigned long *pfn, long size)
> +			__pfn_t *pfn, long size)
>  {
>  	long avail;
>  	const struct block_device_operations *ops = bdev->bd_disk->fops;
> @@ -462,7 +462,7 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
>  	sector += get_start_sect(bdev);
>  	if (sector % (PAGE_SIZE / 512))
>  		return -EINVAL;
> -	avail = ops->direct_access(bdev, sector, addr, pfn);
> +	avail = ops->direct_access(bdev, sector, pfn);
>  	if (!avail)
>  		return -ERANGE;
>  	return min(avail, size);
> diff --git a/fs/dax.c b/fs/dax.c
> index c3e21ccfc358..94611f480091 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -26,6 +26,7 @@
>  #include <linux/sched.h>
>  #include <linux/uio.h>
>  #include <linux/vmstat.h>
> +#include <linux/kmap_pfn.h>
>  
>  int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>  {
> @@ -35,13 +36,16 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>  	might_sleep();
>  	do {
>  		void *addr;
> -		unsigned long pfn;
> +		__pfn_t pfn;
>  		long count;
>  
> -		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
> +		count = bdev_direct_access(bdev, sector, &pfn, size);
>  		if (count < 0)
>  			return count;
>  		BUG_ON(size < count);
> +		addr = kmap_atomic_pfn_t(pfn);
> +		if (!addr)
> +			return -EIO;
>  		while (count > 0) {
>  			unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
>  			if (pgsz > count)
> @@ -57,17 +61,39 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>  			sector += pgsz / 512;
>  			cond_resched();
>  		}
> +		kunmap_atomic_pfn_t(addr);
>  	} while (size);
>  
>  	return 0;
>  }
>  EXPORT_SYMBOL_GPL(dax_clear_blocks);
>  
> -static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
> +static void *__dax_map_bh(struct buffer_head *bh, unsigned blkbits, __pfn_t *pfn)
>  {
> -	unsigned long pfn;
>  	sector_t sector = bh->b_blocknr << (blkbits - 9);
> -	return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
> +	void *addr;
> +	long rc;
> +
> +	rc = bdev_direct_access(bh->b_bdev, sector, pfn, bh->b_size);
> +	if (rc)
> +		return ERR_PTR(rc);
> +	addr = kmap_atomic_pfn_t(*pfn);
> +	if (!addr)
> +		return ERR_PTR(-EIO);
> +	return addr;
> +}
> +
> +static void *dax_map_bh(struct buffer_head *bh, unsigned blkbits)
> +{
> +	__pfn_t pfn;
> +
> +	return __dax_map_bh(bh, blkbits, &pfn);
> +}
> +
> +static void dax_unmap_bh(void *addr)
> +{
> +	if (!IS_ERR(addr))
> +		kunmap_atomic_pfn_t(addr);
>  }
>  
>  static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos,
> @@ -106,7 +132,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
>  	loff_t pos = start;
>  	loff_t max = start;
>  	loff_t bh_max = start;
> -	void *addr;
> +	void *addr = NULL, *kmap = ERR_PTR(-EIO);
>  	bool hole = false;
>  
>  	if (iov_iter_rw(iter) != WRITE)
> @@ -142,9 +168,13 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
>  				addr = NULL;
>  				size = bh->b_size - first;
>  			} else {
> -				retval = dax_get_addr(bh, &addr, blkbits);
> -				if (retval < 0)
> +				dax_unmap_bh(kmap);
> +				kmap = dax_map_bh(bh, blkbits);
> +				if (IS_ERR(kmap)) {
> +					retval = PTR_ERR(kmap);
>  					break;
> +				}
> +				addr = kmap;
>  				if (buffer_unwritten(bh) || buffer_new(bh))
>  					dax_new_buf(addr, retval, first, pos,
>  									end);
> @@ -168,6 +198,8 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
>  		addr += len;
>  	}
>  
> +	dax_unmap_bh(kmap);
> +
>  	return (pos == start) ? retval : pos - start;
>  }
>  
> @@ -261,11 +293,14 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
>  			unsigned blkbits, unsigned long vaddr)
>  {
>  	void *vfrom, *vto;
> -	if (dax_get_addr(bh, &vfrom, blkbits) < 0)
> -		return -EIO;
> +
> +	vfrom = dax_map_bh(bh, blkbits);
> +	if (IS_ERR(vfrom))
> +		return PTR_ERR(vfrom);
>  	vto = kmap_atomic(to);
>  	copy_user_page(vto, vfrom, vaddr, to);
>  	kunmap_atomic(vto);
> +	dax_unmap_bh(vfrom);
>  	return 0;
>  }
>  
> @@ -273,11 +308,10 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  			struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
>  	struct address_space *mapping = inode->i_mapping;
> -	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
>  	unsigned long vaddr = (unsigned long)vmf->virtual_address;
> -	void *addr;
> -	unsigned long pfn;
>  	pgoff_t size;
> +	__pfn_t pfn;
> +	void *addr;
>  	int error;
>  
>  	i_mmap_lock_read(mapping);
> @@ -295,18 +329,17 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  		goto out;
>  	}
>  
> -	error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
> -	if (error < 0)
> -		goto out;
> -	if (error < PAGE_SIZE) {
> -		error = -EIO;
> +	addr = __dax_map_bh(bh, inode->i_blkbits, &pfn);
> +	if (IS_ERR(addr)) {
> +		error = PTR_ERR(addr);
>  		goto out;
>  	}
>  
>  	if (buffer_unwritten(bh) || buffer_new(bh))
>  		clear_page(addr);
> +	dax_unmap_bh(addr);
>  

Boooo. Here this all set is a joke. The all "pmem disable vs still-in-use" argument is mute
here below you have inserted a live, used for ever, pfn into a process vm without holding
a map.

The all "pmem disable vs still-in-use" is a joke. The FS loaded has a reference on the bdev
and the filehadle has a reference on the FS. So what is exactly this "pmem disable" you are
talking about?

And for god sake. I have a bdev I call bdev_direct_access(sector), the bdev calculated the
exact address for me (base + sector). Now I get back this __pfn_t and I need to call
kmap_atomic_pfn_t() which does a loop to search for my range and again base+offset ?

This all model is broken, sorry?

> -	error = vm_insert_mixed(vma, vaddr, pfn);
> +	error = vm_insert_mixed(vma, vaddr, __pfn_t_to_pfn(pfn));
>  
>   out:
>  	i_mmap_unlock_read(mapping);
> @@ -539,10 +572,12 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
>  		return err;
>  	if (buffer_written(&bh)) {
>  		void *addr;
> -		err = dax_get_addr(&bh, &addr, inode->i_blkbits);
> -		if (err < 0)
> -			return err;
> +
> +		addr = dax_map_bh(&bh, inode->i_blkbits);
> +		if (IS_ERR(addr))
> +			return PTR_ERR(addr);
>  		memset(addr + offset, 0, length);
> +		dax_unmap_bh(addr);
>  	}
>  
>  	return 0;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index ff47d5498133..ae59778d8076 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1555,8 +1555,7 @@ struct block_device_operations {
>  	int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
>  	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
>  	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
> -	long (*direct_access)(struct block_device *, sector_t,
> -					void **, unsigned long *pfn);
> +	long (*direct_access)(struct block_device *, sector_t, __pfn_t *pfn);
>  	unsigned int (*check_events) (struct gendisk *disk,
>  				      unsigned int clearing);
>  	/* ->media_changed() is DEPRECATED, use ->check_events() instead */
> @@ -1574,8 +1573,8 @@ extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
>  extern int bdev_read_page(struct block_device *, sector_t, struct page *);
>  extern int bdev_write_page(struct block_device *, sector_t, struct page *,
>  						struct writeback_control *);
> -extern long bdev_direct_access(struct block_device *, sector_t, void **addr,
> -						unsigned long *pfn, long size);
> +extern long bdev_direct_access(struct block_device *, sector_t,
> +		__pfn_t *pfn, long size);
>  #else /* CONFIG_BLOCK */
>  
>  struct block_device;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 57ba5ca6be72..c4683ea2fcab 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -951,6 +951,18 @@ enum {
>  #endif
>  };
>  
> +static inline __pfn_t pfn_to_pfn_t(unsigned long pfn, unsigned long flags)
> +{
> +	__pfn_t pfn_t = { .val = (pfn << PAGE_SHIFT) | (flags & PFN_MASK), };
> +
> +	return pfn_t;
> +}
> +
> +static inline __pfn_t phys_to_pfn_t(dma_addr_t addr, unsigned long flags)
> +{
> +	return pfn_to_pfn_t(addr >> PAGE_SHIFT, flags);
> +}
> +
>  static inline bool __pfn_t_has_page(__pfn_t pfn)
>  {
>  	return (pfn.val & PFN_DEV) == 0;
> 
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/5] allow mapping page-less memremaped areas into KVA
  2015-08-13  5:58   ` Boaz Harrosh
@ 2015-08-13 12:57     ` Dan Williams
  2015-08-13 13:23       ` Boaz Harrosh
  2015-08-13 14:37     ` Christoph Hellwig
  1 sibling, 1 reply; 21+ messages in thread
From: Dan Williams @ 2015-08-13 12:57 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-kernel, Jens Axboe, Rik van Riel,
	linux-nvdimm@lists.01.org, Linux MM, Mel Gorman, torvalds,
	Christoph Hellwig

On Wed, Aug 12, 2015 at 10:58 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 08/13/2015 06:01 AM, Dan Williams wrote:
[..]
>> +void *kmap_atomic_pfn_t(__pfn_t pfn)
>> +{
>> +     struct page *page = __pfn_t_to_page(pfn);
>> +     resource_size_t addr;
>> +     struct kmap *kmap;
>> +
>> +     rcu_read_lock();
>> +     if (page)
>> +             return kmap_atomic(page);
>
> Right even with pages I pay rcu_read_lock(); for every access?
>
>> +     addr = __pfn_t_to_phys(pfn);
>> +     list_for_each_entry_rcu(kmap, &ranges, list)
>> +             if (addr >= kmap->res->start && addr <= kmap->res->end)
>> +                     return kmap->base + addr - kmap->res->start;
>> +
>
> Good god! This loop is a real *joke*. You have just dropped memory access
> performance by 10 fold.
>
> The all point of pages and memory_model.h was to have a one to one
> relation-ships between Kernel-virtual vs physical vs page *
>
> There is already an object that holds a relationship of physical
> to Kernel-virtual. It is called a memory-section. Why not just
> widen its definition?
>
> If you are willing to accept this loop. In current Linux 2015 Kernel
> Then I have nothing farther to say.
>
> Boaz - go mourning for the death of the Linux Kernel alone in the corner ;-(
>

This is explicitly addressed in the changelog, repeated here:

> The __pfn_t to resource lookup is indeed inefficient walking of a linked list,
> but there are two mitigating factors:
>
> 1/ The number of persistent memory ranges is bounded by the number of
>    DIMMs which is on the order of 10s of DIMMs, not hundreds.
>
> 2/ The lookup yields the entire range, if it becomes inefficient to do a
>    kmap_atomic_pfn_t() a PAGE_SIZE at a time the caller can take
>    advantage of the fact that the lookup can be amortized for all kmap
>    operations it needs to perform in a given range.

DAX as is is races against pmem unbind.   A synchronization cost must
be paid somewhere to make sure the memremap() mapping is still valid.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/5] allow mapping page-less memremaped areas into KVA
  2015-08-13 12:57     ` Dan Williams
@ 2015-08-13 13:23       ` Boaz Harrosh
  2015-08-13 14:41         ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Boaz Harrosh @ 2015-08-13 13:23 UTC (permalink / raw)
  To: Dan Williams, Boaz Harrosh
  Cc: linux-kernel, Jens Axboe, Rik van Riel,
	linux-nvdimm@lists.01.org, Linux MM, Mel Gorman, torvalds,
	Christoph Hellwig

On 08/13/2015 03:57 PM, Dan Williams wrote:
<>
> This is explicitly addressed in the changelog, repeated here:
> 
>> The __pfn_t to resource lookup is indeed inefficient walking of a linked list,
>> but there are two mitigating factors:
>>
>> 1/ The number of persistent memory ranges is bounded by the number of
>>    DIMMs which is on the order of 10s of DIMMs, not hundreds.
>>

You do not get where I'm comming from. It used to be a [ptr - ONE_BASE + OTHER_BASE]
(In 64 bit) it is now a call and a loop and a search. how ever you will look at
it is *not* the instantaneous address translation it is now.

I have memory I want memory speeds. You keep thinking HD speeds, where what ever
you do will not matter.

>> 2/ The lookup yields the entire range, if it becomes inefficient to do a
>>    kmap_atomic_pfn_t() a PAGE_SIZE at a time the caller can take
>>    advantage of the fact that the lookup can be amortized for all kmap
>>    operations it needs to perform in a given range.
> 

What "given range" how can a bdev assume that the all sg-list belongs to the
same "range". In fact our code does multple-pmem devices for a long time.
What about say md-of-pmems for example, or btrfs

> DAX as is is races against pmem unbind.   A synchronization cost must
> be paid somewhere to make sure the memremap() mapping is still valid.

Sorry for being so slow, is what I asked. what is exactly "pmem unbind" ?

Currently in my 4.1 Kernel the ioremap is done on modprobe time and
released modprobe --remove time. the --remove can not happen with a mounted
FS dax or not. So what is exactly "pmem unbind". And if there is a new knob
then make it refuse with a raised refcount.

Cheers
Boaz

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/5] allow mapping page-less memremaped areas into KVA
  2015-08-13  5:58   ` Boaz Harrosh
  2015-08-13 12:57     ` Dan Williams
@ 2015-08-13 14:37     ` Christoph Hellwig
  2015-08-13 14:48       ` Boaz Harrosh
  1 sibling, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2015-08-13 14:37 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dan Williams, linux-kernel, axboe, riel, linux-nvdimm, linux-mm,
	mgorman, torvalds

Hi Boaz,

can you please fix your quoting?  I read down about 10 pages but still
couldn't find a comment from you.  For now I gave up on this mail.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/5] allow mapping page-less memremaped areas into KVA
  2015-08-13 13:23       ` Boaz Harrosh
@ 2015-08-13 14:41         ` Christoph Hellwig
  2015-08-13 15:01           ` Boaz Harrosh
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2015-08-13 14:41 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dan Williams, linux-kernel, Jens Axboe, Rik van Riel,
	linux-nvdimm@lists.01.org, Linux MM, Mel Gorman, torvalds,
	Christoph Hellwig

On Thu, Aug 13, 2015 at 04:23:38PM +0300, Boaz Harrosh wrote:
> > DAX as is is races against pmem unbind.   A synchronization cost must
> > be paid somewhere to make sure the memremap() mapping is still valid.
> 
> Sorry for being so slow, is what I asked. what is exactly "pmem unbind" ?
> 
> Currently in my 4.1 Kernel the ioremap is done on modprobe time and
> released modprobe --remove time. the --remove can not happen with a mounted
> FS dax or not. So what is exactly "pmem unbind". And if there is a new knob
> then make it refuse with a raised refcount.

Surprise removal of a PCIe card which is mapped to provide non-volatile
memory for example.  Or memory hot swap.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/5] allow mapping page-less memremaped areas into KVA
  2015-08-13 14:37     ` Christoph Hellwig
@ 2015-08-13 14:48       ` Boaz Harrosh
  2015-08-13 15:29         ` Boaz Harrosh
  2015-08-13 17:37         ` Dave Hansen
  0 siblings, 2 replies; 21+ messages in thread
From: Boaz Harrosh @ 2015-08-13 14:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, linux-kernel, axboe, riel, linux-nvdimm, linux-mm,
	mgorman, torvalds

On 08/13/2015 05:37 PM, Christoph Hellwig wrote:
> Hi Boaz,
> 
> can you please fix your quoting?  I read down about 10 pages but still
> couldn't find a comment from you.  For now I gave up on this mail.
> 

Sorry here:

> +void *kmap_atomic_pfn_t(__pfn_t pfn)
> +{
> +	struct page *page = __pfn_t_to_page(pfn);
> +	resource_size_t addr;
> +	struct kmap *kmap;
> +
> +	rcu_read_lock();
> +	if (page)
> +		return kmap_atomic(page);

Right even with pages I pay rcu_read_lock(); for every access?

> +	addr = __pfn_t_to_phys(pfn);
> +	list_for_each_entry_rcu(kmap, &ranges, list)
> +		if (addr >= kmap->res->start && addr <= kmap->res->end)
> +			return kmap->base + addr - kmap->res->start;
> +

Good god! This loop is a real *joke*. You have just dropped memory access
performance by 10 fold.

The all point of pages and memory_model.h was to have a one to one
relation-ships between Kernel-virtual vs physical vs page *

There is already an object that holds a relationship of physical
to Kernel-virtual. It is called a memory-section. Why not just
widen its definition?

If you are willing to accept this loop. In current Linux 2015 Kernel
Then I have nothing farther to say.

Boaz - go mourning for the death of the Linux Kernel alone in the corner ;-(

> +	/* only unlock in the error case */
> +	rcu_read_unlock();
> +	return NULL;
> +}
> +EXPORT_SYMBOL(kmap_atomic_pfn_t);
> +


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/5] allow mapping page-less memremaped areas into KVA
  2015-08-13 14:41         ` Christoph Hellwig
@ 2015-08-13 15:01           ` Boaz Harrosh
  0 siblings, 0 replies; 21+ messages in thread
From: Boaz Harrosh @ 2015-08-13 15:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, linux-kernel, Jens Axboe, Rik van Riel,
	linux-nvdimm@lists.01.org, Linux MM, Mel Gorman, torvalds

On 08/13/2015 05:41 PM, Christoph Hellwig wrote:
> On Thu, Aug 13, 2015 at 04:23:38PM +0300, Boaz Harrosh wrote:
>>> DAX as is is races against pmem unbind.   A synchronization cost must
>>> be paid somewhere to make sure the memremap() mapping is still valid.
>>
>> Sorry for being so slow, is what I asked. what is exactly "pmem unbind" ?
>>
>> Currently in my 4.1 Kernel the ioremap is done on modprobe time and
>> released modprobe --remove time. the --remove can not happen with a mounted
>> FS dax or not. So what is exactly "pmem unbind". And if there is a new knob
>> then make it refuse with a raised refcount.
> 
> Surprise removal of a PCIe card which is mapped to provide non-volatile
> memory for example.  Or memory hot swap.
> 

Then the mapping is just there and you get garbage. Just the same as
"memory hot swap" the kernel will not let you HOT-REMOVE a referenced
memory. It will just refuse. If you forcefully remove a swapeble memory
chip without HOT-REMOVE first what will happen? so the same here.

SW wise you refuse to HOT-REMOVE. HW wise BTW the Kernel will not die
only farther reads will return all 111111 and writes will go to the
either.

The all kmap thing was for highmem. Is not the case here.

Again see my other comment at dax mmap:

- you go pfn_map take a pfn
- kpfn_unmap
- put pfn on user mmap vma
- then what happens to user access after that. Nothing not even a page_fault
  It will have a vm-mapping to a now none existing physical address that's
  it.

Thanks
Boaz


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 4/5] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t()
  2015-08-13  6:26   ` Boaz Harrosh
@ 2015-08-13 15:21     ` Dan Williams
  2015-08-13 16:34       ` Boaz Harrosh
  0 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2015-08-13 15:21 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-kernel, Jens Axboe, Rik van Riel,
	linux-nvdimm@lists.01.org, Linux MM, Mel Gorman, torvalds,
	Christoph Hellwig

On Wed, Aug 12, 2015 at 11:26 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
> Boooo. Here this all set is a joke. The all "pmem disable vs still-in-use" argument is mute
> here below you have inserted a live, used for ever, pfn into a process vm without holding
> a map.

Careful, don't confuse "unbind" with "unplug".  "Unbind" invalidates
the driver's mapping (ioremap) while "unplug" would invalidate the
pfn.  DAX is indeed broken with respect to unplug and we'll need to go
solve that separately.  I expect "unplug" support will be needed for
hot provisioning pmem to/from virtual machines.

> The all "pmem disable vs still-in-use" is a joke. The FS loaded has a reference on the bdev
> and the filehadle has a reference on the FS. So what is exactly this "pmem disable" you are
> talking about?

Hmm, that's not the same block layer I've been working with for the
past several years:

$ mount /dev/pmem0 /mnt
$ echo namespace0.0 > ../drivers/nd_pmem/unbind # succeeds

Unbind always proceeds unconditionally.  See the recent kernel summit
topic discussion around devm vs unbind [1].  While kmap_atomic_pfn_t()
does not implement revoke semantics it at least forces re-validation
and time bounded references.  For the unplug case we'll need to go
shootdown those DAX mappings in userspace so that they return SIGBUS
on access, or something along those lines.

[1]: http://www.spinics.net/lists/kernel/msg2032864.html

> And for god sake. I have a bdev I call bdev_direct_access(sector), the bdev calculated the
> exact address for me (base + sector). Now I get back this __pfn_t and I need to call
> kmap_atomic_pfn_t() which does a loop to search for my range and again base+offset ?
>
> This all model is broken, sorry?

I think you are confused about the lifetime of the userspace DAX
mapping vs the kernel's mapping and the frequency of calls to
kmap_atomic_pfn_t().  I'm sure you can make this loop look bad with a
micro-benchmark, but the whole point of DAX is to get the kernel out
of the I/O path, so I'm not sure this overhead shows up in any real
way in practice.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/5] allow mapping page-less memremaped areas into KVA
  2015-08-13 14:48       ` Boaz Harrosh
@ 2015-08-13 15:29         ` Boaz Harrosh
  2015-08-13 17:37         ` Dave Hansen
  1 sibling, 0 replies; 21+ messages in thread
From: Boaz Harrosh @ 2015-08-13 15:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, linux-kernel, axboe, riel, linux-nvdimm, linux-mm,
	mgorman, torvalds

On 08/13/2015 05:48 PM, Boaz Harrosh wrote:
<>
> There is already an object that holds a relationship of physical
> to Kernel-virtual. It is called a memory-section. Why not just
> widen its definition?
> 

BTW: Regarding the "widen its definition"

I was thinking of two possible new models here:
[1-A page-less memory section]
- Keep the 64bit phisical-to-kernel_virtual hard coded relationship
- Allocate a section-object, but this section object does not have any
  pages, its only the header. (You need it for the pmd/pmt thing)

  Lots of things just work now if you make sure you do not go through
  a page struct. This needs no extra work I have done this in the past
  all you need is to do your ioremap through the map_kernel_range_noflush(__va(), ....)

[2- Small pages-struct]

- Like above, but each entry in the new section object is small one-ulong size
  holding just flags.

 Then if !(p->flags & PAGE_SPECIAL) page = container_of(p, struct page, flags)

 This model is good because you actually have your pfn_to_page and page_to_pfn
 and need not touch sg-list or bio. But only 8 bytes per frame instead of 64 bytes

But I still think that the best long-term model is the variable size pages
where a page* can be 2M or 1G. Again an extra flag and a widen section definition.
Is about time we move to bigger pages, throughout but still keep the 4k
page-cache-dirty granularity.

Thanks
Boaz

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 4/5] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t()
  2015-08-13 15:21     ` Dan Williams
@ 2015-08-13 16:34       ` Boaz Harrosh
  2015-08-13 18:51         ` Dan Williams
  0 siblings, 1 reply; 21+ messages in thread
From: Boaz Harrosh @ 2015-08-13 16:34 UTC (permalink / raw)
  To: Dan Williams, Boaz Harrosh
  Cc: linux-kernel, Jens Axboe, Rik van Riel,
	linux-nvdimm@lists.01.org, Linux MM, Mel Gorman, torvalds,
	Christoph Hellwig

On 08/13/2015 06:21 PM, Dan Williams wrote:
> On Wed, Aug 12, 2015 at 11:26 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
<>
> 
> Hmm, that's not the same block layer I've been working with for the
> past several years:
> 
> $ mount /dev/pmem0 /mnt
> $ echo namespace0.0 > ../drivers/nd_pmem/unbind # succeeds
> 
> Unbind always proceeds unconditionally.  See the recent kernel summit
> topic discussion around devm vs unbind [1].  While kmap_atomic_pfn_t()
> does not implement revoke semantics it at least forces re-validation
> and time bounded references.  For the unplug case we'll need to go
> shootdown those DAX mappings in userspace so that they return SIGBUS
> on access, or something along those lines.
> 

Then fix unbind to refuse. What is the point of unbind when it trashes
the hot path so badly and makes the code so fat. Who uses it and what for?

First I ever heard of it and I do use Linux a little bit.

> [1]: http://www.spinics.net/lists/kernel/msg2032864.html
> 
Hm...

OK I hate it. I would just make sure to override and refuse unbinding with an
elevated ref count. Is not a good reason for me to trash the hotpath.

>> And for god sake. I have a bdev I call bdev_direct_access(sector), the bdev calculated the
>> exact address for me (base + sector). Now I get back this __pfn_t and I need to call
>> kmap_atomic_pfn_t() which does a loop to search for my range and again base+offset ?
>>
>> This all model is broken, sorry?
> 
> I think you are confused about the lifetime of the userspace DAX
> mapping vs the kernel's mapping and the frequency of calls to
> kmap_atomic_pfn_t().  I'm sure you can make this loop look bad with a
> micro-benchmark, but the whole point of DAX is to get the kernel out
> of the I/O path, so I'm not sure this overhead shows up in any real
> way in practice.

Sigh! It does. very much. 4k random write for you. Will drop in half
if I do this. We've been testing with memory for a long time every
rcu lock counts. A single atomic will drop things by %20

Thanks
Boaz


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/5] allow mapping page-less memremaped areas into KVA
  2015-08-13  3:01 ` [PATCH v5 2/5] allow mapping page-less memremaped areas into KVA Dan Williams
  2015-08-13  5:58   ` Boaz Harrosh
@ 2015-08-13 17:35   ` Matthew Wilcox
  2015-08-13 18:15     ` Dan Williams
  1 sibling, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2015-08-13 17:35 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, axboe, riel, linux-nvdimm, linux-mm, mgorman,
	torvalds, hch

On Wed, Aug 12, 2015 at 11:01:09PM -0400, Dan Williams wrote:
> +static inline __pfn_t page_to_pfn_t(struct page *page)
> +{
> +	__pfn_t pfn = { .val = page_to_pfn(page) << PAGE_SHIFT, };
> +
> +	return pfn;
> +}

static inline __pfn_t page_to_pfn_t(struct page *page)
{
	__pfn_t __pfn;
	unsigned long pfn = page_to_pfn(page);
	BUG_ON(pfn > (-1UL >> PFN_SHIFT))
	__pfn.val = pfn << PFN_SHIFT;

	return __pfn;
}

I have a problem wih PFN_SHIFT being equal to PAGE_SHIFT.  Consider a
32-bit kernel; you're asserting that no memory represented by a struct
page can have a physical address above 4GB.

You only need three bits for flags so far ... how about making PFN_SHIFT
be 6?  That supports physical addresses up to 2^38 (256GB).  That should
be enough, but hardware designers have done some strange things in the
past (I know that HP made PA-RISC hardware that can run 32-bit kernels
with memory between 64GB and 68GB, and they can't be the only strange
hardware people out there).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/5] allow mapping page-less memremaped areas into KVA
  2015-08-13 14:48       ` Boaz Harrosh
  2015-08-13 15:29         ` Boaz Harrosh
@ 2015-08-13 17:37         ` Dave Hansen
  1 sibling, 0 replies; 21+ messages in thread
From: Dave Hansen @ 2015-08-13 17:37 UTC (permalink / raw)
  To: Boaz Harrosh, Christoph Hellwig
  Cc: Dan Williams, linux-kernel, axboe, riel, linux-nvdimm, linux-mm,
	mgorman, torvalds

On 08/13/2015 07:48 AM, Boaz Harrosh wrote:
> There is already an object that holds a relationship of physical
> to Kernel-virtual. It is called a memory-section. Why not just
> widen its definition?

Memory sections are purely there to map physical address ranges back to
metadata about them.  *Originally* for 'struct page', but widened a bit
subsequently.  But, it's *never* been connected to kernel-virtual
addresses in any way that I can think of.

So, that's a curious statement.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/5] allow mapping page-less memremaped areas into KVA
  2015-08-13 17:35   ` Matthew Wilcox
@ 2015-08-13 18:15     ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2015-08-13 18:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, Jens Axboe, Rik van Riel,
	linux-nvdimm@lists.01.org, Linux MM, Mel Gorman, torvalds,
	Christoph Hellwig

On Thu, Aug 13, 2015 at 10:35 AM, Matthew Wilcox <willy@linux.intel.com> wrote:
> On Wed, Aug 12, 2015 at 11:01:09PM -0400, Dan Williams wrote:
>> +static inline __pfn_t page_to_pfn_t(struct page *page)
>> +{
>> +     __pfn_t pfn = { .val = page_to_pfn(page) << PAGE_SHIFT, };
>> +
>> +     return pfn;
>> +}
>
> static inline __pfn_t page_to_pfn_t(struct page *page)
> {
>         __pfn_t __pfn;
>         unsigned long pfn = page_to_pfn(page);
>         BUG_ON(pfn > (-1UL >> PFN_SHIFT))
>         __pfn.val = pfn << PFN_SHIFT;
>
>         return __pfn;
> }
>
> I have a problem wih PFN_SHIFT being equal to PAGE_SHIFT.  Consider a
> 32-bit kernel; you're asserting that no memory represented by a struct
> page can have a physical address above 4GB.
>
> You only need three bits for flags so far ... how about making PFN_SHIFT
> be 6?  That supports physical addresses up to 2^38 (256GB).  That should
> be enough, but hardware designers have done some strange things in the
> past (I know that HP made PA-RISC hardware that can run 32-bit kernels
> with memory between 64GB and 68GB, and they can't be the only strange
> hardware people out there).

Sounds good, especially given we only use 4-bits today.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 4/5] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t()
  2015-08-13 16:34       ` Boaz Harrosh
@ 2015-08-13 18:51         ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2015-08-13 18:51 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-kernel, Jens Axboe, Rik van Riel,
	linux-nvdimm@lists.01.org, Linux MM, Mel Gorman, torvalds,
	Christoph Hellwig

On Thu, Aug 13, 2015 at 9:34 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 08/13/2015 06:21 PM, Dan Williams wrote:
>> On Wed, Aug 12, 2015 at 11:26 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
> <>
>>
>> Hmm, that's not the same block layer I've been working with for the
>> past several years:
>>
>> $ mount /dev/pmem0 /mnt
>> $ echo namespace0.0 > ../drivers/nd_pmem/unbind # succeeds
>>
>> Unbind always proceeds unconditionally.  See the recent kernel summit
>> topic discussion around devm vs unbind [1].  While kmap_atomic_pfn_t()
>> does not implement revoke semantics it at least forces re-validation
>> and time bounded references.  For the unplug case we'll need to go
>> shootdown those DAX mappings in userspace so that they return SIGBUS
>> on access, or something along those lines.
>>
>
> Then fix unbind to refuse. What is the point of unbind when it trashes
> the hot path so badly and makes the code so fat.

What? The DAX hot path avoids the kernel entirely.

> Who uses it and what for?

The device driver core.  We simply can't hold off remove indefinitely.
If the administrator wants the device disabled we need to tear down
and revoke active mappings, or at very least guarantee time bounded
removal.

> First I ever heard of it and I do use Linux a little bit.
>
>> [1]: http://www.spinics.net/lists/kernel/msg2032864.html
>>
> Hm...
>
> OK I hate it. I would just make sure to override and refuse unbinding with an
> elevated ref count. Is not a good reason for me to trash the hotpath.

Again, the current usages are not in hot paths.  If it becomes part of
a hot path *and* shows up in a profile we can look to implement
something with less overhead.  Until then we should plan to honor the
lifetime as defined by ->probe() and ->remove().

In fact I proposed the same as you, but then changed my mind based on
Tejun's response [1].  So please reconsider this idea to solve the
problem by blocking ->remove().  PMEM is new and special, but not
*that* special as to justify breaking basic guarantees.

[1]: https://lkml.org/lkml/2015/7/15/731

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-08-13 18:51 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-13  3:00 [PATCH v5 0/5] introduce __pfn_t for unmapped pfn I/O and DAX lifetime Dan Williams
2015-08-13  3:01 ` [PATCH v5 1/5] mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h Dan Williams
2015-08-13  3:01 ` [PATCH v5 2/5] allow mapping page-less memremaped areas into KVA Dan Williams
2015-08-13  5:58   ` Boaz Harrosh
2015-08-13 12:57     ` Dan Williams
2015-08-13 13:23       ` Boaz Harrosh
2015-08-13 14:41         ` Christoph Hellwig
2015-08-13 15:01           ` Boaz Harrosh
2015-08-13 14:37     ` Christoph Hellwig
2015-08-13 14:48       ` Boaz Harrosh
2015-08-13 15:29         ` Boaz Harrosh
2015-08-13 17:37         ` Dave Hansen
2015-08-13 17:35   ` Matthew Wilcox
2015-08-13 18:15     ` Dan Williams
2015-08-13  3:01 ` [PATCH v5 3/5] dax: drop size parameter to ->direct_access() Dan Williams
2015-08-13  3:01 ` [PATCH v5 4/5] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t() Dan Williams
2015-08-13  6:26   ` Boaz Harrosh
2015-08-13 15:21     ` Dan Williams
2015-08-13 16:34       ` Boaz Harrosh
2015-08-13 18:51         ` Dan Williams
2015-08-13  3:01 ` [PATCH v5 5/5] scatterlist: convert to __pfn_t Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).