All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/4] dax: require 'struct page' and other fixups
@ 2017-09-29  1:21 ` Dan Williams
  0 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2017-09-29  1:21 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Jan Kara, linux-nvdimm, linux-mm, linux-fsdevel,
	Christoph Hellwig, Kirill A. Shutemov

Changes since v1 [1]:
* quiet bdev_dax_supported() in favor of error messages emitted by the
  caller (Jeff)
* fix leftover parentheses in vma_merge (Jeff)
* improve the changelog for "dax: stop using VM_MIXEDMAP for dax"

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-September/012645.html 

---

Prompted by a recent change to add more protection around setting up
'vm_flags' for a dax vma [1], rework the implementation to remove the
requirement to set VM_MIXEDMAP and VM_HUGEPAGE.

VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map. The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations. In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.

This patch series passes a run of the ndctl unit test suite and the
'mmap.sh' [2] test in particular. 'mmap.sh' tries to catch dependencies
on VM_MIXEDMAP and {pte,pmd}_devmap().

[1]: https://lkml.org/lkml/2017/9/25/638
[2]: https://github.com/pmem/ndctl/blob/master/test/mmap.sh

---

Dan Williams (4):
      dax: quiet bdev_dax_supported()
      dax: disable filesystem dax on devices that do not map pages
      dax: stop using VM_MIXEDMAP for dax
      dax: stop using VM_HUGEPAGE for dax


 drivers/dax/device.c |    1 -
 drivers/dax/super.c  |   15 +++++++++++----
 fs/ext2/file.c       |    1 -
 fs/ext4/file.c       |    1 -
 fs/xfs/xfs_file.c    |    2 --
 mm/huge_memory.c     |    8 ++++----
 mm/ksm.c             |    3 +++
 mm/madvise.c         |    2 +-
 mm/memory.c          |   20 ++++++++++++++++++--
 mm/migrate.c         |    3 ++-
 mm/mlock.c           |    3 ++-
 mm/mmap.c            |    3 ++-
 12 files changed, 43 insertions(+), 19 deletions(-)
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 0/4] dax: require 'struct page' and other fixups
@ 2017-09-29  1:21 ` Dan Williams
  0 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2017-09-29  1:21 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Jan Kara, linux-nvdimm, linux-mm, Jeff Moyer,
	linux-fsdevel, Ross Zwisler, Christoph Hellwig,
	Kirill A. Shutemov

Changes since v1 [1]:
* quiet bdev_dax_supported() in favor of error messages emitted by the
  caller (Jeff)
* fix leftover parentheses in vma_merge (Jeff)
* improve the changelog for "dax: stop using VM_MIXEDMAP for dax"

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-September/012645.html 

---

Prompted by a recent change to add more protection around setting up
'vm_flags' for a dax vma [1], rework the implementation to remove the
requirement to set VM_MIXEDMAP and VM_HUGEPAGE.

VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map. The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations. In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.

This patch series passes a run of the ndctl unit test suite and the
'mmap.sh' [2] test in particular. 'mmap.sh' tries to catch dependencies
on VM_MIXEDMAP and {pte,pmd}_devmap().

[1]: https://lkml.org/lkml/2017/9/25/638
[2]: https://github.com/pmem/ndctl/blob/master/test/mmap.sh

---

Dan Williams (4):
      dax: quiet bdev_dax_supported()
      dax: disable filesystem dax on devices that do not map pages
      dax: stop using VM_MIXEDMAP for dax
      dax: stop using VM_HUGEPAGE for dax


 drivers/dax/device.c |    1 -
 drivers/dax/super.c  |   15 +++++++++++----
 fs/ext2/file.c       |    1 -
 fs/ext4/file.c       |    1 -
 fs/xfs/xfs_file.c    |    2 --
 mm/huge_memory.c     |    8 ++++----
 mm/ksm.c             |    3 +++
 mm/madvise.c         |    2 +-
 mm/memory.c          |   20 ++++++++++++++++++--
 mm/migrate.c         |    3 ++-
 mm/mlock.c           |    3 ++-
 mm/mmap.c            |    3 ++-
 12 files changed, 43 insertions(+), 19 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 1/4] dax: quiet bdev_dax_supported()
  2017-09-29  1:21 ` Dan Williams
@ 2017-09-29  1:21   ` Dan Williams
  -1 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2017-09-29  1:21 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel, linux-mm, linux-nvdimm

Leave it to the caller to decide if bdev_dax_supported() failures are
errors worth emitting to the log.

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 30d8a5aedd23..b0cc8117eebe 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -92,21 +92,21 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 	long len;
 
 	if (blocksize != PAGE_SIZE) {
-		pr_err("VFS (%s): error: unsupported blocksize for dax\n",
+		pr_debug("VFS (%s): error: unsupported blocksize for dax\n",
 				sb->s_id);
 		return -EINVAL;
 	}
 
 	err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, &pgoff);
 	if (err) {
-		pr_err("VFS (%s): error: unaligned partition for dax\n",
+		pr_debug("VFS (%s): error: unaligned partition for dax\n",
 				sb->s_id);
 		return err;
 	}
 
 	dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
 	if (!dax_dev) {
-		pr_err("VFS (%s): error: device does not support dax\n",
+		pr_debug("VFS (%s): error: device does not support dax\n",
 				sb->s_id);
 		return -EOPNOTSUPP;
 	}
@@ -118,7 +118,7 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 	put_dax(dax_dev);
 
 	if (len < 1) {
-		pr_err("VFS (%s): error: dax access failed (%ld)\n",
+		pr_debug("VFS (%s): error: dax access failed (%ld)\n",
 				sb->s_id, len);
 		return len < 0 ? len : -EIO;
 	}

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 1/4] dax: quiet bdev_dax_supported()
@ 2017-09-29  1:21   ` Dan Williams
  0 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2017-09-29  1:21 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel, linux-mm, Jeff Moyer, linux-nvdimm

Leave it to the caller to decide if bdev_dax_supported() failures are
errors worth emitting to the log.

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 30d8a5aedd23..b0cc8117eebe 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -92,21 +92,21 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 	long len;
 
 	if (blocksize != PAGE_SIZE) {
-		pr_err("VFS (%s): error: unsupported blocksize for dax\n",
+		pr_debug("VFS (%s): error: unsupported blocksize for dax\n",
 				sb->s_id);
 		return -EINVAL;
 	}
 
 	err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, &pgoff);
 	if (err) {
-		pr_err("VFS (%s): error: unaligned partition for dax\n",
+		pr_debug("VFS (%s): error: unaligned partition for dax\n",
 				sb->s_id);
 		return err;
 	}
 
 	dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
 	if (!dax_dev) {
-		pr_err("VFS (%s): error: device does not support dax\n",
+		pr_debug("VFS (%s): error: device does not support dax\n",
 				sb->s_id);
 		return -EOPNOTSUPP;
 	}
@@ -118,7 +118,7 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 	put_dax(dax_dev);
 
 	if (len < 1) {
-		pr_err("VFS (%s): error: dax access failed (%ld)\n",
+		pr_debug("VFS (%s): error: dax access failed (%ld)\n",
 				sb->s_id, len);
 		return len < 0 ? len : -EIO;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 2/4] dax: disable filesystem dax on devices that do not map pages
  2017-09-29  1:21 ` Dan Williams
@ 2017-09-29  1:21   ` Dan Williams
  -1 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2017-09-29  1:21 UTC (permalink / raw)
  To: akpm; +Cc: Jan Kara, linux-nvdimm, linux-mm, linux-fsdevel, Christoph Hellwig

If a dax buffer from a device that does not map pages is passed to
read(2) or write(2) as a target for direct-I/O it triggers SIGBUS. If
gdb attempts to examine the contents of a dax buffer from a device that
does not map pages it triggers SIGBUS. If fork(2) is called on a process
with a dax mapping from a device that does not map pages it triggers
SIGBUS. 'struct page' is required otherwise several kernel code paths
break in surprising ways. Disable filesystem-dax on devices that do not
map pages.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index b0cc8117eebe..491d4859c644 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -15,6 +15,7 @@
 #include <linux/mount.h>
 #include <linux/magic.h>
 #include <linux/genhd.h>
+#include <linux/pfn_t.h>
 #include <linux/cdev.h>
 #include <linux/hash.h>
 #include <linux/slab.h>
@@ -123,6 +124,12 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 		return len < 0 ? len : -EIO;
 	}
 
+	if (!pfn_t_has_page(pfn)) {
+		pr_debug("VFS (%s): error: dax support not enabled\n",
+				sb->s_id);
+		return -EOPNOTSUPP;
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(__bdev_dax_supported);

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 2/4] dax: disable filesystem dax on devices that do not map pages
@ 2017-09-29  1:21   ` Dan Williams
  0 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2017-09-29  1:21 UTC (permalink / raw)
  To: akpm
  Cc: Jan Kara, linux-nvdimm, linux-mm, Jeff Moyer, linux-fsdevel,
	Ross Zwisler, Christoph Hellwig

If a dax buffer from a device that does not map pages is passed to
read(2) or write(2) as a target for direct-I/O it triggers SIGBUS. If
gdb attempts to examine the contents of a dax buffer from a device that
does not map pages it triggers SIGBUS. If fork(2) is called on a process
with a dax mapping from a device that does not map pages it triggers
SIGBUS. 'struct page' is required otherwise several kernel code paths
break in surprising ways. Disable filesystem-dax on devices that do not
map pages.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index b0cc8117eebe..491d4859c644 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -15,6 +15,7 @@
 #include <linux/mount.h>
 #include <linux/magic.h>
 #include <linux/genhd.h>
+#include <linux/pfn_t.h>
 #include <linux/cdev.h>
 #include <linux/hash.h>
 #include <linux/slab.h>
@@ -123,6 +124,12 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 		return len < 0 ? len : -EIO;
 	}
 
+	if (!pfn_t_has_page(pfn)) {
+		pr_debug("VFS (%s): error: dax support not enabled\n",
+				sb->s_id);
+		return -EOPNOTSUPP;
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(__bdev_dax_supported);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 3/4] dax: stop using VM_MIXEDMAP for dax
  2017-09-29  1:21 ` Dan Williams
@ 2017-09-29  1:21   ` Dan Williams
  -1 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2017-09-29  1:21 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Jan Kara, linux-nvdimm, linux-mm, linux-fsdevel,
	Christoph Hellwig, Kirill A. Shutemov

VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map. The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations. In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.

Now that we always have pages for DAX we can stop setting VM_MIXEDMAP.
This also means we no longer need to worry about safely manipulating
vm_flags in a future where we support dynamically changing the dax mode
of a file.

Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    2 +-
 fs/ext2/file.c       |    1 -
 fs/ext4/file.c       |    2 +-
 fs/xfs/xfs_file.c    |    2 +-
 mm/huge_memory.c     |    8 ++++----
 mm/ksm.c             |    3 +++
 mm/madvise.c         |    2 +-
 mm/memory.c          |   20 ++++++++++++++++++--
 mm/migrate.c         |    3 ++-
 mm/mlock.c           |    3 ++-
 mm/mmap.c            |    3 ++-
 11 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index e9f3b3e4bbf4..ed79d006026e 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -450,7 +450,7 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_ops = &dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ff3a3636a5ca..70657e8550ed 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -125,7 +125,6 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP;
 	return 0;
 }
 #else
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b1da660ac3bc..0cc9d205bd96 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -352,7 +352,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_HUGEPAGE;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ebdd0bd2b261..dece8fe937f5 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1131,7 +1131,7 @@ xfs_file_mmap(
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
 	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 269b5df58543..c69d30e27fd9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -765,11 +765,11 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+	BUG_ON(!((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))
+				|| pfn_t_devmap(pfn)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON(!pfn_t_devmap(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
@@ -824,11 +824,11 @@ int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+	BUG_ON(!((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))
+				|| pfn_t_devmap(pfn)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON(!pfn_t_devmap(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
diff --git a/mm/ksm.c b/mm/ksm.c
index 15dd7415f7b3..787dfe4f3d44 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2358,6 +2358,9 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 				 VM_HUGETLB | VM_MIXEDMAP))
 			return 0;		/* just ignore the advice */
 
+		if (vma_is_dax(vma))
+			return 0;
+
 #ifdef VM_SAO
 		if (*vm_flags & VM_SAO)
 			return 0;
diff --git a/mm/madvise.c b/mm/madvise.c
index 21261ff0466f..40344d43e565 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -95,7 +95,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
 		new_flags |= VM_DONTDUMP;
 		break;
 	case MADV_DODUMP:
-		if (new_flags & VM_SPECIAL) {
+		if (vma_is_dax(vma) || (new_flags & VM_SPECIAL)) {
 			error = -EINVAL;
 			goto out;
 		}
diff --git a/mm/memory.c b/mm/memory.c
index ec4e15494901..771acaf54fe6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -830,6 +830,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			return vma->vm_ops->find_special_page(vma, addr);
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 			return NULL;
+		if (pte_devmap(pte))
+			return NULL;
 		if (is_zero_pfn(pfn))
 			return NULL;
 
@@ -917,6 +919,8 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 		}
 	}
 
+	if (pmd_devmap(pmd))
+		return NULL;
 	if (is_zero_pfn(pfn))
 		return NULL;
 	if (unlikely(pfn > highest_memmap_pfn))
@@ -1227,7 +1231,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * efficient than faulting.
 	 */
 	if (!(vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
-			!vma->anon_vma)
+			!vma->anon_vma && !vma_is_dax(vma))
 		return 0;
 
 	if (is_vm_hugetlb_page(vma))
@@ -1896,12 +1900,24 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 }
 EXPORT_SYMBOL(vm_insert_pfn_prot);
 
+static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
+{
+	/* these checks mirror the abort conditions in vm_normal_page */
+	if (vma->vm_flags & VM_MIXEDMAP)
+		return true;
+	if (pfn_t_devmap(pfn))
+		return true;
+	if (is_zero_pfn(pfn_t_to_pfn(pfn)))
+		return true;
+	return false;
+}
+
 static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn, bool mkwrite)
 {
 	pgprot_t pgprot = vma->vm_page_prot;
 
-	BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
+	BUG_ON(!vm_mixed_ok(vma, pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return -EFAULT;
diff --git a/mm/migrate.c b/mm/migrate.c
index 6954c1435833..179a84a311f6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2927,7 +2927,8 @@ int migrate_vma(const struct migrate_vma_ops *ops,
 	/* Sanity check the arguments */
 	start &= PAGE_MASK;
 	end &= PAGE_MASK;
-	if (!vma || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
+	if (!vma || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)
+			|| vma_is_dax(dma))
 		return -EINVAL;
 	if (start < vma->vm_start || start >= vma->vm_end)
 		return -EINVAL;
diff --git a/mm/mlock.c b/mm/mlock.c
index dfc6f1912176..4d009350893f 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -520,7 +520,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	vm_flags_t old_flags = vma->vm_flags;
 
 	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
-	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
+	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
+	    vma_is_dax(vma))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..2f3971a051c6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1723,7 +1723,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
-					vma == get_gate_vma(current->mm)))
+					vma == get_gate_vma(current->mm) ||
+					vma_is_dax(vma)))
 			mm->locked_vm += (len >> PAGE_SHIFT);
 		else
 			vma->vm_flags &= VM_LOCKED_CLEAR_MASK;

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 3/4] dax: stop using VM_MIXEDMAP for dax
@ 2017-09-29  1:21   ` Dan Williams
  0 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2017-09-29  1:21 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Jan Kara, linux-nvdimm, linux-mm, Jeff Moyer,
	linux-fsdevel, Ross Zwisler, Christoph Hellwig,
	Kirill A. Shutemov

VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map. The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations. In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.

Now that we always have pages for DAX we can stop setting VM_MIXEDMAP.
This also means we no longer need to worry about safely manipulating
vm_flags in a future where we support dynamically changing the dax mode
of a file.

Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    2 +-
 fs/ext2/file.c       |    1 -
 fs/ext4/file.c       |    2 +-
 fs/xfs/xfs_file.c    |    2 +-
 mm/huge_memory.c     |    8 ++++----
 mm/ksm.c             |    3 +++
 mm/madvise.c         |    2 +-
 mm/memory.c          |   20 ++++++++++++++++++--
 mm/migrate.c         |    3 ++-
 mm/mlock.c           |    3 ++-
 mm/mmap.c            |    3 ++-
 11 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index e9f3b3e4bbf4..ed79d006026e 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -450,7 +450,7 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_ops = &dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ff3a3636a5ca..70657e8550ed 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -125,7 +125,6 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP;
 	return 0;
 }
 #else
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b1da660ac3bc..0cc9d205bd96 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -352,7 +352,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_HUGEPAGE;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ebdd0bd2b261..dece8fe937f5 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1131,7 +1131,7 @@ xfs_file_mmap(
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
 	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 269b5df58543..c69d30e27fd9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -765,11 +765,11 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+	BUG_ON(!((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))
+				|| pfn_t_devmap(pfn)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON(!pfn_t_devmap(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
@@ -824,11 +824,11 @@ int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+	BUG_ON(!((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))
+				|| pfn_t_devmap(pfn)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON(!pfn_t_devmap(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
diff --git a/mm/ksm.c b/mm/ksm.c
index 15dd7415f7b3..787dfe4f3d44 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2358,6 +2358,9 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 				 VM_HUGETLB | VM_MIXEDMAP))
 			return 0;		/* just ignore the advice */
 
+		if (vma_is_dax(vma))
+			return 0;
+
 #ifdef VM_SAO
 		if (*vm_flags & VM_SAO)
 			return 0;
diff --git a/mm/madvise.c b/mm/madvise.c
index 21261ff0466f..40344d43e565 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -95,7 +95,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
 		new_flags |= VM_DONTDUMP;
 		break;
 	case MADV_DODUMP:
-		if (new_flags & VM_SPECIAL) {
+		if (vma_is_dax(vma) || (new_flags & VM_SPECIAL)) {
 			error = -EINVAL;
 			goto out;
 		}
diff --git a/mm/memory.c b/mm/memory.c
index ec4e15494901..771acaf54fe6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -830,6 +830,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			return vma->vm_ops->find_special_page(vma, addr);
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 			return NULL;
+		if (pte_devmap(pte))
+			return NULL;
 		if (is_zero_pfn(pfn))
 			return NULL;
 
@@ -917,6 +919,8 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 		}
 	}
 
+	if (pmd_devmap(pmd))
+		return NULL;
 	if (is_zero_pfn(pfn))
 		return NULL;
 	if (unlikely(pfn > highest_memmap_pfn))
@@ -1227,7 +1231,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * efficient than faulting.
 	 */
 	if (!(vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
-			!vma->anon_vma)
+			!vma->anon_vma && !vma_is_dax(vma))
 		return 0;
 
 	if (is_vm_hugetlb_page(vma))
@@ -1896,12 +1900,24 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 }
 EXPORT_SYMBOL(vm_insert_pfn_prot);
 
+static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
+{
+	/* these checks mirror the abort conditions in vm_normal_page */
+	if (vma->vm_flags & VM_MIXEDMAP)
+		return true;
+	if (pfn_t_devmap(pfn))
+		return true;
+	if (is_zero_pfn(pfn_t_to_pfn(pfn)))
+		return true;
+	return false;
+}
+
 static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn, bool mkwrite)
 {
 	pgprot_t pgprot = vma->vm_page_prot;
 
-	BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
+	BUG_ON(!vm_mixed_ok(vma, pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return -EFAULT;
diff --git a/mm/migrate.c b/mm/migrate.c
index 6954c1435833..179a84a311f6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2927,7 +2927,8 @@ int migrate_vma(const struct migrate_vma_ops *ops,
 	/* Sanity check the arguments */
 	start &= PAGE_MASK;
 	end &= PAGE_MASK;
-	if (!vma || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
+	if (!vma || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)
+			|| vma_is_dax(dma))
 		return -EINVAL;
 	if (start < vma->vm_start || start >= vma->vm_end)
 		return -EINVAL;
diff --git a/mm/mlock.c b/mm/mlock.c
index dfc6f1912176..4d009350893f 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -520,7 +520,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	vm_flags_t old_flags = vma->vm_flags;
 
 	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
-	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
+	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
+	    vma_is_dax(vma))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..2f3971a051c6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1723,7 +1723,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
-					vma == get_gate_vma(current->mm)))
+					vma == get_gate_vma(current->mm) ||
+					vma_is_dax(vma)))
 			mm->locked_vm += (len >> PAGE_SHIFT);
 		else
 			vma->vm_flags &= VM_LOCKED_CLEAR_MASK;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 4/4] dax: stop using VM_HUGEPAGE for dax
  2017-09-29  1:21 ` Dan Williams
@ 2017-09-29  1:21   ` Dan Williams
  -1 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2017-09-29  1:21 UTC (permalink / raw)
  To: akpm; +Cc: Jan Kara, linux-nvdimm, linux-mm, linux-fsdevel, Christoph Hellwig

This flag is deprecated in favor of the vma_is_dax() check in
transparent_hugepage_enabled() added in commit baabda261424 "mm: always
enable thp for dax mappings"

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    1 -
 fs/ext4/file.c       |    1 -
 fs/xfs/xfs_file.c    |    2 --
 3 files changed, 4 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index ed79d006026e..74a35eb5e6d3 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -450,7 +450,6 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_ops = &dax_vm_ops;
-	vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0cc9d205bd96..a54e1b4c49f9 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -352,7 +352,6 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_HUGEPAGE;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index dece8fe937f5..c0e0fcbe1bd3 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1130,8 +1130,6 @@ xfs_file_mmap(
 {
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
-	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 4/4] dax: stop using VM_HUGEPAGE for dax
@ 2017-09-29  1:21   ` Dan Williams
  0 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2017-09-29  1:21 UTC (permalink / raw)
  To: akpm
  Cc: Jan Kara, linux-nvdimm, linux-mm, Jeff Moyer, linux-fsdevel,
	Ross Zwisler, Christoph Hellwig

This flag is deprecated in favor of the vma_is_dax() check in
transparent_hugepage_enabled() added in commit baabda261424 "mm: always
enable thp for dax mappings"

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    1 -
 fs/ext4/file.c       |    1 -
 fs/xfs/xfs_file.c    |    2 --
 3 files changed, 4 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index ed79d006026e..74a35eb5e6d3 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -450,7 +450,6 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_ops = &dax_vm_ops;
-	vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0cc9d205bd96..a54e1b4c49f9 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -352,7 +352,6 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_HUGEPAGE;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index dece8fe937f5..c0e0fcbe1bd3 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1130,8 +1130,6 @@ xfs_file_mmap(
 {
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
-	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/4] dax: require 'struct page' and other fixups
  2017-09-29  1:21 ` Dan Williams
                   ` (4 preceding siblings ...)
  (?)
@ 2017-10-01  7:57 ` Christoph Hellwig
  2017-10-01 17:58     ` Dan Williams
  -1 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2017-10-01  7:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Michal Hocko, Jan Kara, linux-nvdimm, linux-mm, Jeff Moyer,
	linux-fsdevel, Ross Zwisler, Christoph Hellwig,
	Kirill A. Shutemov

While this looks like a really nice cleanup of the code and removes
nasty race conditions I'd like to understand the tradeoffs.

This now requires every dax device that is used with a file system
to have a struct page backing, which means not only means we'd
break existing setups, but also a sharp turn from previous policy.

Unless I misremember it was you Intel guys that heavily pushed for
the page-less version, so I'd like to understand why you've changed
your mind.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/4] dax: require 'struct page' and other fixups
  2017-10-01  7:57 ` [PATCH v2 0/4] dax: require 'struct page' and other fixups Christoph Hellwig
@ 2017-10-01 17:58     ` Dan Williams
  0 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2017-10-01 17:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Michal Hocko, Jan Kara, linux-nvdimm, Linux MM, linux-fsdevel,
	Andrew Morton, Kirill A. Shutemov

On Sun, Oct 1, 2017 at 12:57 AM, Christoph Hellwig <hch@lst.de> wrote:
> While this looks like a really nice cleanup of the code and removes
> nasty race conditions I'd like to understand the tradeoffs.
>
> This now requires every dax device that is used with a file system
> to have a struct page backing, which means not only means we'd
> break existing setups, but also a sharp turn from previous policy.
>
> Unless I misremember it was you Intel guys that heavily pushed for
> the page-less version, so I'd like to understand why you've changed
> your mind.

Sure, here's a quick recap of the story so far of how we got here:

* In support of page-less I/O operations envisioned by Matthew I
introduced pfn_t as a proposal for converting the block layer and
other sub-systems to use pfns instead of pages [1]. You helped out on
that patch set with some work on the DMA api. [2]

* The DMA api conversion effort came to a halt when it came time to
touch sparc paths and DaveM said [3]: "Generally speaking, I think
that all actual physical memory the kernel operates on should have a
struct page backing it."

* ZONE_DEVICE was created to solve the DMA problem and in developing /
testing that discovered plenty of proof for Dave's assertion (no fork,
no ptrace, etc). We should have made the switch to require struct page
at that point, but I was persuaded by the argument that changing the
dax policy may break existing assumptions, and that there were larger
issues to go solve at the time.

What changed recently was the discussions around what the dax mount
option means and the assertion that we can, in general, make some
policy changes on our way to removing the "experimental" designation
from filesystem-dax. It is clear that the page-less dax path remains
experimental with all the way it fails in several kernel paths, and
there has been no patches for several months to revive the effort.
Meanwhile the page-less path continues to generate maintenance
overhead. The recent gymnastics (new ->post_mmap file_operation) to
make sure ->vm_flags are safely manipulated when dynamically changing
the dax mode of a file was the final straw for me to pull the trigger
on this series.

In terms of what breaks by changing this policy it should be noted
that we automatically create pages for "legacy" pmem devices, and the
default for "ndctl create-namespace" is to allocate pages. I have yet
to see a bug report where someone was surprised by fork failing or
direct-I/O causing a SIGBUS. So, I think the defaults are working, it
is unlikely that there are environments dependent on page-less
behavior.

That said, I now recall that dax also replaced xip for some setups. I
think we have a couple options here: let embedded configurations
override the page requirement since they can reasonably assert to not
care about the several broken general purpose paths that need pages,
or perhaps follow in the footsteps of what Nicolas is doing for cramfs
where he calls dax "overkill" [4] for his use case.

[1]: https://lwn.net/Articles/643998/
[2]: https://lkml.org/lkml/2015/8/12/86
[3]: https://lkml.org/lkml/2015/8/14/3
[4]: https://lwn.net/Articles/734995/
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/4] dax: require 'struct page' and other fixups
@ 2017-10-01 17:58     ` Dan Williams
  0 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2017-10-01 17:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Michal Hocko, Jan Kara, linux-nvdimm, Linux MM,
	Jeff Moyer, linux-fsdevel, Ross Zwisler, Kirill A. Shutemov

On Sun, Oct 1, 2017 at 12:57 AM, Christoph Hellwig <hch@lst.de> wrote:
> While this looks like a really nice cleanup of the code and removes
> nasty race conditions I'd like to understand the tradeoffs.
>
> This now requires every dax device that is used with a file system
> to have a struct page backing, which means not only means we'd
> break existing setups, but also a sharp turn from previous policy.
>
> Unless I misremember it was you Intel guys that heavily pushed for
> the page-less version, so I'd like to understand why you've changed
> your mind.

Sure, here's a quick recap of the story so far of how we got here:

* In support of page-less I/O operations envisioned by Matthew I
introduced pfn_t as a proposal for converting the block layer and
other sub-systems to use pfns instead of pages [1]. You helped out on
that patch set with some work on the DMA api. [2]

* The DMA api conversion effort came to a halt when it came time to
touch sparc paths and DaveM said [3]: "Generally speaking, I think
that all actual physical memory the kernel operates on should have a
struct page backing it."

* ZONE_DEVICE was created to solve the DMA problem and in developing /
testing that discovered plenty of proof for Dave's assertion (no fork,
no ptrace, etc). We should have made the switch to require struct page
at that point, but I was persuaded by the argument that changing the
dax policy may break existing assumptions, and that there were larger
issues to go solve at the time.

What changed recently was the discussions around what the dax mount
option means and the assertion that we can, in general, make some
policy changes on our way to removing the "experimental" designation
from filesystem-dax. It is clear that the page-less dax path remains
experimental with all the way it fails in several kernel paths, and
there has been no patches for several months to revive the effort.
Meanwhile the page-less path continues to generate maintenance
overhead. The recent gymnastics (new ->post_mmap file_operation) to
make sure ->vm_flags are safely manipulated when dynamically changing
the dax mode of a file was the final straw for me to pull the trigger
on this series.

In terms of what breaks by changing this policy it should be noted
that we automatically create pages for "legacy" pmem devices, and the
default for "ndctl create-namespace" is to allocate pages. I have yet
to see a bug report where someone was surprised by fork failing or
direct-I/O causing a SIGBUS. So, I think the defaults are working, it
is unlikely that there are environments dependent on page-less
behavior.

That said, I now recall that dax also replaced xip for some setups. I
think we have a couple options here: let embedded configurations
override the page requirement since they can reasonably assert to not
care about the several broken general purpose paths that need pages,
or perhaps follow in the footsteps of what Nicolas is doing for cramfs
where he calls dax "overkill" [4] for his use case.

[1]: https://lwn.net/Articles/643998/
[2]: https://lkml.org/lkml/2015/8/12/86
[3]: https://lkml.org/lkml/2015/8/14/3
[4]: https://lwn.net/Articles/734995/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/4] dax: require 'struct page' and other fixups
  2017-10-01 17:58     ` Dan Williams
@ 2017-10-01 21:11       ` Dave Chinner
  -1 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2017-10-01 21:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Michal Hocko, Jan Kara, linux-nvdimm, Linux MM, linux-fsdevel,
	Andrew Morton, Christoph Hellwig, Kirill A. Shutemov

On Sun, Oct 01, 2017 at 10:58:06AM -0700, Dan Williams wrote:
> On Sun, Oct 1, 2017 at 12:57 AM, Christoph Hellwig <hch@lst.de> wrote:
> > While this looks like a really nice cleanup of the code and removes
> > nasty race conditions I'd like to understand the tradeoffs.
> >
> > This now requires every dax device that is used with a file system
> > to have a struct page backing, which means not only means we'd
> > break existing setups, but also a sharp turn from previous policy.
> >
> > Unless I misremember it was you Intel guys that heavily pushed for
> > the page-less version, so I'd like to understand why you've changed
> > your mind.
> 
> Sure, here's a quick recap of the story so far of how we got here:
> 
> * In support of page-less I/O operations envisioned by Matthew I
> introduced pfn_t as a proposal for converting the block layer and
> other sub-systems to use pfns instead of pages [1]. You helped out on
> that patch set with some work on the DMA api. [2]
> 
> * The DMA api conversion effort came to a halt when it came time to
> touch sparc paths and DaveM said [3]: "Generally speaking, I think
> that all actual physical memory the kernel operates on should have a
> struct page backing it."
> 
> * ZONE_DEVICE was created to solve the DMA problem and in developing /
> testing that discovered plenty of proof for Dave's assertion (no fork,
> no ptrace, etc). We should have made the switch to require struct page
> at that point, but I was persuaded by the argument that changing the
> dax policy may break existing assumptions, and that there were larger
> issues to go solve at the time.
> 
> What changed recently was the discussions around what the dax mount
> option means and the assertion that we can, in general, make some
> policy changes on our way to removing the "experimental" designation
> from filesystem-dax. It is clear that the page-less dax path remains
> experimental with all the way it fails in several kernel paths, and
> there has been no patches for several months to revive the effort.
> Meanwhile the page-less path continues to generate maintenance
> overhead. The recent gymnastics (new ->post_mmap file_operation) to
> make sure ->vm_flags are safely manipulated when dynamically changing
> the dax mode of a file was the final straw for me to pull the trigger
> on this series.
> 
> In terms of what breaks by changing this policy it should be noted
> that we automatically create pages for "legacy" pmem devices, and the
> default for "ndctl create-namespace" is to allocate pages. I have yet
> to see a bug report where someone was surprised by fork failing or
> direct-I/O causing a SIGBUS. So, I think the defaults are working, it
> is unlikely that there are environments dependent on page-less
> behavior.

Does this imply that the hardware vendors won't have
tens of terabytes of pmem in systems in the near to medium term?
That's what we were originally told to expect by 2018-19 timeframe
(i.e. 5 years in), and that's kinda what we've been working towards.
Indeed, supporting systems with a couple of orders of magnitude more
pmem than ram was the big driver for page-less DAX mappings in the
first place. i.e. it was needed to avoid the static RAM overhead of
all the static struct pages for such large amounts of physical
memory.

If we decide that we must have struct pages for pmem, then we're
essentially throwing away the ability to support the very systems
the hardware vendors were telling us we needed to design the pmem
infrastructure for.  If that reality has changed, then I'd suggest
that we need to determine what the long term replacement for
pageless IO on large pmem systems will be before we throw what we
have away.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/4] dax: require 'struct page' and other fixups
@ 2017-10-01 21:11       ` Dave Chinner
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2017-10-01 21:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Andrew Morton, Michal Hocko, Jan Kara,
	linux-nvdimm, Linux MM, Jeff Moyer, linux-fsdevel, Ross Zwisler,
	Kirill A. Shutemov

On Sun, Oct 01, 2017 at 10:58:06AM -0700, Dan Williams wrote:
> On Sun, Oct 1, 2017 at 12:57 AM, Christoph Hellwig <hch@lst.de> wrote:
> > While this looks like a really nice cleanup of the code and removes
> > nasty race conditions I'd like to understand the tradeoffs.
> >
> > This now requires every dax device that is used with a file system
> > to have a struct page backing, which means not only means we'd
> > break existing setups, but also a sharp turn from previous policy.
> >
> > Unless I misremember it was you Intel guys that heavily pushed for
> > the page-less version, so I'd like to understand why you've changed
> > your mind.
> 
> Sure, here's a quick recap of the story so far of how we got here:
> 
> * In support of page-less I/O operations envisioned by Matthew I
> introduced pfn_t as a proposal for converting the block layer and
> other sub-systems to use pfns instead of pages [1]. You helped out on
> that patch set with some work on the DMA api. [2]
> 
> * The DMA api conversion effort came to a halt when it came time to
> touch sparc paths and DaveM said [3]: "Generally speaking, I think
> that all actual physical memory the kernel operates on should have a
> struct page backing it."
> 
> * ZONE_DEVICE was created to solve the DMA problem and in developing /
> testing that discovered plenty of proof for Dave's assertion (no fork,
> no ptrace, etc). We should have made the switch to require struct page
> at that point, but I was persuaded by the argument that changing the
> dax policy may break existing assumptions, and that there were larger
> issues to go solve at the time.
> 
> What changed recently was the discussions around what the dax mount
> option means and the assertion that we can, in general, make some
> policy changes on our way to removing the "experimental" designation
> from filesystem-dax. It is clear that the page-less dax path remains
> experimental with all the way it fails in several kernel paths, and
> there has been no patches for several months to revive the effort.
> Meanwhile the page-less path continues to generate maintenance
> overhead. The recent gymnastics (new ->post_mmap file_operation) to
> make sure ->vm_flags are safely manipulated when dynamically changing
> the dax mode of a file was the final straw for me to pull the trigger
> on this series.
> 
> In terms of what breaks by changing this policy it should be noted
> that we automatically create pages for "legacy" pmem devices, and the
> default for "ndctl create-namespace" is to allocate pages. I have yet
> to see a bug report where someone was surprised by fork failing or
> direct-I/O causing a SIGBUS. So, I think the defaults are working, it
> is unlikely that there are environments dependent on page-less
> behavior.

Does this imply that the hardware vendors won't have
tens of terabytes of pmem in systems in the near to medium term?
That's what we were originally told to expect by 2018-19 timeframe
(i.e. 5 years in), and that's kinda what we've been working towards.
Indeed, supporting systems with a couple of orders of magnitude more
pmem than ram was the big driver for page-less DAX mappings in the
first place. i.e. it was needed to avoid the static RAM overhead of
all the static struct pages for such large amounts of physical
memory.

If we decide that we must have struct pages for pmem, then we're
essentially throwing away the ability to support the very systems
the hardware vendors were telling us we needed to design the pmem
infrastructure for.  If that reality has changed, then I'd suggest
that we need to determine what the long term replacement for
pageless IO on large pmem systems will be before we throw what we
have away.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/4] dax: require 'struct page' and other fixups
  2017-10-01 21:11       ` Dave Chinner
  (?)
@ 2017-10-01 21:22       ` Dan Williams
  2017-10-01 21:23           ` Dan Williams
  2017-10-01 21:59           ` Dave Chinner
  -1 siblings, 2 replies; 29+ messages in thread
From: Dan Williams @ 2017-10-01 21:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Andrew Morton, Michal Hocko, Jan Kara,
	linux-nvdimm, Linux MM, Jeff Moyer, linux-fsdevel, Ross Zwisler,
	Kirill A. Shutemov

On Sun, Oct 1, 2017 at 2:11 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Oct 01, 2017 at 10:58:06AM -0700, Dan Williams wrote:
>> On Sun, Oct 1, 2017 at 12:57 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > While this looks like a really nice cleanup of the code and removes
>> > nasty race conditions I'd like to understand the tradeoffs.
>> >
>> > This now requires every dax device that is used with a file system
>> > to have a struct page backing, which means not only means we'd
>> > break existing setups, but also a sharp turn from previous policy.
>> >
>> > Unless I misremember it was you Intel guys that heavily pushed for
>> > the page-less version, so I'd like to understand why you've changed
>> > your mind.
>>
>> Sure, here's a quick recap of the story so far of how we got here:
>>
>> * In support of page-less I/O operations envisioned by Matthew I
>> introduced pfn_t as a proposal for converting the block layer and
>> other sub-systems to use pfns instead of pages [1]. You helped out on
>> that patch set with some work on the DMA api. [2]
>>
>> * The DMA api conversion effort came to a halt when it came time to
>> touch sparc paths and DaveM said [3]: "Generally speaking, I think
>> that all actual physical memory the kernel operates on should have a
>> struct page backing it."
>>
>> * ZONE_DEVICE was created to solve the DMA problem and in developing /
>> testing that discovered plenty of proof for Dave's assertion (no fork,
>> no ptrace, etc). We should have made the switch to require struct page
>> at that point, but I was persuaded by the argument that changing the
>> dax policy may break existing assumptions, and that there were larger
>> issues to go solve at the time.
>>
>> What changed recently was the discussions around what the dax mount
>> option means and the assertion that we can, in general, make some
>> policy changes on our way to removing the "experimental" designation
>> from filesystem-dax. It is clear that the page-less dax path remains
>> experimental with all the way it fails in several kernel paths, and
>> there has been no patches for several months to revive the effort.
>> Meanwhile the page-less path continues to generate maintenance
>> overhead. The recent gymnastics (new ->post_mmap file_operation) to
>> make sure ->vm_flags are safely manipulated when dynamically changing
>> the dax mode of a file was the final straw for me to pull the trigger
>> on this series.
>>
>> In terms of what breaks by changing this policy it should be noted
>> that we automatically create pages for "legacy" pmem devices, and the
>> default for "ndctl create-namespace" is to allocate pages. I have yet
>> to see a bug report where someone was surprised by fork failing or
>> direct-I/O causing a SIGBUS. So, I think the defaults are working, it
>> is unlikely that there are environments dependent on page-less
>> behavior.
>
> Does this imply that the hardware vendors won't have
> tens of terabytes of pmem in systems in the near to medium term?
> That's what we were originally told to expect by 2018-19 timeframe
> (i.e. 5 years in), and that's kinda what we've been working towards.
> Indeed, supporting systems with a couple of orders of magnitude more
> pmem than ram was the big driver for page-less DAX mappings in the
> first place. i.e. it was needed to avoid the static RAM overhead of
> all the static struct pages for such large amounts of physical
> memory.
>
> If we decide that we must have struct pages for pmem, then we're
> essentially throwing away the ability to support the very systems
> the hardware vendors were telling us we needed to design the pmem
> infrastructure for.  If that reality has changed, then I'd suggest
> that we need to determine what the long term replacement for
> pageless IO on large pmem systems will be before we throw what we
> have away.

No, we can support large pmem with struct page capacity reserved from
pmem itself rather than ram. A 1.5% capacity tax does not appear to be
prohibitive.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/4] dax: require 'struct page' and other fixups
  2017-10-01 21:22       ` Dan Williams
@ 2017-10-01 21:23           ` Dan Williams
  2017-10-01 21:59           ` Dave Chinner
  1 sibling, 0 replies; 29+ messages in thread
From: Dan Williams @ 2017-10-01 21:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Michal Hocko, Jan Kara, linux-nvdimm, Linux MM, linux-fsdevel,
	Andrew Morton, Christoph Hellwig, Kirill A. Shutemov

On Sun, Oct 1, 2017 at 2:22 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Sun, Oct 1, 2017 at 2:11 PM, Dave Chinner <david@fromorbit.com> wrote:
>> On Sun, Oct 01, 2017 at 10:58:06AM -0700, Dan Williams wrote:
>>> On Sun, Oct 1, 2017 at 12:57 AM, Christoph Hellwig <hch@lst.de> wrote:
>>> > While this looks like a really nice cleanup of the code and removes
>>> > nasty race conditions I'd like to understand the tradeoffs.
>>> >
>>> > This now requires every dax device that is used with a file system
>>> > to have a struct page backing, which means not only means we'd
>>> > break existing setups, but also a sharp turn from previous policy.
>>> >
>>> > Unless I misremember it was you Intel guys that heavily pushed for
>>> > the page-less version, so I'd like to understand why you've changed
>>> > your mind.
>>>
>>> Sure, here's a quick recap of the story so far of how we got here:
>>>
>>> * In support of page-less I/O operations envisioned by Matthew I
>>> introduced pfn_t as a proposal for converting the block layer and
>>> other sub-systems to use pfns instead of pages [1]. You helped out on
>>> that patch set with some work on the DMA api. [2]
>>>
>>> * The DMA api conversion effort came to a halt when it came time to
>>> touch sparc paths and DaveM said [3]: "Generally speaking, I think
>>> that all actual physical memory the kernel operates on should have a
>>> struct page backing it."
>>>
>>> * ZONE_DEVICE was created to solve the DMA problem and in developing /
>>> testing that discovered plenty of proof for Dave's assertion (no fork,
>>> no ptrace, etc). We should have made the switch to require struct page
>>> at that point, but I was persuaded by the argument that changing the
>>> dax policy may break existing assumptions, and that there were larger
>>> issues to go solve at the time.
>>>
>>> What changed recently was the discussions around what the dax mount
>>> option means and the assertion that we can, in general, make some
>>> policy changes on our way to removing the "experimental" designation
>>> from filesystem-dax. It is clear that the page-less dax path remains
>>> experimental with all the way it fails in several kernel paths, and
>>> there has been no patches for several months to revive the effort.
>>> Meanwhile the page-less path continues to generate maintenance
>>> overhead. The recent gymnastics (new ->post_mmap file_operation) to
>>> make sure ->vm_flags are safely manipulated when dynamically changing
>>> the dax mode of a file was the final straw for me to pull the trigger
>>> on this series.
>>>
>>> In terms of what breaks by changing this policy it should be noted
>>> that we automatically create pages for "legacy" pmem devices, and the
>>> default for "ndctl create-namespace" is to allocate pages. I have yet
>>> to see a bug report where someone was surprised by fork failing or
>>> direct-I/O causing a SIGBUS. So, I think the defaults are working, it
>>> is unlikely that there are environments dependent on page-less
>>> behavior.
>>
>> Does this imply that the hardware vendors won't have
>> tens of terabytes of pmem in systems in the near to medium term?
>> That's what we were originally told to expect by 2018-19 timeframe
>> (i.e. 5 years in), and that's kinda what we've been working towards.
>> Indeed, supporting systems with a couple of orders of magnitude more
>> pmem than ram was the big driver for page-less DAX mappings in the
>> first place. i.e. it was needed to avoid the static RAM overhead of
>> all the static struct pages for such large amounts of physical
>> memory.
>>
>> If we decide that we must have struct pages for pmem, then we're
>> essentially throwing away the ability to support the very systems
>> the hardware vendors were telling us we needed to design the pmem
>> infrastructure for.  If that reality has changed, then I'd suggest
>> that we need to determine what the long term replacement for
>> pageless IO on large pmem systems will be before we throw what we
>> have away.
>
> No, we can support large pmem with struct page capacity reserved from
> pmem itself rather than ram. A 1.5% capacity tax does not appear to be
> prohibitive.

Dynamic allocation of struct page is another step we can take before
deciding we need page-less I/O.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/4] dax: require 'struct page' and other fixups
@ 2017-10-01 21:23           ` Dan Williams
  0 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2017-10-01 21:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Andrew Morton, Michal Hocko, Jan Kara,
	linux-nvdimm, Linux MM, Jeff Moyer, linux-fsdevel, Ross Zwisler,
	Kirill A. Shutemov

On Sun, Oct 1, 2017 at 2:22 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Sun, Oct 1, 2017 at 2:11 PM, Dave Chinner <david@fromorbit.com> wrote:
>> On Sun, Oct 01, 2017 at 10:58:06AM -0700, Dan Williams wrote:
>>> On Sun, Oct 1, 2017 at 12:57 AM, Christoph Hellwig <hch@lst.de> wrote:
>>> > While this looks like a really nice cleanup of the code and removes
>>> > nasty race conditions I'd like to understand the tradeoffs.
>>> >
>>> > This now requires every dax device that is used with a file system
>>> > to have a struct page backing, which means not only means we'd
>>> > break existing setups, but also a sharp turn from previous policy.
>>> >
>>> > Unless I misremember it was you Intel guys that heavily pushed for
>>> > the page-less version, so I'd like to understand why you've changed
>>> > your mind.
>>>
>>> Sure, here's a quick recap of the story so far of how we got here:
>>>
>>> * In support of page-less I/O operations envisioned by Matthew I
>>> introduced pfn_t as a proposal for converting the block layer and
>>> other sub-systems to use pfns instead of pages [1]. You helped out on
>>> that patch set with some work on the DMA api. [2]
>>>
>>> * The DMA api conversion effort came to a halt when it came time to
>>> touch sparc paths and DaveM said [3]: "Generally speaking, I think
>>> that all actual physical memory the kernel operates on should have a
>>> struct page backing it."
>>>
>>> * ZONE_DEVICE was created to solve the DMA problem and in developing /
>>> testing that discovered plenty of proof for Dave's assertion (no fork,
>>> no ptrace, etc). We should have made the switch to require struct page
>>> at that point, but I was persuaded by the argument that changing the
>>> dax policy may break existing assumptions, and that there were larger
>>> issues to go solve at the time.
>>>
>>> What changed recently was the discussions around what the dax mount
>>> option means and the assertion that we can, in general, make some
>>> policy changes on our way to removing the "experimental" designation
>>> from filesystem-dax. It is clear that the page-less dax path remains
>>> experimental with all the way it fails in several kernel paths, and
>>> there has been no patches for several months to revive the effort.
>>> Meanwhile the page-less path continues to generate maintenance
>>> overhead. The recent gymnastics (new ->post_mmap file_operation) to
>>> make sure ->vm_flags are safely manipulated when dynamically changing
>>> the dax mode of a file was the final straw for me to pull the trigger
>>> on this series.
>>>
>>> In terms of what breaks by changing this policy it should be noted
>>> that we automatically create pages for "legacy" pmem devices, and the
>>> default for "ndctl create-namespace" is to allocate pages. I have yet
>>> to see a bug report where someone was surprised by fork failing or
>>> direct-I/O causing a SIGBUS. So, I think the defaults are working, it
>>> is unlikely that there are environments dependent on page-less
>>> behavior.
>>
>> Does this imply that the hardware vendors won't have
>> tens of terabytes of pmem in systems in the near to medium term?
>> That's what we were originally told to expect by 2018-19 timeframe
>> (i.e. 5 years in), and that's kinda what we've been working towards.
>> Indeed, supporting systems with a couple of orders of magnitude more
>> pmem than ram was the big driver for page-less DAX mappings in the
>> first place. i.e. it was needed to avoid the static RAM overhead of
>> all the static struct pages for such large amounts of physical
>> memory.
>>
>> If we decide that we must have struct pages for pmem, then we're
>> essentially throwing away the ability to support the very systems
>> the hardware vendors were telling us we needed to design the pmem
>> infrastructure for.  If that reality has changed, then I'd suggest
>> that we need to determine what the long term replacement for
>> pageless IO on large pmem systems will be before we throw what we
>> have away.
>
> No, we can support large pmem with struct page capacity reserved from
> pmem itself rather than ram. A 1.5% capacity tax does not appear to be
> prohibitive.

Dynamic allocation of struct page is another step we can take before
deciding we need page-less I/O.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/4] dax: require 'struct page' and other fixups
  2017-10-01 21:22       ` Dan Williams
@ 2017-10-01 21:59           ` Dave Chinner
  2017-10-01 21:59           ` Dave Chinner
  1 sibling, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2017-10-01 21:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Michal Hocko, Jan Kara, linux-nvdimm, Linux MM, linux-fsdevel,
	Andrew Morton, Christoph Hellwig, Kirill A. Shutemov

On Sun, Oct 01, 2017 at 02:22:08PM -0700, Dan Williams wrote:
> On Sun, Oct 1, 2017 at 2:11 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Oct 01, 2017 at 10:58:06AM -0700, Dan Williams wrote:
> >> On Sun, Oct 1, 2017 at 12:57 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> > While this looks like a really nice cleanup of the code and removes
> >> > nasty race conditions I'd like to understand the tradeoffs.
> >> >
> >> > This now requires every dax device that is used with a file system
> >> > to have a struct page backing, which means not only means we'd
> >> > break existing setups, but also a sharp turn from previous policy.
> >> >
> >> > Unless I misremember it was you Intel guys that heavily pushed for
> >> > the page-less version, so I'd like to understand why you've changed
> >> > your mind.
> >>
> >> Sure, here's a quick recap of the story so far of how we got here:
> >>
> >> * In support of page-less I/O operations envisioned by Matthew I
> >> introduced pfn_t as a proposal for converting the block layer and
> >> other sub-systems to use pfns instead of pages [1]. You helped out on
> >> that patch set with some work on the DMA api. [2]
> >>
> >> * The DMA api conversion effort came to a halt when it came time to
> >> touch sparc paths and DaveM said [3]: "Generally speaking, I think
> >> that all actual physical memory the kernel operates on should have a
> >> struct page backing it."
> >>
> >> * ZONE_DEVICE was created to solve the DMA problem and in developing /
> >> testing that discovered plenty of proof for Dave's assertion (no fork,
> >> no ptrace, etc). We should have made the switch to require struct page
> >> at that point, but I was persuaded by the argument that changing the
> >> dax policy may break existing assumptions, and that there were larger
> >> issues to go solve at the time.
> >>
> >> What changed recently was the discussions around what the dax mount
> >> option means and the assertion that we can, in general, make some
> >> policy changes on our way to removing the "experimental" designation
> >> from filesystem-dax. It is clear that the page-less dax path remains
> >> experimental with all the way it fails in several kernel paths, and
> >> there has been no patches for several months to revive the effort.
> >> Meanwhile the page-less path continues to generate maintenance
> >> overhead. The recent gymnastics (new ->post_mmap file_operation) to
> >> make sure ->vm_flags are safely manipulated when dynamically changing
> >> the dax mode of a file was the final straw for me to pull the trigger
> >> on this series.
> >>
> >> In terms of what breaks by changing this policy it should be noted
> >> that we automatically create pages for "legacy" pmem devices, and the
> >> default for "ndctl create-namespace" is to allocate pages. I have yet
> >> to see a bug report where someone was surprised by fork failing or
> >> direct-I/O causing a SIGBUS. So, I think the defaults are working, it
> >> is unlikely that there are environments dependent on page-less
> >> behavior.
> >
> > Does this imply that the hardware vendors won't have
> > tens of terabytes of pmem in systems in the near to medium term?
> > That's what we were originally told to expect by 2018-19 timeframe
> > (i.e. 5 years in), and that's kinda what we've been working towards.
> > Indeed, supporting systems with a couple of orders of magnitude more
> > pmem than ram was the big driver for page-less DAX mappings in the
> > first place. i.e. it was needed to avoid the static RAM overhead of
> > all the static struct pages for such large amounts of physical
> > memory.
> >
> > If we decide that we must have struct pages for pmem, then we're
> > essentially throwing away the ability to support the very systems
> > the hardware vendors were telling us we needed to design the pmem
> > infrastructure for.  If that reality has changed, then I'd suggest
> > that we need to determine what the long term replacement for
> > pageless IO on large pmem systems will be before we throw what we
> > have away.
> 
> No, we can support large pmem with struct page capacity reserved from
> pmem itself rather than ram. A 1.5% capacity tax does not appear to be
> prohibitive.

The "capacity tax" had nothing to do with it - the major problem
with self hosting struct pages was that page locks can be hot and
contention on them will rapidly burn through write cycles on the
pmem. That's still a problem, yes?

I don't want to have to ask about all the issues one by one, so I'll
ask you to explain in one go: what has changed (both hardware and
software!) since we last discussed these problems with self hosting
and make it a viable solution?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/4] dax: require 'struct page' and other fixups
@ 2017-10-01 21:59           ` Dave Chinner
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2017-10-01 21:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Andrew Morton, Michal Hocko, Jan Kara,
	linux-nvdimm, Linux MM, Jeff Moyer, linux-fsdevel, Ross Zwisler,
	Kirill A. Shutemov

On Sun, Oct 01, 2017 at 02:22:08PM -0700, Dan Williams wrote:
> On Sun, Oct 1, 2017 at 2:11 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Oct 01, 2017 at 10:58:06AM -0700, Dan Williams wrote:
> >> On Sun, Oct 1, 2017 at 12:57 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> > While this looks like a really nice cleanup of the code and removes
> >> > nasty race conditions I'd like to understand the tradeoffs.
> >> >
> >> > This now requires every dax device that is used with a file system
> >> > to have a struct page backing, which means not only means we'd
> >> > break existing setups, but also a sharp turn from previous policy.
> >> >
> >> > Unless I misremember it was you Intel guys that heavily pushed for
> >> > the page-less version, so I'd like to understand why you've changed
> >> > your mind.
> >>
> >> Sure, here's a quick recap of the story so far of how we got here:
> >>
> >> * In support of page-less I/O operations envisioned by Matthew I
> >> introduced pfn_t as a proposal for converting the block layer and
> >> other sub-systems to use pfns instead of pages [1]. You helped out on
> >> that patch set with some work on the DMA api. [2]
> >>
> >> * The DMA api conversion effort came to a halt when it came time to
> >> touch sparc paths and DaveM said [3]: "Generally speaking, I think
> >> that all actual physical memory the kernel operates on should have a
> >> struct page backing it."
> >>
> >> * ZONE_DEVICE was created to solve the DMA problem and in developing /
> >> testing that discovered plenty of proof for Dave's assertion (no fork,
> >> no ptrace, etc). We should have made the switch to require struct page
> >> at that point, but I was persuaded by the argument that changing the
> >> dax policy may break existing assumptions, and that there were larger
> >> issues to go solve at the time.
> >>
> >> What changed recently was the discussions around what the dax mount
> >> option means and the assertion that we can, in general, make some
> >> policy changes on our way to removing the "experimental" designation
> >> from filesystem-dax. It is clear that the page-less dax path remains
> >> experimental with all the way it fails in several kernel paths, and
> >> there has been no patches for several months to revive the effort.
> >> Meanwhile the page-less path continues to generate maintenance
> >> overhead. The recent gymnastics (new ->post_mmap file_operation) to
> >> make sure ->vm_flags are safely manipulated when dynamically changing
> >> the dax mode of a file was the final straw for me to pull the trigger
> >> on this series.
> >>
> >> In terms of what breaks by changing this policy it should be noted
> >> that we automatically create pages for "legacy" pmem devices, and the
> >> default for "ndctl create-namespace" is to allocate pages. I have yet
> >> to see a bug report where someone was surprised by fork failing or
> >> direct-I/O causing a SIGBUS. So, I think the defaults are working, it
> >> is unlikely that there are environments dependent on page-less
> >> behavior.
> >
> > Does this imply that the hardware vendors won't have
> > tens of terabytes of pmem in systems in the near to medium term?
> > That's what we were originally told to expect by 2018-19 timeframe
> > (i.e. 5 years in), and that's kinda what we've been working towards.
> > Indeed, supporting systems with a couple of orders of magnitude more
> > pmem than ram was the big driver for page-less DAX mappings in the
> > first place. i.e. it was needed to avoid the static RAM overhead of
> > all the static struct pages for such large amounts of physical
> > memory.
> >
> > If we decide that we must have struct pages for pmem, then we're
> > essentially throwing away the ability to support the very systems
> > the hardware vendors were telling us we needed to design the pmem
> > infrastructure for.  If that reality has changed, then I'd suggest
> > that we need to determine what the long term replacement for
> > pageless IO on large pmem systems will be before we throw what we
> > have away.
> 
> No, we can support large pmem with struct page capacity reserved from
> pmem itself rather than ram. A 1.5% capacity tax does not appear to be
> prohibitive.

The "capacity tax" had nothing to do with it - the major problem
with self hosting struct pages was that page locks can be hot and
contention on them will rapidly burn through write cycles on the
pmem. That's still a problem, yes?

I don't want to have to ask about all the issues one by one, so I'll
ask you to explain in one go: what has changed (both hardware and
software!) since we last discussed these problems with self hosting
and make it a viable solution?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/4] dax: require 'struct page' and other fixups
  2017-10-01 21:59           ` Dave Chinner
@ 2017-10-01 23:15             ` Dan Williams
  -1 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2017-10-01 23:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Michal Hocko, Jan Kara, linux-nvdimm, Linux MM, linux-fsdevel,
	Andrew Morton, Christoph Hellwig, Kirill A. Shutemov

On Sun, Oct 1, 2017 at 2:59 PM, Dave Chinner <david@fromorbit.com> wrote:
[..]
> The "capacity tax" had nothing to do with it - the major problem
> with self hosting struct pages was that page locks can be hot and
> contention on them will rapidly burn through write cycles on the
> pmem. That's still a problem, yes?

It was an early concern we had, but it has not born out in practice,
and stopped worrying about it 2 years ago when the ZONE_DEVICE
infrastructure was merged.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/4] dax: require 'struct page' and other fixups
@ 2017-10-01 23:15             ` Dan Williams
  0 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2017-10-01 23:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Andrew Morton, Michal Hocko, Jan Kara,
	linux-nvdimm, Linux MM, Jeff Moyer, linux-fsdevel, Ross Zwisler,
	Kirill A. Shutemov

On Sun, Oct 1, 2017 at 2:59 PM, Dave Chinner <david@fromorbit.com> wrote:
[..]
> The "capacity tax" had nothing to do with it - the major problem
> with self hosting struct pages was that page locks can be hot and
> contention on them will rapidly burn through write cycles on the
> pmem. That's still a problem, yes?

It was an early concern we had, but it has not born out in practice,
and stopped worrying about it 2 years ago when the ZONE_DEVICE
infrastructure was merged.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/4] dax: require 'struct page' and other fixups
  2017-10-01 23:15             ` Dan Williams
@ 2017-10-02 22:47               ` Andrew Morton
  -1 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2017-10-02 22:47 UTC (permalink / raw)
  To: Dan Williams
  Cc: Michal Hocko, Jan Kara, linux-nvdimm, Dave Chinner, Linux MM,
	linux-fsdevel, Christoph Hellwig, Kirill A. Shutemov

On Sun, 1 Oct 2017 16:15:06 -0700 Dan Williams <dan.j.williams@intel.com> wrote:

> On Sun, Oct 1, 2017 at 2:59 PM, Dave Chinner <david@fromorbit.com> wrote:
> [..]
> > The "capacity tax" had nothing to do with it - the major problem
> > with self hosting struct pages was that page locks can be hot and
> > contention on them will rapidly burn through write cycles on the
> > pmem. That's still a problem, yes?
> 
> It was an early concern we had, but it has not born out in practice,
> and stopped worrying about it 2 years ago when the ZONE_DEVICE
> infrastructure was merged.

These are weighty matters and should have been covered in the
changelogs, please.  Could you send along a few paragraphs which
summarise these thoughts and I'll add them in?

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/4] dax: require 'struct page' and other fixups
@ 2017-10-02 22:47               ` Andrew Morton
  0 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2017-10-02 22:47 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Christoph Hellwig, Michal Hocko, Jan Kara,
	linux-nvdimm, Linux MM, Jeff Moyer, linux-fsdevel, Ross Zwisler,
	Kirill A. Shutemov

On Sun, 1 Oct 2017 16:15:06 -0700 Dan Williams <dan.j.williams@intel.com> wrote:

> On Sun, Oct 1, 2017 at 2:59 PM, Dave Chinner <david@fromorbit.com> wrote:
> [..]
> > The "capacity tax" had nothing to do with it - the major problem
> > with self hosting struct pages was that page locks can be hot and
> > contention on them will rapidly burn through write cycles on the
> > pmem. That's still a problem, yes?
> 
> It was an early concern we had, but it has not born out in practice,
> and stopped worrying about it 2 years ago when the ZONE_DEVICE
> infrastructure was merged.

These are weighty matters and should have been covered in the
changelogs, please.  Could you send along a few paragraphs which
summarise these thoughts and I'll add them in?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 3/4] dax: stop using VM_MIXEDMAP for dax
  2017-09-29  1:21   ` Dan Williams
  (?)
@ 2017-10-03  8:09   ` Jan Kara
  2017-10-03 17:29       ` Dan Williams
  -1 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2017-10-03  8:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Michal Hocko, Jan Kara, linux-nvdimm, linux-mm, Jeff Moyer,
	linux-fsdevel, Ross Zwisler, Christoph Hellwig,
	Kirill A. Shutemov

On Thu 28-09-17 18:21:18, Dan Williams wrote:
> VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
> the memory page it is dealing with is not typical memory from the linear
> map. The get_user_pages_fast() path, since it does not resolve the vma,
> is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
> use that as a VM_MIXEDMAP replacement in some locations. In the cases
> where there is no pte to consult we fallback to using vma_is_dax() to
> detect the VM_MIXEDMAP special case.

Well, I somewhat dislike the vma_is_dax() checks sprinkled around. That
seems rather errorprone (easy to forget about it when adding new check
somewhere). Can we possibly also create a helper vma_is_special() (or some
other name) which would do ((vma->vm_flags & VM_SPECIAL) || vma_is_dax(vma)
|| is_vm_hugetlb_page(vma)) and then use it in all those places?

								Honza

> diff --git a/mm/ksm.c b/mm/ksm.c
> index 15dd7415f7b3..787dfe4f3d44 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -2358,6 +2358,9 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
>  				 VM_HUGETLB | VM_MIXEDMAP))
>  			return 0;		/* just ignore the advice */
>  
> +		if (vma_is_dax(vma))
> +			return 0;
> +
>  #ifdef VM_SAO
>  		if (*vm_flags & VM_SAO)
>  			return 0;
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 21261ff0466f..40344d43e565 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -95,7 +95,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
>  		new_flags |= VM_DONTDUMP;
>  		break;
>  	case MADV_DODUMP:
> -		if (new_flags & VM_SPECIAL) {
> +		if (vma_is_dax(vma) || (new_flags & VM_SPECIAL)) {
>  			error = -EINVAL;
>  			goto out;
>  		}
> diff --git a/mm/memory.c b/mm/memory.c
> index ec4e15494901..771acaf54fe6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -830,6 +830,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>  			return vma->vm_ops->find_special_page(vma, addr);
>  		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
>  			return NULL;
> +		if (pte_devmap(pte))
> +			return NULL;
>  		if (is_zero_pfn(pfn))
>  			return NULL;
>  
> @@ -917,6 +919,8 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>  		}
>  	}
>  
> +	if (pmd_devmap(pmd))
> +		return NULL;
>  	if (is_zero_pfn(pfn))
>  		return NULL;
>  	if (unlikely(pfn > highest_memmap_pfn))
> @@ -1227,7 +1231,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	 * efficient than faulting.
>  	 */
>  	if (!(vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
> -			!vma->anon_vma)
> +			!vma->anon_vma && !vma_is_dax(vma))
>  		return 0;
>  
>  	if (is_vm_hugetlb_page(vma))
> @@ -1896,12 +1900,24 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
>  }
>  EXPORT_SYMBOL(vm_insert_pfn_prot);
>  
> +static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
> +{
> +	/* these checks mirror the abort conditions in vm_normal_page */
> +	if (vma->vm_flags & VM_MIXEDMAP)
> +		return true;
> +	if (pfn_t_devmap(pfn))
> +		return true;
> +	if (is_zero_pfn(pfn_t_to_pfn(pfn)))
> +		return true;
> +	return false;
> +}
> +
>  static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
>  			pfn_t pfn, bool mkwrite)
>  {
>  	pgprot_t pgprot = vma->vm_page_prot;
>  
> -	BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
> +	BUG_ON(!vm_mixed_ok(vma, pfn));
>  
>  	if (addr < vma->vm_start || addr >= vma->vm_end)
>  		return -EFAULT;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6954c1435833..179a84a311f6 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2927,7 +2927,8 @@ int migrate_vma(const struct migrate_vma_ops *ops,
>  	/* Sanity check the arguments */
>  	start &= PAGE_MASK;
>  	end &= PAGE_MASK;
> -	if (!vma || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
> +	if (!vma || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)
> +			|| vma_is_dax(dma))
>  		return -EINVAL;
>  	if (start < vma->vm_start || start >= vma->vm_end)
>  		return -EINVAL;
> diff --git a/mm/mlock.c b/mm/mlock.c
> index dfc6f1912176..4d009350893f 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -520,7 +520,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  	vm_flags_t old_flags = vma->vm_flags;
>  
>  	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
> -	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
> +	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
> +	    vma_is_dax(vma))
>  		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
>  		goto out;
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 680506faceae..2f3971a051c6 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1723,7 +1723,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
>  	if (vm_flags & VM_LOCKED) {
>  		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
> -					vma == get_gate_vma(current->mm)))
> +					vma == get_gate_vma(current->mm) ||
> +					vma_is_dax(vma)))
>  			mm->locked_vm += (len >> PAGE_SHIFT);
>  		else
>  			vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 4/4] dax: stop using VM_HUGEPAGE for dax
  2017-09-29  1:21   ` Dan Williams
@ 2017-10-03  8:12     ` Jan Kara
  -1 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2017-10-03  8:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Christoph Hellwig, linux-mm, linux-fsdevel, akpm

On Thu 28-09-17 18:21:23, Dan Williams wrote:
> This flag is deprecated in favor of the vma_is_dax() check in
> transparent_hugepage_enabled() added in commit baabda261424 "mm: always
> enable thp for dax mappings"
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

I like this! You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  drivers/dax/device.c |    1 -
>  fs/ext4/file.c       |    1 -
>  fs/xfs/xfs_file.c    |    2 --
>  3 files changed, 4 deletions(-)
> 
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index ed79d006026e..74a35eb5e6d3 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -450,7 +450,6 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
>  		return rc;
>  
>  	vma->vm_ops = &dax_vm_ops;
> -	vma->vm_flags |= VM_HUGEPAGE;
>  	return 0;
>  }
>  
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 0cc9d205bd96..a54e1b4c49f9 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -352,7 +352,6 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
>  	file_accessed(file);
>  	if (IS_DAX(file_inode(file))) {
>  		vma->vm_ops = &ext4_dax_vm_ops;
> -		vma->vm_flags |= VM_HUGEPAGE;
>  	} else {
>  		vma->vm_ops = &ext4_file_vm_ops;
>  	}
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index dece8fe937f5..c0e0fcbe1bd3 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1130,8 +1130,6 @@ xfs_file_mmap(
>  {
>  	file_accessed(filp);
>  	vma->vm_ops = &xfs_file_vm_ops;
> -	if (IS_DAX(file_inode(filp)))
> -		vma->vm_flags |= VM_HUGEPAGE;
>  	return 0;
>  }
>  
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 4/4] dax: stop using VM_HUGEPAGE for dax
@ 2017-10-03  8:12     ` Jan Kara
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2017-10-03  8:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Jan Kara, linux-nvdimm, linux-mm, Jeff Moyer,
	linux-fsdevel, Ross Zwisler, Christoph Hellwig

On Thu 28-09-17 18:21:23, Dan Williams wrote:
> This flag is deprecated in favor of the vma_is_dax() check in
> transparent_hugepage_enabled() added in commit baabda261424 "mm: always
> enable thp for dax mappings"
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

I like this! You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  drivers/dax/device.c |    1 -
>  fs/ext4/file.c       |    1 -
>  fs/xfs/xfs_file.c    |    2 --
>  3 files changed, 4 deletions(-)
> 
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index ed79d006026e..74a35eb5e6d3 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -450,7 +450,6 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
>  		return rc;
>  
>  	vma->vm_ops = &dax_vm_ops;
> -	vma->vm_flags |= VM_HUGEPAGE;
>  	return 0;
>  }
>  
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 0cc9d205bd96..a54e1b4c49f9 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -352,7 +352,6 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
>  	file_accessed(file);
>  	if (IS_DAX(file_inode(file))) {
>  		vma->vm_ops = &ext4_dax_vm_ops;
> -		vma->vm_flags |= VM_HUGEPAGE;
>  	} else {
>  		vma->vm_ops = &ext4_file_vm_ops;
>  	}
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index dece8fe937f5..c0e0fcbe1bd3 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1130,8 +1130,6 @@ xfs_file_mmap(
>  {
>  	file_accessed(filp);
>  	vma->vm_ops = &xfs_file_vm_ops;
> -	if (IS_DAX(file_inode(filp)))
> -		vma->vm_flags |= VM_HUGEPAGE;
>  	return 0;
>  }
>  
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 3/4] dax: stop using VM_MIXEDMAP for dax
  2017-10-03  8:09   ` Jan Kara
@ 2017-10-03 17:29       ` Dan Williams
  0 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2017-10-03 17:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Michal Hocko, linux-nvdimm, Christoph Hellwig, Linux MM,
	linux-fsdevel, Andrew Morton, Kirill A. Shutemov

On Tue, Oct 3, 2017 at 1:09 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 28-09-17 18:21:18, Dan Williams wrote:
>> VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
>> the memory page it is dealing with is not typical memory from the linear
>> map. The get_user_pages_fast() path, since it does not resolve the vma,
>> is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
>> use that as a VM_MIXEDMAP replacement in some locations. In the cases
>> where there is no pte to consult we fallback to using vma_is_dax() to
>> detect the VM_MIXEDMAP special case.
>
> Well, I somewhat dislike the vma_is_dax() checks sprinkled around. That
> seems rather errorprone (easy to forget about it when adding new check
> somewhere). Can we possibly also create a helper vma_is_special() (or some
> other name) which would do ((vma->vm_flags & VM_SPECIAL) || vma_is_dax(vma)
> || is_vm_hugetlb_page(vma)) and then use it in all those places?

Yes, I can take a look at that... I shied away from it initially since
it does not appear that "vma_is_special()" paths are symmetric in
terms of all the conditions they check, but perhaps I can start small
with the ones that are common. I'll break that conversion out into a
lead-in cleanup patch.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 3/4] dax: stop using VM_MIXEDMAP for dax
@ 2017-10-03 17:29       ` Dan Williams
  0 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2017-10-03 17:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Michal Hocko, linux-nvdimm, Linux MM, Jeff Moyer,
	linux-fsdevel, Ross Zwisler, Christoph Hellwig,
	Kirill A. Shutemov

On Tue, Oct 3, 2017 at 1:09 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 28-09-17 18:21:18, Dan Williams wrote:
>> VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
>> the memory page it is dealing with is not typical memory from the linear
>> map. The get_user_pages_fast() path, since it does not resolve the vma,
>> is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
>> use that as a VM_MIXEDMAP replacement in some locations. In the cases
>> where there is no pte to consult we fallback to using vma_is_dax() to
>> detect the VM_MIXEDMAP special case.
>
> Well, I somewhat dislike the vma_is_dax() checks sprinkled around. That
> seems rather errorprone (easy to forget about it when adding new check
> somewhere). Can we possibly also create a helper vma_is_special() (or some
> other name) which would do ((vma->vm_flags & VM_SPECIAL) || vma_is_dax(vma)
> || is_vm_hugetlb_page(vma)) and then use it in all those places?

Yes, I can take a look at that... I shied away from it initially since
it does not appear that "vma_is_special()" paths are symmetric in
terms of all the conditions they check, but perhaps I can start small
with the ones that are common. I'll break that conversion out into a
lead-in cleanup patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2017-10-03 17:29 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-29  1:21 [PATCH v2 0/4] dax: require 'struct page' and other fixups Dan Williams
2017-09-29  1:21 ` Dan Williams
2017-09-29  1:21 ` [PATCH v2 1/4] dax: quiet bdev_dax_supported() Dan Williams
2017-09-29  1:21   ` Dan Williams
2017-09-29  1:21 ` [PATCH v2 2/4] dax: disable filesystem dax on devices that do not map pages Dan Williams
2017-09-29  1:21   ` Dan Williams
2017-09-29  1:21 ` [PATCH v2 3/4] dax: stop using VM_MIXEDMAP for dax Dan Williams
2017-09-29  1:21   ` Dan Williams
2017-10-03  8:09   ` Jan Kara
2017-10-03 17:29     ` Dan Williams
2017-10-03 17:29       ` Dan Williams
2017-09-29  1:21 ` [PATCH v2 4/4] dax: stop using VM_HUGEPAGE " Dan Williams
2017-09-29  1:21   ` Dan Williams
2017-10-03  8:12   ` Jan Kara
2017-10-03  8:12     ` Jan Kara
2017-10-01  7:57 ` [PATCH v2 0/4] dax: require 'struct page' and other fixups Christoph Hellwig
2017-10-01 17:58   ` Dan Williams
2017-10-01 17:58     ` Dan Williams
2017-10-01 21:11     ` Dave Chinner
2017-10-01 21:11       ` Dave Chinner
2017-10-01 21:22       ` Dan Williams
2017-10-01 21:23         ` Dan Williams
2017-10-01 21:23           ` Dan Williams
2017-10-01 21:59         ` Dave Chinner
2017-10-01 21:59           ` Dave Chinner
2017-10-01 23:15           ` Dan Williams
2017-10-01 23:15             ` Dan Williams
2017-10-02 22:47             ` Andrew Morton
2017-10-02 22:47               ` Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.