All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] mm, dax: export dax capabilities and mapping size info to userspace
@ 2016-09-15  6:54 ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15  6:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrea Arcangeli, Xiao Guangrong, Arnd Bergmann, linux-nvdimm,
	Dave Hansen, david, linux-kernel, npiggin, xfs, linux-fsdevel,
	Andrew Morton, hch, Kirill A. Shutemov

In the debate about how to support persistent memory applications that
want to use hardware-platform memory-media persistence
rules/cpu-instructions rather than filesystem data intergrity system
calls [1], one of the consistent requests is to move these applications
to use a device file rather than a filesystem file [2].

While there is still a desire to offer the same syscall overhead
avoidance in filesystem-dax as device-dax, there is performance
optimization work and analysis that still needs to be done.
Optimization/analysis to address filesystem-dax performance being slower
than the typical page-cache path on top of pmem [3], and whether the
performance gains are worth developing new filesytem data integrity
mechanisms.

In the meantime we have device-dax and are missing a way to identify its
capabilities compared to filesytem-dax.  Critically, we want a
persistent memory transaction library, that is handed an address range
to manage, to be able to determine if it is safe to forgo calling
fsync/msync to record newly allocated blocks after a write fault.  This
question is answered by the new VM_SYNC flag.

It is also important to know if the pages behind a mapping are backed by
page cache and need to be synced, or are referencing media directly.  We
have an XFS inode flag that can indicate the inode is DAX enabled, but
nothing for device-dax or other filesystems.  Yes, an application that
maps /dev/dax should assume the mapping is DAX, but it is useful to be
able to tell that from the address range directly, and a common
mechanism across filesystems.

Finally, while developing and debugging the filesystem-dax huge page
support it was frustrating that the only way to unit test and verify the
implementation was via debug print statements.  This series extends
mincore(2) to optionally provide an indication of the hardware mapping
size.  This is hopefully useful to other cases that want to evaluate
transparent-huge-page usage.


Changes since the RFC [4]:

1/ Drop DAX indication out of mincore.  It is a vma capability not a
   per-page property and fits better as a vma flag.  Multiple people
   indicated it would be better if the new syscall published the capability
   as an extent or aggregated over a range, and this facility is already
   provided by smaps.

2/ Add VM_SYNC to explicity disclaim a need to call fsync/msync

3/ Drop the syscall wire-up patch since it is trivial and can be revived
   if we decide to move forward with the new mincore syscall.


[1]: https://lwn.net/Articles/676737/
[2]: https://lists.01.org/pipermail/linux-nvdimm/2016-September/006893.html
[3]: https://lists.01.org/pipermail/linux-nvdimm/2016-August/006497.html
[4]: https://lists.01.org/pipermail/linux-nvdimm/2016-September/006875.html

---

Dan Williams (3):
      mm, dax: add VM_SYNC flag for device-dax VMAs
      mm, dax: add VM_DAX flag for DAX VMAs
      mm, mincore2(): retrieve tlb-size attributes of an address range


 drivers/dax/Kconfig                    |    1 
 drivers/dax/dax.c                      |    2 
 fs/Kconfig                             |    1 
 fs/ext2/file.c                         |    2 
 fs/ext4/file.c                         |    2 
 fs/proc/task_mmu.c                     |    4 +
 fs/xfs/xfs_file.c                      |    2 
 include/linux/mm.h                     |   31 +++++++-
 include/linux/syscalls.h               |    2 
 include/uapi/asm-generic/mman-common.h |    2 
 kernel/sys_ni.c                        |    1 
 mm/mincore.c                           |  130 ++++++++++++++++++++++++--------
 12 files changed, 141 insertions(+), 39 deletions(-)
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v2 0/3] mm, dax: export dax capabilities and mapping size info to userspace
@ 2016-09-15  6:54 ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15  6:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrea Arcangeli, Xiao Guangrong, Arnd Bergmann, linux-nvdimm,
	Dave Hansen, david, linux-kernel, npiggin, xfs, linux-fsdevel,
	Andrew Morton, hch, Kirill A. Shutemov

In the debate about how to support persistent memory applications that
want to use hardware-platform memory-media persistence
rules/cpu-instructions rather than filesystem data intergrity system
calls [1], one of the consistent requests is to move these applications
to use a device file rather than a filesystem file [2].

While there is still a desire to offer the same syscall overhead
avoidance in filesystem-dax as device-dax, there is performance
optimization work and analysis that still needs to be done.
Optimization/analysis to address filesystem-dax performance being slower
than the typical page-cache path on top of pmem [3], and whether the
performance gains are worth developing new filesytem data integrity
mechanisms.

In the meantime we have device-dax and are missing a way to identify its
capabilities compared to filesytem-dax.  Critically, we want a
persistent memory transaction library, that is handed an address range
to manage, to be able to determine if it is safe to forgo calling
fsync/msync to record newly allocated blocks after a write fault.  This
question is answered by the new VM_SYNC flag.

It is also important to know if the pages behind a mapping are backed by
page cache and need to be synced, or are referencing media directly.  We
have an XFS inode flag that can indicate the inode is DAX enabled, but
nothing for device-dax or other filesystems.  Yes, an application that
maps /dev/dax should assume the mapping is DAX, but it is useful to be
able to tell that from the address range directly, and a common
mechanism across filesystems.

Finally, while developing and debugging the filesystem-dax huge page
support it was frustrating that the only way to unit test and verify the
implementation was via debug print statements.  This series extends
mincore(2) to optionally provide an indication of the hardware mapping
size.  This is hopefully useful to other cases that want to evaluate
transparent-huge-page usage.


Changes since the RFC [4]:

1/ Drop DAX indication out of mincore.  It is a vma capability not a
   per-page property and fits better as a vma flag.  Multiple people
   indicated it would be better if the new syscall published the capability
   as an extent or aggregated over a range, and this facility is already
   provided by smaps.

2/ Add VM_SYNC to explicity disclaim a need to call fsync/msync

3/ Drop the syscall wire-up patch since it is trivial and can be revived
   if we decide to move forward with the new mincore syscall.


[1]: https://lwn.net/Articles/676737/
[2]: https://lists.01.org/pipermail/linux-nvdimm/2016-September/006893.html
[3]: https://lists.01.org/pipermail/linux-nvdimm/2016-August/006497.html
[4]: https://lists.01.org/pipermail/linux-nvdimm/2016-September/006875.html

---

Dan Williams (3):
      mm, dax: add VM_SYNC flag for device-dax VMAs
      mm, dax: add VM_DAX flag for DAX VMAs
      mm, mincore2(): retrieve tlb-size attributes of an address range


 drivers/dax/Kconfig                    |    1 
 drivers/dax/dax.c                      |    2 
 fs/Kconfig                             |    1 
 fs/ext2/file.c                         |    2 
 fs/ext4/file.c                         |    2 
 fs/proc/task_mmu.c                     |    4 +
 fs/xfs/xfs_file.c                      |    2 
 include/linux/mm.h                     |   31 +++++++-
 include/linux/syscalls.h               |    2 
 include/uapi/asm-generic/mman-common.h |    2 
 kernel/sys_ni.c                        |    1 
 mm/mincore.c                           |  130 ++++++++++++++++++++++++--------
 12 files changed, 141 insertions(+), 39 deletions(-)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v2 0/3] mm, dax: export dax capabilities and mapping size info to userspace
@ 2016-09-15  6:54 ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15  6:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrea Arcangeli, Xiao Guangrong, Arnd Bergmann, linux-nvdimm,
	Dave Hansen, david, linux-kernel, npiggin, xfs, linux-fsdevel,
	Andrew Morton, hch, Kirill A. Shutemov

In the debate about how to support persistent memory applications that
want to use hardware-platform memory-media persistence
rules/cpu-instructions rather than filesystem data intergrity system
calls [1], one of the consistent requests is to move these applications
to use a device file rather than a filesystem file [2].

While there is still a desire to offer the same syscall overhead
avoidance in filesystem-dax as device-dax, there is performance
optimization work and analysis that still needs to be done.
Optimization/analysis to address filesystem-dax performance being slower
than the typical page-cache path on top of pmem [3], and whether the
performance gains are worth developing new filesytem data integrity
mechanisms.

In the meantime we have device-dax and are missing a way to identify its
capabilities compared to filesytem-dax.  Critically, we want a
persistent memory transaction library, that is handed an address range
to manage, to be able to determine if it is safe to forgo calling
fsync/msync to record newly allocated blocks after a write fault.  This
question is answered by the new VM_SYNC flag.

It is also important to know if the pages behind a mapping are backed by
page cache and need to be synced, or are referencing media directly.  We
have an XFS inode flag that can indicate the inode is DAX enabled, but
nothing for device-dax or other filesystems.  Yes, an application that
maps /dev/dax should assume the mapping is DAX, but it is useful to be
able to tell that from the address range directly, and a common
mechanism across filesystems.

Finally, while developing and debugging the filesystem-dax huge page
support it was frustrating that the only way to unit test and verify the
implementation was via debug print statements.  This series extends
mincore(2) to optionally provide an indication of the hardware mapping
size.  This is hopefully useful to other cases that want to evaluate
transparent-huge-page usage.


Changes since the RFC [4]:

1/ Drop DAX indication out of mincore.  It is a vma capability not a
   per-page property and fits better as a vma flag.  Multiple people
   indicated it would be better if the new syscall published the capability
   as an extent or aggregated over a range, and this facility is already
   provided by smaps.

2/ Add VM_SYNC to explicity disclaim a need to call fsync/msync

3/ Drop the syscall wire-up patch since it is trivial and can be revived
   if we decide to move forward with the new mincore syscall.


[1]: https://lwn.net/Articles/676737/
[2]: https://lists.01.org/pipermail/linux-nvdimm/2016-September/006893.html
[3]: https://lists.01.org/pipermail/linux-nvdimm/2016-August/006497.html
[4]: https://lists.01.org/pipermail/linux-nvdimm/2016-September/006875.html

---

Dan Williams (3):
      mm, dax: add VM_SYNC flag for device-dax VMAs
      mm, dax: add VM_DAX flag for DAX VMAs
      mm, mincore2(): retrieve tlb-size attributes of an address range


 drivers/dax/Kconfig                    |    1 
 drivers/dax/dax.c                      |    2 
 fs/Kconfig                             |    1 
 fs/ext2/file.c                         |    2 
 fs/ext4/file.c                         |    2 
 fs/proc/task_mmu.c                     |    4 +
 fs/xfs/xfs_file.c                      |    2 
 include/linux/mm.h                     |   31 +++++++-
 include/linux/syscalls.h               |    2 
 include/uapi/asm-generic/mman-common.h |    2 
 kernel/sys_ni.c                        |    1 
 mm/mincore.c                           |  130 ++++++++++++++++++++++++--------
 12 files changed, 141 insertions(+), 39 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v2 0/3] mm, dax: export dax capabilities and mapping size info to userspace
@ 2016-09-15  6:54 ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15  6:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrea Arcangeli, Xiao Guangrong, Arnd Bergmann, linux-nvdimm,
	Dave Hansen, linux-kernel, npiggin, xfs, linux-fsdevel,
	Andrew Morton, hch, Kirill A. Shutemov

In the debate about how to support persistent memory applications that
want to use hardware-platform memory-media persistence
rules/cpu-instructions rather than filesystem data intergrity system
calls [1], one of the consistent requests is to move these applications
to use a device file rather than a filesystem file [2].

While there is still a desire to offer the same syscall overhead
avoidance in filesystem-dax as device-dax, there is performance
optimization work and analysis that still needs to be done.
Optimization/analysis to address filesystem-dax performance being slower
than the typical page-cache path on top of pmem [3], and whether the
performance gains are worth developing new filesytem data integrity
mechanisms.

In the meantime we have device-dax and are missing a way to identify its
capabilities compared to filesytem-dax.  Critically, we want a
persistent memory transaction library, that is handed an address range
to manage, to be able to determine if it is safe to forgo calling
fsync/msync to record newly allocated blocks after a write fault.  This
question is answered by the new VM_SYNC flag.

It is also important to know if the pages behind a mapping are backed by
page cache and need to be synced, or are referencing media directly.  We
have an XFS inode flag that can indicate the inode is DAX enabled, but
nothing for device-dax or other filesystems.  Yes, an application that
maps /dev/dax should assume the mapping is DAX, but it is useful to be
able to tell that from the address range directly, and a common
mechanism across filesystems.

Finally, while developing and debugging the filesystem-dax huge page
support it was frustrating that the only way to unit test and verify the
implementation was via debug print statements.  This series extends
mincore(2) to optionally provide an indication of the hardware mapping
size.  This is hopefully useful to other cases that want to evaluate
transparent-huge-page usage.


Changes since the RFC [4]:

1/ Drop DAX indication out of mincore.  It is a vma capability not a
   per-page property and fits better as a vma flag.  Multiple people
   indicated it would be better if the new syscall published the capability
   as an extent or aggregated over a range, and this facility is already
   provided by smaps.

2/ Add VM_SYNC to explicity disclaim a need to call fsync/msync

3/ Drop the syscall wire-up patch since it is trivial and can be revived
   if we decide to move forward with the new mincore syscall.


[1]: https://lwn.net/Articles/676737/
[2]: https://lists.01.org/pipermail/linux-nvdimm/2016-September/006893.html
[3]: https://lists.01.org/pipermail/linux-nvdimm/2016-August/006497.html
[4]: https://lists.01.org/pipermail/linux-nvdimm/2016-September/006875.html

---

Dan Williams (3):
      mm, dax: add VM_SYNC flag for device-dax VMAs
      mm, dax: add VM_DAX flag for DAX VMAs
      mm, mincore2(): retrieve tlb-size attributes of an address range


 drivers/dax/Kconfig                    |    1 
 drivers/dax/dax.c                      |    2 
 fs/Kconfig                             |    1 
 fs/ext2/file.c                         |    2 
 fs/ext4/file.c                         |    2 
 fs/proc/task_mmu.c                     |    4 +
 fs/xfs/xfs_file.c                      |    2 
 include/linux/mm.h                     |   31 +++++++-
 include/linux/syscalls.h               |    2 
 include/uapi/asm-generic/mman-common.h |    2 
 kernel/sys_ni.c                        |    1 
 mm/mincore.c                           |  130 ++++++++++++++++++++++++--------
 12 files changed, 141 insertions(+), 39 deletions(-)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v2 1/3] mm, dax: add VM_SYNC flag for device-dax VMAs
  2016-09-15  6:54 ` Dan Williams
  (?)
  (?)
@ 2016-09-15  6:54   ` Dan Williams
  -1 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15  6:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, david, linux-kernel, npiggin, xfs, linux-fsdevel, hch

Introduce a new vma flag to indicate the property of device-dax VMAs
that, while file-backed, do not require notification to a filesystem
agent to sync metadata after a fault.  In particular this enables
persistent memory applications to know if they can commit transactions
to media via cpu instructions alone, or need to call back into the
kernel to synchronize metadata.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/Kconfig |    1 +
 drivers/dax/dax.c   |    2 +-
 fs/proc/task_mmu.c  |    3 +++
 include/linux/mm.h  |   21 +++++++++++++++++----
 4 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index cedab7572de3..a4d99e637623 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -2,6 +2,7 @@ menuconfig DEV_DAX
 	tristate "DAX: direct access to differentiated memory"
 	default m if NVDIMM_DAX
 	depends on TRANSPARENT_HUGEPAGE
+	select ARCH_USES_HIGH_VMA_FLAGS if 64BIT
 	help
 	  Support raw access to differentiated (persistence, bandwidth,
 	  latency...) memory via an mmap(2) capable character
diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index 29f600f2c447..88fad2519907 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -528,7 +528,7 @@ static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 
 	kref_get(&dax_dev->kref);
 	vma->vm_ops = &dax_dev_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC;
 	return 0;
 
 }
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f6fa99eca515..03a65ac7f222 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -675,6 +675,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_PKEY_BIT2)]	= "",
 		[ilog2(VM_PKEY_BIT3)]	= "",
 #endif
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+		[ilog2(VM_SYNC)]	= "sn",
+#endif
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef815b9cd426..f3f6df6bb498 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -198,14 +198,17 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
 
 #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
-#define VM_HIGH_ARCH_BIT_0	32	/* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_1	33	/* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
+/* bits below only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_0	32
+#define VM_HIGH_ARCH_BIT_1	33
+#define VM_HIGH_ARCH_BIT_2	34
+#define VM_HIGH_ARCH_BIT_3	35
+#define VM_HIGH_ARCH_BIT_4	36
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
+#define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #if defined(CONFIG_X86)
@@ -234,6 +237,16 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_MPX		VM_ARCH_2
 #endif
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+/*
+ * The metadata for file-backed vma does not exist or is otherwise
+ * synced before fault handler returns to userspace
+ */
+#define VM_SYNC		VM_HIGH_ARCH_4
+#else
+#define VM_SYNC		0
+#endif
+
 #ifndef VM_GROWSUP
 # define VM_GROWSUP	VM_NONE
 #endif

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 1/3] mm, dax: add VM_SYNC flag for device-dax VMAs
@ 2016-09-15  6:54   ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15  6:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, david, linux-kernel, npiggin, xfs, linux-fsdevel, hch

Introduce a new vma flag to indicate the property of device-dax VMAs
that, while file-backed, do not require notification to a filesystem
agent to sync metadata after a fault.  In particular this enables
persistent memory applications to know if they can commit transactions
to media via cpu instructions alone, or need to call back into the
kernel to synchronize metadata.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/Kconfig |    1 +
 drivers/dax/dax.c   |    2 +-
 fs/proc/task_mmu.c  |    3 +++
 include/linux/mm.h  |   21 +++++++++++++++++----
 4 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index cedab7572de3..a4d99e637623 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -2,6 +2,7 @@ menuconfig DEV_DAX
 	tristate "DAX: direct access to differentiated memory"
 	default m if NVDIMM_DAX
 	depends on TRANSPARENT_HUGEPAGE
+	select ARCH_USES_HIGH_VMA_FLAGS if 64BIT
 	help
 	  Support raw access to differentiated (persistence, bandwidth,
 	  latency...) memory via an mmap(2) capable character
diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index 29f600f2c447..88fad2519907 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -528,7 +528,7 @@ static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 
 	kref_get(&dax_dev->kref);
 	vma->vm_ops = &dax_dev_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC;
 	return 0;
 
 }
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f6fa99eca515..03a65ac7f222 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -675,6 +675,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_PKEY_BIT2)]	= "",
 		[ilog2(VM_PKEY_BIT3)]	= "",
 #endif
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+		[ilog2(VM_SYNC)]	= "sn",
+#endif
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef815b9cd426..f3f6df6bb498 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -198,14 +198,17 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
 
 #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
-#define VM_HIGH_ARCH_BIT_0	32	/* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_1	33	/* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
+/* bits below only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_0	32
+#define VM_HIGH_ARCH_BIT_1	33
+#define VM_HIGH_ARCH_BIT_2	34
+#define VM_HIGH_ARCH_BIT_3	35
+#define VM_HIGH_ARCH_BIT_4	36
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
+#define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #if defined(CONFIG_X86)
@@ -234,6 +237,16 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_MPX		VM_ARCH_2
 #endif
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+/*
+ * The metadata for file-backed vma does not exist or is otherwise
+ * synced before fault handler returns to userspace
+ */
+#define VM_SYNC		VM_HIGH_ARCH_4
+#else
+#define VM_SYNC		0
+#endif
+
 #ifndef VM_GROWSUP
 # define VM_GROWSUP	VM_NONE
 #endif

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 1/3] mm, dax: add VM_SYNC flag for device-dax VMAs
@ 2016-09-15  6:54   ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15  6:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, david, linux-kernel, npiggin, xfs, linux-fsdevel, hch

Introduce a new vma flag to indicate the property of device-dax VMAs
that, while file-backed, do not require notification to a filesystem
agent to sync metadata after a fault.  In particular this enables
persistent memory applications to know if they can commit transactions
to media via cpu instructions alone, or need to call back into the
kernel to synchronize metadata.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/Kconfig |    1 +
 drivers/dax/dax.c   |    2 +-
 fs/proc/task_mmu.c  |    3 +++
 include/linux/mm.h  |   21 +++++++++++++++++----
 4 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index cedab7572de3..a4d99e637623 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -2,6 +2,7 @@ menuconfig DEV_DAX
 	tristate "DAX: direct access to differentiated memory"
 	default m if NVDIMM_DAX
 	depends on TRANSPARENT_HUGEPAGE
+	select ARCH_USES_HIGH_VMA_FLAGS if 64BIT
 	help
 	  Support raw access to differentiated (persistence, bandwidth,
 	  latency...) memory via an mmap(2) capable character
diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index 29f600f2c447..88fad2519907 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -528,7 +528,7 @@ static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 
 	kref_get(&dax_dev->kref);
 	vma->vm_ops = &dax_dev_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC;
 	return 0;
 
 }
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f6fa99eca515..03a65ac7f222 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -675,6 +675,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_PKEY_BIT2)]	= "",
 		[ilog2(VM_PKEY_BIT3)]	= "",
 #endif
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+		[ilog2(VM_SYNC)]	= "sn",
+#endif
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef815b9cd426..f3f6df6bb498 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -198,14 +198,17 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
 
 #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
-#define VM_HIGH_ARCH_BIT_0	32	/* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_1	33	/* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
+/* bits below only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_0	32
+#define VM_HIGH_ARCH_BIT_1	33
+#define VM_HIGH_ARCH_BIT_2	34
+#define VM_HIGH_ARCH_BIT_3	35
+#define VM_HIGH_ARCH_BIT_4	36
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
+#define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #if defined(CONFIG_X86)
@@ -234,6 +237,16 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_MPX		VM_ARCH_2
 #endif
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+/*
+ * The metadata for file-backed vma does not exist or is otherwise
+ * synced before fault handler returns to userspace
+ */
+#define VM_SYNC		VM_HIGH_ARCH_4
+#else
+#define VM_SYNC		0
+#endif
+
 #ifndef VM_GROWSUP
 # define VM_GROWSUP	VM_NONE
 #endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 1/3] mm, dax: add VM_SYNC flag for device-dax VMAs
@ 2016-09-15  6:54   ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15  6:54 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-nvdimm, linux-kernel, npiggin, xfs, linux-fsdevel, hch

Introduce a new vma flag to indicate the property of device-dax VMAs
that, while file-backed, do not require notification to a filesystem
agent to sync metadata after a fault.  In particular this enables
persistent memory applications to know if they can commit transactions
to media via cpu instructions alone, or need to call back into the
kernel to synchronize metadata.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/Kconfig |    1 +
 drivers/dax/dax.c   |    2 +-
 fs/proc/task_mmu.c  |    3 +++
 include/linux/mm.h  |   21 +++++++++++++++++----
 4 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index cedab7572de3..a4d99e637623 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -2,6 +2,7 @@ menuconfig DEV_DAX
 	tristate "DAX: direct access to differentiated memory"
 	default m if NVDIMM_DAX
 	depends on TRANSPARENT_HUGEPAGE
+	select ARCH_USES_HIGH_VMA_FLAGS if 64BIT
 	help
 	  Support raw access to differentiated (persistence, bandwidth,
 	  latency...) memory via an mmap(2) capable character
diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index 29f600f2c447..88fad2519907 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -528,7 +528,7 @@ static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 
 	kref_get(&dax_dev->kref);
 	vma->vm_ops = &dax_dev_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC;
 	return 0;
 
 }
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f6fa99eca515..03a65ac7f222 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -675,6 +675,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_PKEY_BIT2)]	= "",
 		[ilog2(VM_PKEY_BIT3)]	= "",
 #endif
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+		[ilog2(VM_SYNC)]	= "sn",
+#endif
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef815b9cd426..f3f6df6bb498 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -198,14 +198,17 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
 
 #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
-#define VM_HIGH_ARCH_BIT_0	32	/* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_1	33	/* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
+/* bits below only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_0	32
+#define VM_HIGH_ARCH_BIT_1	33
+#define VM_HIGH_ARCH_BIT_2	34
+#define VM_HIGH_ARCH_BIT_3	35
+#define VM_HIGH_ARCH_BIT_4	36
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
+#define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #if defined(CONFIG_X86)
@@ -234,6 +237,16 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_MPX		VM_ARCH_2
 #endif
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+/*
+ * The metadata for file-backed vma does not exist or is otherwise
+ * synced before fault handler returns to userspace
+ */
+#define VM_SYNC		VM_HIGH_ARCH_4
+#else
+#define VM_SYNC		0
+#endif
+
 #ifndef VM_GROWSUP
 # define VM_GROWSUP	VM_NONE
 #endif

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
  2016-09-15  6:54 ` Dan Williams
  (?)
  (?)
@ 2016-09-15  6:54   ` Dan Williams
  -1 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15  6:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, david, linux-kernel, npiggin, xfs, linux-fsdevel, hch

The DAX property, page cache bypass, of a VMA is only detectable via the
vma_is_dax() helper to check the S_DAX inode flag.  However, this is
only available internal to the kernel and is a property that userspace
applications would like to interrogate.

Yes, this new VM_DAX flag is only available on 64-bit, but the
expectation is that the capacities of persistent memory devices are too
large for 32-bit platforms.  While there is usage of DAX on 32-bit, that
usage is primarily driven by DAX's replacement of XIP.  XIP is a memory
saving technique for embedded devices to execute out of DAX, but in that
usage the application does not need to discern if page cache is present
or not.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/dax.c  |    2 +-
 fs/Kconfig         |    1 +
 fs/ext2/file.c     |    2 +-
 fs/ext4/file.c     |    2 +-
 fs/proc/task_mmu.c |    1 +
 fs/xfs/xfs_file.c  |    2 +-
 include/linux/mm.h |   10 ++++++++++
 7 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index 88fad2519907..1cb4117870bd 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -528,7 +528,7 @@ static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 
 	kref_get(&dax_dev->kref);
 	vma->vm_ops = &dax_dev_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC | VM_DAX;
 	return 0;
 
 }
diff --git a/fs/Kconfig b/fs/Kconfig
index 2bc7ad775842..6d9afe4c1710 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
 	depends on !(ARM || MIPS || SPARC)
+	select ARCH_USES_HIGH_VMA_FLAGS if 64BIT
 	help
 	  Direct Access (DAX) can be used on memory-backed block devices.
 	  If the block device supports DAX and the filesystem supports DAX,
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 5efeefe17abb..b9c829cf427c 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -118,7 +118,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX;
 	return 0;
 }
 #else
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 261ac3734c58..7a777f1bbde3 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -312,7 +312,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 03a65ac7f222..b9b9dc059e19 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -677,6 +677,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 #endif
 #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
 		[ilog2(VM_SYNC)]	= "sn",
+		[ilog2(VM_DAX)]		= "dx",
 #endif
 	};
 	size_t i;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e612a0233710..80ed83405683 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1644,7 +1644,7 @@ xfs_file_mmap(
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
 	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX;
 	return 0;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3f6df6bb498..5930402596c0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -204,11 +204,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34
 #define VM_HIGH_ARCH_BIT_3	35
 #define VM_HIGH_ARCH_BIT_4	36
+#define VM_HIGH_ARCH_BIT_5	37
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #if defined(CONFIG_X86)
@@ -243,8 +245,16 @@ extern unsigned int kobjsize(const void *objp);
  * synced before fault handler returns to userspace
  */
 #define VM_SYNC		VM_HIGH_ARCH_4
+/*
+ * Mapping is not indirected through the page-cache, accesses hit memory
+ * media directly*.
+ *
+ * (*) a fileystem may map the zero-page into holes of a file.
+ */
+#define VM_DAX		VM_HIGH_ARCH_5
 #else
 #define VM_SYNC		0
+#define VM_DAX		0
 #endif
 
 #ifndef VM_GROWSUP

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15  6:54   ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15  6:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, david, linux-kernel, npiggin, xfs, linux-fsdevel, hch

The DAX property, page cache bypass, of a VMA is only detectable via the
vma_is_dax() helper to check the S_DAX inode flag.  However, this is
only available internal to the kernel and is a property that userspace
applications would like to interrogate.

Yes, this new VM_DAX flag is only available on 64-bit, but the
expectation is that the capacities of persistent memory devices are too
large for 32-bit platforms.  While there is usage of DAX on 32-bit, that
usage is primarily driven by DAX's replacement of XIP.  XIP is a memory
saving technique for embedded devices to execute out of DAX, but in that
usage the application does not need to discern if page cache is present
or not.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/dax.c  |    2 +-
 fs/Kconfig         |    1 +
 fs/ext2/file.c     |    2 +-
 fs/ext4/file.c     |    2 +-
 fs/proc/task_mmu.c |    1 +
 fs/xfs/xfs_file.c  |    2 +-
 include/linux/mm.h |   10 ++++++++++
 7 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index 88fad2519907..1cb4117870bd 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -528,7 +528,7 @@ static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 
 	kref_get(&dax_dev->kref);
 	vma->vm_ops = &dax_dev_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC | VM_DAX;
 	return 0;
 
 }
diff --git a/fs/Kconfig b/fs/Kconfig
index 2bc7ad775842..6d9afe4c1710 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
 	depends on !(ARM || MIPS || SPARC)
+	select ARCH_USES_HIGH_VMA_FLAGS if 64BIT
 	help
 	  Direct Access (DAX) can be used on memory-backed block devices.
 	  If the block device supports DAX and the filesystem supports DAX,
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 5efeefe17abb..b9c829cf427c 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -118,7 +118,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX;
 	return 0;
 }
 #else
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 261ac3734c58..7a777f1bbde3 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -312,7 +312,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 03a65ac7f222..b9b9dc059e19 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -677,6 +677,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 #endif
 #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
 		[ilog2(VM_SYNC)]	= "sn",
+		[ilog2(VM_DAX)]		= "dx",
 #endif
 	};
 	size_t i;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e612a0233710..80ed83405683 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1644,7 +1644,7 @@ xfs_file_mmap(
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
 	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX;
 	return 0;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3f6df6bb498..5930402596c0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -204,11 +204,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34
 #define VM_HIGH_ARCH_BIT_3	35
 #define VM_HIGH_ARCH_BIT_4	36
+#define VM_HIGH_ARCH_BIT_5	37
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #if defined(CONFIG_X86)
@@ -243,8 +245,16 @@ extern unsigned int kobjsize(const void *objp);
  * synced before fault handler returns to userspace
  */
 #define VM_SYNC		VM_HIGH_ARCH_4
+/*
+ * Mapping is not indirected through the page-cache, accesses hit memory
+ * media directly*.
+ *
+ * (*) a fileystem may map the zero-page into holes of a file.
+ */
+#define VM_DAX		VM_HIGH_ARCH_5
 #else
 #define VM_SYNC		0
+#define VM_DAX		0
 #endif
 
 #ifndef VM_GROWSUP

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15  6:54   ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15  6:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, david, linux-kernel, npiggin, xfs, linux-fsdevel, hch

The DAX property, page cache bypass, of a VMA is only detectable via the
vma_is_dax() helper to check the S_DAX inode flag.  However, this is
only available internal to the kernel and is a property that userspace
applications would like to interrogate.

Yes, this new VM_DAX flag is only available on 64-bit, but the
expectation is that the capacities of persistent memory devices are too
large for 32-bit platforms.  While there is usage of DAX on 32-bit, that
usage is primarily driven by DAX's replacement of XIP.  XIP is a memory
saving technique for embedded devices to execute out of DAX, but in that
usage the application does not need to discern if page cache is present
or not.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/dax.c  |    2 +-
 fs/Kconfig         |    1 +
 fs/ext2/file.c     |    2 +-
 fs/ext4/file.c     |    2 +-
 fs/proc/task_mmu.c |    1 +
 fs/xfs/xfs_file.c  |    2 +-
 include/linux/mm.h |   10 ++++++++++
 7 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index 88fad2519907..1cb4117870bd 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -528,7 +528,7 @@ static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 
 	kref_get(&dax_dev->kref);
 	vma->vm_ops = &dax_dev_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC | VM_DAX;
 	return 0;
 
 }
diff --git a/fs/Kconfig b/fs/Kconfig
index 2bc7ad775842..6d9afe4c1710 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
 	depends on !(ARM || MIPS || SPARC)
+	select ARCH_USES_HIGH_VMA_FLAGS if 64BIT
 	help
 	  Direct Access (DAX) can be used on memory-backed block devices.
 	  If the block device supports DAX and the filesystem supports DAX,
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 5efeefe17abb..b9c829cf427c 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -118,7 +118,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX;
 	return 0;
 }
 #else
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 261ac3734c58..7a777f1bbde3 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -312,7 +312,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 03a65ac7f222..b9b9dc059e19 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -677,6 +677,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 #endif
 #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
 		[ilog2(VM_SYNC)]	= "sn",
+		[ilog2(VM_DAX)]		= "dx",
 #endif
 	};
 	size_t i;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e612a0233710..80ed83405683 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1644,7 +1644,7 @@ xfs_file_mmap(
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
 	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX;
 	return 0;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3f6df6bb498..5930402596c0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -204,11 +204,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34
 #define VM_HIGH_ARCH_BIT_3	35
 #define VM_HIGH_ARCH_BIT_4	36
+#define VM_HIGH_ARCH_BIT_5	37
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #if defined(CONFIG_X86)
@@ -243,8 +245,16 @@ extern unsigned int kobjsize(const void *objp);
  * synced before fault handler returns to userspace
  */
 #define VM_SYNC		VM_HIGH_ARCH_4
+/*
+ * Mapping is not indirected through the page-cache, accesses hit memory
+ * media directly*.
+ *
+ * (*) a fileystem may map the zero-page into holes of a file.
+ */
+#define VM_DAX		VM_HIGH_ARCH_5
 #else
 #define VM_SYNC		0
+#define VM_DAX		0
 #endif
 
 #ifndef VM_GROWSUP

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15  6:54   ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15  6:54 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-nvdimm, linux-kernel, npiggin, xfs, linux-fsdevel, hch

The DAX property, page cache bypass, of a VMA is only detectable via the
vma_is_dax() helper to check the S_DAX inode flag.  However, this is
only available internal to the kernel and is a property that userspace
applications would like to interrogate.

Yes, this new VM_DAX flag is only available on 64-bit, but the
expectation is that the capacities of persistent memory devices are too
large for 32-bit platforms.  While there is usage of DAX on 32-bit, that
usage is primarily driven by DAX's replacement of XIP.  XIP is a memory
saving technique for embedded devices to execute out of DAX, but in that
usage the application does not need to discern if page cache is present
or not.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/dax.c  |    2 +-
 fs/Kconfig         |    1 +
 fs/ext2/file.c     |    2 +-
 fs/ext4/file.c     |    2 +-
 fs/proc/task_mmu.c |    1 +
 fs/xfs/xfs_file.c  |    2 +-
 include/linux/mm.h |   10 ++++++++++
 7 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index 88fad2519907..1cb4117870bd 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -528,7 +528,7 @@ static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 
 	kref_get(&dax_dev->kref);
 	vma->vm_ops = &dax_dev_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC | VM_DAX;
 	return 0;
 
 }
diff --git a/fs/Kconfig b/fs/Kconfig
index 2bc7ad775842..6d9afe4c1710 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
 	depends on !(ARM || MIPS || SPARC)
+	select ARCH_USES_HIGH_VMA_FLAGS if 64BIT
 	help
 	  Direct Access (DAX) can be used on memory-backed block devices.
 	  If the block device supports DAX and the filesystem supports DAX,
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 5efeefe17abb..b9c829cf427c 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -118,7 +118,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX;
 	return 0;
 }
 #else
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 261ac3734c58..7a777f1bbde3 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -312,7 +312,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 03a65ac7f222..b9b9dc059e19 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -677,6 +677,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 #endif
 #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
 		[ilog2(VM_SYNC)]	= "sn",
+		[ilog2(VM_DAX)]		= "dx",
 #endif
 	};
 	size_t i;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e612a0233710..80ed83405683 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1644,7 +1644,7 @@ xfs_file_mmap(
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
 	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX;
 	return 0;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3f6df6bb498..5930402596c0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -204,11 +204,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34
 #define VM_HIGH_ARCH_BIT_3	35
 #define VM_HIGH_ARCH_BIT_4	36
+#define VM_HIGH_ARCH_BIT_5	37
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #if defined(CONFIG_X86)
@@ -243,8 +245,16 @@ extern unsigned int kobjsize(const void *objp);
  * synced before fault handler returns to userspace
  */
 #define VM_SYNC		VM_HIGH_ARCH_4
+/*
+ * Mapping is not indirected through the page-cache, accesses hit memory
+ * media directly*.
+ *
+ * (*) a fileystem may map the zero-page into holes of a file.
+ */
+#define VM_DAX		VM_HIGH_ARCH_5
 #else
 #define VM_SYNC		0
+#define VM_DAX		0
 #endif
 
 #ifndef VM_GROWSUP

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 3/3] mm, mincore2(): retrieve tlb-size attributes of an address range
  2016-09-15  6:54 ` Dan Williams
  (?)
  (?)
@ 2016-09-15  6:54   ` Dan Williams
  -1 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15  6:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrea Arcangeli, Xiao Guangrong, Arnd Bergmann, linux-nvdimm,
	Dave Hansen, david, linux-kernel, npiggin, xfs, linux-fsdevel,
	Andrew Morton, hch, Kirill A. Shutemov

There are cases, particularly for testing and validating a configuration
to know the hardware mapping geometry of the pages in a given process
address range.  Consider filesystem-dax where a configuration needs to
take care to align partitions and block allocations before huge page
mappings might be used, or anonymous-transparent-huge-pages where a
process is opportunistically assigned large pages.  mincore2() allows
these configurations to be surveyed and validated.

The implementation takes advantage of the unused bits in the per-page
byte returned for each PAGE_SIZE extent of a given address range.  The
new format of each vector byte is:

(TLB_SHIFT - PAGE_SHIFT) << 1 | page_present

[1]: https://lkml.org/lkml/2016/9/7/61

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/syscalls.h               |    2 
 include/uapi/asm-generic/mman-common.h |    2 
 kernel/sys_ni.c                        |    1 
 mm/mincore.c                           |  130 ++++++++++++++++++++++++--------
 4 files changed, 104 insertions(+), 31 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index d02239022bd0..4aa2ee7e359a 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -467,6 +467,8 @@ asmlinkage long sys_munlockall(void);
 asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
 asmlinkage long sys_mincore(unsigned long start, size_t len,
 				unsigned char __user * vec);
+asmlinkage long sys_mincore2(unsigned long start, size_t len,
+				unsigned char __user * vec, int flags);
 
 asmlinkage long sys_pivot_root(const char __user *new_root,
 				const char __user *put_old);
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 58274382a616..6c7eca1a85ca 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -72,4 +72,6 @@
 #define MAP_HUGE_SHIFT	26
 #define MAP_HUGE_MASK	0x3f
 
+#define MINCORE_ORDER	1		/* retrieve hardware mapping-size-order */
+
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 2c5e3a8e00d7..e14b87834054 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -197,6 +197,7 @@ cond_syscall(sys_mlockall);
 cond_syscall(sys_munlockall);
 cond_syscall(sys_mlock2);
 cond_syscall(sys_mincore);
+cond_syscall(sys_mincore2);
 cond_syscall(sys_madvise);
 cond_syscall(sys_mremap);
 cond_syscall(sys_remap_file_pages);
diff --git a/mm/mincore.c b/mm/mincore.c
index c0b5ba965200..b0b83ef086eb 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -15,25 +15,61 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
+#include <linux/dax.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 
+#ifndef MINCORE_ORDER
+#define MINCORE_ORDER 0
+#endif
+
+#define MINCORE_ORDER_MASK 0x3e
+#define MINCORE_ORDER_SHIFT 1
+
+struct mincore_params {
+	unsigned char *vec;
+	int flags;
+};
+
+static void mincore_set(unsigned char *vec, struct vm_area_struct *vma, int nr,
+		int flags)
+{
+	unsigned char mincore = 1;
+
+	if (!nr) {
+		*vec = 0;
+		return;
+	}
+
+	if (flags & MINCORE_ORDER) {
+		unsigned char order = ilog2(nr);
+
+		WARN_ON((order << MINCORE_ORDER_SHIFT) & ~MINCORE_ORDER_MASK);
+		mincore |= order << MINCORE_ORDER_SHIFT;
+	}
+	memset(vec, mincore, nr);
+}
+
 static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
 			unsigned long end, struct mm_walk *walk)
 {
 #ifdef CONFIG_HUGETLB_PAGE
+	struct mincore_params *p = walk->private;
+	int nr = (end - addr) >> PAGE_SHIFT;
+	unsigned char *vec = p->vec;
 	unsigned char present;
-	unsigned char *vec = walk->private;
 
 	/*
 	 * Hugepages under user process are always in RAM and never
 	 * swapped out, but theoretically it needs to be checked.
 	 */
 	present = pte && !huge_pte_none(huge_ptep_get(pte));
-	for (; addr != end; vec++, addr += PAGE_SIZE)
-		*vec = present;
-	walk->private = vec;
+	if (!present)
+		memset(vec, 0, nr);
+	else
+		mincore_set(vec, walk->vma, nr, p->flags);
+	p->vec = vec + nr;
 #else
 	BUG();
 #endif
@@ -82,20 +118,24 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 }
 
 static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
-				struct vm_area_struct *vma, unsigned char *vec)
+				struct vm_area_struct *vma, unsigned char *vec,
+				int flags)
 {
 	unsigned long nr = (end - addr) >> PAGE_SHIFT;
+	unsigned char present;
 	int i;
 
 	if (vma->vm_file) {
 		pgoff_t pgoff;
 
 		pgoff = linear_page_index(vma, addr);
-		for (i = 0; i < nr; i++, pgoff++)
-			vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff);
+		for (i = 0; i < nr; i++, pgoff++) {
+			present = mincore_page(vma->vm_file->f_mapping, pgoff);
+			mincore_set(vec + i, vma, present, flags);
+		}
 	} else {
 		for (i = 0; i < nr; i++)
-			vec[i] = 0;
+			mincore_set(vec + i, vma, 0, flags);
 	}
 	return nr;
 }
@@ -103,8 +143,11 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
 static int mincore_unmapped_range(unsigned long addr, unsigned long end,
 				   struct mm_walk *walk)
 {
-	walk->private += __mincore_unmapped_range(addr, end,
-						  walk->vma, walk->private);
+	struct mincore_params *p = walk->private;
+	int nr = __mincore_unmapped_range(addr, end, walk->vma, p->vec,
+			p->flags);
+
+	p->vec += nr;
 	return 0;
 }
 
@@ -114,18 +157,20 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	spinlock_t *ptl;
 	struct vm_area_struct *vma = walk->vma;
 	pte_t *ptep;
-	unsigned char *vec = walk->private;
+	struct mincore_params *p = walk->private;
+	unsigned char *vec = p->vec;
 	int nr = (end - addr) >> PAGE_SHIFT;
+	int flags = p->flags;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
-		memset(vec, 1, nr);
+		mincore_set(vec, vma, nr, flags);
 		spin_unlock(ptl);
 		goto out;
 	}
 
 	if (pmd_trans_unstable(pmd)) {
-		__mincore_unmapped_range(addr, end, vma, vec);
+		__mincore_unmapped_range(addr, end, vma, vec, flags);
 		goto out;
 	}
 
@@ -135,9 +180,9 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 
 		if (pte_none(pte))
 			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
-						 vma, vec);
+						 vma, vec, flags);
 		else if (pte_present(pte))
-			*vec = 1;
+			mincore_set(vec, vma, 1, flags);
 		else { /* pte is a swap entry */
 			swp_entry_t entry = pte_to_swp_entry(pte);
 
@@ -146,14 +191,17 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 				 * migration or hwpoison entries are always
 				 * uptodate
 				 */
-				*vec = 1;
+				mincore_set(vec, vma, 1, flags);
 			} else {
 #ifdef CONFIG_SWAP
-				*vec = mincore_page(swap_address_space(entry),
-					entry.val);
+				unsigned char present;
+
+				present = mincore_page(swap_address_space(entry),
+						entry.val);
+				mincore_set(vec, vma, present, flags);
 #else
 				WARN_ON(1);
-				*vec = 1;
+				mincore_set(vec, vma, 1, flags);
 #endif
 			}
 		}
@@ -161,7 +209,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	}
 	pte_unmap_unlock(ptep - 1, ptl);
 out:
-	walk->private += nr;
+	p->vec = vec + nr;
 	cond_resched();
 	return 0;
 }
@@ -171,16 +219,21 @@ out:
  * all the arguments, we hold the mmap semaphore: we should
  * just return the amount of info we're asked for.
  */
-static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *vec)
+static long do_mincore(unsigned long addr, unsigned long pages,
+		unsigned char *vec, int flags)
 {
 	struct vm_area_struct *vma;
 	unsigned long end;
 	int err;
+	struct mincore_params p = {
+		.vec = vec,
+		.flags = flags,
+	};
 	struct mm_walk mincore_walk = {
 		.pmd_entry = mincore_pte_range,
 		.pte_hole = mincore_unmapped_range,
 		.hugetlb_entry = mincore_hugetlb,
-		.private = vec,
+		.private = &p,
 	};
 
 	vma = find_vma(current->mm, addr);
@@ -195,13 +248,18 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
 }
 
 /*
- * The mincore(2) system call.
+ * The mincore2(2) system call.
  *
- * mincore() returns the memory residency status of the pages in the
- * current process's address space specified by [addr, addr + len).
- * The status is returned in a vector of bytes.  The least significant
- * bit of each byte is 1 if the referenced page is in memory, otherwise
- * it is zero.
+ * mincore2() returns the memory residency status of the pages in the
+ * current process's address space specified by [addr, addr + len).  The
+ * status is returned in a vector of bytes.  The least significant bit
+ * of each byte is 1 if the referenced page is in memory, otherwise it
+ * is zero.  When 'flags' is non-zero each byte additionally contains an
+ * indication of the hardware mapping size of each page (bits 1 through
+ * 5 of each vector byte).  Where the order relates to the hardware
+ * mapping size backing the given logical-page.  For example, a present
+ * 2MB-mapped-huge-page would correspond to 512 vector entries with the
+ * value (9 << 1) | (1) => 0x13
  *
  * Because the status of a page can change after mincore() checks it
  * but before it returns to the application, the returned vector may
@@ -218,8 +276,8 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
  *		mapped
  *  -EAGAIN - A kernel resource was temporarily unavailable.
  */
-SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
-		unsigned char __user *, vec)
+SYSCALL_DEFINE4(mincore2, unsigned long, start, size_t, len,
+		unsigned char __user *, vec, int, flags)
 {
 	long retval;
 	unsigned long pages;
@@ -229,6 +287,10 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
+	/* Check that undefined flags are zero */
+	if (flags & ~MINCORE_ORDER)
+		return -EINVAL;
+
 	/* ..and we need to be passed a valid user-space range */
 	if (!access_ok(VERIFY_READ, (void __user *) start, len))
 		return -ENOMEM;
@@ -251,7 +313,7 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
 		 * the temporary buffer size.
 		 */
 		down_read(&current->mm->mmap_sem);
-		retval = do_mincore(start, min(pages, PAGE_SIZE), tmp);
+		retval = do_mincore(start, min(pages, PAGE_SIZE), tmp, flags);
 		up_read(&current->mm->mmap_sem);
 
 		if (retval <= 0)
@@ -268,3 +330,9 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
 	free_page((unsigned long) tmp);
 	return retval;
 }
+
+SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
+		unsigned char __user *, vec)
+{
+	return sys_mincore2(start, len, vec, 0);
+}

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 3/3] mm, mincore2(): retrieve tlb-size attributes of an address range
@ 2016-09-15  6:54   ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15  6:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrea Arcangeli, Xiao Guangrong, Arnd Bergmann, linux-nvdimm,
	Dave Hansen, david, linux-kernel, npiggin, xfs, linux-fsdevel,
	Andrew Morton, hch, Kirill A. Shutemov

There are cases, particularly for testing and validating a configuration
to know the hardware mapping geometry of the pages in a given process
address range.  Consider filesystem-dax where a configuration needs to
take care to align partitions and block allocations before huge page
mappings might be used, or anonymous-transparent-huge-pages where a
process is opportunistically assigned large pages.  mincore2() allows
these configurations to be surveyed and validated.

The implementation takes advantage of the unused bits in the per-page
byte returned for each PAGE_SIZE extent of a given address range.  The
new format of each vector byte is:

(TLB_SHIFT - PAGE_SHIFT) << 1 | page_present

[1]: https://lkml.org/lkml/2016/9/7/61

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/syscalls.h               |    2 
 include/uapi/asm-generic/mman-common.h |    2 
 kernel/sys_ni.c                        |    1 
 mm/mincore.c                           |  130 ++++++++++++++++++++++++--------
 4 files changed, 104 insertions(+), 31 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index d02239022bd0..4aa2ee7e359a 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -467,6 +467,8 @@ asmlinkage long sys_munlockall(void);
 asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
 asmlinkage long sys_mincore(unsigned long start, size_t len,
 				unsigned char __user * vec);
+asmlinkage long sys_mincore2(unsigned long start, size_t len,
+				unsigned char __user * vec, int flags);
 
 asmlinkage long sys_pivot_root(const char __user *new_root,
 				const char __user *put_old);
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 58274382a616..6c7eca1a85ca 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -72,4 +72,6 @@
 #define MAP_HUGE_SHIFT	26
 #define MAP_HUGE_MASK	0x3f
 
+#define MINCORE_ORDER	1		/* retrieve hardware mapping-size-order */
+
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 2c5e3a8e00d7..e14b87834054 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -197,6 +197,7 @@ cond_syscall(sys_mlockall);
 cond_syscall(sys_munlockall);
 cond_syscall(sys_mlock2);
 cond_syscall(sys_mincore);
+cond_syscall(sys_mincore2);
 cond_syscall(sys_madvise);
 cond_syscall(sys_mremap);
 cond_syscall(sys_remap_file_pages);
diff --git a/mm/mincore.c b/mm/mincore.c
index c0b5ba965200..b0b83ef086eb 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -15,25 +15,61 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
+#include <linux/dax.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 
+#ifndef MINCORE_ORDER
+#define MINCORE_ORDER 0
+#endif
+
+#define MINCORE_ORDER_MASK 0x3e
+#define MINCORE_ORDER_SHIFT 1
+
+struct mincore_params {
+	unsigned char *vec;
+	int flags;
+};
+
+static void mincore_set(unsigned char *vec, struct vm_area_struct *vma, int nr,
+		int flags)
+{
+	unsigned char mincore = 1;
+
+	if (!nr) {
+		*vec = 0;
+		return;
+	}
+
+	if (flags & MINCORE_ORDER) {
+		unsigned char order = ilog2(nr);
+
+		WARN_ON((order << MINCORE_ORDER_SHIFT) & ~MINCORE_ORDER_MASK);
+		mincore |= order << MINCORE_ORDER_SHIFT;
+	}
+	memset(vec, mincore, nr);
+}
+
 static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
 			unsigned long end, struct mm_walk *walk)
 {
 #ifdef CONFIG_HUGETLB_PAGE
+	struct mincore_params *p = walk->private;
+	int nr = (end - addr) >> PAGE_SHIFT;
+	unsigned char *vec = p->vec;
 	unsigned char present;
-	unsigned char *vec = walk->private;
 
 	/*
 	 * Hugepages under user process are always in RAM and never
 	 * swapped out, but theoretically it needs to be checked.
 	 */
 	present = pte && !huge_pte_none(huge_ptep_get(pte));
-	for (; addr != end; vec++, addr += PAGE_SIZE)
-		*vec = present;
-	walk->private = vec;
+	if (!present)
+		memset(vec, 0, nr);
+	else
+		mincore_set(vec, walk->vma, nr, p->flags);
+	p->vec = vec + nr;
 #else
 	BUG();
 #endif
@@ -82,20 +118,24 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 }
 
 static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
-				struct vm_area_struct *vma, unsigned char *vec)
+				struct vm_area_struct *vma, unsigned char *vec,
+				int flags)
 {
 	unsigned long nr = (end - addr) >> PAGE_SHIFT;
+	unsigned char present;
 	int i;
 
 	if (vma->vm_file) {
 		pgoff_t pgoff;
 
 		pgoff = linear_page_index(vma, addr);
-		for (i = 0; i < nr; i++, pgoff++)
-			vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff);
+		for (i = 0; i < nr; i++, pgoff++) {
+			present = mincore_page(vma->vm_file->f_mapping, pgoff);
+			mincore_set(vec + i, vma, present, flags);
+		}
 	} else {
 		for (i = 0; i < nr; i++)
-			vec[i] = 0;
+			mincore_set(vec + i, vma, 0, flags);
 	}
 	return nr;
 }
@@ -103,8 +143,11 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
 static int mincore_unmapped_range(unsigned long addr, unsigned long end,
 				   struct mm_walk *walk)
 {
-	walk->private += __mincore_unmapped_range(addr, end,
-						  walk->vma, walk->private);
+	struct mincore_params *p = walk->private;
+	int nr = __mincore_unmapped_range(addr, end, walk->vma, p->vec,
+			p->flags);
+
+	p->vec += nr;
 	return 0;
 }
 
@@ -114,18 +157,20 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	spinlock_t *ptl;
 	struct vm_area_struct *vma = walk->vma;
 	pte_t *ptep;
-	unsigned char *vec = walk->private;
+	struct mincore_params *p = walk->private;
+	unsigned char *vec = p->vec;
 	int nr = (end - addr) >> PAGE_SHIFT;
+	int flags = p->flags;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
-		memset(vec, 1, nr);
+		mincore_set(vec, vma, nr, flags);
 		spin_unlock(ptl);
 		goto out;
 	}
 
 	if (pmd_trans_unstable(pmd)) {
-		__mincore_unmapped_range(addr, end, vma, vec);
+		__mincore_unmapped_range(addr, end, vma, vec, flags);
 		goto out;
 	}
 
@@ -135,9 +180,9 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 
 		if (pte_none(pte))
 			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
-						 vma, vec);
+						 vma, vec, flags);
 		else if (pte_present(pte))
-			*vec = 1;
+			mincore_set(vec, vma, 1, flags);
 		else { /* pte is a swap entry */
 			swp_entry_t entry = pte_to_swp_entry(pte);
 
@@ -146,14 +191,17 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 				 * migration or hwpoison entries are always
 				 * uptodate
 				 */
-				*vec = 1;
+				mincore_set(vec, vma, 1, flags);
 			} else {
 #ifdef CONFIG_SWAP
-				*vec = mincore_page(swap_address_space(entry),
-					entry.val);
+				unsigned char present;
+
+				present = mincore_page(swap_address_space(entry),
+						entry.val);
+				mincore_set(vec, vma, present, flags);
 #else
 				WARN_ON(1);
-				*vec = 1;
+				mincore_set(vec, vma, 1, flags);
 #endif
 			}
 		}
@@ -161,7 +209,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	}
 	pte_unmap_unlock(ptep - 1, ptl);
 out:
-	walk->private += nr;
+	p->vec = vec + nr;
 	cond_resched();
 	return 0;
 }
@@ -171,16 +219,21 @@ out:
  * all the arguments, we hold the mmap semaphore: we should
  * just return the amount of info we're asked for.
  */
-static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *vec)
+static long do_mincore(unsigned long addr, unsigned long pages,
+		unsigned char *vec, int flags)
 {
 	struct vm_area_struct *vma;
 	unsigned long end;
 	int err;
+	struct mincore_params p = {
+		.vec = vec,
+		.flags = flags,
+	};
 	struct mm_walk mincore_walk = {
 		.pmd_entry = mincore_pte_range,
 		.pte_hole = mincore_unmapped_range,
 		.hugetlb_entry = mincore_hugetlb,
-		.private = vec,
+		.private = &p,
 	};
 
 	vma = find_vma(current->mm, addr);
@@ -195,13 +248,18 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
 }
 
 /*
- * The mincore(2) system call.
+ * The mincore2(2) system call.
  *
- * mincore() returns the memory residency status of the pages in the
- * current process's address space specified by [addr, addr + len).
- * The status is returned in a vector of bytes.  The least significant
- * bit of each byte is 1 if the referenced page is in memory, otherwise
- * it is zero.
+ * mincore2() returns the memory residency status of the pages in the
+ * current process's address space specified by [addr, addr + len).  The
+ * status is returned in a vector of bytes.  The least significant bit
+ * of each byte is 1 if the referenced page is in memory, otherwise it
+ * is zero.  When 'flags' is non-zero each byte additionally contains an
+ * indication of the hardware mapping size of each page (bits 1 through
+ * 5 of each vector byte).  Where the order relates to the hardware
+ * mapping size backing the given logical-page.  For example, a present
+ * 2MB-mapped-huge-page would correspond to 512 vector entries with the
+ * value (9 << 1) | (1) => 0x13
  *
  * Because the status of a page can change after mincore() checks it
  * but before it returns to the application, the returned vector may
@@ -218,8 +276,8 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
  *		mapped
  *  -EAGAIN - A kernel resource was temporarily unavailable.
  */
-SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
-		unsigned char __user *, vec)
+SYSCALL_DEFINE4(mincore2, unsigned long, start, size_t, len,
+		unsigned char __user *, vec, int, flags)
 {
 	long retval;
 	unsigned long pages;
@@ -229,6 +287,10 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
+	/* Check that undefined flags are zero */
+	if (flags & ~MINCORE_ORDER)
+		return -EINVAL;
+
 	/* ..and we need to be passed a valid user-space range */
 	if (!access_ok(VERIFY_READ, (void __user *) start, len))
 		return -ENOMEM;
@@ -251,7 +313,7 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
 		 * the temporary buffer size.
 		 */
 		down_read(&current->mm->mmap_sem);
-		retval = do_mincore(start, min(pages, PAGE_SIZE), tmp);
+		retval = do_mincore(start, min(pages, PAGE_SIZE), tmp, flags);
 		up_read(&current->mm->mmap_sem);
 
 		if (retval <= 0)
@@ -268,3 +330,9 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
 	free_page((unsigned long) tmp);
 	return retval;
 }
+
+SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
+		unsigned char __user *, vec)
+{
+	return sys_mincore2(start, len, vec, 0);
+}

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 3/3] mm, mincore2(): retrieve tlb-size attributes of an address range
@ 2016-09-15  6:54   ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15  6:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrea Arcangeli, Xiao Guangrong, Arnd Bergmann, linux-nvdimm,
	Dave Hansen, david, linux-kernel, npiggin, xfs, linux-fsdevel,
	Andrew Morton, hch, Kirill A. Shutemov

There are cases, particularly for testing and validating a configuration
to know the hardware mapping geometry of the pages in a given process
address range.  Consider filesystem-dax where a configuration needs to
take care to align partitions and block allocations before huge page
mappings might be used, or anonymous-transparent-huge-pages where a
process is opportunistically assigned large pages.  mincore2() allows
these configurations to be surveyed and validated.

The implementation takes advantage of the unused bits in the per-page
byte returned for each PAGE_SIZE extent of a given address range.  The
new format of each vector byte is:

(TLB_SHIFT - PAGE_SHIFT) << 1 | page_present

[1]: https://lkml.org/lkml/2016/9/7/61

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/syscalls.h               |    2 
 include/uapi/asm-generic/mman-common.h |    2 
 kernel/sys_ni.c                        |    1 
 mm/mincore.c                           |  130 ++++++++++++++++++++++++--------
 4 files changed, 104 insertions(+), 31 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index d02239022bd0..4aa2ee7e359a 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -467,6 +467,8 @@ asmlinkage long sys_munlockall(void);
 asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
 asmlinkage long sys_mincore(unsigned long start, size_t len,
 				unsigned char __user * vec);
+asmlinkage long sys_mincore2(unsigned long start, size_t len,
+				unsigned char __user * vec, int flags);
 
 asmlinkage long sys_pivot_root(const char __user *new_root,
 				const char __user *put_old);
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 58274382a616..6c7eca1a85ca 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -72,4 +72,6 @@
 #define MAP_HUGE_SHIFT	26
 #define MAP_HUGE_MASK	0x3f
 
+#define MINCORE_ORDER	1		/* retrieve hardware mapping-size-order */
+
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 2c5e3a8e00d7..e14b87834054 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -197,6 +197,7 @@ cond_syscall(sys_mlockall);
 cond_syscall(sys_munlockall);
 cond_syscall(sys_mlock2);
 cond_syscall(sys_mincore);
+cond_syscall(sys_mincore2);
 cond_syscall(sys_madvise);
 cond_syscall(sys_mremap);
 cond_syscall(sys_remap_file_pages);
diff --git a/mm/mincore.c b/mm/mincore.c
index c0b5ba965200..b0b83ef086eb 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -15,25 +15,61 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
+#include <linux/dax.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 
+#ifndef MINCORE_ORDER
+#define MINCORE_ORDER 0
+#endif
+
+#define MINCORE_ORDER_MASK 0x3e
+#define MINCORE_ORDER_SHIFT 1
+
+struct mincore_params {
+	unsigned char *vec;
+	int flags;
+};
+
+static void mincore_set(unsigned char *vec, struct vm_area_struct *vma, int nr,
+		int flags)
+{
+	unsigned char mincore = 1;
+
+	if (!nr) {
+		*vec = 0;
+		return;
+	}
+
+	if (flags & MINCORE_ORDER) {
+		unsigned char order = ilog2(nr);
+
+		WARN_ON((order << MINCORE_ORDER_SHIFT) & ~MINCORE_ORDER_MASK);
+		mincore |= order << MINCORE_ORDER_SHIFT;
+	}
+	memset(vec, mincore, nr);
+}
+
 static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
 			unsigned long end, struct mm_walk *walk)
 {
 #ifdef CONFIG_HUGETLB_PAGE
+	struct mincore_params *p = walk->private;
+	int nr = (end - addr) >> PAGE_SHIFT;
+	unsigned char *vec = p->vec;
 	unsigned char present;
-	unsigned char *vec = walk->private;
 
 	/*
 	 * Hugepages under user process are always in RAM and never
 	 * swapped out, but theoretically it needs to be checked.
 	 */
 	present = pte && !huge_pte_none(huge_ptep_get(pte));
-	for (; addr != end; vec++, addr += PAGE_SIZE)
-		*vec = present;
-	walk->private = vec;
+	if (!present)
+		memset(vec, 0, nr);
+	else
+		mincore_set(vec, walk->vma, nr, p->flags);
+	p->vec = vec + nr;
 #else
 	BUG();
 #endif
@@ -82,20 +118,24 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 }
 
 static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
-				struct vm_area_struct *vma, unsigned char *vec)
+				struct vm_area_struct *vma, unsigned char *vec,
+				int flags)
 {
 	unsigned long nr = (end - addr) >> PAGE_SHIFT;
+	unsigned char present;
 	int i;
 
 	if (vma->vm_file) {
 		pgoff_t pgoff;
 
 		pgoff = linear_page_index(vma, addr);
-		for (i = 0; i < nr; i++, pgoff++)
-			vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff);
+		for (i = 0; i < nr; i++, pgoff++) {
+			present = mincore_page(vma->vm_file->f_mapping, pgoff);
+			mincore_set(vec + i, vma, present, flags);
+		}
 	} else {
 		for (i = 0; i < nr; i++)
-			vec[i] = 0;
+			mincore_set(vec + i, vma, 0, flags);
 	}
 	return nr;
 }
@@ -103,8 +143,11 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
 static int mincore_unmapped_range(unsigned long addr, unsigned long end,
 				   struct mm_walk *walk)
 {
-	walk->private += __mincore_unmapped_range(addr, end,
-						  walk->vma, walk->private);
+	struct mincore_params *p = walk->private;
+	int nr = __mincore_unmapped_range(addr, end, walk->vma, p->vec,
+			p->flags);
+
+	p->vec += nr;
 	return 0;
 }
 
@@ -114,18 +157,20 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	spinlock_t *ptl;
 	struct vm_area_struct *vma = walk->vma;
 	pte_t *ptep;
-	unsigned char *vec = walk->private;
+	struct mincore_params *p = walk->private;
+	unsigned char *vec = p->vec;
 	int nr = (end - addr) >> PAGE_SHIFT;
+	int flags = p->flags;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
-		memset(vec, 1, nr);
+		mincore_set(vec, vma, nr, flags);
 		spin_unlock(ptl);
 		goto out;
 	}
 
 	if (pmd_trans_unstable(pmd)) {
-		__mincore_unmapped_range(addr, end, vma, vec);
+		__mincore_unmapped_range(addr, end, vma, vec, flags);
 		goto out;
 	}
 
@@ -135,9 +180,9 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 
 		if (pte_none(pte))
 			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
-						 vma, vec);
+						 vma, vec, flags);
 		else if (pte_present(pte))
-			*vec = 1;
+			mincore_set(vec, vma, 1, flags);
 		else { /* pte is a swap entry */
 			swp_entry_t entry = pte_to_swp_entry(pte);
 
@@ -146,14 +191,17 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 				 * migration or hwpoison entries are always
 				 * uptodate
 				 */
-				*vec = 1;
+				mincore_set(vec, vma, 1, flags);
 			} else {
 #ifdef CONFIG_SWAP
-				*vec = mincore_page(swap_address_space(entry),
-					entry.val);
+				unsigned char present;
+
+				present = mincore_page(swap_address_space(entry),
+						entry.val);
+				mincore_set(vec, vma, present, flags);
 #else
 				WARN_ON(1);
-				*vec = 1;
+				mincore_set(vec, vma, 1, flags);
 #endif
 			}
 		}
@@ -161,7 +209,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	}
 	pte_unmap_unlock(ptep - 1, ptl);
 out:
-	walk->private += nr;
+	p->vec = vec + nr;
 	cond_resched();
 	return 0;
 }
@@ -171,16 +219,21 @@ out:
  * all the arguments, we hold the mmap semaphore: we should
  * just return the amount of info we're asked for.
  */
-static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *vec)
+static long do_mincore(unsigned long addr, unsigned long pages,
+		unsigned char *vec, int flags)
 {
 	struct vm_area_struct *vma;
 	unsigned long end;
 	int err;
+	struct mincore_params p = {
+		.vec = vec,
+		.flags = flags,
+	};
 	struct mm_walk mincore_walk = {
 		.pmd_entry = mincore_pte_range,
 		.pte_hole = mincore_unmapped_range,
 		.hugetlb_entry = mincore_hugetlb,
-		.private = vec,
+		.private = &p,
 	};
 
 	vma = find_vma(current->mm, addr);
@@ -195,13 +248,18 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
 }
 
 /*
- * The mincore(2) system call.
+ * The mincore2(2) system call.
  *
- * mincore() returns the memory residency status of the pages in the
- * current process's address space specified by [addr, addr + len).
- * The status is returned in a vector of bytes.  The least significant
- * bit of each byte is 1 if the referenced page is in memory, otherwise
- * it is zero.
+ * mincore2() returns the memory residency status of the pages in the
+ * current process's address space specified by [addr, addr + len).  The
+ * status is returned in a vector of bytes.  The least significant bit
+ * of each byte is 1 if the referenced page is in memory, otherwise it
+ * is zero.  When 'flags' is non-zero each byte additionally contains an
+ * indication of the hardware mapping size of each page (bits 1 through
+ * 5 of each vector byte).  Where the order relates to the hardware
+ * mapping size backing the given logical-page.  For example, a present
+ * 2MB-mapped-huge-page would correspond to 512 vector entries with the
+ * value (9 << 1) | (1) => 0x13
  *
  * Because the status of a page can change after mincore() checks it
  * but before it returns to the application, the returned vector may
@@ -218,8 +276,8 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
  *		mapped
  *  -EAGAIN - A kernel resource was temporarily unavailable.
  */
-SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
-		unsigned char __user *, vec)
+SYSCALL_DEFINE4(mincore2, unsigned long, start, size_t, len,
+		unsigned char __user *, vec, int, flags)
 {
 	long retval;
 	unsigned long pages;
@@ -229,6 +287,10 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
+	/* Check that undefined flags are zero */
+	if (flags & ~MINCORE_ORDER)
+		return -EINVAL;
+
 	/* ..and we need to be passed a valid user-space range */
 	if (!access_ok(VERIFY_READ, (void __user *) start, len))
 		return -ENOMEM;
@@ -251,7 +313,7 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
 		 * the temporary buffer size.
 		 */
 		down_read(&current->mm->mmap_sem);
-		retval = do_mincore(start, min(pages, PAGE_SIZE), tmp);
+		retval = do_mincore(start, min(pages, PAGE_SIZE), tmp, flags);
 		up_read(&current->mm->mmap_sem);
 
 		if (retval <= 0)
@@ -268,3 +330,9 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
 	free_page((unsigned long) tmp);
 	return retval;
 }
+
+SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
+		unsigned char __user *, vec)
+{
+	return sys_mincore2(start, len, vec, 0);
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 3/3] mm, mincore2(): retrieve tlb-size attributes of an address range
@ 2016-09-15  6:54   ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15  6:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrea Arcangeli, Xiao Guangrong, Arnd Bergmann, linux-nvdimm,
	Dave Hansen, linux-kernel, npiggin, xfs, linux-fsdevel,
	Andrew Morton, hch, Kirill A. Shutemov

There are cases, particularly for testing and validating a configuration
to know the hardware mapping geometry of the pages in a given process
address range.  Consider filesystem-dax where a configuration needs to
take care to align partitions and block allocations before huge page
mappings might be used, or anonymous-transparent-huge-pages where a
process is opportunistically assigned large pages.  mincore2() allows
these configurations to be surveyed and validated.

The implementation takes advantage of the unused bits in the per-page
byte returned for each PAGE_SIZE extent of a given address range.  The
new format of each vector byte is:

(TLB_SHIFT - PAGE_SHIFT) << 1 | page_present

[1]: https://lkml.org/lkml/2016/9/7/61

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/syscalls.h               |    2 
 include/uapi/asm-generic/mman-common.h |    2 
 kernel/sys_ni.c                        |    1 
 mm/mincore.c                           |  130 ++++++++++++++++++++++++--------
 4 files changed, 104 insertions(+), 31 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index d02239022bd0..4aa2ee7e359a 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -467,6 +467,8 @@ asmlinkage long sys_munlockall(void);
 asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
 asmlinkage long sys_mincore(unsigned long start, size_t len,
 				unsigned char __user * vec);
+asmlinkage long sys_mincore2(unsigned long start, size_t len,
+				unsigned char __user * vec, int flags);
 
 asmlinkage long sys_pivot_root(const char __user *new_root,
 				const char __user *put_old);
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 58274382a616..6c7eca1a85ca 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -72,4 +72,6 @@
 #define MAP_HUGE_SHIFT	26
 #define MAP_HUGE_MASK	0x3f
 
+#define MINCORE_ORDER	1		/* retrieve hardware mapping-size-order */
+
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 2c5e3a8e00d7..e14b87834054 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -197,6 +197,7 @@ cond_syscall(sys_mlockall);
 cond_syscall(sys_munlockall);
 cond_syscall(sys_mlock2);
 cond_syscall(sys_mincore);
+cond_syscall(sys_mincore2);
 cond_syscall(sys_madvise);
 cond_syscall(sys_mremap);
 cond_syscall(sys_remap_file_pages);
diff --git a/mm/mincore.c b/mm/mincore.c
index c0b5ba965200..b0b83ef086eb 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -15,25 +15,61 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
+#include <linux/dax.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 
+#ifndef MINCORE_ORDER
+#define MINCORE_ORDER 0
+#endif
+
+#define MINCORE_ORDER_MASK 0x3e
+#define MINCORE_ORDER_SHIFT 1
+
+struct mincore_params {
+	unsigned char *vec;
+	int flags;
+};
+
+static void mincore_set(unsigned char *vec, struct vm_area_struct *vma, int nr,
+		int flags)
+{
+	unsigned char mincore = 1;
+
+	if (!nr) {
+		*vec = 0;
+		return;
+	}
+
+	if (flags & MINCORE_ORDER) {
+		unsigned char order = ilog2(nr);
+
+		WARN_ON((order << MINCORE_ORDER_SHIFT) & ~MINCORE_ORDER_MASK);
+		mincore |= order << MINCORE_ORDER_SHIFT;
+	}
+	memset(vec, mincore, nr);
+}
+
 static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
 			unsigned long end, struct mm_walk *walk)
 {
 #ifdef CONFIG_HUGETLB_PAGE
+	struct mincore_params *p = walk->private;
+	int nr = (end - addr) >> PAGE_SHIFT;
+	unsigned char *vec = p->vec;
 	unsigned char present;
-	unsigned char *vec = walk->private;
 
 	/*
 	 * Hugepages under user process are always in RAM and never
 	 * swapped out, but theoretically it needs to be checked.
 	 */
 	present = pte && !huge_pte_none(huge_ptep_get(pte));
-	for (; addr != end; vec++, addr += PAGE_SIZE)
-		*vec = present;
-	walk->private = vec;
+	if (!present)
+		memset(vec, 0, nr);
+	else
+		mincore_set(vec, walk->vma, nr, p->flags);
+	p->vec = vec + nr;
 #else
 	BUG();
 #endif
@@ -82,20 +118,24 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 }
 
 static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
-				struct vm_area_struct *vma, unsigned char *vec)
+				struct vm_area_struct *vma, unsigned char *vec,
+				int flags)
 {
 	unsigned long nr = (end - addr) >> PAGE_SHIFT;
+	unsigned char present;
 	int i;
 
 	if (vma->vm_file) {
 		pgoff_t pgoff;
 
 		pgoff = linear_page_index(vma, addr);
-		for (i = 0; i < nr; i++, pgoff++)
-			vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff);
+		for (i = 0; i < nr; i++, pgoff++) {
+			present = mincore_page(vma->vm_file->f_mapping, pgoff);
+			mincore_set(vec + i, vma, present, flags);
+		}
 	} else {
 		for (i = 0; i < nr; i++)
-			vec[i] = 0;
+			mincore_set(vec + i, vma, 0, flags);
 	}
 	return nr;
 }
@@ -103,8 +143,11 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
 static int mincore_unmapped_range(unsigned long addr, unsigned long end,
 				   struct mm_walk *walk)
 {
-	walk->private += __mincore_unmapped_range(addr, end,
-						  walk->vma, walk->private);
+	struct mincore_params *p = walk->private;
+	int nr = __mincore_unmapped_range(addr, end, walk->vma, p->vec,
+			p->flags);
+
+	p->vec += nr;
 	return 0;
 }
 
@@ -114,18 +157,20 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	spinlock_t *ptl;
 	struct vm_area_struct *vma = walk->vma;
 	pte_t *ptep;
-	unsigned char *vec = walk->private;
+	struct mincore_params *p = walk->private;
+	unsigned char *vec = p->vec;
 	int nr = (end - addr) >> PAGE_SHIFT;
+	int flags = p->flags;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
-		memset(vec, 1, nr);
+		mincore_set(vec, vma, nr, flags);
 		spin_unlock(ptl);
 		goto out;
 	}
 
 	if (pmd_trans_unstable(pmd)) {
-		__mincore_unmapped_range(addr, end, vma, vec);
+		__mincore_unmapped_range(addr, end, vma, vec, flags);
 		goto out;
 	}
 
@@ -135,9 +180,9 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 
 		if (pte_none(pte))
 			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
-						 vma, vec);
+						 vma, vec, flags);
 		else if (pte_present(pte))
-			*vec = 1;
+			mincore_set(vec, vma, 1, flags);
 		else { /* pte is a swap entry */
 			swp_entry_t entry = pte_to_swp_entry(pte);
 
@@ -146,14 +191,17 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 				 * migration or hwpoison entries are always
 				 * uptodate
 				 */
-				*vec = 1;
+				mincore_set(vec, vma, 1, flags);
 			} else {
 #ifdef CONFIG_SWAP
-				*vec = mincore_page(swap_address_space(entry),
-					entry.val);
+				unsigned char present;
+
+				present = mincore_page(swap_address_space(entry),
+						entry.val);
+				mincore_set(vec, vma, present, flags);
 #else
 				WARN_ON(1);
-				*vec = 1;
+				mincore_set(vec, vma, 1, flags);
 #endif
 			}
 		}
@@ -161,7 +209,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	}
 	pte_unmap_unlock(ptep - 1, ptl);
 out:
-	walk->private += nr;
+	p->vec = vec + nr;
 	cond_resched();
 	return 0;
 }
@@ -171,16 +219,21 @@ out:
  * all the arguments, we hold the mmap semaphore: we should
  * just return the amount of info we're asked for.
  */
-static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *vec)
+static long do_mincore(unsigned long addr, unsigned long pages,
+		unsigned char *vec, int flags)
 {
 	struct vm_area_struct *vma;
 	unsigned long end;
 	int err;
+	struct mincore_params p = {
+		.vec = vec,
+		.flags = flags,
+	};
 	struct mm_walk mincore_walk = {
 		.pmd_entry = mincore_pte_range,
 		.pte_hole = mincore_unmapped_range,
 		.hugetlb_entry = mincore_hugetlb,
-		.private = vec,
+		.private = &p,
 	};
 
 	vma = find_vma(current->mm, addr);
@@ -195,13 +248,18 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
 }
 
 /*
- * The mincore(2) system call.
+ * The mincore2(2) system call.
  *
- * mincore() returns the memory residency status of the pages in the
- * current process's address space specified by [addr, addr + len).
- * The status is returned in a vector of bytes.  The least significant
- * bit of each byte is 1 if the referenced page is in memory, otherwise
- * it is zero.
+ * mincore2() returns the memory residency status of the pages in the
+ * current process's address space specified by [addr, addr + len).  The
+ * status is returned in a vector of bytes.  The least significant bit
+ * of each byte is 1 if the referenced page is in memory, otherwise it
+ * is zero.  When 'flags' is non-zero each byte additionally contains an
+ * indication of the hardware mapping size of each page (bits 1 through
+ * 5 of each vector byte).  Where the order relates to the hardware
+ * mapping size backing the given logical-page.  For example, a present
+ * 2MB-mapped-huge-page would correspond to 512 vector entries with the
+ * value (9 << 1) | (1) => 0x13
  *
  * Because the status of a page can change after mincore() checks it
  * but before it returns to the application, the returned vector may
@@ -218,8 +276,8 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
  *		mapped
  *  -EAGAIN - A kernel resource was temporarily unavailable.
  */
-SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
-		unsigned char __user *, vec)
+SYSCALL_DEFINE4(mincore2, unsigned long, start, size_t, len,
+		unsigned char __user *, vec, int, flags)
 {
 	long retval;
 	unsigned long pages;
@@ -229,6 +287,10 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
+	/* Check that undefined flags are zero */
+	if (flags & ~MINCORE_ORDER)
+		return -EINVAL;
+
 	/* ..and we need to be passed a valid user-space range */
 	if (!access_ok(VERIFY_READ, (void __user *) start, len))
 		return -ENOMEM;
@@ -251,7 +313,7 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
 		 * the temporary buffer size.
 		 */
 		down_read(&current->mm->mmap_sem);
-		retval = do_mincore(start, min(pages, PAGE_SIZE), tmp);
+		retval = do_mincore(start, min(pages, PAGE_SIZE), tmp, flags);
 		up_read(&current->mm->mmap_sem);
 
 		if (retval <= 0)
@@ -268,3 +330,9 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
 	free_page((unsigned long) tmp);
 	return retval;
 }
+
+SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
+		unsigned char __user *, vec)
+{
+	return sys_mincore2(start, len, vec, 0);
+}

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
  2016-09-15  6:54   ` Dan Williams
  (?)
@ 2016-09-15  8:26     ` Christoph Hellwig
  -1 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2016-09-15  8:26 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-mm, linux-nvdimm, david, linux-kernel, npiggin, xfs,
	linux-fsdevel, hch

On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> The DAX property, page cache bypass, of a VMA is only detectable via the
> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> only available internal to the kernel and is a property that userspace
> applications would like to interrogate.

They have absolutely no business knowing such an implementation detail.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15  8:26     ` Christoph Hellwig
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2016-09-15  8:26 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-mm, linux-nvdimm, david, linux-kernel, npiggin, xfs,
	linux-fsdevel, hch

On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> The DAX property, page cache bypass, of a VMA is only detectable via the
> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> only available internal to the kernel and is a property that userspace
> applications would like to interrogate.

They have absolutely no business knowing such an implementation detail.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15  8:26     ` Christoph Hellwig
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2016-09-15  8:26 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-kernel, npiggin, xfs, linux-mm, linux-fsdevel, hch

On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> The DAX property, page cache bypass, of a VMA is only detectable via the
> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> only available internal to the kernel and is a property that userspace
> applications would like to interrogate.

They have absolutely no business knowing such an implementation detail.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
  2016-09-15  8:26     ` Christoph Hellwig
  (?)
  (?)
@ 2016-09-15 17:01       ` Dan Williams
  -1 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15 17:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, david, linux-kernel, Nicholas Piggin,
	XFS Developers, Linux MM, linux-fsdevel

On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> The DAX property, page cache bypass, of a VMA is only detectable via the
>> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> only available internal to the kernel and is a property that userspace
>> applications would like to interrogate.
>
> They have absolutely no business knowing such an implementation detail.

Hasn't that train already left the station with FS_XFLAG_DAX?

The other problem with hiding the DAX property is that it turns out to
not be a transparent acceleration feature.  See xfs/086 xfs/088
xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is
due to the fact that DAX disallows delayed allocation behavior.

If behavior changes I think we should indicate that to userspace and
VM_DAX is certainly more useful to userspace than some of the other vm
internals we already export in those flags.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15 17:01       ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15 17:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linux MM, linux-nvdimm@lists.01.org, david, linux-kernel,
	Nicholas Piggin, XFS Developers, linux-fsdevel

On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> The DAX property, page cache bypass, of a VMA is only detectable via the
>> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> only available internal to the kernel and is a property that userspace
>> applications would like to interrogate.
>
> They have absolutely no business knowing such an implementation detail.

Hasn't that train already left the station with FS_XFLAG_DAX?

The other problem with hiding the DAX property is that it turns out to
not be a transparent acceleration feature.  See xfs/086 xfs/088
xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is
due to the fact that DAX disallows delayed allocation behavior.

If behavior changes I think we should indicate that to userspace and
VM_DAX is certainly more useful to userspace than some of the other vm
internals we already export in those flags.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15 17:01       ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15 17:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linux MM, linux-nvdimm, david, linux-kernel, Nicholas Piggin,
	XFS Developers, linux-fsdevel

On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> The DAX property, page cache bypass, of a VMA is only detectable via the
>> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> only available internal to the kernel and is a property that userspace
>> applications would like to interrogate.
>
> They have absolutely no business knowing such an implementation detail.

Hasn't that train already left the station with FS_XFLAG_DAX?

The other problem with hiding the DAX property is that it turns out to
not be a transparent acceleration feature.  See xfs/086 xfs/088
xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is
due to the fact that DAX disallows delayed allocation behavior.

If behavior changes I think we should indicate that to userspace and
VM_DAX is certainly more useful to userspace than some of the other vm
internals we already export in those flags.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15 17:01       ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15 17:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers,
	Linux MM, linux-fsdevel

On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> The DAX property, page cache bypass, of a VMA is only detectable via the
>> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> only available internal to the kernel and is a property that userspace
>> applications would like to interrogate.
>
> They have absolutely no business knowing such an implementation detail.

Hasn't that train already left the station with FS_XFLAG_DAX?

The other problem with hiding the DAX property is that it turns out to
not be a transparent acceleration feature.  See xfs/086 xfs/088
xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is
due to the fact that DAX disallows delayed allocation behavior.

If behavior changes I think we should indicate that to userspace and
VM_DAX is certainly more useful to userspace than some of the other vm
internals we already export in those flags.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
  2016-09-15 17:01       ` Dan Williams
  (?)
  (?)
@ 2016-09-15 17:09         ` Darrick J. Wong
  -1 siblings, 0 replies; 63+ messages in thread
From: Darrick J. Wong @ 2016-09-15 17:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin,
	XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> >> The DAX property, page cache bypass, of a VMA is only detectable via the
> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> >> only available internal to the kernel and is a property that userspace
> >> applications would like to interrogate.
> >
> > They have absolutely no business knowing such an implementation detail.
> 
> Hasn't that train already left the station with FS_XFLAG_DAX?

Seeing as FS_IOC_FSGETXATTR is a "generic" ioctl now, why not just
implement it for all the DAX fses and block devices?  Aside from xflags,
the other fields are probably all zero for non-xfs (aside from project
quota id I guess).

(Yeah, sort of awkward, I know...)

--D

> The other problem with hiding the DAX property is that it turns out to
> not be a transparent acceleration feature.  See xfs/086 xfs/088
> xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is
> due to the fact that DAX disallows delayed allocation behavior.
> 
> If behavior changes I think we should indicate that to userspace and
> VM_DAX is certainly more useful to userspace than some of the other vm
> internals we already export in those flags.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15 17:09         ` Darrick J. Wong
  0 siblings, 0 replies; 63+ messages in thread
From: Darrick J. Wong @ 2016-09-15 17:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm@lists.01.org, linux-kernel,
	Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel

On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> >> The DAX property, page cache bypass, of a VMA is only detectable via the
> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> >> only available internal to the kernel and is a property that userspace
> >> applications would like to interrogate.
> >
> > They have absolutely no business knowing such an implementation detail.
> 
> Hasn't that train already left the station with FS_XFLAG_DAX?

Seeing as FS_IOC_FSGETXATTR is a "generic" ioctl now, why not just
implement it for all the DAX fses and block devices?  Aside from xflags,
the other fields are probably all zero for non-xfs (aside from project
quota id I guess).

(Yeah, sort of awkward, I know...)

--D

> The other problem with hiding the DAX property is that it turns out to
> not be a transparent acceleration feature.  See xfs/086 xfs/088
> xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is
> due to the fact that DAX disallows delayed allocation behavior.
> 
> If behavior changes I think we should indicate that to userspace and
> VM_DAX is certainly more useful to userspace than some of the other vm
> internals we already export in those flags.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15 17:09         ` Darrick J. Wong
  0 siblings, 0 replies; 63+ messages in thread
From: Darrick J. Wong @ 2016-09-15 17:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm@lists.01.org, linux-kernel,
	Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel

On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> >> The DAX property, page cache bypass, of a VMA is only detectable via the
> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> >> only available internal to the kernel and is a property that userspace
> >> applications would like to interrogate.
> >
> > They have absolutely no business knowing such an implementation detail.
> 
> Hasn't that train already left the station with FS_XFLAG_DAX?

Seeing as FS_IOC_FSGETXATTR is a "generic" ioctl now, why not just
implement it for all the DAX fses and block devices?  Aside from xflags,
the other fields are probably all zero for non-xfs (aside from project
quota id I guess).

(Yeah, sort of awkward, I know...)

--D

> The other problem with hiding the DAX property is that it turns out to
> not be a transparent acceleration feature.  See xfs/086 xfs/088
> xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is
> due to the fact that DAX disallows delayed allocation behavior.
> 
> If behavior changes I think we should indicate that to userspace and
> VM_DAX is certainly more useful to userspace than some of the other vm
> internals we already export in those flags.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15 17:09         ` Darrick J. Wong
  0 siblings, 0 replies; 63+ messages in thread
From: Darrick J. Wong @ 2016-09-15 17:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin,
	XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> >> The DAX property, page cache bypass, of a VMA is only detectable via the
> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> >> only available internal to the kernel and is a property that userspace
> >> applications would like to interrogate.
> >
> > They have absolutely no business knowing such an implementation detail.
> 
> Hasn't that train already left the station with FS_XFLAG_DAX?

Seeing as FS_IOC_FSGETXATTR is a "generic" ioctl now, why not just
implement it for all the DAX fses and block devices?  Aside from xflags,
the other fields are probably all zero for non-xfs (aside from project
quota id I guess).

(Yeah, sort of awkward, I know...)

--D

> The other problem with hiding the DAX property is that it turns out to
> not be a transparent acceleration feature.  See xfs/086 xfs/088
> xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is
> due to the fact that DAX disallows delayed allocation behavior.
> 
> If behavior changes I think we should indicate that to userspace and
> VM_DAX is certainly more useful to userspace than some of the other vm
> internals we already export in those flags.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
  2016-09-15 17:09         ` Darrick J. Wong
  (?)
  (?)
@ 2016-09-15 17:44           ` Dan Williams
  -1 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15 17:44 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin,
	XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 10:09 AM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> only available internal to the kernel and is a property that userspace
>> >> applications would like to interrogate.
>> >
>> > They have absolutely no business knowing such an implementation detail.
>>
>> Hasn't that train already left the station with FS_XFLAG_DAX?
>
> Seeing as FS_IOC_FSGETXATTR is a "generic" ioctl now, why not just
> implement it for all the DAX fses and block devices?  Aside from xflags,
> the other fields are probably all zero for non-xfs (aside from project
> quota id I guess).
>
> (Yeah, sort of awkward, I know...)

It would solve the problem at hand, I'll take a look.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15 17:44           ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15 17:44 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, linux-nvdimm@lists.01.org, linux-kernel,
	Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel

On Thu, Sep 15, 2016 at 10:09 AM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> only available internal to the kernel and is a property that userspace
>> >> applications would like to interrogate.
>> >
>> > They have absolutely no business knowing such an implementation detail.
>>
>> Hasn't that train already left the station with FS_XFLAG_DAX?
>
> Seeing as FS_IOC_FSGETXATTR is a "generic" ioctl now, why not just
> implement it for all the DAX fses and block devices?  Aside from xflags,
> the other fields are probably all zero for non-xfs (aside from project
> quota id I guess).
>
> (Yeah, sort of awkward, I know...)

It would solve the problem at hand, I'll take a look.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15 17:44           ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15 17:44 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, linux-nvdimm@lists.01.org, linux-kernel,
	Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel

On Thu, Sep 15, 2016 at 10:09 AM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> only available internal to the kernel and is a property that userspace
>> >> applications would like to interrogate.
>> >
>> > They have absolutely no business knowing such an implementation detail.
>>
>> Hasn't that train already left the station with FS_XFLAG_DAX?
>
> Seeing as FS_IOC_FSGETXATTR is a "generic" ioctl now, why not just
> implement it for all the DAX fses and block devices?  Aside from xflags,
> the other fields are probably all zero for non-xfs (aside from project
> quota id I guess).
>
> (Yeah, sort of awkward, I know...)

It would solve the problem at hand, I'll take a look.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15 17:44           ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15 17:44 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin,
	XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 10:09 AM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> only available internal to the kernel and is a property that userspace
>> >> applications would like to interrogate.
>> >
>> > They have absolutely no business knowing such an implementation detail.
>>
>> Hasn't that train already left the station with FS_XFLAG_DAX?
>
> Seeing as FS_IOC_FSGETXATTR is a "generic" ioctl now, why not just
> implement it for all the DAX fses and block devices?  Aside from xflags,
> the other fields are probably all zero for non-xfs (aside from project
> quota id I guess).
>
> (Yeah, sort of awkward, I know...)

It would solve the problem at hand, I'll take a look.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
  2016-09-15 17:01       ` Dan Williams
  (?)
  (?)
@ 2016-09-15 23:07         ` Dave Chinner
  -1 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2016-09-15 23:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers,
	Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> >> The DAX property, page cache bypass, of a VMA is only detectable via the
> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> >> only available internal to the kernel and is a property that userspace
> >> applications would like to interrogate.
> >
> > They have absolutely no business knowing such an implementation detail.
> 
> Hasn't that train already left the station with FS_XFLAG_DAX?

No, that's an admin flag, not a runtime hint for applications. Just
because that flag is set on an inode, it does not mean that DAX is
actually in use - it will be ignored if the backing dev is not dax
capable.

> The other problem with hiding the DAX property is that it turns out to
> not be a transparent acceleration feature.  See xfs/086 xfs/088
> xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is
> due to the fact that DAX disallows delayed allocation behavior.

Which is not a bug, nor is it something that app developers should
be surprised by.

i.e. Subtle differences in error reporting behaviour occur in
filesystems /all the time/. Run the test on a non-dax filesystem
with an extent size hint. It fails /exactly the same way as DAX/.
Run it with direct IO - fails the same way as DAX. Run it
with synchronous writes - it fails the same way as DAX.

IOWs, if an app can't handle the way DAX reports errors, then they
are /broken/. Delayed allocation requires checking the return value
of fsync() or close() to capture the allocation error - many more
apps get that wrong than the ones that expect the immediate errors
from write()...

Anyway: to domeonstrate that the nothign is actually broken, and
you might sometimes need to fix tests and send patches to
fstests@vger.kernel.org, this makes xfs/086 pass for me on DAX:

--- a/tests/xfs/086
+++ b/tests/xfs/086
@@ -96,7 +96,8 @@ _scratch_mount
 
 echo "+ modify files"
 for x in `seq 1 64`; do
-	$XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full
+	$XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" \
+		>> $seqres.full 2>&1
 done
 umount "${SCRATCH_MNT}"
 
Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15 23:07         ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2016-09-15 23:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Linux MM, linux-nvdimm@lists.01.org,
	linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel

On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> >> The DAX property, page cache bypass, of a VMA is only detectable via the
> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> >> only available internal to the kernel and is a property that userspace
> >> applications would like to interrogate.
> >
> > They have absolutely no business knowing such an implementation detail.
> 
> Hasn't that train already left the station with FS_XFLAG_DAX?

No, that's an admin flag, not a runtime hint for applications. Just
because that flag is set on an inode, it does not mean that DAX is
actually in use - it will be ignored if the backing dev is not dax
capable.

> The other problem with hiding the DAX property is that it turns out to
> not be a transparent acceleration feature.  See xfs/086 xfs/088
> xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is
> due to the fact that DAX disallows delayed allocation behavior.

Which is not a bug, nor is it something that app developers should
be surprised by.

i.e. Subtle differences in error reporting behaviour occur in
filesystems /all the time/. Run the test on a non-dax filesystem
with an extent size hint. It fails /exactly the same way as DAX/.
Run it with direct IO - fails the same way as DAX. Run it
with synchronous writes - it fails the same way as DAX.

IOWs, if an app can't handle the way DAX reports errors, then they
are /broken/. Delayed allocation requires checking the return value
of fsync() or close() to capture the allocation error - many more
apps get that wrong than the ones that expect the immediate errors
from write()...

Anyway: to domeonstrate that the nothign is actually broken, and
you might sometimes need to fix tests and send patches to
fstests@vger.kernel.org, this makes xfs/086 pass for me on DAX:

--- a/tests/xfs/086
+++ b/tests/xfs/086
@@ -96,7 +96,8 @@ _scratch_mount
 
 echo "+ modify files"
 for x in `seq 1 64`; do
-	$XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full
+	$XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" \
+		>> $seqres.full 2>&1
 done
 umount "${SCRATCH_MNT}"
 
Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15 23:07         ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2016-09-15 23:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Linux MM, linux-nvdimm, linux-kernel,
	Nicholas Piggin, XFS Developers, linux-fsdevel

On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> >> The DAX property, page cache bypass, of a VMA is only detectable via the
> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> >> only available internal to the kernel and is a property that userspace
> >> applications would like to interrogate.
> >
> > They have absolutely no business knowing such an implementation detail.
> 
> Hasn't that train already left the station with FS_XFLAG_DAX?

No, that's an admin flag, not a runtime hint for applications. Just
because that flag is set on an inode, it does not mean that DAX is
actually in use - it will be ignored if the backing dev is not dax
capable.

> The other problem with hiding the DAX property is that it turns out to
> not be a transparent acceleration feature.  See xfs/086 xfs/088
> xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is
> due to the fact that DAX disallows delayed allocation behavior.

Which is not a bug, nor is it something that app developers should
be surprised by.

i.e. Subtle differences in error reporting behaviour occur in
filesystems /all the time/. Run the test on a non-dax filesystem
with an extent size hint. It fails /exactly the same way as DAX/.
Run it with direct IO - fails the same way as DAX. Run it
with synchronous writes - it fails the same way as DAX.

IOWs, if an app can't handle the way DAX reports errors, then they
are /broken/. Delayed allocation requires checking the return value
of fsync() or close() to capture the allocation error - many more
apps get that wrong than the ones that expect the immediate errors
from write()...

Anyway: to domeonstrate that the nothign is actually broken, and
you might sometimes need to fix tests and send patches to
fstests@vger.kernel.org, this makes xfs/086 pass for me on DAX:

--- a/tests/xfs/086
+++ b/tests/xfs/086
@@ -96,7 +96,8 @@ _scratch_mount
 
 echo "+ modify files"
 for x in `seq 1 64`; do
-	$XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full
+	$XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" \
+		>> $seqres.full 2>&1
 done
 umount "${SCRATCH_MNT}"
 
Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15 23:07         ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2016-09-15 23:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers,
	Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> >> The DAX property, page cache bypass, of a VMA is only detectable via the
> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> >> only available internal to the kernel and is a property that userspace
> >> applications would like to interrogate.
> >
> > They have absolutely no business knowing such an implementation detail.
> 
> Hasn't that train already left the station with FS_XFLAG_DAX?

No, that's an admin flag, not a runtime hint for applications. Just
because that flag is set on an inode, it does not mean that DAX is
actually in use - it will be ignored if the backing dev is not dax
capable.

> The other problem with hiding the DAX property is that it turns out to
> not be a transparent acceleration feature.  See xfs/086 xfs/088
> xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is
> due to the fact that DAX disallows delayed allocation behavior.

Which is not a bug, nor is it something that app developers should
be surprised by.

i.e. Subtle differences in error reporting behaviour occur in
filesystems /all the time/. Run the test on a non-dax filesystem
with an extent size hint. It fails /exactly the same way as DAX/.
Run it with direct IO - fails the same way as DAX. Run it
with synchronous writes - it fails the same way as DAX.

IOWs, if an app can't handle the way DAX reports errors, then they
are /broken/. Delayed allocation requires checking the return value
of fsync() or close() to capture the allocation error - many more
apps get that wrong than the ones that expect the immediate errors
from write()...

Anyway: to domeonstrate that the nothign is actually broken, and
you might sometimes need to fix tests and send patches to
fstests@vger.kernel.org, this makes xfs/086 pass for me on DAX:

--- a/tests/xfs/086
+++ b/tests/xfs/086
@@ -96,7 +96,8 @@ _scratch_mount
 
 echo "+ modify files"
 for x in `seq 1 64`; do
-	$XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full
+	$XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" \
+		>> $seqres.full 2>&1
 done
 umount "${SCRATCH_MNT}"
 
Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
  2016-09-15 23:07         ` Dave Chinner
  (?)
  (?)
@ 2016-09-15 23:19           ` Dan Williams
  -1 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15 23:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers,
	Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> only available internal to the kernel and is a property that userspace
>> >> applications would like to interrogate.
>> >
>> > They have absolutely no business knowing such an implementation detail.
>>
>> Hasn't that train already left the station with FS_XFLAG_DAX?
>
> No, that's an admin flag, not a runtime hint for applications. Just
> because that flag is set on an inode, it does not mean that DAX is
> actually in use - it will be ignored if the backing dev is not dax
> capable.

Ok, but then VM_DAX does not suffer from that problem.  I'm trying to
understand why VM_DAX has no business being in the smaps "VmFlags"
line, but something ambiguous to userspace like VM_MIXEDMAP does?

>
>> The other problem with hiding the DAX property is that it turns out to
>> not be a transparent acceleration feature.  See xfs/086 xfs/088
>> xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is
>> due to the fact that DAX disallows delayed allocation behavior.
>
> Which is not a bug, nor is it something that app developers should
> be surprised by.
>
> i.e. Subtle differences in error reporting behaviour occur in
> filesystems /all the time/. Run the test on a non-dax filesystem
> with an extent size hint. It fails /exactly the same way as DAX/.
> Run it with direct IO - fails the same way as DAX. Run it
> with synchronous writes - it fails the same way as DAX.
>
> IOWs, if an app can't handle the way DAX reports errors, then they
> are /broken/. Delayed allocation requires checking the return value
> of fsync() or close() to capture the allocation error - many more
> apps get that wrong than the ones that expect the immediate errors
> from write()...
>
> Anyway: to domeonstrate that the nothign is actually broken, and
> you might sometimes need to fix tests and send patches to
> fstests@vger.kernel.org, this makes xfs/086 pass for me on DAX:
>
> --- a/tests/xfs/086
> +++ b/tests/xfs/086
> @@ -96,7 +96,8 @@ _scratch_mount
>
>  echo "+ modify files"
>  for x in `seq 1 64`; do
> -       $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full
> +       $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" \
> +               >> $seqres.full 2>&1
>  done
>  umount "${SCRATCH_MNT}"

Thanks for that!  Wasn't immediately obvious to me, and didn't get
that response when I asked on the list a while back.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15 23:19           ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15 23:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Linux MM, linux-nvdimm@lists.01.org,
	linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel

On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> only available internal to the kernel and is a property that userspace
>> >> applications would like to interrogate.
>> >
>> > They have absolutely no business knowing such an implementation detail.
>>
>> Hasn't that train already left the station with FS_XFLAG_DAX?
>
> No, that's an admin flag, not a runtime hint for applications. Just
> because that flag is set on an inode, it does not mean that DAX is
> actually in use - it will be ignored if the backing dev is not dax
> capable.

Ok, but then VM_DAX does not suffer from that problem.  I'm trying to
understand why VM_DAX has no business being in the smaps "VmFlags"
line, but something ambiguous to userspace like VM_MIXEDMAP does?

>
>> The other problem with hiding the DAX property is that it turns out to
>> not be a transparent acceleration feature.  See xfs/086 xfs/088
>> xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is
>> due to the fact that DAX disallows delayed allocation behavior.
>
> Which is not a bug, nor is it something that app developers should
> be surprised by.
>
> i.e. Subtle differences in error reporting behaviour occur in
> filesystems /all the time/. Run the test on a non-dax filesystem
> with an extent size hint. It fails /exactly the same way as DAX/.
> Run it with direct IO - fails the same way as DAX. Run it
> with synchronous writes - it fails the same way as DAX.
>
> IOWs, if an app can't handle the way DAX reports errors, then they
> are /broken/. Delayed allocation requires checking the return value
> of fsync() or close() to capture the allocation error - many more
> apps get that wrong than the ones that expect the immediate errors
> from write()...
>
> Anyway: to domeonstrate that the nothign is actually broken, and
> you might sometimes need to fix tests and send patches to
> fstests@vger.kernel.org, this makes xfs/086 pass for me on DAX:
>
> --- a/tests/xfs/086
> +++ b/tests/xfs/086
> @@ -96,7 +96,8 @@ _scratch_mount
>
>  echo "+ modify files"
>  for x in `seq 1 64`; do
> -       $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full
> +       $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" \
> +               >> $seqres.full 2>&1
>  done
>  umount "${SCRATCH_MNT}"

Thanks for that!  Wasn't immediately obvious to me, and didn't get
that response when I asked on the list a while back.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15 23:19           ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15 23:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Linux MM, linux-nvdimm, linux-kernel,
	Nicholas Piggin, XFS Developers, linux-fsdevel

On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> only available internal to the kernel and is a property that userspace
>> >> applications would like to interrogate.
>> >
>> > They have absolutely no business knowing such an implementation detail.
>>
>> Hasn't that train already left the station with FS_XFLAG_DAX?
>
> No, that's an admin flag, not a runtime hint for applications. Just
> because that flag is set on an inode, it does not mean that DAX is
> actually in use - it will be ignored if the backing dev is not dax
> capable.

Ok, but then VM_DAX does not suffer from that problem.  I'm trying to
understand why VM_DAX has no business being in the smaps "VmFlags"
line, but something ambiguous to userspace like VM_MIXEDMAP does?

>
>> The other problem with hiding the DAX property is that it turns out to
>> not be a transparent acceleration feature.  See xfs/086 xfs/088
>> xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is
>> due to the fact that DAX disallows delayed allocation behavior.
>
> Which is not a bug, nor is it something that app developers should
> be surprised by.
>
> i.e. Subtle differences in error reporting behaviour occur in
> filesystems /all the time/. Run the test on a non-dax filesystem
> with an extent size hint. It fails /exactly the same way as DAX/.
> Run it with direct IO - fails the same way as DAX. Run it
> with synchronous writes - it fails the same way as DAX.
>
> IOWs, if an app can't handle the way DAX reports errors, then they
> are /broken/. Delayed allocation requires checking the return value
> of fsync() or close() to capture the allocation error - many more
> apps get that wrong than the ones that expect the immediate errors
> from write()...
>
> Anyway: to domeonstrate that the nothign is actually broken, and
> you might sometimes need to fix tests and send patches to
> fstests@vger.kernel.org, this makes xfs/086 pass for me on DAX:
>
> --- a/tests/xfs/086
> +++ b/tests/xfs/086
> @@ -96,7 +96,8 @@ _scratch_mount
>
>  echo "+ modify files"
>  for x in `seq 1 64`; do
> -       $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full
> +       $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" \
> +               >> $seqres.full 2>&1
>  done
>  umount "${SCRATCH_MNT}"

Thanks for that!  Wasn't immediately obvious to me, and didn't get
that response when I asked on the list a while back.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-15 23:19           ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-15 23:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers,
	Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> only available internal to the kernel and is a property that userspace
>> >> applications would like to interrogate.
>> >
>> > They have absolutely no business knowing such an implementation detail.
>>
>> Hasn't that train already left the station with FS_XFLAG_DAX?
>
> No, that's an admin flag, not a runtime hint for applications. Just
> because that flag is set on an inode, it does not mean that DAX is
> actually in use - it will be ignored if the backing dev is not dax
> capable.

Ok, but then VM_DAX does not suffer from that problem.  I'm trying to
understand why VM_DAX has no business being in the smaps "VmFlags"
line, but something ambiguous to userspace like VM_MIXEDMAP does?

>
>> The other problem with hiding the DAX property is that it turns out to
>> not be a transparent acceleration feature.  See xfs/086 xfs/088
>> xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is
>> due to the fact that DAX disallows delayed allocation behavior.
>
> Which is not a bug, nor is it something that app developers should
> be surprised by.
>
> i.e. Subtle differences in error reporting behaviour occur in
> filesystems /all the time/. Run the test on a non-dax filesystem
> with an extent size hint. It fails /exactly the same way as DAX/.
> Run it with direct IO - fails the same way as DAX. Run it
> with synchronous writes - it fails the same way as DAX.
>
> IOWs, if an app can't handle the way DAX reports errors, then they
> are /broken/. Delayed allocation requires checking the return value
> of fsync() or close() to capture the allocation error - many more
> apps get that wrong than the ones that expect the immediate errors
> from write()...
>
> Anyway: to domeonstrate that the nothign is actually broken, and
> you might sometimes need to fix tests and send patches to
> fstests@vger.kernel.org, this makes xfs/086 pass for me on DAX:
>
> --- a/tests/xfs/086
> +++ b/tests/xfs/086
> @@ -96,7 +96,8 @@ _scratch_mount
>
>  echo "+ modify files"
>  for x in `seq 1 64`; do
> -       $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full
> +       $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" \
> +               >> $seqres.full 2>&1
>  done
>  umount "${SCRATCH_MNT}"

Thanks for that!  Wasn't immediately obvious to me, and didn't get
that response when I asked on the list a while back.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
  2016-09-15 23:07         ` Dave Chinner
  (?)
  (?)
@ 2016-09-16  0:16           ` Dan Williams
  -1 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-16  0:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers,
	Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> only available internal to the kernel and is a property that userspace
>> >> applications would like to interrogate.
>> >
>> > They have absolutely no business knowing such an implementation detail.
>>
>> Hasn't that train already left the station with FS_XFLAG_DAX?
>
> No, that's an admin flag, not a runtime hint for applications. Just
> because that flag is set on an inode, it does not mean that DAX is
> actually in use - it will be ignored if the backing dev is not dax
> capable.
>

What's the point of an admin flag if an admin can't do cat /proc/<pid
of interest>/smaps, or some other mechanism, to validate that the
setting the admin cares about is in effect?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-16  0:16           ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-16  0:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Linux MM, linux-nvdimm@lists.01.org,
	linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel

On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> only available internal to the kernel and is a property that userspace
>> >> applications would like to interrogate.
>> >
>> > They have absolutely no business knowing such an implementation detail.
>>
>> Hasn't that train already left the station with FS_XFLAG_DAX?
>
> No, that's an admin flag, not a runtime hint for applications. Just
> because that flag is set on an inode, it does not mean that DAX is
> actually in use - it will be ignored if the backing dev is not dax
> capable.
>

What's the point of an admin flag if an admin can't do cat /proc/<pid
of interest>/smaps, or some other mechanism, to validate that the
setting the admin cares about is in effect?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-16  0:16           ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-16  0:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Linux MM, linux-nvdimm, linux-kernel,
	Nicholas Piggin, XFS Developers, linux-fsdevel

On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> only available internal to the kernel and is a property that userspace
>> >> applications would like to interrogate.
>> >
>> > They have absolutely no business knowing such an implementation detail.
>>
>> Hasn't that train already left the station with FS_XFLAG_DAX?
>
> No, that's an admin flag, not a runtime hint for applications. Just
> because that flag is set on an inode, it does not mean that DAX is
> actually in use - it will be ignored if the backing dev is not dax
> capable.
>

What's the point of an admin flag if an admin can't do cat /proc/<pid
of interest>/smaps, or some other mechanism, to validate that the
setting the admin cares about is in effect?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-16  0:16           ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-16  0:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers,
	Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> only available internal to the kernel and is a property that userspace
>> >> applications would like to interrogate.
>> >
>> > They have absolutely no business knowing such an implementation detail.
>>
>> Hasn't that train already left the station with FS_XFLAG_DAX?
>
> No, that's an admin flag, not a runtime hint for applications. Just
> because that flag is set on an inode, it does not mean that DAX is
> actually in use - it will be ignored if the backing dev is not dax
> capable.
>

What's the point of an admin flag if an admin can't do cat /proc/<pid
of interest>/smaps, or some other mechanism, to validate that the
setting the admin cares about is in effect?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
  2016-09-16  0:16           ` Dan Williams
  (?)
  (?)
@ 2016-09-16  1:24             ` Dave Chinner
  -1 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2016-09-16  1:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers,
	Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> >> >> only available internal to the kernel and is a property that userspace
> >> >> applications would like to interrogate.
> >> >
> >> > They have absolutely no business knowing such an implementation detail.
> >>
> >> Hasn't that train already left the station with FS_XFLAG_DAX?
> >
> > No, that's an admin flag, not a runtime hint for applications. Just
> > because that flag is set on an inode, it does not mean that DAX is
> > actually in use - it will be ignored if the backing dev is not dax
> > capable.
> 
> What's the point of an admin flag if an admin can't do cat /proc/<pid
> of interest>/smaps, or some other mechanism, to validate that the
> setting the admin cares about is in effect?

Sorry, I don't follow - why would you be looking at mapping file
regions in /proc to determine if some file somewhere in a filesystem
has a specific flag set on it or not?

FS_XFLAG_DAX is an inode attribute flag, not something you can
query or administrate through mmap:

I.e.
# xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
 --------------- foo
 --------------x foo
 --------------- foo
#

What happens when that flag is set on an inode is determined by a
whole bunch of other things that are completely separate to the
management of the inode flag itself.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-16  1:24             ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2016-09-16  1:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Linux MM, linux-nvdimm@lists.01.org,
	linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel

On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> >> >> only available internal to the kernel and is a property that userspace
> >> >> applications would like to interrogate.
> >> >
> >> > They have absolutely no business knowing such an implementation detail.
> >>
> >> Hasn't that train already left the station with FS_XFLAG_DAX?
> >
> > No, that's an admin flag, not a runtime hint for applications. Just
> > because that flag is set on an inode, it does not mean that DAX is
> > actually in use - it will be ignored if the backing dev is not dax
> > capable.
> 
> What's the point of an admin flag if an admin can't do cat /proc/<pid
> of interest>/smaps, or some other mechanism, to validate that the
> setting the admin cares about is in effect?

Sorry, I don't follow - why would you be looking at mapping file
regions in /proc to determine if some file somewhere in a filesystem
has a specific flag set on it or not?

FS_XFLAG_DAX is an inode attribute flag, not something you can
query or administrate through mmap:

I.e.
# xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
 --------------- foo
 --------------x foo
 --------------- foo
#

What happens when that flag is set on an inode is determined by a
whole bunch of other things that are completely separate to the
management of the inode flag itself.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-16  1:24             ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2016-09-16  1:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Linux MM, linux-nvdimm, linux-kernel,
	Nicholas Piggin, XFS Developers, linux-fsdevel

On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> >> >> only available internal to the kernel and is a property that userspace
> >> >> applications would like to interrogate.
> >> >
> >> > They have absolutely no business knowing such an implementation detail.
> >>
> >> Hasn't that train already left the station with FS_XFLAG_DAX?
> >
> > No, that's an admin flag, not a runtime hint for applications. Just
> > because that flag is set on an inode, it does not mean that DAX is
> > actually in use - it will be ignored if the backing dev is not dax
> > capable.
> 
> What's the point of an admin flag if an admin can't do cat /proc/<pid
> of interest>/smaps, or some other mechanism, to validate that the
> setting the admin cares about is in effect?

Sorry, I don't follow - why would you be looking at mapping file
regions in /proc to determine if some file somewhere in a filesystem
has a specific flag set on it or not?

FS_XFLAG_DAX is an inode attribute flag, not something you can
query or administrate through mmap:

I.e.
# xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
 --------------- foo
 --------------x foo
 --------------- foo
#

What happens when that flag is set on an inode is determined by a
whole bunch of other things that are completely separate to the
management of the inode flag itself.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-16  1:24             ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2016-09-16  1:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers,
	Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> >> >> only available internal to the kernel and is a property that userspace
> >> >> applications would like to interrogate.
> >> >
> >> > They have absolutely no business knowing such an implementation detail.
> >>
> >> Hasn't that train already left the station with FS_XFLAG_DAX?
> >
> > No, that's an admin flag, not a runtime hint for applications. Just
> > because that flag is set on an inode, it does not mean that DAX is
> > actually in use - it will be ignored if the backing dev is not dax
> > capable.
> 
> What's the point of an admin flag if an admin can't do cat /proc/<pid
> of interest>/smaps, or some other mechanism, to validate that the
> setting the admin cares about is in effect?

Sorry, I don't follow - why would you be looking at mapping file
regions in /proc to determine if some file somewhere in a filesystem
has a specific flag set on it or not?

FS_XFLAG_DAX is an inode attribute flag, not something you can
query or administrate through mmap:

I.e.
# xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
 --------------- foo
 --------------x foo
 --------------- foo
#

What happens when that flag is set on an inode is determined by a
whole bunch of other things that are completely separate to the
management of the inode flag itself.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
  2016-09-16  1:24             ` Dave Chinner
  (?)
  (?)
@ 2016-09-16  2:04               ` Dan Williams
  -1 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-16  2:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers,
	Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> >> only available internal to the kernel and is a property that userspace
>> >> >> applications would like to interrogate.
>> >> >
>> >> > They have absolutely no business knowing such an implementation detail.
>> >>
>> >> Hasn't that train already left the station with FS_XFLAG_DAX?
>> >
>> > No, that's an admin flag, not a runtime hint for applications. Just
>> > because that flag is set on an inode, it does not mean that DAX is
>> > actually in use - it will be ignored if the backing dev is not dax
>> > capable.
>>
>> What's the point of an admin flag if an admin can't do cat /proc/<pid
>> of interest>/smaps, or some other mechanism, to validate that the
>> setting the admin cares about is in effect?
>
> Sorry, I don't follow - why would you be looking at mapping file
> regions in /proc to determine if some file somewhere in a filesystem
> has a specific flag set on it or not?
>
> FS_XFLAG_DAX is an inode attribute flag, not something you can
> query or administrate through mmap:
>
> I.e.
> # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
>  --------------- foo
>  --------------x foo
>  --------------- foo
> #
>
> What happens when that flag is set on an inode is determined by a
> whole bunch of other things that are completely separate to the
> management of the inode flag itself.

Right, I understand that, but how does an admin audit those "bunch of
other things" that actually gate whether DAX ends up being used in
practice?  There's currently no way for userspace to observe that a
file with FS_XFLAG_DAX actually results in a change in mmap behavior.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-16  2:04               ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-16  2:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Linux MM, linux-nvdimm@lists.01.org,
	linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel

On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> >> only available internal to the kernel and is a property that userspace
>> >> >> applications would like to interrogate.
>> >> >
>> >> > They have absolutely no business knowing such an implementation detail.
>> >>
>> >> Hasn't that train already left the station with FS_XFLAG_DAX?
>> >
>> > No, that's an admin flag, not a runtime hint for applications. Just
>> > because that flag is set on an inode, it does not mean that DAX is
>> > actually in use - it will be ignored if the backing dev is not dax
>> > capable.
>>
>> What's the point of an admin flag if an admin can't do cat /proc/<pid
>> of interest>/smaps, or some other mechanism, to validate that the
>> setting the admin cares about is in effect?
>
> Sorry, I don't follow - why would you be looking at mapping file
> regions in /proc to determine if some file somewhere in a filesystem
> has a specific flag set on it or not?
>
> FS_XFLAG_DAX is an inode attribute flag, not something you can
> query or administrate through mmap:
>
> I.e.
> # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
>  --------------- foo
>  --------------x foo
>  --------------- foo
> #
>
> What happens when that flag is set on an inode is determined by a
> whole bunch of other things that are completely separate to the
> management of the inode flag itself.

Right, I understand that, but how does an admin audit those "bunch of
other things" that actually gate whether DAX ends up being used in
practice?  There's currently no way for userspace to observe that a
file with FS_XFLAG_DAX actually results in a change in mmap behavior.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-16  2:04               ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-16  2:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Linux MM, linux-nvdimm, linux-kernel,
	Nicholas Piggin, XFS Developers, linux-fsdevel

On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> >> only available internal to the kernel and is a property that userspace
>> >> >> applications would like to interrogate.
>> >> >
>> >> > They have absolutely no business knowing such an implementation detail.
>> >>
>> >> Hasn't that train already left the station with FS_XFLAG_DAX?
>> >
>> > No, that's an admin flag, not a runtime hint for applications. Just
>> > because that flag is set on an inode, it does not mean that DAX is
>> > actually in use - it will be ignored if the backing dev is not dax
>> > capable.
>>
>> What's the point of an admin flag if an admin can't do cat /proc/<pid
>> of interest>/smaps, or some other mechanism, to validate that the
>> setting the admin cares about is in effect?
>
> Sorry, I don't follow - why would you be looking at mapping file
> regions in /proc to determine if some file somewhere in a filesystem
> has a specific flag set on it or not?
>
> FS_XFLAG_DAX is an inode attribute flag, not something you can
> query or administrate through mmap:
>
> I.e.
> # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
>  --------------- foo
>  --------------x foo
>  --------------- foo
> #
>
> What happens when that flag is set on an inode is determined by a
> whole bunch of other things that are completely separate to the
> management of the inode flag itself.

Right, I understand that, but how does an admin audit those "bunch of
other things" that actually gate whether DAX ends up being used in
practice?  There's currently no way for userspace to observe that a
file with FS_XFLAG_DAX actually results in a change in mmap behavior.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-16  2:04               ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-16  2:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers,
	Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> >> only available internal to the kernel and is a property that userspace
>> >> >> applications would like to interrogate.
>> >> >
>> >> > They have absolutely no business knowing such an implementation detail.
>> >>
>> >> Hasn't that train already left the station with FS_XFLAG_DAX?
>> >
>> > No, that's an admin flag, not a runtime hint for applications. Just
>> > because that flag is set on an inode, it does not mean that DAX is
>> > actually in use - it will be ignored if the backing dev is not dax
>> > capable.
>>
>> What's the point of an admin flag if an admin can't do cat /proc/<pid
>> of interest>/smaps, or some other mechanism, to validate that the
>> setting the admin cares about is in effect?
>
> Sorry, I don't follow - why would you be looking at mapping file
> regions in /proc to determine if some file somewhere in a filesystem
> has a specific flag set on it or not?
>
> FS_XFLAG_DAX is an inode attribute flag, not something you can
> query or administrate through mmap:
>
> I.e.
> # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
>  --------------- foo
>  --------------x foo
>  --------------- foo
> #
>
> What happens when that flag is set on an inode is determined by a
> whole bunch of other things that are completely separate to the
> management of the inode flag itself.

Right, I understand that, but how does an admin audit those "bunch of
other things" that actually gate whether DAX ends up being used in
practice?  There's currently no way for userspace to observe that a
file with FS_XFLAG_DAX actually results in a change in mmap behavior.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
  2016-09-16  2:04               ` Dan Williams
  (?)
  (?)
@ 2016-09-16  3:41                 ` Dan Williams
  -1 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-16  3:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers,
	Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 7:04 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
>> On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
>>> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>>> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>>> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>>> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>>> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>>> >> >> only available internal to the kernel and is a property that userspace
>>> >> >> applications would like to interrogate.
>>> >> >
>>> >> > They have absolutely no business knowing such an implementation detail.
>>> >>
>>> >> Hasn't that train already left the station with FS_XFLAG_DAX?
>>> >
>>> > No, that's an admin flag, not a runtime hint for applications. Just
>>> > because that flag is set on an inode, it does not mean that DAX is
>>> > actually in use - it will be ignored if the backing dev is not dax
>>> > capable.
>>>
>>> What's the point of an admin flag if an admin can't do cat /proc/<pid
>>> of interest>/smaps, or some other mechanism, to validate that the
>>> setting the admin cares about is in effect?
>>
>> Sorry, I don't follow - why would you be looking at mapping file
>> regions in /proc to determine if some file somewhere in a filesystem
>> has a specific flag set on it or not?
>>
>> FS_XFLAG_DAX is an inode attribute flag, not something you can
>> query or administrate through mmap:
>>
>> I.e.
>> # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
>>  --------------- foo
>>  --------------x foo
>>  --------------- foo
>> #
>>
>> What happens when that flag is set on an inode is determined by a
>> whole bunch of other things that are completely separate to the
>> management of the inode flag itself.
>
> Right, I understand that, but how does an admin audit those "bunch of
> other things" that actually gate whether DAX ends up being used in
> practice?  There's currently no way for userspace to observe that a
> file with FS_XFLAG_DAX actually results in a change in mmap behavior.

Let me put it another way, if we inadvertently break DAX causing it to
be disabled in scenarios when it should be enabled.  What is the
interface for the admin to check "I have the DAX inode flag set, but
the file this application expects to be mapped DAX is mapped with page
cache"?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-16  3:41                 ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-16  3:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Linux MM, linux-nvdimm@lists.01.org,
	linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel

On Thu, Sep 15, 2016 at 7:04 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
>> On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
>>> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>>> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>>> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>>> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>>> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>>> >> >> only available internal to the kernel and is a property that userspace
>>> >> >> applications would like to interrogate.
>>> >> >
>>> >> > They have absolutely no business knowing such an implementation detail.
>>> >>
>>> >> Hasn't that train already left the station with FS_XFLAG_DAX?
>>> >
>>> > No, that's an admin flag, not a runtime hint for applications. Just
>>> > because that flag is set on an inode, it does not mean that DAX is
>>> > actually in use - it will be ignored if the backing dev is not dax
>>> > capable.
>>>
>>> What's the point of an admin flag if an admin can't do cat /proc/<pid
>>> of interest>/smaps, or some other mechanism, to validate that the
>>> setting the admin cares about is in effect?
>>
>> Sorry, I don't follow - why would you be looking at mapping file
>> regions in /proc to determine if some file somewhere in a filesystem
>> has a specific flag set on it or not?
>>
>> FS_XFLAG_DAX is an inode attribute flag, not something you can
>> query or administrate through mmap:
>>
>> I.e.
>> # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
>>  --------------- foo
>>  --------------x foo
>>  --------------- foo
>> #
>>
>> What happens when that flag is set on an inode is determined by a
>> whole bunch of other things that are completely separate to the
>> management of the inode flag itself.
>
> Right, I understand that, but how does an admin audit those "bunch of
> other things" that actually gate whether DAX ends up being used in
> practice?  There's currently no way for userspace to observe that a
> file with FS_XFLAG_DAX actually results in a change in mmap behavior.

Let me put it another way, if we inadvertently break DAX causing it to
be disabled in scenarios when it should be enabled.  What is the
interface for the admin to check "I have the DAX inode flag set, but
the file this application expects to be mapped DAX is mapped with page
cache"?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-16  3:41                 ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-16  3:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Linux MM, linux-nvdimm, linux-kernel,
	Nicholas Piggin, XFS Developers, linux-fsdevel

On Thu, Sep 15, 2016 at 7:04 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
>> On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
>>> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>>> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>>> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>>> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>>> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>>> >> >> only available internal to the kernel and is a property that userspace
>>> >> >> applications would like to interrogate.
>>> >> >
>>> >> > They have absolutely no business knowing such an implementation detail.
>>> >>
>>> >> Hasn't that train already left the station with FS_XFLAG_DAX?
>>> >
>>> > No, that's an admin flag, not a runtime hint for applications. Just
>>> > because that flag is set on an inode, it does not mean that DAX is
>>> > actually in use - it will be ignored if the backing dev is not dax
>>> > capable.
>>>
>>> What's the point of an admin flag if an admin can't do cat /proc/<pid
>>> of interest>/smaps, or some other mechanism, to validate that the
>>> setting the admin cares about is in effect?
>>
>> Sorry, I don't follow - why would you be looking at mapping file
>> regions in /proc to determine if some file somewhere in a filesystem
>> has a specific flag set on it or not?
>>
>> FS_XFLAG_DAX is an inode attribute flag, not something you can
>> query or administrate through mmap:
>>
>> I.e.
>> # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
>>  --------------- foo
>>  --------------x foo
>>  --------------- foo
>> #
>>
>> What happens when that flag is set on an inode is determined by a
>> whole bunch of other things that are completely separate to the
>> management of the inode flag itself.
>
> Right, I understand that, but how does an admin audit those "bunch of
> other things" that actually gate whether DAX ends up being used in
> practice?  There's currently no way for userspace to observe that a
> file with FS_XFLAG_DAX actually results in a change in mmap behavior.

Let me put it another way, if we inadvertently break DAX causing it to
be disabled in scenarios when it should be enabled.  What is the
interface for the admin to check "I have the DAX inode flag set, but
the file this application expects to be mapped DAX is mapped with page
cache"?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-16  3:41                 ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-16  3:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers,
	Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 7:04 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
>> On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
>>> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>>> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>>> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>>> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>>> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>>> >> >> only available internal to the kernel and is a property that userspace
>>> >> >> applications would like to interrogate.
>>> >> >
>>> >> > They have absolutely no business knowing such an implementation detail.
>>> >>
>>> >> Hasn't that train already left the station with FS_XFLAG_DAX?
>>> >
>>> > No, that's an admin flag, not a runtime hint for applications. Just
>>> > because that flag is set on an inode, it does not mean that DAX is
>>> > actually in use - it will be ignored if the backing dev is not dax
>>> > capable.
>>>
>>> What's the point of an admin flag if an admin can't do cat /proc/<pid
>>> of interest>/smaps, or some other mechanism, to validate that the
>>> setting the admin cares about is in effect?
>>
>> Sorry, I don't follow - why would you be looking at mapping file
>> regions in /proc to determine if some file somewhere in a filesystem
>> has a specific flag set on it or not?
>>
>> FS_XFLAG_DAX is an inode attribute flag, not something you can
>> query or administrate through mmap:
>>
>> I.e.
>> # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
>>  --------------- foo
>>  --------------x foo
>>  --------------- foo
>> #
>>
>> What happens when that flag is set on an inode is determined by a
>> whole bunch of other things that are completely separate to the
>> management of the inode flag itself.
>
> Right, I understand that, but how does an admin audit those "bunch of
> other things" that actually gate whether DAX ends up being used in
> practice?  There's currently no way for userspace to observe that a
> file with FS_XFLAG_DAX actually results in a change in mmap behavior.

Let me put it another way, if we inadvertently break DAX causing it to
be disabled in scenarios when it should be enabled.  What is the
interface for the admin to check "I have the DAX inode flag set, but
the file this application expects to be mapped DAX is mapped with page
cache"?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
  2016-09-16  2:04               ` Dan Williams
  (?)
  (?)
@ 2016-09-16  5:36                 ` Dave Chinner
  -1 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2016-09-16  5:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers,
	Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 07:04:27PM -0700, Dan Williams wrote:
> On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
> >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
> >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
> >> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> >> >> >> only available internal to the kernel and is a property that userspace
> >> >> >> applications would like to interrogate.
> >> >> >
> >> >> > They have absolutely no business knowing such an implementation detail.
> >> >>
> >> >> Hasn't that train already left the station with FS_XFLAG_DAX?
> >> >
> >> > No, that's an admin flag, not a runtime hint for applications. Just
> >> > because that flag is set on an inode, it does not mean that DAX is
> >> > actually in use - it will be ignored if the backing dev is not dax
> >> > capable.
> >>
> >> What's the point of an admin flag if an admin can't do cat /proc/<pid
> >> of interest>/smaps, or some other mechanism, to validate that the
> >> setting the admin cares about is in effect?
> >
> > Sorry, I don't follow - why would you be looking at mapping file
> > regions in /proc to determine if some file somewhere in a filesystem
> > has a specific flag set on it or not?
> >
> > FS_XFLAG_DAX is an inode attribute flag, not something you can
> > query or administrate through mmap:
> >
> > I.e.
> > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
> >  --------------- foo
> >  --------------x foo
> >  --------------- foo
> > #
> >
> > What happens when that flag is set on an inode is determined by a
> > whole bunch of other things that are completely separate to the
> > management of the inode flag itself.
> 
> Right, I understand that, but how does an admin audit those "bunch of
> other things"

Filesystem mounts checks all the various stuff that determines
whether DAX can be used. It logs to the console that it is "Dax
capable". Any file that then has FS_XFLAG_DAX set will result in DAX
being used. There is no other possibility when these two things are
reported.

/me points at runtime diagnostic tracepoints like
trace_xfs_file_dax_read() and notes that dax is sadly lacking in
diagnostic tracepoints.

Besides, userspace can't do anything useful with this information,
because the FS_XFLAG_DAX can be changed /at any time/ by an admin.
And the filesystem is free to remove it at any time, too, if it
needs to (e.g. file gets reflinked or snapshotted).

That's right, an inode can dynamically change from DAX to non-DAX
underneath the application, and the application /will not notice/.
That's because changing the flag will sync and invalidate the
existing mappings and the next application access will simply fault
it back in using whatever mechanism the inode is now configured
with.

Plain and simple: userspace has absolutely no fucking idea of
whether DAX is enabled or not, and whatever the kernel returns to
userspace above the DAX configuration is stale before it even got
out of the kernel....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-16  5:36                 ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2016-09-16  5:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Linux MM, linux-nvdimm@lists.01.org,
	linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel

On Thu, Sep 15, 2016 at 07:04:27PM -0700, Dan Williams wrote:
> On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
> >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
> >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
> >> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> >> >> >> only available internal to the kernel and is a property that userspace
> >> >> >> applications would like to interrogate.
> >> >> >
> >> >> > They have absolutely no business knowing such an implementation detail.
> >> >>
> >> >> Hasn't that train already left the station with FS_XFLAG_DAX?
> >> >
> >> > No, that's an admin flag, not a runtime hint for applications. Just
> >> > because that flag is set on an inode, it does not mean that DAX is
> >> > actually in use - it will be ignored if the backing dev is not dax
> >> > capable.
> >>
> >> What's the point of an admin flag if an admin can't do cat /proc/<pid
> >> of interest>/smaps, or some other mechanism, to validate that the
> >> setting the admin cares about is in effect?
> >
> > Sorry, I don't follow - why would you be looking at mapping file
> > regions in /proc to determine if some file somewhere in a filesystem
> > has a specific flag set on it or not?
> >
> > FS_XFLAG_DAX is an inode attribute flag, not something you can
> > query or administrate through mmap:
> >
> > I.e.
> > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
> >  --------------- foo
> >  --------------x foo
> >  --------------- foo
> > #
> >
> > What happens when that flag is set on an inode is determined by a
> > whole bunch of other things that are completely separate to the
> > management of the inode flag itself.
> 
> Right, I understand that, but how does an admin audit those "bunch of
> other things"

Filesystem mounts checks all the various stuff that determines
whether DAX can be used. It logs to the console that it is "Dax
capable". Any file that then has FS_XFLAG_DAX set will result in DAX
being used. There is no other possibility when these two things are
reported.

/me points at runtime diagnostic tracepoints like
trace_xfs_file_dax_read() and notes that dax is sadly lacking in
diagnostic tracepoints.

Besides, userspace can't do anything useful with this information,
because the FS_XFLAG_DAX can be changed /at any time/ by an admin.
And the filesystem is free to remove it at any time, too, if it
needs to (e.g. file gets reflinked or snapshotted).

That's right, an inode can dynamically change from DAX to non-DAX
underneath the application, and the application /will not notice/.
That's because changing the flag will sync and invalidate the
existing mappings and the next application access will simply fault
it back in using whatever mechanism the inode is now configured
with.

Plain and simple: userspace has absolutely no fucking idea of
whether DAX is enabled or not, and whatever the kernel returns to
userspace above the DAX configuration is stale before it even got
out of the kernel....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-16  5:36                 ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2016-09-16  5:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Linux MM, linux-nvdimm, linux-kernel,
	Nicholas Piggin, XFS Developers, linux-fsdevel

On Thu, Sep 15, 2016 at 07:04:27PM -0700, Dan Williams wrote:
> On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
> >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
> >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
> >> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> >> >> >> only available internal to the kernel and is a property that userspace
> >> >> >> applications would like to interrogate.
> >> >> >
> >> >> > They have absolutely no business knowing such an implementation detail.
> >> >>
> >> >> Hasn't that train already left the station with FS_XFLAG_DAX?
> >> >
> >> > No, that's an admin flag, not a runtime hint for applications. Just
> >> > because that flag is set on an inode, it does not mean that DAX is
> >> > actually in use - it will be ignored if the backing dev is not dax
> >> > capable.
> >>
> >> What's the point of an admin flag if an admin can't do cat /proc/<pid
> >> of interest>/smaps, or some other mechanism, to validate that the
> >> setting the admin cares about is in effect?
> >
> > Sorry, I don't follow - why would you be looking at mapping file
> > regions in /proc to determine if some file somewhere in a filesystem
> > has a specific flag set on it or not?
> >
> > FS_XFLAG_DAX is an inode attribute flag, not something you can
> > query or administrate through mmap:
> >
> > I.e.
> > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
> >  --------------- foo
> >  --------------x foo
> >  --------------- foo
> > #
> >
> > What happens when that flag is set on an inode is determined by a
> > whole bunch of other things that are completely separate to the
> > management of the inode flag itself.
> 
> Right, I understand that, but how does an admin audit those "bunch of
> other things"

Filesystem mounts checks all the various stuff that determines
whether DAX can be used. It logs to the console that it is "Dax
capable". Any file that then has FS_XFLAG_DAX set will result in DAX
being used. There is no other possibility when these two things are
reported.

/me points at runtime diagnostic tracepoints like
trace_xfs_file_dax_read() and notes that dax is sadly lacking in
diagnostic tracepoints.

Besides, userspace can't do anything useful with this information,
because the FS_XFLAG_DAX can be changed /at any time/ by an admin.
And the filesystem is free to remove it at any time, too, if it
needs to (e.g. file gets reflinked or snapshotted).

That's right, an inode can dynamically change from DAX to non-DAX
underneath the application, and the application /will not notice/.
That's because changing the flag will sync and invalidate the
existing mappings and the next application access will simply fault
it back in using whatever mechanism the inode is now configured
with.

Plain and simple: userspace has absolutely no fucking idea of
whether DAX is enabled or not, and whatever the kernel returns to
userspace above the DAX configuration is stale before it even got
out of the kernel....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-16  5:36                 ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2016-09-16  5:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers,
	Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 07:04:27PM -0700, Dan Williams wrote:
> On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
> >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
> >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
> >> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> >> >> >> only available internal to the kernel and is a property that userspace
> >> >> >> applications would like to interrogate.
> >> >> >
> >> >> > They have absolutely no business knowing such an implementation detail.
> >> >>
> >> >> Hasn't that train already left the station with FS_XFLAG_DAX?
> >> >
> >> > No, that's an admin flag, not a runtime hint for applications. Just
> >> > because that flag is set on an inode, it does not mean that DAX is
> >> > actually in use - it will be ignored if the backing dev is not dax
> >> > capable.
> >>
> >> What's the point of an admin flag if an admin can't do cat /proc/<pid
> >> of interest>/smaps, or some other mechanism, to validate that the
> >> setting the admin cares about is in effect?
> >
> > Sorry, I don't follow - why would you be looking at mapping file
> > regions in /proc to determine if some file somewhere in a filesystem
> > has a specific flag set on it or not?
> >
> > FS_XFLAG_DAX is an inode attribute flag, not something you can
> > query or administrate through mmap:
> >
> > I.e.
> > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
> >  --------------- foo
> >  --------------x foo
> >  --------------- foo
> > #
> >
> > What happens when that flag is set on an inode is determined by a
> > whole bunch of other things that are completely separate to the
> > management of the inode flag itself.
> 
> Right, I understand that, but how does an admin audit those "bunch of
> other things"

Filesystem mounts checks all the various stuff that determines
whether DAX can be used. It logs to the console that it is "Dax
capable". Any file that then has FS_XFLAG_DAX set will result in DAX
being used. There is no other possibility when these two things are
reported.

/me points at runtime diagnostic tracepoints like
trace_xfs_file_dax_read() and notes that dax is sadly lacking in
diagnostic tracepoints.

Besides, userspace can't do anything useful with this information,
because the FS_XFLAG_DAX can be changed /at any time/ by an admin.
And the filesystem is free to remove it at any time, too, if it
needs to (e.g. file gets reflinked or snapshotted).

That's right, an inode can dynamically change from DAX to non-DAX
underneath the application, and the application /will not notice/.
That's because changing the flag will sync and invalidate the
existing mappings and the next application access will simply fault
it back in using whatever mechanism the inode is now configured
with.

Plain and simple: userspace has absolutely no fucking idea of
whether DAX is enabled or not, and whatever the kernel returns to
userspace above the DAX configuration is stale before it even got
out of the kernel....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
  2016-09-16  5:36                 ` Dave Chinner
  (?)
  (?)
@ 2016-09-16 10:47                   ` Dan Williams
  -1 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-16 10:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers,
	Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 10:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 07:04:27PM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
>> >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> >> >> only available internal to the kernel and is a property that userspace
>> >> >> >> applications would like to interrogate.
>> >> >> >
>> >> >> > They have absolutely no business knowing such an implementation detail.
>> >> >>
>> >> >> Hasn't that train already left the station with FS_XFLAG_DAX?
>> >> >
>> >> > No, that's an admin flag, not a runtime hint for applications. Just
>> >> > because that flag is set on an inode, it does not mean that DAX is
>> >> > actually in use - it will be ignored if the backing dev is not dax
>> >> > capable.
>> >>
>> >> What's the point of an admin flag if an admin can't do cat /proc/<pid
>> >> of interest>/smaps, or some other mechanism, to validate that the
>> >> setting the admin cares about is in effect?
>> >
>> > Sorry, I don't follow - why would you be looking at mapping file
>> > regions in /proc to determine if some file somewhere in a filesystem
>> > has a specific flag set on it or not?
>> >
>> > FS_XFLAG_DAX is an inode attribute flag, not something you can
>> > query or administrate through mmap:
>> >
>> > I.e.
>> > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
>> >  --------------- foo
>> >  --------------x foo
>> >  --------------- foo
>> > #
>> >
>> > What happens when that flag is set on an inode is determined by a
>> > whole bunch of other things that are completely separate to the
>> > management of the inode flag itself.
>>
>> Right, I understand that, but how does an admin audit those "bunch of
>> other things"
>
> Filesystem mounts checks all the various stuff that determines
> whether DAX can be used. It logs to the console that it is "Dax
> capable". Any file that then has FS_XFLAG_DAX set will result in DAX
> being used. There is no other possibility when these two things are
> reported.
>
> /me points at runtime diagnostic tracepoints like
> trace_xfs_file_dax_read() and notes that dax is sadly lacking in
> diagnostic tracepoints.
>
> Besides, userspace can't do anything useful with this information,
> because the FS_XFLAG_DAX can be changed /at any time/ by an admin.
> And the filesystem is free to remove it at any time, too, if it
> needs to (e.g. file gets reflinked or snapshotted).
>
> That's right, an inode can dynamically change from DAX to non-DAX
> underneath the application, and the application /will not notice/.
> That's because changing the flag will sync and invalidate the
> existing mappings and the next application access will simply fault
> it back in using whatever mechanism the inode is now configured
> with.
>
> Plain and simple: userspace has absolutely no fucking idea of
> whether DAX is enabled or not, and whatever the kernel returns to
> userspace above the DAX configuration is stale before it even got
> out of the kernel....

smaps is already known to be an ephemeral interface, but we output
useful information there nonetheless.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-16 10:47                   ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-16 10:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Linux MM, linux-nvdimm@lists.01.org,
	linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel

On Thu, Sep 15, 2016 at 10:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 07:04:27PM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
>> >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> >> >> only available internal to the kernel and is a property that userspace
>> >> >> >> applications would like to interrogate.
>> >> >> >
>> >> >> > They have absolutely no business knowing such an implementation detail.
>> >> >>
>> >> >> Hasn't that train already left the station with FS_XFLAG_DAX?
>> >> >
>> >> > No, that's an admin flag, not a runtime hint for applications. Just
>> >> > because that flag is set on an inode, it does not mean that DAX is
>> >> > actually in use - it will be ignored if the backing dev is not dax
>> >> > capable.
>> >>
>> >> What's the point of an admin flag if an admin can't do cat /proc/<pid
>> >> of interest>/smaps, or some other mechanism, to validate that the
>> >> setting the admin cares about is in effect?
>> >
>> > Sorry, I don't follow - why would you be looking at mapping file
>> > regions in /proc to determine if some file somewhere in a filesystem
>> > has a specific flag set on it or not?
>> >
>> > FS_XFLAG_DAX is an inode attribute flag, not something you can
>> > query or administrate through mmap:
>> >
>> > I.e.
>> > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
>> >  --------------- foo
>> >  --------------x foo
>> >  --------------- foo
>> > #
>> >
>> > What happens when that flag is set on an inode is determined by a
>> > whole bunch of other things that are completely separate to the
>> > management of the inode flag itself.
>>
>> Right, I understand that, but how does an admin audit those "bunch of
>> other things"
>
> Filesystem mounts checks all the various stuff that determines
> whether DAX can be used. It logs to the console that it is "Dax
> capable". Any file that then has FS_XFLAG_DAX set will result in DAX
> being used. There is no other possibility when these two things are
> reported.
>
> /me points at runtime diagnostic tracepoints like
> trace_xfs_file_dax_read() and notes that dax is sadly lacking in
> diagnostic tracepoints.
>
> Besides, userspace can't do anything useful with this information,
> because the FS_XFLAG_DAX can be changed /at any time/ by an admin.
> And the filesystem is free to remove it at any time, too, if it
> needs to (e.g. file gets reflinked or snapshotted).
>
> That's right, an inode can dynamically change from DAX to non-DAX
> underneath the application, and the application /will not notice/.
> That's because changing the flag will sync and invalidate the
> existing mappings and the next application access will simply fault
> it back in using whatever mechanism the inode is now configured
> with.
>
> Plain and simple: userspace has absolutely no fucking idea of
> whether DAX is enabled or not, and whatever the kernel returns to
> userspace above the DAX configuration is stale before it even got
> out of the kernel....

smaps is already known to be an ephemeral interface, but we output
useful information there nonetheless.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-16 10:47                   ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-16 10:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Linux MM, linux-nvdimm, linux-kernel,
	Nicholas Piggin, XFS Developers, linux-fsdevel

On Thu, Sep 15, 2016 at 10:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 07:04:27PM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
>> >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> >> >> only available internal to the kernel and is a property that userspace
>> >> >> >> applications would like to interrogate.
>> >> >> >
>> >> >> > They have absolutely no business knowing such an implementation detail.
>> >> >>
>> >> >> Hasn't that train already left the station with FS_XFLAG_DAX?
>> >> >
>> >> > No, that's an admin flag, not a runtime hint for applications. Just
>> >> > because that flag is set on an inode, it does not mean that DAX is
>> >> > actually in use - it will be ignored if the backing dev is not dax
>> >> > capable.
>> >>
>> >> What's the point of an admin flag if an admin can't do cat /proc/<pid
>> >> of interest>/smaps, or some other mechanism, to validate that the
>> >> setting the admin cares about is in effect?
>> >
>> > Sorry, I don't follow - why would you be looking at mapping file
>> > regions in /proc to determine if some file somewhere in a filesystem
>> > has a specific flag set on it or not?
>> >
>> > FS_XFLAG_DAX is an inode attribute flag, not something you can
>> > query or administrate through mmap:
>> >
>> > I.e.
>> > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
>> >  --------------- foo
>> >  --------------x foo
>> >  --------------- foo
>> > #
>> >
>> > What happens when that flag is set on an inode is determined by a
>> > whole bunch of other things that are completely separate to the
>> > management of the inode flag itself.
>>
>> Right, I understand that, but how does an admin audit those "bunch of
>> other things"
>
> Filesystem mounts checks all the various stuff that determines
> whether DAX can be used. It logs to the console that it is "Dax
> capable". Any file that then has FS_XFLAG_DAX set will result in DAX
> being used. There is no other possibility when these two things are
> reported.
>
> /me points at runtime diagnostic tracepoints like
> trace_xfs_file_dax_read() and notes that dax is sadly lacking in
> diagnostic tracepoints.
>
> Besides, userspace can't do anything useful with this information,
> because the FS_XFLAG_DAX can be changed /at any time/ by an admin.
> And the filesystem is free to remove it at any time, too, if it
> needs to (e.g. file gets reflinked or snapshotted).
>
> That's right, an inode can dynamically change from DAX to non-DAX
> underneath the application, and the application /will not notice/.
> That's because changing the flag will sync and invalidate the
> existing mappings and the next application access will simply fault
> it back in using whatever mechanism the inode is now configured
> with.
>
> Plain and simple: userspace has absolutely no fucking idea of
> whether DAX is enabled or not, and whatever the kernel returns to
> userspace above the DAX configuration is stale before it even got
> out of the kernel....

smaps is already known to be an ephemeral interface, but we output
useful information there nonetheless.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs
@ 2016-09-16 10:47                   ` Dan Williams
  0 siblings, 0 replies; 63+ messages in thread
From: Dan Williams @ 2016-09-16 10:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers,
	Linux MM, linux-fsdevel, Christoph Hellwig

On Thu, Sep 15, 2016 at 10:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 07:04:27PM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
>> >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> >> >> only available internal to the kernel and is a property that userspace
>> >> >> >> applications would like to interrogate.
>> >> >> >
>> >> >> > They have absolutely no business knowing such an implementation detail.
>> >> >>
>> >> >> Hasn't that train already left the station with FS_XFLAG_DAX?
>> >> >
>> >> > No, that's an admin flag, not a runtime hint for applications. Just
>> >> > because that flag is set on an inode, it does not mean that DAX is
>> >> > actually in use - it will be ignored if the backing dev is not dax
>> >> > capable.
>> >>
>> >> What's the point of an admin flag if an admin can't do cat /proc/<pid
>> >> of interest>/smaps, or some other mechanism, to validate that the
>> >> setting the admin cares about is in effect?
>> >
>> > Sorry, I don't follow - why would you be looking at mapping file
>> > regions in /proc to determine if some file somewhere in a filesystem
>> > has a specific flag set on it or not?
>> >
>> > FS_XFLAG_DAX is an inode attribute flag, not something you can
>> > query or administrate through mmap:
>> >
>> > I.e.
>> > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
>> >  --------------- foo
>> >  --------------x foo
>> >  --------------- foo
>> > #
>> >
>> > What happens when that flag is set on an inode is determined by a
>> > whole bunch of other things that are completely separate to the
>> > management of the inode flag itself.
>>
>> Right, I understand that, but how does an admin audit those "bunch of
>> other things"
>
> Filesystem mounts checks all the various stuff that determines
> whether DAX can be used. It logs to the console that it is "Dax
> capable". Any file that then has FS_XFLAG_DAX set will result in DAX
> being used. There is no other possibility when these two things are
> reported.
>
> /me points at runtime diagnostic tracepoints like
> trace_xfs_file_dax_read() and notes that dax is sadly lacking in
> diagnostic tracepoints.
>
> Besides, userspace can't do anything useful with this information,
> because the FS_XFLAG_DAX can be changed /at any time/ by an admin.
> And the filesystem is free to remove it at any time, too, if it
> needs to (e.g. file gets reflinked or snapshotted).
>
> That's right, an inode can dynamically change from DAX to non-DAX
> underneath the application, and the application /will not notice/.
> That's because changing the flag will sync and invalidate the
> existing mappings and the next application access will simply fault
> it back in using whatever mechanism the inode is now configured
> with.
>
> Plain and simple: userspace has absolutely no fucking idea of
> whether DAX is enabled or not, and whatever the kernel returns to
> userspace above the DAX configuration is stale before it even got
> out of the kernel....

smaps is already known to be an ephemeral interface, but we output
useful information there nonetheless.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2016-09-16 10:47 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-15  6:54 [PATCH v2 0/3] mm, dax: export dax capabilities and mapping size info to userspace Dan Williams
2016-09-15  6:54 ` Dan Williams
2016-09-15  6:54 ` Dan Williams
2016-09-15  6:54 ` Dan Williams
2016-09-15  6:54 ` [PATCH v2 1/3] mm, dax: add VM_SYNC flag for device-dax VMAs Dan Williams
2016-09-15  6:54   ` Dan Williams
2016-09-15  6:54   ` Dan Williams
2016-09-15  6:54   ` Dan Williams
2016-09-15  6:54 ` [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs Dan Williams
2016-09-15  6:54   ` Dan Williams
2016-09-15  6:54   ` Dan Williams
2016-09-15  6:54   ` Dan Williams
2016-09-15  8:26   ` Christoph Hellwig
2016-09-15  8:26     ` Christoph Hellwig
2016-09-15  8:26     ` Christoph Hellwig
2016-09-15 17:01     ` Dan Williams
2016-09-15 17:01       ` Dan Williams
2016-09-15 17:01       ` Dan Williams
2016-09-15 17:01       ` Dan Williams
2016-09-15 17:09       ` Darrick J. Wong
2016-09-15 17:09         ` Darrick J. Wong
2016-09-15 17:09         ` Darrick J. Wong
2016-09-15 17:09         ` Darrick J. Wong
2016-09-15 17:44         ` Dan Williams
2016-09-15 17:44           ` Dan Williams
2016-09-15 17:44           ` Dan Williams
2016-09-15 17:44           ` Dan Williams
2016-09-15 23:07       ` Dave Chinner
2016-09-15 23:07         ` Dave Chinner
2016-09-15 23:07         ` Dave Chinner
2016-09-15 23:07         ` Dave Chinner
2016-09-15 23:19         ` Dan Williams
2016-09-15 23:19           ` Dan Williams
2016-09-15 23:19           ` Dan Williams
2016-09-15 23:19           ` Dan Williams
2016-09-16  0:16         ` Dan Williams
2016-09-16  0:16           ` Dan Williams
2016-09-16  0:16           ` Dan Williams
2016-09-16  0:16           ` Dan Williams
2016-09-16  1:24           ` Dave Chinner
2016-09-16  1:24             ` Dave Chinner
2016-09-16  1:24             ` Dave Chinner
2016-09-16  1:24             ` Dave Chinner
2016-09-16  2:04             ` Dan Williams
2016-09-16  2:04               ` Dan Williams
2016-09-16  2:04               ` Dan Williams
2016-09-16  2:04               ` Dan Williams
2016-09-16  3:41               ` Dan Williams
2016-09-16  3:41                 ` Dan Williams
2016-09-16  3:41                 ` Dan Williams
2016-09-16  3:41                 ` Dan Williams
2016-09-16  5:36               ` Dave Chinner
2016-09-16  5:36                 ` Dave Chinner
2016-09-16  5:36                 ` Dave Chinner
2016-09-16  5:36                 ` Dave Chinner
2016-09-16 10:47                 ` Dan Williams
2016-09-16 10:47                   ` Dan Williams
2016-09-16 10:47                   ` Dan Williams
2016-09-16 10:47                   ` Dan Williams
2016-09-15  6:54 ` [PATCH v2 3/3] mm, mincore2(): retrieve tlb-size attributes of an address range Dan Williams
2016-09-15  6:54   ` Dan Williams
2016-09-15  6:54   ` Dan Williams
2016-09-15  6:54   ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.