* [PATCH v2 0/3] mm, dax: export dax capabilities and mapping size info to userspace @ 2016-09-15 6:54 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 6:54 UTC (permalink / raw) To: linux-mm Cc: Andrea Arcangeli, Xiao Guangrong, Arnd Bergmann, linux-nvdimm, Dave Hansen, david, linux-kernel, npiggin, xfs, linux-fsdevel, Andrew Morton, hch, Kirill A. Shutemov In the debate about how to support persistent memory applications that want to use hardware-platform memory-media persistence rules/cpu-instructions rather than filesystem data intergrity system calls [1], one of the consistent requests is to move these applications to use a device file rather than a filesystem file [2]. While there is still a desire to offer the same syscall overhead avoidance in filesystem-dax as device-dax, there is performance optimization work and analysis that still needs to be done. Optimization/analysis to address filesystem-dax performance being slower than the typical page-cache path on top of pmem [3], and whether the performance gains are worth developing new filesytem data integrity mechanisms. In the meantime we have device-dax and are missing a way to identify its capabilities compared to filesytem-dax. Critically, we want a persistent memory transaction library, that is handed an address range to manage, to be able to determine if it is safe to forgo calling fsync/msync to record newly allocated blocks after a write fault. This question is answered by the new VM_SYNC flag. It is also important to know if the pages behind a mapping are backed by page cache and need to be synced, or are referencing media directly. We have an XFS inode flag that can indicate the inode is DAX enabled, but nothing for device-dax or other filesystems. Yes, an application that maps /dev/dax should assume the mapping is DAX, but it is useful to be able to tell that from the address range directly, and a common mechanism across filesystems. Finally, while developing and debugging the filesystem-dax huge page support it was frustrating that the only way to unit test and verify the implementation was via debug print statements. This series extends mincore(2) to optionally provide an indication of the hardware mapping size. This is hopefully useful to other cases that want to evaluate transparent-huge-page usage. Changes since the RFC [4]: 1/ Drop DAX indication out of mincore. It is a vma capability not a per-page property and fits better as a vma flag. Multiple people indicated it would be better if the new syscall published the capability as an extent or aggregated over a range, and this facility is already provided by smaps. 2/ Add VM_SYNC to explicity disclaim a need to call fsync/msync 3/ Drop the syscall wire-up patch since it is trivial and can be revived if we decide to move forward with the new mincore syscall. [1]: https://lwn.net/Articles/676737/ [2]: https://lists.01.org/pipermail/linux-nvdimm/2016-September/006893.html [3]: https://lists.01.org/pipermail/linux-nvdimm/2016-August/006497.html [4]: https://lists.01.org/pipermail/linux-nvdimm/2016-September/006875.html --- Dan Williams (3): mm, dax: add VM_SYNC flag for device-dax VMAs mm, dax: add VM_DAX flag for DAX VMAs mm, mincore2(): retrieve tlb-size attributes of an address range drivers/dax/Kconfig | 1 drivers/dax/dax.c | 2 fs/Kconfig | 1 fs/ext2/file.c | 2 fs/ext4/file.c | 2 fs/proc/task_mmu.c | 4 + fs/xfs/xfs_file.c | 2 include/linux/mm.h | 31 +++++++- include/linux/syscalls.h | 2 include/uapi/asm-generic/mman-common.h | 2 kernel/sys_ni.c | 1 mm/mincore.c | 130 ++++++++++++++++++++++++-------- 12 files changed, 141 insertions(+), 39 deletions(-) _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 63+ messages in thread
* [PATCH v2 0/3] mm, dax: export dax capabilities and mapping size info to userspace @ 2016-09-15 6:54 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 6:54 UTC (permalink / raw) To: linux-mm Cc: Andrea Arcangeli, Xiao Guangrong, Arnd Bergmann, linux-nvdimm, Dave Hansen, linux-kernel, npiggin, xfs, linux-fsdevel, Andrew Morton, hch, Kirill A. Shutemov In the debate about how to support persistent memory applications that want to use hardware-platform memory-media persistence rules/cpu-instructions rather than filesystem data intergrity system calls [1], one of the consistent requests is to move these applications to use a device file rather than a filesystem file [2]. While there is still a desire to offer the same syscall overhead avoidance in filesystem-dax as device-dax, there is performance optimization work and analysis that still needs to be done. Optimization/analysis to address filesystem-dax performance being slower than the typical page-cache path on top of pmem [3], and whether the performance gains are worth developing new filesytem data integrity mechanisms. In the meantime we have device-dax and are missing a way to identify its capabilities compared to filesytem-dax. Critically, we want a persistent memory transaction library, that is handed an address range to manage, to be able to determine if it is safe to forgo calling fsync/msync to record newly allocated blocks after a write fault. This question is answered by the new VM_SYNC flag. It is also important to know if the pages behind a mapping are backed by page cache and need to be synced, or are referencing media directly. We have an XFS inode flag that can indicate the inode is DAX enabled, but nothing for device-dax or other filesystems. Yes, an application that maps /dev/dax should assume the mapping is DAX, but it is useful to be able to tell that from the address range directly, and a common mechanism across filesystems. Finally, while developing and debugging the filesystem-dax huge page support it was frustrating that the only way to unit test and verify the implementation was via debug print statements. This series extends mincore(2) to optionally provide an indication of the hardware mapping size. This is hopefully useful to other cases that want to evaluate transparent-huge-page usage. Changes since the RFC [4]: 1/ Drop DAX indication out of mincore. It is a vma capability not a per-page property and fits better as a vma flag. Multiple people indicated it would be better if the new syscall published the capability as an extent or aggregated over a range, and this facility is already provided by smaps. 2/ Add VM_SYNC to explicity disclaim a need to call fsync/msync 3/ Drop the syscall wire-up patch since it is trivial and can be revived if we decide to move forward with the new mincore syscall. [1]: https://lwn.net/Articles/676737/ [2]: https://lists.01.org/pipermail/linux-nvdimm/2016-September/006893.html [3]: https://lists.01.org/pipermail/linux-nvdimm/2016-August/006497.html [4]: https://lists.01.org/pipermail/linux-nvdimm/2016-September/006875.html --- Dan Williams (3): mm, dax: add VM_SYNC flag for device-dax VMAs mm, dax: add VM_DAX flag for DAX VMAs mm, mincore2(): retrieve tlb-size attributes of an address range drivers/dax/Kconfig | 1 drivers/dax/dax.c | 2 fs/Kconfig | 1 fs/ext2/file.c | 2 fs/ext4/file.c | 2 fs/proc/task_mmu.c | 4 + fs/xfs/xfs_file.c | 2 include/linux/mm.h | 31 +++++++- include/linux/syscalls.h | 2 include/uapi/asm-generic/mman-common.h | 2 kernel/sys_ni.c | 1 mm/mincore.c | 130 ++++++++++++++++++++++++-------- 12 files changed, 141 insertions(+), 39 deletions(-) _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 63+ messages in thread
* [PATCH v2 0/3] mm, dax: export dax capabilities and mapping size info to userspace @ 2016-09-15 6:54 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 6:54 UTC (permalink / raw) To: linux-mm Cc: Andrea Arcangeli, Xiao Guangrong, Arnd Bergmann, linux-nvdimm, Dave Hansen, david, linux-kernel, npiggin, xfs, linux-fsdevel, Andrew Morton, hch, Kirill A. Shutemov In the debate about how to support persistent memory applications that want to use hardware-platform memory-media persistence rules/cpu-instructions rather than filesystem data intergrity system calls [1], one of the consistent requests is to move these applications to use a device file rather than a filesystem file [2]. While there is still a desire to offer the same syscall overhead avoidance in filesystem-dax as device-dax, there is performance optimization work and analysis that still needs to be done. Optimization/analysis to address filesystem-dax performance being slower than the typical page-cache path on top of pmem [3], and whether the performance gains are worth developing new filesytem data integrity mechanisms. In the meantime we have device-dax and are missing a way to identify its capabilities compared to filesytem-dax. Critically, we want a persistent memory transaction library, that is handed an address range to manage, to be able to determine if it is safe to forgo calling fsync/msync to record newly allocated blocks after a write fault. This question is answered by the new VM_SYNC flag. It is also important to know if the pages behind a mapping are backed by page cache and need to be synced, or are referencing media directly. We have an XFS inode flag that can indicate the inode is DAX enabled, but nothing for device-dax or other filesystems. Yes, an application that maps /dev/dax should assume the mapping is DAX, but it is useful to be able to tell that from the address range directly, and a common mechanism across filesystems. Finally, while developing and debugging the filesystem-dax huge page support it was frustrating that the only way to unit test and verify the implementation was via debug print statements. This series extends mincore(2) to optionally provide an indication of the hardware mapping size. This is hopefully useful to other cases that want to evaluate transparent-huge-page usage. Changes since the RFC [4]: 1/ Drop DAX indication out of mincore. It is a vma capability not a per-page property and fits better as a vma flag. Multiple people indicated it would be better if the new syscall published the capability as an extent or aggregated over a range, and this facility is already provided by smaps. 2/ Add VM_SYNC to explicity disclaim a need to call fsync/msync 3/ Drop the syscall wire-up patch since it is trivial and can be revived if we decide to move forward with the new mincore syscall. [1]: https://lwn.net/Articles/676737/ [2]: https://lists.01.org/pipermail/linux-nvdimm/2016-September/006893.html [3]: https://lists.01.org/pipermail/linux-nvdimm/2016-August/006497.html [4]: https://lists.01.org/pipermail/linux-nvdimm/2016-September/006875.html --- Dan Williams (3): mm, dax: add VM_SYNC flag for device-dax VMAs mm, dax: add VM_DAX flag for DAX VMAs mm, mincore2(): retrieve tlb-size attributes of an address range drivers/dax/Kconfig | 1 drivers/dax/dax.c | 2 fs/Kconfig | 1 fs/ext2/file.c | 2 fs/ext4/file.c | 2 fs/proc/task_mmu.c | 4 + fs/xfs/xfs_file.c | 2 include/linux/mm.h | 31 +++++++- include/linux/syscalls.h | 2 include/uapi/asm-generic/mman-common.h | 2 kernel/sys_ni.c | 1 mm/mincore.c | 130 ++++++++++++++++++++++++-------- 12 files changed, 141 insertions(+), 39 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 63+ messages in thread
* [PATCH v2 0/3] mm, dax: export dax capabilities and mapping size info to userspace @ 2016-09-15 6:54 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 6:54 UTC (permalink / raw) To: linux-mm Cc: Andrea Arcangeli, Xiao Guangrong, Arnd Bergmann, linux-nvdimm, Dave Hansen, david, linux-kernel, npiggin, xfs, linux-fsdevel, Andrew Morton, hch, Kirill A. Shutemov In the debate about how to support persistent memory applications that want to use hardware-platform memory-media persistence rules/cpu-instructions rather than filesystem data intergrity system calls [1], one of the consistent requests is to move these applications to use a device file rather than a filesystem file [2]. While there is still a desire to offer the same syscall overhead avoidance in filesystem-dax as device-dax, there is performance optimization work and analysis that still needs to be done. Optimization/analysis to address filesystem-dax performance being slower than the typical page-cache path on top of pmem [3], and whether the performance gains are worth developing new filesytem data integrity mechanisms. In the meantime we have device-dax and are missing a way to identify its capabilities compared to filesytem-dax. Critically, we want a persistent memory transaction library, that is handed an address range to manage, to be able to determine if it is safe to forgo calling fsync/msync to record newly allocated blocks after a write fault. This question is answered by the new VM_SYNC flag. It is also important to know if the pages behind a mapping are backed by page cache and need to be synced, or are referencing media directly. We have an XFS inode flag that can indicate the inode is DAX enabled, but nothing for device-dax or other filesystems. Yes, an application that maps /dev/dax should assume the mapping is DAX, but it is useful to be able to tell that from the address range directly, and a common mechanism across filesystems. Finally, while developing and debugging the filesystem-dax huge page support it was frustrating that the only way to unit test and verify the implementation was via debug print statements. This series extends mincore(2) to optionally provide an indication of the hardware mapping size. This is hopefully useful to other cases that want to evaluate transparent-huge-page usage. Changes since the RFC [4]: 1/ Drop DAX indication out of mincore. It is a vma capability not a per-page property and fits better as a vma flag. Multiple people indicated it would be better if the new syscall published the capability as an extent or aggregated over a range, and this facility is already provided by smaps. 2/ Add VM_SYNC to explicity disclaim a need to call fsync/msync 3/ Drop the syscall wire-up patch since it is trivial and can be revived if we decide to move forward with the new mincore syscall. [1]: https://lwn.net/Articles/676737/ [2]: https://lists.01.org/pipermail/linux-nvdimm/2016-September/006893.html [3]: https://lists.01.org/pipermail/linux-nvdimm/2016-August/006497.html [4]: https://lists.01.org/pipermail/linux-nvdimm/2016-September/006875.html --- Dan Williams (3): mm, dax: add VM_SYNC flag for device-dax VMAs mm, dax: add VM_DAX flag for DAX VMAs mm, mincore2(): retrieve tlb-size attributes of an address range drivers/dax/Kconfig | 1 drivers/dax/dax.c | 2 fs/Kconfig | 1 fs/ext2/file.c | 2 fs/ext4/file.c | 2 fs/proc/task_mmu.c | 4 + fs/xfs/xfs_file.c | 2 include/linux/mm.h | 31 +++++++- include/linux/syscalls.h | 2 include/uapi/asm-generic/mman-common.h | 2 kernel/sys_ni.c | 1 mm/mincore.c | 130 ++++++++++++++++++++++++-------- 12 files changed, 141 insertions(+), 39 deletions(-) ^ permalink raw reply [flat|nested] 63+ messages in thread
* [PATCH v2 1/3] mm, dax: add VM_SYNC flag for device-dax VMAs 2016-09-15 6:54 ` Dan Williams (?) (?) @ 2016-09-15 6:54 ` Dan Williams -1 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 6:54 UTC (permalink / raw) To: linux-mm Cc: linux-nvdimm, david, linux-kernel, npiggin, xfs, linux-fsdevel, hch Introduce a new vma flag to indicate the property of device-dax VMAs that, while file-backed, do not require notification to a filesystem agent to sync metadata after a fault. In particular this enables persistent memory applications to know if they can commit transactions to media via cpu instructions alone, or need to call back into the kernel to synchronize metadata. Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/dax/Kconfig | 1 + drivers/dax/dax.c | 2 +- fs/proc/task_mmu.c | 3 +++ include/linux/mm.h | 21 +++++++++++++++++---- 4 files changed, 22 insertions(+), 5 deletions(-) diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index cedab7572de3..a4d99e637623 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -2,6 +2,7 @@ menuconfig DEV_DAX tristate "DAX: direct access to differentiated memory" default m if NVDIMM_DAX depends on TRANSPARENT_HUGEPAGE + select ARCH_USES_HIGH_VMA_FLAGS if 64BIT help Support raw access to differentiated (persistence, bandwidth, latency...) memory via an mmap(2) capable character diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c index 29f600f2c447..88fad2519907 100644 --- a/drivers/dax/dax.c +++ b/drivers/dax/dax.c @@ -528,7 +528,7 @@ static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma) kref_get(&dax_dev->kref); vma->vm_ops = &dax_dev_vm_ops; - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC; return 0; } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index f6fa99eca515..03a65ac7f222 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -675,6 +675,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_PKEY_BIT2)] = "", [ilog2(VM_PKEY_BIT3)] = "", #endif +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS + [ilog2(VM_SYNC)] = "sn", +#endif }; size_t i; diff --git a/include/linux/mm.h b/include/linux/mm.h index ef815b9cd426..f3f6df6bb498 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -198,14 +198,17 @@ extern unsigned int kobjsize(const void *objp); #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */ #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS -#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */ -#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */ -#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ -#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ +/* bits below only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_0 32 +#define VM_HIGH_ARCH_BIT_1 33 +#define VM_HIGH_ARCH_BIT_2 34 +#define VM_HIGH_ARCH_BIT_3 35 +#define VM_HIGH_ARCH_BIT_4 36 #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) +#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ #if defined(CONFIG_X86) @@ -234,6 +237,16 @@ extern unsigned int kobjsize(const void *objp); # define VM_MPX VM_ARCH_2 #endif +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS +/* + * The metadata for file-backed vma does not exist or is otherwise + * synced before fault handler returns to userspace + */ +#define VM_SYNC VM_HIGH_ARCH_4 +#else +#define VM_SYNC 0 +#endif + #ifndef VM_GROWSUP # define VM_GROWSUP VM_NONE #endif _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH v2 1/3] mm, dax: add VM_SYNC flag for device-dax VMAs @ 2016-09-15 6:54 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 6:54 UTC (permalink / raw) To: linux-mm; +Cc: linux-nvdimm, linux-kernel, npiggin, xfs, linux-fsdevel, hch Introduce a new vma flag to indicate the property of device-dax VMAs that, while file-backed, do not require notification to a filesystem agent to sync metadata after a fault. In particular this enables persistent memory applications to know if they can commit transactions to media via cpu instructions alone, or need to call back into the kernel to synchronize metadata. Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/dax/Kconfig | 1 + drivers/dax/dax.c | 2 +- fs/proc/task_mmu.c | 3 +++ include/linux/mm.h | 21 +++++++++++++++++---- 4 files changed, 22 insertions(+), 5 deletions(-) diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index cedab7572de3..a4d99e637623 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -2,6 +2,7 @@ menuconfig DEV_DAX tristate "DAX: direct access to differentiated memory" default m if NVDIMM_DAX depends on TRANSPARENT_HUGEPAGE + select ARCH_USES_HIGH_VMA_FLAGS if 64BIT help Support raw access to differentiated (persistence, bandwidth, latency...) memory via an mmap(2) capable character diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c index 29f600f2c447..88fad2519907 100644 --- a/drivers/dax/dax.c +++ b/drivers/dax/dax.c @@ -528,7 +528,7 @@ static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma) kref_get(&dax_dev->kref); vma->vm_ops = &dax_dev_vm_ops; - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC; return 0; } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index f6fa99eca515..03a65ac7f222 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -675,6 +675,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_PKEY_BIT2)] = "", [ilog2(VM_PKEY_BIT3)] = "", #endif +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS + [ilog2(VM_SYNC)] = "sn", +#endif }; size_t i; diff --git a/include/linux/mm.h b/include/linux/mm.h index ef815b9cd426..f3f6df6bb498 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -198,14 +198,17 @@ extern unsigned int kobjsize(const void *objp); #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */ #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS -#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */ -#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */ -#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ -#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ +/* bits below only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_0 32 +#define VM_HIGH_ARCH_BIT_1 33 +#define VM_HIGH_ARCH_BIT_2 34 +#define VM_HIGH_ARCH_BIT_3 35 +#define VM_HIGH_ARCH_BIT_4 36 #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) +#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ #if defined(CONFIG_X86) @@ -234,6 +237,16 @@ extern unsigned int kobjsize(const void *objp); # define VM_MPX VM_ARCH_2 #endif +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS +/* + * The metadata for file-backed vma does not exist or is otherwise + * synced before fault handler returns to userspace + */ +#define VM_SYNC VM_HIGH_ARCH_4 +#else +#define VM_SYNC 0 +#endif + #ifndef VM_GROWSUP # define VM_GROWSUP VM_NONE #endif _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH v2 1/3] mm, dax: add VM_SYNC flag for device-dax VMAs @ 2016-09-15 6:54 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 6:54 UTC (permalink / raw) To: linux-mm Cc: linux-nvdimm, david, linux-kernel, npiggin, xfs, linux-fsdevel, hch Introduce a new vma flag to indicate the property of device-dax VMAs that, while file-backed, do not require notification to a filesystem agent to sync metadata after a fault. In particular this enables persistent memory applications to know if they can commit transactions to media via cpu instructions alone, or need to call back into the kernel to synchronize metadata. Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/dax/Kconfig | 1 + drivers/dax/dax.c | 2 +- fs/proc/task_mmu.c | 3 +++ include/linux/mm.h | 21 +++++++++++++++++---- 4 files changed, 22 insertions(+), 5 deletions(-) diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index cedab7572de3..a4d99e637623 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -2,6 +2,7 @@ menuconfig DEV_DAX tristate "DAX: direct access to differentiated memory" default m if NVDIMM_DAX depends on TRANSPARENT_HUGEPAGE + select ARCH_USES_HIGH_VMA_FLAGS if 64BIT help Support raw access to differentiated (persistence, bandwidth, latency...) memory via an mmap(2) capable character diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c index 29f600f2c447..88fad2519907 100644 --- a/drivers/dax/dax.c +++ b/drivers/dax/dax.c @@ -528,7 +528,7 @@ static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma) kref_get(&dax_dev->kref); vma->vm_ops = &dax_dev_vm_ops; - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC; return 0; } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index f6fa99eca515..03a65ac7f222 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -675,6 +675,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_PKEY_BIT2)] = "", [ilog2(VM_PKEY_BIT3)] = "", #endif +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS + [ilog2(VM_SYNC)] = "sn", +#endif }; size_t i; diff --git a/include/linux/mm.h b/include/linux/mm.h index ef815b9cd426..f3f6df6bb498 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -198,14 +198,17 @@ extern unsigned int kobjsize(const void *objp); #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */ #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS -#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */ -#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */ -#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ -#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ +/* bits below only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_0 32 +#define VM_HIGH_ARCH_BIT_1 33 +#define VM_HIGH_ARCH_BIT_2 34 +#define VM_HIGH_ARCH_BIT_3 35 +#define VM_HIGH_ARCH_BIT_4 36 #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) +#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ #if defined(CONFIG_X86) @@ -234,6 +237,16 @@ extern unsigned int kobjsize(const void *objp); # define VM_MPX VM_ARCH_2 #endif +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS +/* + * The metadata for file-backed vma does not exist or is otherwise + * synced before fault handler returns to userspace + */ +#define VM_SYNC VM_HIGH_ARCH_4 +#else +#define VM_SYNC 0 +#endif + #ifndef VM_GROWSUP # define VM_GROWSUP VM_NONE #endif -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH v2 1/3] mm, dax: add VM_SYNC flag for device-dax VMAs @ 2016-09-15 6:54 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 6:54 UTC (permalink / raw) To: linux-mm Cc: linux-nvdimm, david, linux-kernel, npiggin, xfs, linux-fsdevel, hch Introduce a new vma flag to indicate the property of device-dax VMAs that, while file-backed, do not require notification to a filesystem agent to sync metadata after a fault. In particular this enables persistent memory applications to know if they can commit transactions to media via cpu instructions alone, or need to call back into the kernel to synchronize metadata. Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/dax/Kconfig | 1 + drivers/dax/dax.c | 2 +- fs/proc/task_mmu.c | 3 +++ include/linux/mm.h | 21 +++++++++++++++++---- 4 files changed, 22 insertions(+), 5 deletions(-) diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index cedab7572de3..a4d99e637623 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -2,6 +2,7 @@ menuconfig DEV_DAX tristate "DAX: direct access to differentiated memory" default m if NVDIMM_DAX depends on TRANSPARENT_HUGEPAGE + select ARCH_USES_HIGH_VMA_FLAGS if 64BIT help Support raw access to differentiated (persistence, bandwidth, latency...) memory via an mmap(2) capable character diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c index 29f600f2c447..88fad2519907 100644 --- a/drivers/dax/dax.c +++ b/drivers/dax/dax.c @@ -528,7 +528,7 @@ static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma) kref_get(&dax_dev->kref); vma->vm_ops = &dax_dev_vm_ops; - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC; return 0; } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index f6fa99eca515..03a65ac7f222 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -675,6 +675,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_PKEY_BIT2)] = "", [ilog2(VM_PKEY_BIT3)] = "", #endif +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS + [ilog2(VM_SYNC)] = "sn", +#endif }; size_t i; diff --git a/include/linux/mm.h b/include/linux/mm.h index ef815b9cd426..f3f6df6bb498 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -198,14 +198,17 @@ extern unsigned int kobjsize(const void *objp); #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */ #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS -#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */ -#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */ -#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ -#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ +/* bits below only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_0 32 +#define VM_HIGH_ARCH_BIT_1 33 +#define VM_HIGH_ARCH_BIT_2 34 +#define VM_HIGH_ARCH_BIT_3 35 +#define VM_HIGH_ARCH_BIT_4 36 #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) +#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ #if defined(CONFIG_X86) @@ -234,6 +237,16 @@ extern unsigned int kobjsize(const void *objp); # define VM_MPX VM_ARCH_2 #endif +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS +/* + * The metadata for file-backed vma does not exist or is otherwise + * synced before fault handler returns to userspace + */ +#define VM_SYNC VM_HIGH_ARCH_4 +#else +#define VM_SYNC 0 +#endif + #ifndef VM_GROWSUP # define VM_GROWSUP VM_NONE #endif ^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs 2016-09-15 6:54 ` Dan Williams (?) (?) @ 2016-09-15 6:54 ` Dan Williams -1 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 6:54 UTC (permalink / raw) To: linux-mm Cc: linux-nvdimm, david, linux-kernel, npiggin, xfs, linux-fsdevel, hch The DAX property, page cache bypass, of a VMA is only detectable via the vma_is_dax() helper to check the S_DAX inode flag. However, this is only available internal to the kernel and is a property that userspace applications would like to interrogate. Yes, this new VM_DAX flag is only available on 64-bit, but the expectation is that the capacities of persistent memory devices are too large for 32-bit platforms. While there is usage of DAX on 32-bit, that usage is primarily driven by DAX's replacement of XIP. XIP is a memory saving technique for embedded devices to execute out of DAX, but in that usage the application does not need to discern if page cache is present or not. Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/dax/dax.c | 2 +- fs/Kconfig | 1 + fs/ext2/file.c | 2 +- fs/ext4/file.c | 2 +- fs/proc/task_mmu.c | 1 + fs/xfs/xfs_file.c | 2 +- include/linux/mm.h | 10 ++++++++++ 7 files changed, 16 insertions(+), 4 deletions(-) diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c index 88fad2519907..1cb4117870bd 100644 --- a/drivers/dax/dax.c +++ b/drivers/dax/dax.c @@ -528,7 +528,7 @@ static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma) kref_get(&dax_dev->kref); vma->vm_ops = &dax_dev_vm_ops; - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC | VM_DAX; return 0; } diff --git a/fs/Kconfig b/fs/Kconfig index 2bc7ad775842..6d9afe4c1710 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -38,6 +38,7 @@ config FS_DAX bool "Direct Access (DAX) support" depends on MMU depends on !(ARM || MIPS || SPARC) + select ARCH_USES_HIGH_VMA_FLAGS if 64BIT help Direct Access (DAX) can be used on memory-backed block devices. If the block device supports DAX and the filesystem supports DAX, diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 5efeefe17abb..b9c829cf427c 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -118,7 +118,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma) file_accessed(file); vma->vm_ops = &ext2_dax_vm_ops; - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX; return 0; } #else diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 261ac3734c58..7a777f1bbde3 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -312,7 +312,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) file_accessed(file); if (IS_DAX(file_inode(file))) { vma->vm_ops = &ext4_dax_vm_ops; - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX; } else { vma->vm_ops = &ext4_file_vm_ops; } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 03a65ac7f222..b9b9dc059e19 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -677,6 +677,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) #endif #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS [ilog2(VM_SYNC)] = "sn", + [ilog2(VM_DAX)] = "dx", #endif }; size_t i; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index e612a0233710..80ed83405683 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1644,7 +1644,7 @@ xfs_file_mmap( file_accessed(filp); vma->vm_ops = &xfs_file_vm_ops; if (IS_DAX(file_inode(filp))) - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX; return 0; } diff --git a/include/linux/mm.h b/include/linux/mm.h index f3f6df6bb498..5930402596c0 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -204,11 +204,13 @@ extern unsigned int kobjsize(const void *objp); #define VM_HIGH_ARCH_BIT_2 34 #define VM_HIGH_ARCH_BIT_3 35 #define VM_HIGH_ARCH_BIT_4 36 +#define VM_HIGH_ARCH_BIT_5 37 #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ #if defined(CONFIG_X86) @@ -243,8 +245,16 @@ extern unsigned int kobjsize(const void *objp); * synced before fault handler returns to userspace */ #define VM_SYNC VM_HIGH_ARCH_4 +/* + * Mapping is not indirected through the page-cache, accesses hit memory + * media directly*. + * + * (*) a fileystem may map the zero-page into holes of a file. + */ +#define VM_DAX VM_HIGH_ARCH_5 #else #define VM_SYNC 0 +#define VM_DAX 0 #endif #ifndef VM_GROWSUP _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 6:54 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 6:54 UTC (permalink / raw) To: linux-mm; +Cc: linux-nvdimm, linux-kernel, npiggin, xfs, linux-fsdevel, hch The DAX property, page cache bypass, of a VMA is only detectable via the vma_is_dax() helper to check the S_DAX inode flag. However, this is only available internal to the kernel and is a property that userspace applications would like to interrogate. Yes, this new VM_DAX flag is only available on 64-bit, but the expectation is that the capacities of persistent memory devices are too large for 32-bit platforms. While there is usage of DAX on 32-bit, that usage is primarily driven by DAX's replacement of XIP. XIP is a memory saving technique for embedded devices to execute out of DAX, but in that usage the application does not need to discern if page cache is present or not. Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/dax/dax.c | 2 +- fs/Kconfig | 1 + fs/ext2/file.c | 2 +- fs/ext4/file.c | 2 +- fs/proc/task_mmu.c | 1 + fs/xfs/xfs_file.c | 2 +- include/linux/mm.h | 10 ++++++++++ 7 files changed, 16 insertions(+), 4 deletions(-) diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c index 88fad2519907..1cb4117870bd 100644 --- a/drivers/dax/dax.c +++ b/drivers/dax/dax.c @@ -528,7 +528,7 @@ static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma) kref_get(&dax_dev->kref); vma->vm_ops = &dax_dev_vm_ops; - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC | VM_DAX; return 0; } diff --git a/fs/Kconfig b/fs/Kconfig index 2bc7ad775842..6d9afe4c1710 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -38,6 +38,7 @@ config FS_DAX bool "Direct Access (DAX) support" depends on MMU depends on !(ARM || MIPS || SPARC) + select ARCH_USES_HIGH_VMA_FLAGS if 64BIT help Direct Access (DAX) can be used on memory-backed block devices. If the block device supports DAX and the filesystem supports DAX, diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 5efeefe17abb..b9c829cf427c 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -118,7 +118,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma) file_accessed(file); vma->vm_ops = &ext2_dax_vm_ops; - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX; return 0; } #else diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 261ac3734c58..7a777f1bbde3 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -312,7 +312,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) file_accessed(file); if (IS_DAX(file_inode(file))) { vma->vm_ops = &ext4_dax_vm_ops; - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX; } else { vma->vm_ops = &ext4_file_vm_ops; } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 03a65ac7f222..b9b9dc059e19 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -677,6 +677,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) #endif #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS [ilog2(VM_SYNC)] = "sn", + [ilog2(VM_DAX)] = "dx", #endif }; size_t i; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index e612a0233710..80ed83405683 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1644,7 +1644,7 @@ xfs_file_mmap( file_accessed(filp); vma->vm_ops = &xfs_file_vm_ops; if (IS_DAX(file_inode(filp))) - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX; return 0; } diff --git a/include/linux/mm.h b/include/linux/mm.h index f3f6df6bb498..5930402596c0 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -204,11 +204,13 @@ extern unsigned int kobjsize(const void *objp); #define VM_HIGH_ARCH_BIT_2 34 #define VM_HIGH_ARCH_BIT_3 35 #define VM_HIGH_ARCH_BIT_4 36 +#define VM_HIGH_ARCH_BIT_5 37 #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ #if defined(CONFIG_X86) @@ -243,8 +245,16 @@ extern unsigned int kobjsize(const void *objp); * synced before fault handler returns to userspace */ #define VM_SYNC VM_HIGH_ARCH_4 +/* + * Mapping is not indirected through the page-cache, accesses hit memory + * media directly*. + * + * (*) a fileystem may map the zero-page into holes of a file. + */ +#define VM_DAX VM_HIGH_ARCH_5 #else #define VM_SYNC 0 +#define VM_DAX 0 #endif #ifndef VM_GROWSUP _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 6:54 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 6:54 UTC (permalink / raw) To: linux-mm Cc: linux-nvdimm, david, linux-kernel, npiggin, xfs, linux-fsdevel, hch The DAX property, page cache bypass, of a VMA is only detectable via the vma_is_dax() helper to check the S_DAX inode flag. However, this is only available internal to the kernel and is a property that userspace applications would like to interrogate. Yes, this new VM_DAX flag is only available on 64-bit, but the expectation is that the capacities of persistent memory devices are too large for 32-bit platforms. While there is usage of DAX on 32-bit, that usage is primarily driven by DAX's replacement of XIP. XIP is a memory saving technique for embedded devices to execute out of DAX, but in that usage the application does not need to discern if page cache is present or not. Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/dax/dax.c | 2 +- fs/Kconfig | 1 + fs/ext2/file.c | 2 +- fs/ext4/file.c | 2 +- fs/proc/task_mmu.c | 1 + fs/xfs/xfs_file.c | 2 +- include/linux/mm.h | 10 ++++++++++ 7 files changed, 16 insertions(+), 4 deletions(-) diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c index 88fad2519907..1cb4117870bd 100644 --- a/drivers/dax/dax.c +++ b/drivers/dax/dax.c @@ -528,7 +528,7 @@ static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma) kref_get(&dax_dev->kref); vma->vm_ops = &dax_dev_vm_ops; - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC | VM_DAX; return 0; } diff --git a/fs/Kconfig b/fs/Kconfig index 2bc7ad775842..6d9afe4c1710 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -38,6 +38,7 @@ config FS_DAX bool "Direct Access (DAX) support" depends on MMU depends on !(ARM || MIPS || SPARC) + select ARCH_USES_HIGH_VMA_FLAGS if 64BIT help Direct Access (DAX) can be used on memory-backed block devices. If the block device supports DAX and the filesystem supports DAX, diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 5efeefe17abb..b9c829cf427c 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -118,7 +118,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma) file_accessed(file); vma->vm_ops = &ext2_dax_vm_ops; - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX; return 0; } #else diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 261ac3734c58..7a777f1bbde3 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -312,7 +312,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) file_accessed(file); if (IS_DAX(file_inode(file))) { vma->vm_ops = &ext4_dax_vm_ops; - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX; } else { vma->vm_ops = &ext4_file_vm_ops; } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 03a65ac7f222..b9b9dc059e19 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -677,6 +677,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) #endif #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS [ilog2(VM_SYNC)] = "sn", + [ilog2(VM_DAX)] = "dx", #endif }; size_t i; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index e612a0233710..80ed83405683 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1644,7 +1644,7 @@ xfs_file_mmap( file_accessed(filp); vma->vm_ops = &xfs_file_vm_ops; if (IS_DAX(file_inode(filp))) - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX; return 0; } diff --git a/include/linux/mm.h b/include/linux/mm.h index f3f6df6bb498..5930402596c0 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -204,11 +204,13 @@ extern unsigned int kobjsize(const void *objp); #define VM_HIGH_ARCH_BIT_2 34 #define VM_HIGH_ARCH_BIT_3 35 #define VM_HIGH_ARCH_BIT_4 36 +#define VM_HIGH_ARCH_BIT_5 37 #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ #if defined(CONFIG_X86) @@ -243,8 +245,16 @@ extern unsigned int kobjsize(const void *objp); * synced before fault handler returns to userspace */ #define VM_SYNC VM_HIGH_ARCH_4 +/* + * Mapping is not indirected through the page-cache, accesses hit memory + * media directly*. + * + * (*) a fileystem may map the zero-page into holes of a file. + */ +#define VM_DAX VM_HIGH_ARCH_5 #else #define VM_SYNC 0 +#define VM_DAX 0 #endif #ifndef VM_GROWSUP -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 6:54 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 6:54 UTC (permalink / raw) To: linux-mm Cc: linux-nvdimm, david, linux-kernel, npiggin, xfs, linux-fsdevel, hch The DAX property, page cache bypass, of a VMA is only detectable via the vma_is_dax() helper to check the S_DAX inode flag. However, this is only available internal to the kernel and is a property that userspace applications would like to interrogate. Yes, this new VM_DAX flag is only available on 64-bit, but the expectation is that the capacities of persistent memory devices are too large for 32-bit platforms. While there is usage of DAX on 32-bit, that usage is primarily driven by DAX's replacement of XIP. XIP is a memory saving technique for embedded devices to execute out of DAX, but in that usage the application does not need to discern if page cache is present or not. Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/dax/dax.c | 2 +- fs/Kconfig | 1 + fs/ext2/file.c | 2 +- fs/ext4/file.c | 2 +- fs/proc/task_mmu.c | 1 + fs/xfs/xfs_file.c | 2 +- include/linux/mm.h | 10 ++++++++++ 7 files changed, 16 insertions(+), 4 deletions(-) diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c index 88fad2519907..1cb4117870bd 100644 --- a/drivers/dax/dax.c +++ b/drivers/dax/dax.c @@ -528,7 +528,7 @@ static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma) kref_get(&dax_dev->kref); vma->vm_ops = &dax_dev_vm_ops; - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_SYNC | VM_DAX; return 0; } diff --git a/fs/Kconfig b/fs/Kconfig index 2bc7ad775842..6d9afe4c1710 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -38,6 +38,7 @@ config FS_DAX bool "Direct Access (DAX) support" depends on MMU depends on !(ARM || MIPS || SPARC) + select ARCH_USES_HIGH_VMA_FLAGS if 64BIT help Direct Access (DAX) can be used on memory-backed block devices. If the block device supports DAX and the filesystem supports DAX, diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 5efeefe17abb..b9c829cf427c 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -118,7 +118,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma) file_accessed(file); vma->vm_ops = &ext2_dax_vm_ops; - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX; return 0; } #else diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 261ac3734c58..7a777f1bbde3 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -312,7 +312,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) file_accessed(file); if (IS_DAX(file_inode(file))) { vma->vm_ops = &ext4_dax_vm_ops; - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX; } else { vma->vm_ops = &ext4_file_vm_ops; } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 03a65ac7f222..b9b9dc059e19 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -677,6 +677,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) #endif #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS [ilog2(VM_SYNC)] = "sn", + [ilog2(VM_DAX)] = "dx", #endif }; size_t i; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index e612a0233710..80ed83405683 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1644,7 +1644,7 @@ xfs_file_mmap( file_accessed(filp); vma->vm_ops = &xfs_file_vm_ops; if (IS_DAX(file_inode(filp))) - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE | VM_DAX; return 0; } diff --git a/include/linux/mm.h b/include/linux/mm.h index f3f6df6bb498..5930402596c0 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -204,11 +204,13 @@ extern unsigned int kobjsize(const void *objp); #define VM_HIGH_ARCH_BIT_2 34 #define VM_HIGH_ARCH_BIT_3 35 #define VM_HIGH_ARCH_BIT_4 36 +#define VM_HIGH_ARCH_BIT_5 37 #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ #if defined(CONFIG_X86) @@ -243,8 +245,16 @@ extern unsigned int kobjsize(const void *objp); * synced before fault handler returns to userspace */ #define VM_SYNC VM_HIGH_ARCH_4 +/* + * Mapping is not indirected through the page-cache, accesses hit memory + * media directly*. + * + * (*) a fileystem may map the zero-page into holes of a file. + */ +#define VM_DAX VM_HIGH_ARCH_5 #else #define VM_SYNC 0 +#define VM_DAX 0 #endif #ifndef VM_GROWSUP ^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs 2016-09-15 6:54 ` Dan Williams (?) @ 2016-09-15 8:26 ` Christoph Hellwig -1 siblings, 0 replies; 63+ messages in thread From: Christoph Hellwig @ 2016-09-15 8:26 UTC (permalink / raw) To: Dan Williams Cc: linux-mm, linux-nvdimm, david, linux-kernel, npiggin, xfs, linux-fsdevel, hch On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: > The DAX property, page cache bypass, of a VMA is only detectable via the > vma_is_dax() helper to check the S_DAX inode flag. However, this is > only available internal to the kernel and is a property that userspace > applications would like to interrogate. They have absolutely no business knowing such an implementation detail. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 8:26 ` Christoph Hellwig 0 siblings, 0 replies; 63+ messages in thread From: Christoph Hellwig @ 2016-09-15 8:26 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, linux-kernel, npiggin, xfs, linux-mm, linux-fsdevel, hch On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: > The DAX property, page cache bypass, of a VMA is only detectable via the > vma_is_dax() helper to check the S_DAX inode flag. However, this is > only available internal to the kernel and is a property that userspace > applications would like to interrogate. They have absolutely no business knowing such an implementation detail. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 8:26 ` Christoph Hellwig 0 siblings, 0 replies; 63+ messages in thread From: Christoph Hellwig @ 2016-09-15 8:26 UTC (permalink / raw) To: Dan Williams Cc: linux-mm, linux-nvdimm, david, linux-kernel, npiggin, xfs, linux-fsdevel, hch On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: > The DAX property, page cache bypass, of a VMA is only detectable via the > vma_is_dax() helper to check the S_DAX inode flag. However, this is > only available internal to the kernel and is a property that userspace > applications would like to interrogate. They have absolutely no business knowing such an implementation detail. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs 2016-09-15 8:26 ` Christoph Hellwig (?) (?) @ 2016-09-15 17:01 ` Dan Williams -1 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 17:01 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-nvdimm, david, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> The DAX property, page cache bypass, of a VMA is only detectable via the >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> only available internal to the kernel and is a property that userspace >> applications would like to interrogate. > > They have absolutely no business knowing such an implementation detail. Hasn't that train already left the station with FS_XFLAG_DAX? The other problem with hiding the DAX property is that it turns out to not be a transparent acceleration feature. See xfs/086 xfs/088 xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is due to the fact that DAX disallows delayed allocation behavior. If behavior changes I think we should indicate that to userspace and VM_DAX is certainly more useful to userspace than some of the other vm internals we already export in those flags. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 17:01 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 17:01 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> The DAX property, page cache bypass, of a VMA is only detectable via the >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> only available internal to the kernel and is a property that userspace >> applications would like to interrogate. > > They have absolutely no business knowing such an implementation detail. Hasn't that train already left the station with FS_XFLAG_DAX? The other problem with hiding the DAX property is that it turns out to not be a transparent acceleration feature. See xfs/086 xfs/088 xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is due to the fact that DAX disallows delayed allocation behavior. If behavior changes I think we should indicate that to userspace and VM_DAX is certainly more useful to userspace than some of the other vm internals we already export in those flags. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 17:01 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 17:01 UTC (permalink / raw) To: Christoph Hellwig Cc: Linux MM, linux-nvdimm, david, linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> The DAX property, page cache bypass, of a VMA is only detectable via the >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> only available internal to the kernel and is a property that userspace >> applications would like to interrogate. > > They have absolutely no business knowing such an implementation detail. Hasn't that train already left the station with FS_XFLAG_DAX? The other problem with hiding the DAX property is that it turns out to not be a transparent acceleration feature. See xfs/086 xfs/088 xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is due to the fact that DAX disallows delayed allocation behavior. If behavior changes I think we should indicate that to userspace and VM_DAX is certainly more useful to userspace than some of the other vm internals we already export in those flags. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 17:01 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 17:01 UTC (permalink / raw) To: Christoph Hellwig Cc: Linux MM, linux-nvdimm@lists.01.org, david, linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> The DAX property, page cache bypass, of a VMA is only detectable via the >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> only available internal to the kernel and is a property that userspace >> applications would like to interrogate. > > They have absolutely no business knowing such an implementation detail. Hasn't that train already left the station with FS_XFLAG_DAX? The other problem with hiding the DAX property is that it turns out to not be a transparent acceleration feature. See xfs/086 xfs/088 xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is due to the fact that DAX disallows delayed allocation behavior. If behavior changes I think we should indicate that to userspace and VM_DAX is certainly more useful to userspace than some of the other vm internals we already export in those flags. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs 2016-09-15 17:01 ` Dan Williams (?) (?) @ 2016-09-15 17:09 ` Darrick J. Wong -1 siblings, 0 replies; 63+ messages in thread From: Darrick J. Wong @ 2016-09-15 17:09 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: > On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: > >> The DAX property, page cache bypass, of a VMA is only detectable via the > >> vma_is_dax() helper to check the S_DAX inode flag. However, this is > >> only available internal to the kernel and is a property that userspace > >> applications would like to interrogate. > > > > They have absolutely no business knowing such an implementation detail. > > Hasn't that train already left the station with FS_XFLAG_DAX? Seeing as FS_IOC_FSGETXATTR is a "generic" ioctl now, why not just implement it for all the DAX fses and block devices? Aside from xflags, the other fields are probably all zero for non-xfs (aside from project quota id I guess). (Yeah, sort of awkward, I know...) --D > The other problem with hiding the DAX property is that it turns out to > not be a transparent acceleration feature. See xfs/086 xfs/088 > xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is > due to the fact that DAX disallows delayed allocation behavior. > > If behavior changes I think we should indicate that to userspace and > VM_DAX is certainly more useful to userspace than some of the other vm > internals we already export in those flags. > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 17:09 ` Darrick J. Wong 0 siblings, 0 replies; 63+ messages in thread From: Darrick J. Wong @ 2016-09-15 17:09 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: > On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: > >> The DAX property, page cache bypass, of a VMA is only detectable via the > >> vma_is_dax() helper to check the S_DAX inode flag. However, this is > >> only available internal to the kernel and is a property that userspace > >> applications would like to interrogate. > > > > They have absolutely no business knowing such an implementation detail. > > Hasn't that train already left the station with FS_XFLAG_DAX? Seeing as FS_IOC_FSGETXATTR is a "generic" ioctl now, why not just implement it for all the DAX fses and block devices? Aside from xflags, the other fields are probably all zero for non-xfs (aside from project quota id I guess). (Yeah, sort of awkward, I know...) --D > The other problem with hiding the DAX property is that it turns out to > not be a transparent acceleration feature. See xfs/086 xfs/088 > xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is > due to the fact that DAX disallows delayed allocation behavior. > > If behavior changes I think we should indicate that to userspace and > VM_DAX is certainly more useful to userspace than some of the other vm > internals we already export in those flags. > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 17:09 ` Darrick J. Wong 0 siblings, 0 replies; 63+ messages in thread From: Darrick J. Wong @ 2016-09-15 17:09 UTC (permalink / raw) To: Dan Williams Cc: Christoph Hellwig, linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: > On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: > >> The DAX property, page cache bypass, of a VMA is only detectable via the > >> vma_is_dax() helper to check the S_DAX inode flag. However, this is > >> only available internal to the kernel and is a property that userspace > >> applications would like to interrogate. > > > > They have absolutely no business knowing such an implementation detail. > > Hasn't that train already left the station with FS_XFLAG_DAX? Seeing as FS_IOC_FSGETXATTR is a "generic" ioctl now, why not just implement it for all the DAX fses and block devices? Aside from xflags, the other fields are probably all zero for non-xfs (aside from project quota id I guess). (Yeah, sort of awkward, I know...) --D > The other problem with hiding the DAX property is that it turns out to > not be a transparent acceleration feature. See xfs/086 xfs/088 > xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is > due to the fact that DAX disallows delayed allocation behavior. > > If behavior changes I think we should indicate that to userspace and > VM_DAX is certainly more useful to userspace than some of the other vm > internals we already export in those flags. > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 17:09 ` Darrick J. Wong 0 siblings, 0 replies; 63+ messages in thread From: Darrick J. Wong @ 2016-09-15 17:09 UTC (permalink / raw) To: Dan Williams Cc: Christoph Hellwig, linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: > On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: > >> The DAX property, page cache bypass, of a VMA is only detectable via the > >> vma_is_dax() helper to check the S_DAX inode flag. However, this is > >> only available internal to the kernel and is a property that userspace > >> applications would like to interrogate. > > > > They have absolutely no business knowing such an implementation detail. > > Hasn't that train already left the station with FS_XFLAG_DAX? Seeing as FS_IOC_FSGETXATTR is a "generic" ioctl now, why not just implement it for all the DAX fses and block devices? Aside from xflags, the other fields are probably all zero for non-xfs (aside from project quota id I guess). (Yeah, sort of awkward, I know...) --D > The other problem with hiding the DAX property is that it turns out to > not be a transparent acceleration feature. See xfs/086 xfs/088 > xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is > due to the fact that DAX disallows delayed allocation behavior. > > If behavior changes I think we should indicate that to userspace and > VM_DAX is certainly more useful to userspace than some of the other vm > internals we already export in those flags. > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs 2016-09-15 17:09 ` Darrick J. Wong (?) (?) @ 2016-09-15 17:44 ` Dan Williams -1 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 17:44 UTC (permalink / raw) To: Darrick J. Wong Cc: linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 10:09 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> only available internal to the kernel and is a property that userspace >> >> applications would like to interrogate. >> > >> > They have absolutely no business knowing such an implementation detail. >> >> Hasn't that train already left the station with FS_XFLAG_DAX? > > Seeing as FS_IOC_FSGETXATTR is a "generic" ioctl now, why not just > implement it for all the DAX fses and block devices? Aside from xflags, > the other fields are probably all zero for non-xfs (aside from project > quota id I guess). > > (Yeah, sort of awkward, I know...) It would solve the problem at hand, I'll take a look. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 17:44 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 17:44 UTC (permalink / raw) To: Darrick J. Wong Cc: linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 10:09 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> only available internal to the kernel and is a property that userspace >> >> applications would like to interrogate. >> > >> > They have absolutely no business knowing such an implementation detail. >> >> Hasn't that train already left the station with FS_XFLAG_DAX? > > Seeing as FS_IOC_FSGETXATTR is a "generic" ioctl now, why not just > implement it for all the DAX fses and block devices? Aside from xflags, > the other fields are probably all zero for non-xfs (aside from project > quota id I guess). > > (Yeah, sort of awkward, I know...) It would solve the problem at hand, I'll take a look. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 17:44 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 17:44 UTC (permalink / raw) To: Darrick J. Wong Cc: Christoph Hellwig, linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel On Thu, Sep 15, 2016 at 10:09 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> only available internal to the kernel and is a property that userspace >> >> applications would like to interrogate. >> > >> > They have absolutely no business knowing such an implementation detail. >> >> Hasn't that train already left the station with FS_XFLAG_DAX? > > Seeing as FS_IOC_FSGETXATTR is a "generic" ioctl now, why not just > implement it for all the DAX fses and block devices? Aside from xflags, > the other fields are probably all zero for non-xfs (aside from project > quota id I guess). > > (Yeah, sort of awkward, I know...) It would solve the problem at hand, I'll take a look. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 17:44 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 17:44 UTC (permalink / raw) To: Darrick J. Wong Cc: Christoph Hellwig, linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel On Thu, Sep 15, 2016 at 10:09 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> only available internal to the kernel and is a property that userspace >> >> applications would like to interrogate. >> > >> > They have absolutely no business knowing such an implementation detail. >> >> Hasn't that train already left the station with FS_XFLAG_DAX? > > Seeing as FS_IOC_FSGETXATTR is a "generic" ioctl now, why not just > implement it for all the DAX fses and block devices? Aside from xflags, > the other fields are probably all zero for non-xfs (aside from project > quota id I guess). > > (Yeah, sort of awkward, I know...) It would solve the problem at hand, I'll take a look. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs 2016-09-15 17:01 ` Dan Williams (?) (?) @ 2016-09-15 23:07 ` Dave Chinner -1 siblings, 0 replies; 63+ messages in thread From: Dave Chinner @ 2016-09-15 23:07 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: > On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: > >> The DAX property, page cache bypass, of a VMA is only detectable via the > >> vma_is_dax() helper to check the S_DAX inode flag. However, this is > >> only available internal to the kernel and is a property that userspace > >> applications would like to interrogate. > > > > They have absolutely no business knowing such an implementation detail. > > Hasn't that train already left the station with FS_XFLAG_DAX? No, that's an admin flag, not a runtime hint for applications. Just because that flag is set on an inode, it does not mean that DAX is actually in use - it will be ignored if the backing dev is not dax capable. > The other problem with hiding the DAX property is that it turns out to > not be a transparent acceleration feature. See xfs/086 xfs/088 > xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is > due to the fact that DAX disallows delayed allocation behavior. Which is not a bug, nor is it something that app developers should be surprised by. i.e. Subtle differences in error reporting behaviour occur in filesystems /all the time/. Run the test on a non-dax filesystem with an extent size hint. It fails /exactly the same way as DAX/. Run it with direct IO - fails the same way as DAX. Run it with synchronous writes - it fails the same way as DAX. IOWs, if an app can't handle the way DAX reports errors, then they are /broken/. Delayed allocation requires checking the return value of fsync() or close() to capture the allocation error - many more apps get that wrong than the ones that expect the immediate errors from write()... Anyway: to domeonstrate that the nothign is actually broken, and you might sometimes need to fix tests and send patches to fstests@vger.kernel.org, this makes xfs/086 pass for me on DAX: --- a/tests/xfs/086 +++ b/tests/xfs/086 @@ -96,7 +96,8 @@ _scratch_mount echo "+ modify files" for x in `seq 1 64`; do - $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full + $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" \ + >> $seqres.full 2>&1 done umount "${SCRATCH_MNT}" Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 23:07 ` Dave Chinner 0 siblings, 0 replies; 63+ messages in thread From: Dave Chinner @ 2016-09-15 23:07 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: > On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: > >> The DAX property, page cache bypass, of a VMA is only detectable via the > >> vma_is_dax() helper to check the S_DAX inode flag. However, this is > >> only available internal to the kernel and is a property that userspace > >> applications would like to interrogate. > > > > They have absolutely no business knowing such an implementation detail. > > Hasn't that train already left the station with FS_XFLAG_DAX? No, that's an admin flag, not a runtime hint for applications. Just because that flag is set on an inode, it does not mean that DAX is actually in use - it will be ignored if the backing dev is not dax capable. > The other problem with hiding the DAX property is that it turns out to > not be a transparent acceleration feature. See xfs/086 xfs/088 > xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is > due to the fact that DAX disallows delayed allocation behavior. Which is not a bug, nor is it something that app developers should be surprised by. i.e. Subtle differences in error reporting behaviour occur in filesystems /all the time/. Run the test on a non-dax filesystem with an extent size hint. It fails /exactly the same way as DAX/. Run it with direct IO - fails the same way as DAX. Run it with synchronous writes - it fails the same way as DAX. IOWs, if an app can't handle the way DAX reports errors, then they are /broken/. Delayed allocation requires checking the return value of fsync() or close() to capture the allocation error - many more apps get that wrong than the ones that expect the immediate errors from write()... Anyway: to domeonstrate that the nothign is actually broken, and you might sometimes need to fix tests and send patches to fstests@vger.kernel.org, this makes xfs/086 pass for me on DAX: --- a/tests/xfs/086 +++ b/tests/xfs/086 @@ -96,7 +96,8 @@ _scratch_mount echo "+ modify files" for x in `seq 1 64`; do - $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full + $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" \ + >> $seqres.full 2>&1 done umount "${SCRATCH_MNT}" Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 23:07 ` Dave Chinner 0 siblings, 0 replies; 63+ messages in thread From: Dave Chinner @ 2016-09-15 23:07 UTC (permalink / raw) To: Dan Williams Cc: Christoph Hellwig, Linux MM, linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: > On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: > >> The DAX property, page cache bypass, of a VMA is only detectable via the > >> vma_is_dax() helper to check the S_DAX inode flag. However, this is > >> only available internal to the kernel and is a property that userspace > >> applications would like to interrogate. > > > > They have absolutely no business knowing such an implementation detail. > > Hasn't that train already left the station with FS_XFLAG_DAX? No, that's an admin flag, not a runtime hint for applications. Just because that flag is set on an inode, it does not mean that DAX is actually in use - it will be ignored if the backing dev is not dax capable. > The other problem with hiding the DAX property is that it turns out to > not be a transparent acceleration feature. See xfs/086 xfs/088 > xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is > due to the fact that DAX disallows delayed allocation behavior. Which is not a bug, nor is it something that app developers should be surprised by. i.e. Subtle differences in error reporting behaviour occur in filesystems /all the time/. Run the test on a non-dax filesystem with an extent size hint. It fails /exactly the same way as DAX/. Run it with direct IO - fails the same way as DAX. Run it with synchronous writes - it fails the same way as DAX. IOWs, if an app can't handle the way DAX reports errors, then they are /broken/. Delayed allocation requires checking the return value of fsync() or close() to capture the allocation error - many more apps get that wrong than the ones that expect the immediate errors from write()... Anyway: to domeonstrate that the nothign is actually broken, and you might sometimes need to fix tests and send patches to fstests@vger.kernel.org, this makes xfs/086 pass for me on DAX: --- a/tests/xfs/086 +++ b/tests/xfs/086 @@ -96,7 +96,8 @@ _scratch_mount echo "+ modify files" for x in `seq 1 64`; do - $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full + $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" \ + >> $seqres.full 2>&1 done umount "${SCRATCH_MNT}" Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 23:07 ` Dave Chinner 0 siblings, 0 replies; 63+ messages in thread From: Dave Chinner @ 2016-09-15 23:07 UTC (permalink / raw) To: Dan Williams Cc: Christoph Hellwig, Linux MM, linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: > On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: > >> The DAX property, page cache bypass, of a VMA is only detectable via the > >> vma_is_dax() helper to check the S_DAX inode flag. However, this is > >> only available internal to the kernel and is a property that userspace > >> applications would like to interrogate. > > > > They have absolutely no business knowing such an implementation detail. > > Hasn't that train already left the station with FS_XFLAG_DAX? No, that's an admin flag, not a runtime hint for applications. Just because that flag is set on an inode, it does not mean that DAX is actually in use - it will be ignored if the backing dev is not dax capable. > The other problem with hiding the DAX property is that it turns out to > not be a transparent acceleration feature. See xfs/086 xfs/088 > xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is > due to the fact that DAX disallows delayed allocation behavior. Which is not a bug, nor is it something that app developers should be surprised by. i.e. Subtle differences in error reporting behaviour occur in filesystems /all the time/. Run the test on a non-dax filesystem with an extent size hint. It fails /exactly the same way as DAX/. Run it with direct IO - fails the same way as DAX. Run it with synchronous writes - it fails the same way as DAX. IOWs, if an app can't handle the way DAX reports errors, then they are /broken/. Delayed allocation requires checking the return value of fsync() or close() to capture the allocation error - many more apps get that wrong than the ones that expect the immediate errors from write()... Anyway: to domeonstrate that the nothign is actually broken, and you might sometimes need to fix tests and send patches to fstests@vger.kernel.org, this makes xfs/086 pass for me on DAX: --- a/tests/xfs/086 +++ b/tests/xfs/086 @@ -96,7 +96,8 @@ _scratch_mount echo "+ modify files" for x in `seq 1 64`; do - $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full + $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" \ + >> $seqres.full 2>&1 done umount "${SCRATCH_MNT}" Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs 2016-09-15 23:07 ` Dave Chinner (?) (?) @ 2016-09-15 23:19 ` Dan Williams -1 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 23:19 UTC (permalink / raw) To: Dave Chinner Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> only available internal to the kernel and is a property that userspace >> >> applications would like to interrogate. >> > >> > They have absolutely no business knowing such an implementation detail. >> >> Hasn't that train already left the station with FS_XFLAG_DAX? > > No, that's an admin flag, not a runtime hint for applications. Just > because that flag is set on an inode, it does not mean that DAX is > actually in use - it will be ignored if the backing dev is not dax > capable. Ok, but then VM_DAX does not suffer from that problem. I'm trying to understand why VM_DAX has no business being in the smaps "VmFlags" line, but something ambiguous to userspace like VM_MIXEDMAP does? > >> The other problem with hiding the DAX property is that it turns out to >> not be a transparent acceleration feature. See xfs/086 xfs/088 >> xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is >> due to the fact that DAX disallows delayed allocation behavior. > > Which is not a bug, nor is it something that app developers should > be surprised by. > > i.e. Subtle differences in error reporting behaviour occur in > filesystems /all the time/. Run the test on a non-dax filesystem > with an extent size hint. It fails /exactly the same way as DAX/. > Run it with direct IO - fails the same way as DAX. Run it > with synchronous writes - it fails the same way as DAX. > > IOWs, if an app can't handle the way DAX reports errors, then they > are /broken/. Delayed allocation requires checking the return value > of fsync() or close() to capture the allocation error - many more > apps get that wrong than the ones that expect the immediate errors > from write()... > > Anyway: to domeonstrate that the nothign is actually broken, and > you might sometimes need to fix tests and send patches to > fstests@vger.kernel.org, this makes xfs/086 pass for me on DAX: > > --- a/tests/xfs/086 > +++ b/tests/xfs/086 > @@ -96,7 +96,8 @@ _scratch_mount > > echo "+ modify files" > for x in `seq 1 64`; do > - $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full > + $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" \ > + >> $seqres.full 2>&1 > done > umount "${SCRATCH_MNT}" Thanks for that! Wasn't immediately obvious to me, and didn't get that response when I asked on the list a while back. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 23:19 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 23:19 UTC (permalink / raw) To: Dave Chinner Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> only available internal to the kernel and is a property that userspace >> >> applications would like to interrogate. >> > >> > They have absolutely no business knowing such an implementation detail. >> >> Hasn't that train already left the station with FS_XFLAG_DAX? > > No, that's an admin flag, not a runtime hint for applications. Just > because that flag is set on an inode, it does not mean that DAX is > actually in use - it will be ignored if the backing dev is not dax > capable. Ok, but then VM_DAX does not suffer from that problem. I'm trying to understand why VM_DAX has no business being in the smaps "VmFlags" line, but something ambiguous to userspace like VM_MIXEDMAP does? > >> The other problem with hiding the DAX property is that it turns out to >> not be a transparent acceleration feature. See xfs/086 xfs/088 >> xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is >> due to the fact that DAX disallows delayed allocation behavior. > > Which is not a bug, nor is it something that app developers should > be surprised by. > > i.e. Subtle differences in error reporting behaviour occur in > filesystems /all the time/. Run the test on a non-dax filesystem > with an extent size hint. It fails /exactly the same way as DAX/. > Run it with direct IO - fails the same way as DAX. Run it > with synchronous writes - it fails the same way as DAX. > > IOWs, if an app can't handle the way DAX reports errors, then they > are /broken/. Delayed allocation requires checking the return value > of fsync() or close() to capture the allocation error - many more > apps get that wrong than the ones that expect the immediate errors > from write()... > > Anyway: to domeonstrate that the nothign is actually broken, and > you might sometimes need to fix tests and send patches to > fstests@vger.kernel.org, this makes xfs/086 pass for me on DAX: > > --- a/tests/xfs/086 > +++ b/tests/xfs/086 > @@ -96,7 +96,8 @@ _scratch_mount > > echo "+ modify files" > for x in `seq 1 64`; do > - $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full > + $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" \ > + >> $seqres.full 2>&1 > done > umount "${SCRATCH_MNT}" Thanks for that! Wasn't immediately obvious to me, and didn't get that response when I asked on the list a while back. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 23:19 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 23:19 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Linux MM, linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> only available internal to the kernel and is a property that userspace >> >> applications would like to interrogate. >> > >> > They have absolutely no business knowing such an implementation detail. >> >> Hasn't that train already left the station with FS_XFLAG_DAX? > > No, that's an admin flag, not a runtime hint for applications. Just > because that flag is set on an inode, it does not mean that DAX is > actually in use - it will be ignored if the backing dev is not dax > capable. Ok, but then VM_DAX does not suffer from that problem. I'm trying to understand why VM_DAX has no business being in the smaps "VmFlags" line, but something ambiguous to userspace like VM_MIXEDMAP does? > >> The other problem with hiding the DAX property is that it turns out to >> not be a transparent acceleration feature. See xfs/086 xfs/088 >> xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is >> due to the fact that DAX disallows delayed allocation behavior. > > Which is not a bug, nor is it something that app developers should > be surprised by. > > i.e. Subtle differences in error reporting behaviour occur in > filesystems /all the time/. Run the test on a non-dax filesystem > with an extent size hint. It fails /exactly the same way as DAX/. > Run it with direct IO - fails the same way as DAX. Run it > with synchronous writes - it fails the same way as DAX. > > IOWs, if an app can't handle the way DAX reports errors, then they > are /broken/. Delayed allocation requires checking the return value > of fsync() or close() to capture the allocation error - many more > apps get that wrong than the ones that expect the immediate errors > from write()... > > Anyway: to domeonstrate that the nothign is actually broken, and > you might sometimes need to fix tests and send patches to > fstests@vger.kernel.org, this makes xfs/086 pass for me on DAX: > > --- a/tests/xfs/086 > +++ b/tests/xfs/086 > @@ -96,7 +96,8 @@ _scratch_mount > > echo "+ modify files" > for x in `seq 1 64`; do > - $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full > + $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" \ > + >> $seqres.full 2>&1 > done > umount "${SCRATCH_MNT}" Thanks for that! Wasn't immediately obvious to me, and didn't get that response when I asked on the list a while back. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-15 23:19 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 23:19 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Linux MM, linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> only available internal to the kernel and is a property that userspace >> >> applications would like to interrogate. >> > >> > They have absolutely no business knowing such an implementation detail. >> >> Hasn't that train already left the station with FS_XFLAG_DAX? > > No, that's an admin flag, not a runtime hint for applications. Just > because that flag is set on an inode, it does not mean that DAX is > actually in use - it will be ignored if the backing dev is not dax > capable. Ok, but then VM_DAX does not suffer from that problem. I'm trying to understand why VM_DAX has no business being in the smaps "VmFlags" line, but something ambiguous to userspace like VM_MIXEDMAP does? > >> The other problem with hiding the DAX property is that it turns out to >> not be a transparent acceleration feature. See xfs/086 xfs/088 >> xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is >> due to the fact that DAX disallows delayed allocation behavior. > > Which is not a bug, nor is it something that app developers should > be surprised by. > > i.e. Subtle differences in error reporting behaviour occur in > filesystems /all the time/. Run the test on a non-dax filesystem > with an extent size hint. It fails /exactly the same way as DAX/. > Run it with direct IO - fails the same way as DAX. Run it > with synchronous writes - it fails the same way as DAX. > > IOWs, if an app can't handle the way DAX reports errors, then they > are /broken/. Delayed allocation requires checking the return value > of fsync() or close() to capture the allocation error - many more > apps get that wrong than the ones that expect the immediate errors > from write()... > > Anyway: to domeonstrate that the nothign is actually broken, and > you might sometimes need to fix tests and send patches to > fstests@vger.kernel.org, this makes xfs/086 pass for me on DAX: > > --- a/tests/xfs/086 > +++ b/tests/xfs/086 > @@ -96,7 +96,8 @@ _scratch_mount > > echo "+ modify files" > for x in `seq 1 64`; do > - $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full > + $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" \ > + >> $seqres.full 2>&1 > done > umount "${SCRATCH_MNT}" Thanks for that! Wasn't immediately obvious to me, and didn't get that response when I asked on the list a while back. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs 2016-09-15 23:07 ` Dave Chinner (?) (?) @ 2016-09-16 0:16 ` Dan Williams -1 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-16 0:16 UTC (permalink / raw) To: Dave Chinner Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> only available internal to the kernel and is a property that userspace >> >> applications would like to interrogate. >> > >> > They have absolutely no business knowing such an implementation detail. >> >> Hasn't that train already left the station with FS_XFLAG_DAX? > > No, that's an admin flag, not a runtime hint for applications. Just > because that flag is set on an inode, it does not mean that DAX is > actually in use - it will be ignored if the backing dev is not dax > capable. > What's the point of an admin flag if an admin can't do cat /proc/<pid of interest>/smaps, or some other mechanism, to validate that the setting the admin cares about is in effect? _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-16 0:16 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-16 0:16 UTC (permalink / raw) To: Dave Chinner Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> only available internal to the kernel and is a property that userspace >> >> applications would like to interrogate. >> > >> > They have absolutely no business knowing such an implementation detail. >> >> Hasn't that train already left the station with FS_XFLAG_DAX? > > No, that's an admin flag, not a runtime hint for applications. Just > because that flag is set on an inode, it does not mean that DAX is > actually in use - it will be ignored if the backing dev is not dax > capable. > What's the point of an admin flag if an admin can't do cat /proc/<pid of interest>/smaps, or some other mechanism, to validate that the setting the admin cares about is in effect? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-16 0:16 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-16 0:16 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Linux MM, linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> only available internal to the kernel and is a property that userspace >> >> applications would like to interrogate. >> > >> > They have absolutely no business knowing such an implementation detail. >> >> Hasn't that train already left the station with FS_XFLAG_DAX? > > No, that's an admin flag, not a runtime hint for applications. Just > because that flag is set on an inode, it does not mean that DAX is > actually in use - it will be ignored if the backing dev is not dax > capable. > What's the point of an admin flag if an admin can't do cat /proc/<pid of interest>/smaps, or some other mechanism, to validate that the setting the admin cares about is in effect? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-16 0:16 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-16 0:16 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Linux MM, linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> only available internal to the kernel and is a property that userspace >> >> applications would like to interrogate. >> > >> > They have absolutely no business knowing such an implementation detail. >> >> Hasn't that train already left the station with FS_XFLAG_DAX? > > No, that's an admin flag, not a runtime hint for applications. Just > because that flag is set on an inode, it does not mean that DAX is > actually in use - it will be ignored if the backing dev is not dax > capable. > What's the point of an admin flag if an admin can't do cat /proc/<pid of interest>/smaps, or some other mechanism, to validate that the setting the admin cares about is in effect? ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs 2016-09-16 0:16 ` Dan Williams (?) (?) @ 2016-09-16 1:24 ` Dave Chinner -1 siblings, 0 replies; 63+ messages in thread From: Dave Chinner @ 2016-09-16 1:24 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: > On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: > >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: > >> >> The DAX property, page cache bypass, of a VMA is only detectable via the > >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is > >> >> only available internal to the kernel and is a property that userspace > >> >> applications would like to interrogate. > >> > > >> > They have absolutely no business knowing such an implementation detail. > >> > >> Hasn't that train already left the station with FS_XFLAG_DAX? > > > > No, that's an admin flag, not a runtime hint for applications. Just > > because that flag is set on an inode, it does not mean that DAX is > > actually in use - it will be ignored if the backing dev is not dax > > capable. > > What's the point of an admin flag if an admin can't do cat /proc/<pid > of interest>/smaps, or some other mechanism, to validate that the > setting the admin cares about is in effect? Sorry, I don't follow - why would you be looking at mapping file regions in /proc to determine if some file somewhere in a filesystem has a specific flag set on it or not? FS_XFLAG_DAX is an inode attribute flag, not something you can query or administrate through mmap: I.e. # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo --------------- foo --------------x foo --------------- foo # What happens when that flag is set on an inode is determined by a whole bunch of other things that are completely separate to the management of the inode flag itself. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-16 1:24 ` Dave Chinner 0 siblings, 0 replies; 63+ messages in thread From: Dave Chinner @ 2016-09-16 1:24 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: > On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: > >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: > >> >> The DAX property, page cache bypass, of a VMA is only detectable via the > >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is > >> >> only available internal to the kernel and is a property that userspace > >> >> applications would like to interrogate. > >> > > >> > They have absolutely no business knowing such an implementation detail. > >> > >> Hasn't that train already left the station with FS_XFLAG_DAX? > > > > No, that's an admin flag, not a runtime hint for applications. Just > > because that flag is set on an inode, it does not mean that DAX is > > actually in use - it will be ignored if the backing dev is not dax > > capable. > > What's the point of an admin flag if an admin can't do cat /proc/<pid > of interest>/smaps, or some other mechanism, to validate that the > setting the admin cares about is in effect? Sorry, I don't follow - why would you be looking at mapping file regions in /proc to determine if some file somewhere in a filesystem has a specific flag set on it or not? FS_XFLAG_DAX is an inode attribute flag, not something you can query or administrate through mmap: I.e. # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo --------------- foo --------------x foo --------------- foo # What happens when that flag is set on an inode is determined by a whole bunch of other things that are completely separate to the management of the inode flag itself. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-16 1:24 ` Dave Chinner 0 siblings, 0 replies; 63+ messages in thread From: Dave Chinner @ 2016-09-16 1:24 UTC (permalink / raw) To: Dan Williams Cc: Christoph Hellwig, Linux MM, linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: > On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: > >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: > >> >> The DAX property, page cache bypass, of a VMA is only detectable via the > >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is > >> >> only available internal to the kernel and is a property that userspace > >> >> applications would like to interrogate. > >> > > >> > They have absolutely no business knowing such an implementation detail. > >> > >> Hasn't that train already left the station with FS_XFLAG_DAX? > > > > No, that's an admin flag, not a runtime hint for applications. Just > > because that flag is set on an inode, it does not mean that DAX is > > actually in use - it will be ignored if the backing dev is not dax > > capable. > > What's the point of an admin flag if an admin can't do cat /proc/<pid > of interest>/smaps, or some other mechanism, to validate that the > setting the admin cares about is in effect? Sorry, I don't follow - why would you be looking at mapping file regions in /proc to determine if some file somewhere in a filesystem has a specific flag set on it or not? FS_XFLAG_DAX is an inode attribute flag, not something you can query or administrate through mmap: I.e. # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo --------------- foo --------------x foo --------------- foo # What happens when that flag is set on an inode is determined by a whole bunch of other things that are completely separate to the management of the inode flag itself. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-16 1:24 ` Dave Chinner 0 siblings, 0 replies; 63+ messages in thread From: Dave Chinner @ 2016-09-16 1:24 UTC (permalink / raw) To: Dan Williams Cc: Christoph Hellwig, Linux MM, linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: > On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: > >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: > >> >> The DAX property, page cache bypass, of a VMA is only detectable via the > >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is > >> >> only available internal to the kernel and is a property that userspace > >> >> applications would like to interrogate. > >> > > >> > They have absolutely no business knowing such an implementation detail. > >> > >> Hasn't that train already left the station with FS_XFLAG_DAX? > > > > No, that's an admin flag, not a runtime hint for applications. Just > > because that flag is set on an inode, it does not mean that DAX is > > actually in use - it will be ignored if the backing dev is not dax > > capable. > > What's the point of an admin flag if an admin can't do cat /proc/<pid > of interest>/smaps, or some other mechanism, to validate that the > setting the admin cares about is in effect? Sorry, I don't follow - why would you be looking at mapping file regions in /proc to determine if some file somewhere in a filesystem has a specific flag set on it or not? FS_XFLAG_DAX is an inode attribute flag, not something you can query or administrate through mmap: I.e. # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo --------------- foo --------------x foo --------------- foo # What happens when that flag is set on an inode is determined by a whole bunch of other things that are completely separate to the management of the inode flag itself. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs 2016-09-16 1:24 ` Dave Chinner (?) (?) @ 2016-09-16 2:04 ` Dan Williams -1 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-16 2:04 UTC (permalink / raw) To: Dave Chinner Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> >> only available internal to the kernel and is a property that userspace >> >> >> applications would like to interrogate. >> >> > >> >> > They have absolutely no business knowing such an implementation detail. >> >> >> >> Hasn't that train already left the station with FS_XFLAG_DAX? >> > >> > No, that's an admin flag, not a runtime hint for applications. Just >> > because that flag is set on an inode, it does not mean that DAX is >> > actually in use - it will be ignored if the backing dev is not dax >> > capable. >> >> What's the point of an admin flag if an admin can't do cat /proc/<pid >> of interest>/smaps, or some other mechanism, to validate that the >> setting the admin cares about is in effect? > > Sorry, I don't follow - why would you be looking at mapping file > regions in /proc to determine if some file somewhere in a filesystem > has a specific flag set on it or not? > > FS_XFLAG_DAX is an inode attribute flag, not something you can > query or administrate through mmap: > > I.e. > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo > --------------- foo > --------------x foo > --------------- foo > # > > What happens when that flag is set on an inode is determined by a > whole bunch of other things that are completely separate to the > management of the inode flag itself. Right, I understand that, but how does an admin audit those "bunch of other things" that actually gate whether DAX ends up being used in practice? There's currently no way for userspace to observe that a file with FS_XFLAG_DAX actually results in a change in mmap behavior. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-16 2:04 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-16 2:04 UTC (permalink / raw) To: Dave Chinner Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> >> only available internal to the kernel and is a property that userspace >> >> >> applications would like to interrogate. >> >> > >> >> > They have absolutely no business knowing such an implementation detail. >> >> >> >> Hasn't that train already left the station with FS_XFLAG_DAX? >> > >> > No, that's an admin flag, not a runtime hint for applications. Just >> > because that flag is set on an inode, it does not mean that DAX is >> > actually in use - it will be ignored if the backing dev is not dax >> > capable. >> >> What's the point of an admin flag if an admin can't do cat /proc/<pid >> of interest>/smaps, or some other mechanism, to validate that the >> setting the admin cares about is in effect? > > Sorry, I don't follow - why would you be looking at mapping file > regions in /proc to determine if some file somewhere in a filesystem > has a specific flag set on it or not? > > FS_XFLAG_DAX is an inode attribute flag, not something you can > query or administrate through mmap: > > I.e. > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo > --------------- foo > --------------x foo > --------------- foo > # > > What happens when that flag is set on an inode is determined by a > whole bunch of other things that are completely separate to the > management of the inode flag itself. Right, I understand that, but how does an admin audit those "bunch of other things" that actually gate whether DAX ends up being used in practice? There's currently no way for userspace to observe that a file with FS_XFLAG_DAX actually results in a change in mmap behavior. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-16 2:04 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-16 2:04 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Linux MM, linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> >> only available internal to the kernel and is a property that userspace >> >> >> applications would like to interrogate. >> >> > >> >> > They have absolutely no business knowing such an implementation detail. >> >> >> >> Hasn't that train already left the station with FS_XFLAG_DAX? >> > >> > No, that's an admin flag, not a runtime hint for applications. Just >> > because that flag is set on an inode, it does not mean that DAX is >> > actually in use - it will be ignored if the backing dev is not dax >> > capable. >> >> What's the point of an admin flag if an admin can't do cat /proc/<pid >> of interest>/smaps, or some other mechanism, to validate that the >> setting the admin cares about is in effect? > > Sorry, I don't follow - why would you be looking at mapping file > regions in /proc to determine if some file somewhere in a filesystem > has a specific flag set on it or not? > > FS_XFLAG_DAX is an inode attribute flag, not something you can > query or administrate through mmap: > > I.e. > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo > --------------- foo > --------------x foo > --------------- foo > # > > What happens when that flag is set on an inode is determined by a > whole bunch of other things that are completely separate to the > management of the inode flag itself. Right, I understand that, but how does an admin audit those "bunch of other things" that actually gate whether DAX ends up being used in practice? There's currently no way for userspace to observe that a file with FS_XFLAG_DAX actually results in a change in mmap behavior. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-16 2:04 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-16 2:04 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Linux MM, linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> >> only available internal to the kernel and is a property that userspace >> >> >> applications would like to interrogate. >> >> > >> >> > They have absolutely no business knowing such an implementation detail. >> >> >> >> Hasn't that train already left the station with FS_XFLAG_DAX? >> > >> > No, that's an admin flag, not a runtime hint for applications. Just >> > because that flag is set on an inode, it does not mean that DAX is >> > actually in use - it will be ignored if the backing dev is not dax >> > capable. >> >> What's the point of an admin flag if an admin can't do cat /proc/<pid >> of interest>/smaps, or some other mechanism, to validate that the >> setting the admin cares about is in effect? > > Sorry, I don't follow - why would you be looking at mapping file > regions in /proc to determine if some file somewhere in a filesystem > has a specific flag set on it or not? > > FS_XFLAG_DAX is an inode attribute flag, not something you can > query or administrate through mmap: > > I.e. > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo > --------------- foo > --------------x foo > --------------- foo > # > > What happens when that flag is set on an inode is determined by a > whole bunch of other things that are completely separate to the > management of the inode flag itself. Right, I understand that, but how does an admin audit those "bunch of other things" that actually gate whether DAX ends up being used in practice? There's currently no way for userspace to observe that a file with FS_XFLAG_DAX actually results in a change in mmap behavior. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs 2016-09-16 2:04 ` Dan Williams (?) (?) @ 2016-09-16 3:41 ` Dan Williams -1 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-16 3:41 UTC (permalink / raw) To: Dave Chinner Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 7:04 PM, Dan Williams <dan.j.williams@intel.com> wrote: > On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote: >> On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: >>> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: >>> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >>> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >>> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >>> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >>> >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >>> >> >> only available internal to the kernel and is a property that userspace >>> >> >> applications would like to interrogate. >>> >> > >>> >> > They have absolutely no business knowing such an implementation detail. >>> >> >>> >> Hasn't that train already left the station with FS_XFLAG_DAX? >>> > >>> > No, that's an admin flag, not a runtime hint for applications. Just >>> > because that flag is set on an inode, it does not mean that DAX is >>> > actually in use - it will be ignored if the backing dev is not dax >>> > capable. >>> >>> What's the point of an admin flag if an admin can't do cat /proc/<pid >>> of interest>/smaps, or some other mechanism, to validate that the >>> setting the admin cares about is in effect? >> >> Sorry, I don't follow - why would you be looking at mapping file >> regions in /proc to determine if some file somewhere in a filesystem >> has a specific flag set on it or not? >> >> FS_XFLAG_DAX is an inode attribute flag, not something you can >> query or administrate through mmap: >> >> I.e. >> # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo >> --------------- foo >> --------------x foo >> --------------- foo >> # >> >> What happens when that flag is set on an inode is determined by a >> whole bunch of other things that are completely separate to the >> management of the inode flag itself. > > Right, I understand that, but how does an admin audit those "bunch of > other things" that actually gate whether DAX ends up being used in > practice? There's currently no way for userspace to observe that a > file with FS_XFLAG_DAX actually results in a change in mmap behavior. Let me put it another way, if we inadvertently break DAX causing it to be disabled in scenarios when it should be enabled. What is the interface for the admin to check "I have the DAX inode flag set, but the file this application expects to be mapped DAX is mapped with page cache"? _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-16 3:41 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-16 3:41 UTC (permalink / raw) To: Dave Chinner Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 7:04 PM, Dan Williams <dan.j.williams@intel.com> wrote: > On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote: >> On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: >>> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: >>> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >>> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >>> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >>> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >>> >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >>> >> >> only available internal to the kernel and is a property that userspace >>> >> >> applications would like to interrogate. >>> >> > >>> >> > They have absolutely no business knowing such an implementation detail. >>> >> >>> >> Hasn't that train already left the station with FS_XFLAG_DAX? >>> > >>> > No, that's an admin flag, not a runtime hint for applications. Just >>> > because that flag is set on an inode, it does not mean that DAX is >>> > actually in use - it will be ignored if the backing dev is not dax >>> > capable. >>> >>> What's the point of an admin flag if an admin can't do cat /proc/<pid >>> of interest>/smaps, or some other mechanism, to validate that the >>> setting the admin cares about is in effect? >> >> Sorry, I don't follow - why would you be looking at mapping file >> regions in /proc to determine if some file somewhere in a filesystem >> has a specific flag set on it or not? >> >> FS_XFLAG_DAX is an inode attribute flag, not something you can >> query or administrate through mmap: >> >> I.e. >> # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo >> --------------- foo >> --------------x foo >> --------------- foo >> # >> >> What happens when that flag is set on an inode is determined by a >> whole bunch of other things that are completely separate to the >> management of the inode flag itself. > > Right, I understand that, but how does an admin audit those "bunch of > other things" that actually gate whether DAX ends up being used in > practice? There's currently no way for userspace to observe that a > file with FS_XFLAG_DAX actually results in a change in mmap behavior. Let me put it another way, if we inadvertently break DAX causing it to be disabled in scenarios when it should be enabled. What is the interface for the admin to check "I have the DAX inode flag set, but the file this application expects to be mapped DAX is mapped with page cache"? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-16 3:41 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-16 3:41 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Linux MM, linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel On Thu, Sep 15, 2016 at 7:04 PM, Dan Williams <dan.j.williams@intel.com> wrote: > On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote: >> On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: >>> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: >>> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >>> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >>> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >>> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >>> >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >>> >> >> only available internal to the kernel and is a property that userspace >>> >> >> applications would like to interrogate. >>> >> > >>> >> > They have absolutely no business knowing such an implementation detail. >>> >> >>> >> Hasn't that train already left the station with FS_XFLAG_DAX? >>> > >>> > No, that's an admin flag, not a runtime hint for applications. Just >>> > because that flag is set on an inode, it does not mean that DAX is >>> > actually in use - it will be ignored if the backing dev is not dax >>> > capable. >>> >>> What's the point of an admin flag if an admin can't do cat /proc/<pid >>> of interest>/smaps, or some other mechanism, to validate that the >>> setting the admin cares about is in effect? >> >> Sorry, I don't follow - why would you be looking at mapping file >> regions in /proc to determine if some file somewhere in a filesystem >> has a specific flag set on it or not? >> >> FS_XFLAG_DAX is an inode attribute flag, not something you can >> query or administrate through mmap: >> >> I.e. >> # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo >> --------------- foo >> --------------x foo >> --------------- foo >> # >> >> What happens when that flag is set on an inode is determined by a >> whole bunch of other things that are completely separate to the >> management of the inode flag itself. > > Right, I understand that, but how does an admin audit those "bunch of > other things" that actually gate whether DAX ends up being used in > practice? There's currently no way for userspace to observe that a > file with FS_XFLAG_DAX actually results in a change in mmap behavior. Let me put it another way, if we inadvertently break DAX causing it to be disabled in scenarios when it should be enabled. What is the interface for the admin to check "I have the DAX inode flag set, but the file this application expects to be mapped DAX is mapped with page cache"? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-16 3:41 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-16 3:41 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Linux MM, linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel On Thu, Sep 15, 2016 at 7:04 PM, Dan Williams <dan.j.williams@intel.com> wrote: > On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote: >> On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: >>> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: >>> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >>> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >>> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >>> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >>> >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >>> >> >> only available internal to the kernel and is a property that userspace >>> >> >> applications would like to interrogate. >>> >> > >>> >> > They have absolutely no business knowing such an implementation detail. >>> >> >>> >> Hasn't that train already left the station with FS_XFLAG_DAX? >>> > >>> > No, that's an admin flag, not a runtime hint for applications. Just >>> > because that flag is set on an inode, it does not mean that DAX is >>> > actually in use - it will be ignored if the backing dev is not dax >>> > capable. >>> >>> What's the point of an admin flag if an admin can't do cat /proc/<pid >>> of interest>/smaps, or some other mechanism, to validate that the >>> setting the admin cares about is in effect? >> >> Sorry, I don't follow - why would you be looking at mapping file >> regions in /proc to determine if some file somewhere in a filesystem >> has a specific flag set on it or not? >> >> FS_XFLAG_DAX is an inode attribute flag, not something you can >> query or administrate through mmap: >> >> I.e. >> # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo >> --------------- foo >> --------------x foo >> --------------- foo >> # >> >> What happens when that flag is set on an inode is determined by a >> whole bunch of other things that are completely separate to the >> management of the inode flag itself. > > Right, I understand that, but how does an admin audit those "bunch of > other things" that actually gate whether DAX ends up being used in > practice? There's currently no way for userspace to observe that a > file with FS_XFLAG_DAX actually results in a change in mmap behavior. Let me put it another way, if we inadvertently break DAX causing it to be disabled in scenarios when it should be enabled. What is the interface for the admin to check "I have the DAX inode flag set, but the file this application expects to be mapped DAX is mapped with page cache"? ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs 2016-09-16 2:04 ` Dan Williams (?) (?) @ 2016-09-16 5:36 ` Dave Chinner -1 siblings, 0 replies; 63+ messages in thread From: Dave Chinner @ 2016-09-16 5:36 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 07:04:27PM -0700, Dan Williams wrote: > On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: > >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: > >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: > >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: > >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the > >> >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is > >> >> >> only available internal to the kernel and is a property that userspace > >> >> >> applications would like to interrogate. > >> >> > > >> >> > They have absolutely no business knowing such an implementation detail. > >> >> > >> >> Hasn't that train already left the station with FS_XFLAG_DAX? > >> > > >> > No, that's an admin flag, not a runtime hint for applications. Just > >> > because that flag is set on an inode, it does not mean that DAX is > >> > actually in use - it will be ignored if the backing dev is not dax > >> > capable. > >> > >> What's the point of an admin flag if an admin can't do cat /proc/<pid > >> of interest>/smaps, or some other mechanism, to validate that the > >> setting the admin cares about is in effect? > > > > Sorry, I don't follow - why would you be looking at mapping file > > regions in /proc to determine if some file somewhere in a filesystem > > has a specific flag set on it or not? > > > > FS_XFLAG_DAX is an inode attribute flag, not something you can > > query or administrate through mmap: > > > > I.e. > > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo > > --------------- foo > > --------------x foo > > --------------- foo > > # > > > > What happens when that flag is set on an inode is determined by a > > whole bunch of other things that are completely separate to the > > management of the inode flag itself. > > Right, I understand that, but how does an admin audit those "bunch of > other things" Filesystem mounts checks all the various stuff that determines whether DAX can be used. It logs to the console that it is "Dax capable". Any file that then has FS_XFLAG_DAX set will result in DAX being used. There is no other possibility when these two things are reported. /me points at runtime diagnostic tracepoints like trace_xfs_file_dax_read() and notes that dax is sadly lacking in diagnostic tracepoints. Besides, userspace can't do anything useful with this information, because the FS_XFLAG_DAX can be changed /at any time/ by an admin. And the filesystem is free to remove it at any time, too, if it needs to (e.g. file gets reflinked or snapshotted). That's right, an inode can dynamically change from DAX to non-DAX underneath the application, and the application /will not notice/. That's because changing the flag will sync and invalidate the existing mappings and the next application access will simply fault it back in using whatever mechanism the inode is now configured with. Plain and simple: userspace has absolutely no fucking idea of whether DAX is enabled or not, and whatever the kernel returns to userspace above the DAX configuration is stale before it even got out of the kernel.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-16 5:36 ` Dave Chinner 0 siblings, 0 replies; 63+ messages in thread From: Dave Chinner @ 2016-09-16 5:36 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 07:04:27PM -0700, Dan Williams wrote: > On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: > >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: > >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: > >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: > >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the > >> >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is > >> >> >> only available internal to the kernel and is a property that userspace > >> >> >> applications would like to interrogate. > >> >> > > >> >> > They have absolutely no business knowing such an implementation detail. > >> >> > >> >> Hasn't that train already left the station with FS_XFLAG_DAX? > >> > > >> > No, that's an admin flag, not a runtime hint for applications. Just > >> > because that flag is set on an inode, it does not mean that DAX is > >> > actually in use - it will be ignored if the backing dev is not dax > >> > capable. > >> > >> What's the point of an admin flag if an admin can't do cat /proc/<pid > >> of interest>/smaps, or some other mechanism, to validate that the > >> setting the admin cares about is in effect? > > > > Sorry, I don't follow - why would you be looking at mapping file > > regions in /proc to determine if some file somewhere in a filesystem > > has a specific flag set on it or not? > > > > FS_XFLAG_DAX is an inode attribute flag, not something you can > > query or administrate through mmap: > > > > I.e. > > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo > > --------------- foo > > --------------x foo > > --------------- foo > > # > > > > What happens when that flag is set on an inode is determined by a > > whole bunch of other things that are completely separate to the > > management of the inode flag itself. > > Right, I understand that, but how does an admin audit those "bunch of > other things" Filesystem mounts checks all the various stuff that determines whether DAX can be used. It logs to the console that it is "Dax capable". Any file that then has FS_XFLAG_DAX set will result in DAX being used. There is no other possibility when these two things are reported. /me points at runtime diagnostic tracepoints like trace_xfs_file_dax_read() and notes that dax is sadly lacking in diagnostic tracepoints. Besides, userspace can't do anything useful with this information, because the FS_XFLAG_DAX can be changed /at any time/ by an admin. And the filesystem is free to remove it at any time, too, if it needs to (e.g. file gets reflinked or snapshotted). That's right, an inode can dynamically change from DAX to non-DAX underneath the application, and the application /will not notice/. That's because changing the flag will sync and invalidate the existing mappings and the next application access will simply fault it back in using whatever mechanism the inode is now configured with. Plain and simple: userspace has absolutely no fucking idea of whether DAX is enabled or not, and whatever the kernel returns to userspace above the DAX configuration is stale before it even got out of the kernel.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-16 5:36 ` Dave Chinner 0 siblings, 0 replies; 63+ messages in thread From: Dave Chinner @ 2016-09-16 5:36 UTC (permalink / raw) To: Dan Williams Cc: Christoph Hellwig, Linux MM, linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel On Thu, Sep 15, 2016 at 07:04:27PM -0700, Dan Williams wrote: > On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: > >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: > >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: > >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: > >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the > >> >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is > >> >> >> only available internal to the kernel and is a property that userspace > >> >> >> applications would like to interrogate. > >> >> > > >> >> > They have absolutely no business knowing such an implementation detail. > >> >> > >> >> Hasn't that train already left the station with FS_XFLAG_DAX? > >> > > >> > No, that's an admin flag, not a runtime hint for applications. Just > >> > because that flag is set on an inode, it does not mean that DAX is > >> > actually in use - it will be ignored if the backing dev is not dax > >> > capable. > >> > >> What's the point of an admin flag if an admin can't do cat /proc/<pid > >> of interest>/smaps, or some other mechanism, to validate that the > >> setting the admin cares about is in effect? > > > > Sorry, I don't follow - why would you be looking at mapping file > > regions in /proc to determine if some file somewhere in a filesystem > > has a specific flag set on it or not? > > > > FS_XFLAG_DAX is an inode attribute flag, not something you can > > query or administrate through mmap: > > > > I.e. > > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo > > --------------- foo > > --------------x foo > > --------------- foo > > # > > > > What happens when that flag is set on an inode is determined by a > > whole bunch of other things that are completely separate to the > > management of the inode flag itself. > > Right, I understand that, but how does an admin audit those "bunch of > other things" Filesystem mounts checks all the various stuff that determines whether DAX can be used. It logs to the console that it is "Dax capable". Any file that then has FS_XFLAG_DAX set will result in DAX being used. There is no other possibility when these two things are reported. /me points at runtime diagnostic tracepoints like trace_xfs_file_dax_read() and notes that dax is sadly lacking in diagnostic tracepoints. Besides, userspace can't do anything useful with this information, because the FS_XFLAG_DAX can be changed /at any time/ by an admin. And the filesystem is free to remove it at any time, too, if it needs to (e.g. file gets reflinked or snapshotted). That's right, an inode can dynamically change from DAX to non-DAX underneath the application, and the application /will not notice/. That's because changing the flag will sync and invalidate the existing mappings and the next application access will simply fault it back in using whatever mechanism the inode is now configured with. Plain and simple: userspace has absolutely no fucking idea of whether DAX is enabled or not, and whatever the kernel returns to userspace above the DAX configuration is stale before it even got out of the kernel.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-16 5:36 ` Dave Chinner 0 siblings, 0 replies; 63+ messages in thread From: Dave Chinner @ 2016-09-16 5:36 UTC (permalink / raw) To: Dan Williams Cc: Christoph Hellwig, Linux MM, linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel On Thu, Sep 15, 2016 at 07:04:27PM -0700, Dan Williams wrote: > On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: > >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: > >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: > >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: > >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: > >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the > >> >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is > >> >> >> only available internal to the kernel and is a property that userspace > >> >> >> applications would like to interrogate. > >> >> > > >> >> > They have absolutely no business knowing such an implementation detail. > >> >> > >> >> Hasn't that train already left the station with FS_XFLAG_DAX? > >> > > >> > No, that's an admin flag, not a runtime hint for applications. Just > >> > because that flag is set on an inode, it does not mean that DAX is > >> > actually in use - it will be ignored if the backing dev is not dax > >> > capable. > >> > >> What's the point of an admin flag if an admin can't do cat /proc/<pid > >> of interest>/smaps, or some other mechanism, to validate that the > >> setting the admin cares about is in effect? > > > > Sorry, I don't follow - why would you be looking at mapping file > > regions in /proc to determine if some file somewhere in a filesystem > > has a specific flag set on it or not? > > > > FS_XFLAG_DAX is an inode attribute flag, not something you can > > query or administrate through mmap: > > > > I.e. > > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo > > --------------- foo > > --------------x foo > > --------------- foo > > # > > > > What happens when that flag is set on an inode is determined by a > > whole bunch of other things that are completely separate to the > > management of the inode flag itself. > > Right, I understand that, but how does an admin audit those "bunch of > other things" Filesystem mounts checks all the various stuff that determines whether DAX can be used. It logs to the console that it is "Dax capable". Any file that then has FS_XFLAG_DAX set will result in DAX being used. There is no other possibility when these two things are reported. /me points at runtime diagnostic tracepoints like trace_xfs_file_dax_read() and notes that dax is sadly lacking in diagnostic tracepoints. Besides, userspace can't do anything useful with this information, because the FS_XFLAG_DAX can be changed /at any time/ by an admin. And the filesystem is free to remove it at any time, too, if it needs to (e.g. file gets reflinked or snapshotted). That's right, an inode can dynamically change from DAX to non-DAX underneath the application, and the application /will not notice/. That's because changing the flag will sync and invalidate the existing mappings and the next application access will simply fault it back in using whatever mechanism the inode is now configured with. Plain and simple: userspace has absolutely no fucking idea of whether DAX is enabled or not, and whatever the kernel returns to userspace above the DAX configuration is stale before it even got out of the kernel.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs 2016-09-16 5:36 ` Dave Chinner (?) (?) @ 2016-09-16 10:47 ` Dan Williams -1 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-16 10:47 UTC (permalink / raw) To: Dave Chinner Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 10:36 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Sep 15, 2016 at 07:04:27PM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: >> >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: >> >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> >> >> only available internal to the kernel and is a property that userspace >> >> >> >> applications would like to interrogate. >> >> >> > >> >> >> > They have absolutely no business knowing such an implementation detail. >> >> >> >> >> >> Hasn't that train already left the station with FS_XFLAG_DAX? >> >> > >> >> > No, that's an admin flag, not a runtime hint for applications. Just >> >> > because that flag is set on an inode, it does not mean that DAX is >> >> > actually in use - it will be ignored if the backing dev is not dax >> >> > capable. >> >> >> >> What's the point of an admin flag if an admin can't do cat /proc/<pid >> >> of interest>/smaps, or some other mechanism, to validate that the >> >> setting the admin cares about is in effect? >> > >> > Sorry, I don't follow - why would you be looking at mapping file >> > regions in /proc to determine if some file somewhere in a filesystem >> > has a specific flag set on it or not? >> > >> > FS_XFLAG_DAX is an inode attribute flag, not something you can >> > query or administrate through mmap: >> > >> > I.e. >> > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo >> > --------------- foo >> > --------------x foo >> > --------------- foo >> > # >> > >> > What happens when that flag is set on an inode is determined by a >> > whole bunch of other things that are completely separate to the >> > management of the inode flag itself. >> >> Right, I understand that, but how does an admin audit those "bunch of >> other things" > > Filesystem mounts checks all the various stuff that determines > whether DAX can be used. It logs to the console that it is "Dax > capable". Any file that then has FS_XFLAG_DAX set will result in DAX > being used. There is no other possibility when these two things are > reported. > > /me points at runtime diagnostic tracepoints like > trace_xfs_file_dax_read() and notes that dax is sadly lacking in > diagnostic tracepoints. > > Besides, userspace can't do anything useful with this information, > because the FS_XFLAG_DAX can be changed /at any time/ by an admin. > And the filesystem is free to remove it at any time, too, if it > needs to (e.g. file gets reflinked or snapshotted). > > That's right, an inode can dynamically change from DAX to non-DAX > underneath the application, and the application /will not notice/. > That's because changing the flag will sync and invalidate the > existing mappings and the next application access will simply fault > it back in using whatever mechanism the inode is now configured > with. > > Plain and simple: userspace has absolutely no fucking idea of > whether DAX is enabled or not, and whatever the kernel returns to > userspace above the DAX configuration is stale before it even got > out of the kernel.... smaps is already known to be an ephemeral interface, but we output useful information there nonetheless. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-16 10:47 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-16 10:47 UTC (permalink / raw) To: Dave Chinner Cc: linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, Linux MM, linux-fsdevel, Christoph Hellwig On Thu, Sep 15, 2016 at 10:36 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Sep 15, 2016 at 07:04:27PM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: >> >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: >> >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> >> >> only available internal to the kernel and is a property that userspace >> >> >> >> applications would like to interrogate. >> >> >> > >> >> >> > They have absolutely no business knowing such an implementation detail. >> >> >> >> >> >> Hasn't that train already left the station with FS_XFLAG_DAX? >> >> > >> >> > No, that's an admin flag, not a runtime hint for applications. Just >> >> > because that flag is set on an inode, it does not mean that DAX is >> >> > actually in use - it will be ignored if the backing dev is not dax >> >> > capable. >> >> >> >> What's the point of an admin flag if an admin can't do cat /proc/<pid >> >> of interest>/smaps, or some other mechanism, to validate that the >> >> setting the admin cares about is in effect? >> > >> > Sorry, I don't follow - why would you be looking at mapping file >> > regions in /proc to determine if some file somewhere in a filesystem >> > has a specific flag set on it or not? >> > >> > FS_XFLAG_DAX is an inode attribute flag, not something you can >> > query or administrate through mmap: >> > >> > I.e. >> > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo >> > --------------- foo >> > --------------x foo >> > --------------- foo >> > # >> > >> > What happens when that flag is set on an inode is determined by a >> > whole bunch of other things that are completely separate to the >> > management of the inode flag itself. >> >> Right, I understand that, but how does an admin audit those "bunch of >> other things" > > Filesystem mounts checks all the various stuff that determines > whether DAX can be used. It logs to the console that it is "Dax > capable". Any file that then has FS_XFLAG_DAX set will result in DAX > being used. There is no other possibility when these two things are > reported. > > /me points at runtime diagnostic tracepoints like > trace_xfs_file_dax_read() and notes that dax is sadly lacking in > diagnostic tracepoints. > > Besides, userspace can't do anything useful with this information, > because the FS_XFLAG_DAX can be changed /at any time/ by an admin. > And the filesystem is free to remove it at any time, too, if it > needs to (e.g. file gets reflinked or snapshotted). > > That's right, an inode can dynamically change from DAX to non-DAX > underneath the application, and the application /will not notice/. > That's because changing the flag will sync and invalidate the > existing mappings and the next application access will simply fault > it back in using whatever mechanism the inode is now configured > with. > > Plain and simple: userspace has absolutely no fucking idea of > whether DAX is enabled or not, and whatever the kernel returns to > userspace above the DAX configuration is stale before it even got > out of the kernel.... smaps is already known to be an ephemeral interface, but we output useful information there nonetheless. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-16 10:47 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-16 10:47 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Linux MM, linux-nvdimm, linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel On Thu, Sep 15, 2016 at 10:36 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Sep 15, 2016 at 07:04:27PM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: >> >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: >> >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> >> >> only available internal to the kernel and is a property that userspace >> >> >> >> applications would like to interrogate. >> >> >> > >> >> >> > They have absolutely no business knowing such an implementation detail. >> >> >> >> >> >> Hasn't that train already left the station with FS_XFLAG_DAX? >> >> > >> >> > No, that's an admin flag, not a runtime hint for applications. Just >> >> > because that flag is set on an inode, it does not mean that DAX is >> >> > actually in use - it will be ignored if the backing dev is not dax >> >> > capable. >> >> >> >> What's the point of an admin flag if an admin can't do cat /proc/<pid >> >> of interest>/smaps, or some other mechanism, to validate that the >> >> setting the admin cares about is in effect? >> > >> > Sorry, I don't follow - why would you be looking at mapping file >> > regions in /proc to determine if some file somewhere in a filesystem >> > has a specific flag set on it or not? >> > >> > FS_XFLAG_DAX is an inode attribute flag, not something you can >> > query or administrate through mmap: >> > >> > I.e. >> > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo >> > --------------- foo >> > --------------x foo >> > --------------- foo >> > # >> > >> > What happens when that flag is set on an inode is determined by a >> > whole bunch of other things that are completely separate to the >> > management of the inode flag itself. >> >> Right, I understand that, but how does an admin audit those "bunch of >> other things" > > Filesystem mounts checks all the various stuff that determines > whether DAX can be used. It logs to the console that it is "Dax > capable". Any file that then has FS_XFLAG_DAX set will result in DAX > being used. There is no other possibility when these two things are > reported. > > /me points at runtime diagnostic tracepoints like > trace_xfs_file_dax_read() and notes that dax is sadly lacking in > diagnostic tracepoints. > > Besides, userspace can't do anything useful with this information, > because the FS_XFLAG_DAX can be changed /at any time/ by an admin. > And the filesystem is free to remove it at any time, too, if it > needs to (e.g. file gets reflinked or snapshotted). > > That's right, an inode can dynamically change from DAX to non-DAX > underneath the application, and the application /will not notice/. > That's because changing the flag will sync and invalidate the > existing mappings and the next application access will simply fault > it back in using whatever mechanism the inode is now configured > with. > > Plain and simple: userspace has absolutely no fucking idea of > whether DAX is enabled or not, and whatever the kernel returns to > userspace above the DAX configuration is stale before it even got > out of the kernel.... smaps is already known to be an ephemeral interface, but we output useful information there nonetheless. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs @ 2016-09-16 10:47 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-16 10:47 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Linux MM, linux-nvdimm@lists.01.org, linux-kernel, Nicholas Piggin, XFS Developers, linux-fsdevel On Thu, Sep 15, 2016 at 10:36 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Sep 15, 2016 at 07:04:27PM -0700, Dan Williams wrote: >> On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote: >> >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote: >> >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote: >> >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote: >> >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote: >> >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the >> >> >> >> vma_is_dax() helper to check the S_DAX inode flag. However, this is >> >> >> >> only available internal to the kernel and is a property that userspace >> >> >> >> applications would like to interrogate. >> >> >> > >> >> >> > They have absolutely no business knowing such an implementation detail. >> >> >> >> >> >> Hasn't that train already left the station with FS_XFLAG_DAX? >> >> > >> >> > No, that's an admin flag, not a runtime hint for applications. Just >> >> > because that flag is set on an inode, it does not mean that DAX is >> >> > actually in use - it will be ignored if the backing dev is not dax >> >> > capable. >> >> >> >> What's the point of an admin flag if an admin can't do cat /proc/<pid >> >> of interest>/smaps, or some other mechanism, to validate that the >> >> setting the admin cares about is in effect? >> > >> > Sorry, I don't follow - why would you be looking at mapping file >> > regions in /proc to determine if some file somewhere in a filesystem >> > has a specific flag set on it or not? >> > >> > FS_XFLAG_DAX is an inode attribute flag, not something you can >> > query or administrate through mmap: >> > >> > I.e. >> > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo >> > --------------- foo >> > --------------x foo >> > --------------- foo >> > # >> > >> > What happens when that flag is set on an inode is determined by a >> > whole bunch of other things that are completely separate to the >> > management of the inode flag itself. >> >> Right, I understand that, but how does an admin audit those "bunch of >> other things" > > Filesystem mounts checks all the various stuff that determines > whether DAX can be used. It logs to the console that it is "Dax > capable". Any file that then has FS_XFLAG_DAX set will result in DAX > being used. There is no other possibility when these two things are > reported. > > /me points at runtime diagnostic tracepoints like > trace_xfs_file_dax_read() and notes that dax is sadly lacking in > diagnostic tracepoints. > > Besides, userspace can't do anything useful with this information, > because the FS_XFLAG_DAX can be changed /at any time/ by an admin. > And the filesystem is free to remove it at any time, too, if it > needs to (e.g. file gets reflinked or snapshotted). > > That's right, an inode can dynamically change from DAX to non-DAX > underneath the application, and the application /will not notice/. > That's because changing the flag will sync and invalidate the > existing mappings and the next application access will simply fault > it back in using whatever mechanism the inode is now configured > with. > > Plain and simple: userspace has absolutely no fucking idea of > whether DAX is enabled or not, and whatever the kernel returns to > userspace above the DAX configuration is stale before it even got > out of the kernel.... smaps is already known to be an ephemeral interface, but we output useful information there nonetheless. ^ permalink raw reply [flat|nested] 63+ messages in thread
* [PATCH v2 3/3] mm, mincore2(): retrieve tlb-size attributes of an address range 2016-09-15 6:54 ` Dan Williams (?) (?) @ 2016-09-15 6:54 ` Dan Williams -1 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 6:54 UTC (permalink / raw) To: linux-mm Cc: Andrea Arcangeli, Xiao Guangrong, Arnd Bergmann, linux-nvdimm, Dave Hansen, david, linux-kernel, npiggin, xfs, linux-fsdevel, Andrew Morton, hch, Kirill A. Shutemov There are cases, particularly for testing and validating a configuration to know the hardware mapping geometry of the pages in a given process address range. Consider filesystem-dax where a configuration needs to take care to align partitions and block allocations before huge page mappings might be used, or anonymous-transparent-huge-pages where a process is opportunistically assigned large pages. mincore2() allows these configurations to be surveyed and validated. The implementation takes advantage of the unused bits in the per-page byte returned for each PAGE_SIZE extent of a given address range. The new format of each vector byte is: (TLB_SHIFT - PAGE_SHIFT) << 1 | page_present [1]: https://lkml.org/lkml/2016/9/7/61 Cc: Arnd Bergmann <arnd@arndb.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- include/linux/syscalls.h | 2 include/uapi/asm-generic/mman-common.h | 2 kernel/sys_ni.c | 1 mm/mincore.c | 130 ++++++++++++++++++++++++-------- 4 files changed, 104 insertions(+), 31 deletions(-) diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index d02239022bd0..4aa2ee7e359a 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -467,6 +467,8 @@ asmlinkage long sys_munlockall(void); asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); asmlinkage long sys_mincore(unsigned long start, size_t len, unsigned char __user * vec); +asmlinkage long sys_mincore2(unsigned long start, size_t len, + unsigned char __user * vec, int flags); asmlinkage long sys_pivot_root(const char __user *new_root, const char __user *put_old); diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 58274382a616..6c7eca1a85ca 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -72,4 +72,6 @@ #define MAP_HUGE_SHIFT 26 #define MAP_HUGE_MASK 0x3f +#define MINCORE_ORDER 1 /* retrieve hardware mapping-size-order */ + #endif /* __ASM_GENERIC_MMAN_COMMON_H */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 2c5e3a8e00d7..e14b87834054 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -197,6 +197,7 @@ cond_syscall(sys_mlockall); cond_syscall(sys_munlockall); cond_syscall(sys_mlock2); cond_syscall(sys_mincore); +cond_syscall(sys_mincore2); cond_syscall(sys_madvise); cond_syscall(sys_mremap); cond_syscall(sys_remap_file_pages); diff --git a/mm/mincore.c b/mm/mincore.c index c0b5ba965200..b0b83ef086eb 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -15,25 +15,61 @@ #include <linux/swap.h> #include <linux/swapops.h> #include <linux/hugetlb.h> +#include <linux/dax.h> #include <asm/uaccess.h> #include <asm/pgtable.h> +#ifndef MINCORE_ORDER +#define MINCORE_ORDER 0 +#endif + +#define MINCORE_ORDER_MASK 0x3e +#define MINCORE_ORDER_SHIFT 1 + +struct mincore_params { + unsigned char *vec; + int flags; +}; + +static void mincore_set(unsigned char *vec, struct vm_area_struct *vma, int nr, + int flags) +{ + unsigned char mincore = 1; + + if (!nr) { + *vec = 0; + return; + } + + if (flags & MINCORE_ORDER) { + unsigned char order = ilog2(nr); + + WARN_ON((order << MINCORE_ORDER_SHIFT) & ~MINCORE_ORDER_MASK); + mincore |= order << MINCORE_ORDER_SHIFT; + } + memset(vec, mincore, nr); +} + static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long end, struct mm_walk *walk) { #ifdef CONFIG_HUGETLB_PAGE + struct mincore_params *p = walk->private; + int nr = (end - addr) >> PAGE_SHIFT; + unsigned char *vec = p->vec; unsigned char present; - unsigned char *vec = walk->private; /* * Hugepages under user process are always in RAM and never * swapped out, but theoretically it needs to be checked. */ present = pte && !huge_pte_none(huge_ptep_get(pte)); - for (; addr != end; vec++, addr += PAGE_SIZE) - *vec = present; - walk->private = vec; + if (!present) + memset(vec, 0, nr); + else + mincore_set(vec, walk->vma, nr, p->flags); + p->vec = vec + nr; #else BUG(); #endif @@ -82,20 +118,24 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) } static int __mincore_unmapped_range(unsigned long addr, unsigned long end, - struct vm_area_struct *vma, unsigned char *vec) + struct vm_area_struct *vma, unsigned char *vec, + int flags) { unsigned long nr = (end - addr) >> PAGE_SHIFT; + unsigned char present; int i; if (vma->vm_file) { pgoff_t pgoff; pgoff = linear_page_index(vma, addr); - for (i = 0; i < nr; i++, pgoff++) - vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff); + for (i = 0; i < nr; i++, pgoff++) { + present = mincore_page(vma->vm_file->f_mapping, pgoff); + mincore_set(vec + i, vma, present, flags); + } } else { for (i = 0; i < nr; i++) - vec[i] = 0; + mincore_set(vec + i, vma, 0, flags); } return nr; } @@ -103,8 +143,11 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end, static int mincore_unmapped_range(unsigned long addr, unsigned long end, struct mm_walk *walk) { - walk->private += __mincore_unmapped_range(addr, end, - walk->vma, walk->private); + struct mincore_params *p = walk->private; + int nr = __mincore_unmapped_range(addr, end, walk->vma, p->vec, + p->flags); + + p->vec += nr; return 0; } @@ -114,18 +157,20 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, spinlock_t *ptl; struct vm_area_struct *vma = walk->vma; pte_t *ptep; - unsigned char *vec = walk->private; + struct mincore_params *p = walk->private; + unsigned char *vec = p->vec; int nr = (end - addr) >> PAGE_SHIFT; + int flags = p->flags; ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { - memset(vec, 1, nr); + mincore_set(vec, vma, nr, flags); spin_unlock(ptl); goto out; } if (pmd_trans_unstable(pmd)) { - __mincore_unmapped_range(addr, end, vma, vec); + __mincore_unmapped_range(addr, end, vma, vec, flags); goto out; } @@ -135,9 +180,9 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, if (pte_none(pte)) __mincore_unmapped_range(addr, addr + PAGE_SIZE, - vma, vec); + vma, vec, flags); else if (pte_present(pte)) - *vec = 1; + mincore_set(vec, vma, 1, flags); else { /* pte is a swap entry */ swp_entry_t entry = pte_to_swp_entry(pte); @@ -146,14 +191,17 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, * migration or hwpoison entries are always * uptodate */ - *vec = 1; + mincore_set(vec, vma, 1, flags); } else { #ifdef CONFIG_SWAP - *vec = mincore_page(swap_address_space(entry), - entry.val); + unsigned char present; + + present = mincore_page(swap_address_space(entry), + entry.val); + mincore_set(vec, vma, present, flags); #else WARN_ON(1); - *vec = 1; + mincore_set(vec, vma, 1, flags); #endif } } @@ -161,7 +209,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, } pte_unmap_unlock(ptep - 1, ptl); out: - walk->private += nr; + p->vec = vec + nr; cond_resched(); return 0; } @@ -171,16 +219,21 @@ out: * all the arguments, we hold the mmap semaphore: we should * just return the amount of info we're asked for. */ -static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *vec) +static long do_mincore(unsigned long addr, unsigned long pages, + unsigned char *vec, int flags) { struct vm_area_struct *vma; unsigned long end; int err; + struct mincore_params p = { + .vec = vec, + .flags = flags, + }; struct mm_walk mincore_walk = { .pmd_entry = mincore_pte_range, .pte_hole = mincore_unmapped_range, .hugetlb_entry = mincore_hugetlb, - .private = vec, + .private = &p, }; vma = find_vma(current->mm, addr); @@ -195,13 +248,18 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v } /* - * The mincore(2) system call. + * The mincore2(2) system call. * - * mincore() returns the memory residency status of the pages in the - * current process's address space specified by [addr, addr + len). - * The status is returned in a vector of bytes. The least significant - * bit of each byte is 1 if the referenced page is in memory, otherwise - * it is zero. + * mincore2() returns the memory residency status of the pages in the + * current process's address space specified by [addr, addr + len). The + * status is returned in a vector of bytes. The least significant bit + * of each byte is 1 if the referenced page is in memory, otherwise it + * is zero. When 'flags' is non-zero each byte additionally contains an + * indication of the hardware mapping size of each page (bits 1 through + * 5 of each vector byte). Where the order relates to the hardware + * mapping size backing the given logical-page. For example, a present + * 2MB-mapped-huge-page would correspond to 512 vector entries with the + * value (9 << 1) | (1) => 0x13 * * Because the status of a page can change after mincore() checks it * but before it returns to the application, the returned vector may @@ -218,8 +276,8 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v * mapped * -EAGAIN - A kernel resource was temporarily unavailable. */ -SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, - unsigned char __user *, vec) +SYSCALL_DEFINE4(mincore2, unsigned long, start, size_t, len, + unsigned char __user *, vec, int, flags) { long retval; unsigned long pages; @@ -229,6 +287,10 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, if (start & ~PAGE_MASK) return -EINVAL; + /* Check that undefined flags are zero */ + if (flags & ~MINCORE_ORDER) + return -EINVAL; + /* ..and we need to be passed a valid user-space range */ if (!access_ok(VERIFY_READ, (void __user *) start, len)) return -ENOMEM; @@ -251,7 +313,7 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, * the temporary buffer size. */ down_read(¤t->mm->mmap_sem); - retval = do_mincore(start, min(pages, PAGE_SIZE), tmp); + retval = do_mincore(start, min(pages, PAGE_SIZE), tmp, flags); up_read(¤t->mm->mmap_sem); if (retval <= 0) @@ -268,3 +330,9 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, free_page((unsigned long) tmp); return retval; } + +SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, + unsigned char __user *, vec) +{ + return sys_mincore2(start, len, vec, 0); +} _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH v2 3/3] mm, mincore2(): retrieve tlb-size attributes of an address range @ 2016-09-15 6:54 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 6:54 UTC (permalink / raw) To: linux-mm Cc: Andrea Arcangeli, Xiao Guangrong, Arnd Bergmann, linux-nvdimm, Dave Hansen, linux-kernel, npiggin, xfs, linux-fsdevel, Andrew Morton, hch, Kirill A. Shutemov There are cases, particularly for testing and validating a configuration to know the hardware mapping geometry of the pages in a given process address range. Consider filesystem-dax where a configuration needs to take care to align partitions and block allocations before huge page mappings might be used, or anonymous-transparent-huge-pages where a process is opportunistically assigned large pages. mincore2() allows these configurations to be surveyed and validated. The implementation takes advantage of the unused bits in the per-page byte returned for each PAGE_SIZE extent of a given address range. The new format of each vector byte is: (TLB_SHIFT - PAGE_SHIFT) << 1 | page_present [1]: https://lkml.org/lkml/2016/9/7/61 Cc: Arnd Bergmann <arnd@arndb.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- include/linux/syscalls.h | 2 include/uapi/asm-generic/mman-common.h | 2 kernel/sys_ni.c | 1 mm/mincore.c | 130 ++++++++++++++++++++++++-------- 4 files changed, 104 insertions(+), 31 deletions(-) diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index d02239022bd0..4aa2ee7e359a 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -467,6 +467,8 @@ asmlinkage long sys_munlockall(void); asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); asmlinkage long sys_mincore(unsigned long start, size_t len, unsigned char __user * vec); +asmlinkage long sys_mincore2(unsigned long start, size_t len, + unsigned char __user * vec, int flags); asmlinkage long sys_pivot_root(const char __user *new_root, const char __user *put_old); diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 58274382a616..6c7eca1a85ca 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -72,4 +72,6 @@ #define MAP_HUGE_SHIFT 26 #define MAP_HUGE_MASK 0x3f +#define MINCORE_ORDER 1 /* retrieve hardware mapping-size-order */ + #endif /* __ASM_GENERIC_MMAN_COMMON_H */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 2c5e3a8e00d7..e14b87834054 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -197,6 +197,7 @@ cond_syscall(sys_mlockall); cond_syscall(sys_munlockall); cond_syscall(sys_mlock2); cond_syscall(sys_mincore); +cond_syscall(sys_mincore2); cond_syscall(sys_madvise); cond_syscall(sys_mremap); cond_syscall(sys_remap_file_pages); diff --git a/mm/mincore.c b/mm/mincore.c index c0b5ba965200..b0b83ef086eb 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -15,25 +15,61 @@ #include <linux/swap.h> #include <linux/swapops.h> #include <linux/hugetlb.h> +#include <linux/dax.h> #include <asm/uaccess.h> #include <asm/pgtable.h> +#ifndef MINCORE_ORDER +#define MINCORE_ORDER 0 +#endif + +#define MINCORE_ORDER_MASK 0x3e +#define MINCORE_ORDER_SHIFT 1 + +struct mincore_params { + unsigned char *vec; + int flags; +}; + +static void mincore_set(unsigned char *vec, struct vm_area_struct *vma, int nr, + int flags) +{ + unsigned char mincore = 1; + + if (!nr) { + *vec = 0; + return; + } + + if (flags & MINCORE_ORDER) { + unsigned char order = ilog2(nr); + + WARN_ON((order << MINCORE_ORDER_SHIFT) & ~MINCORE_ORDER_MASK); + mincore |= order << MINCORE_ORDER_SHIFT; + } + memset(vec, mincore, nr); +} + static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long end, struct mm_walk *walk) { #ifdef CONFIG_HUGETLB_PAGE + struct mincore_params *p = walk->private; + int nr = (end - addr) >> PAGE_SHIFT; + unsigned char *vec = p->vec; unsigned char present; - unsigned char *vec = walk->private; /* * Hugepages under user process are always in RAM and never * swapped out, but theoretically it needs to be checked. */ present = pte && !huge_pte_none(huge_ptep_get(pte)); - for (; addr != end; vec++, addr += PAGE_SIZE) - *vec = present; - walk->private = vec; + if (!present) + memset(vec, 0, nr); + else + mincore_set(vec, walk->vma, nr, p->flags); + p->vec = vec + nr; #else BUG(); #endif @@ -82,20 +118,24 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) } static int __mincore_unmapped_range(unsigned long addr, unsigned long end, - struct vm_area_struct *vma, unsigned char *vec) + struct vm_area_struct *vma, unsigned char *vec, + int flags) { unsigned long nr = (end - addr) >> PAGE_SHIFT; + unsigned char present; int i; if (vma->vm_file) { pgoff_t pgoff; pgoff = linear_page_index(vma, addr); - for (i = 0; i < nr; i++, pgoff++) - vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff); + for (i = 0; i < nr; i++, pgoff++) { + present = mincore_page(vma->vm_file->f_mapping, pgoff); + mincore_set(vec + i, vma, present, flags); + } } else { for (i = 0; i < nr; i++) - vec[i] = 0; + mincore_set(vec + i, vma, 0, flags); } return nr; } @@ -103,8 +143,11 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end, static int mincore_unmapped_range(unsigned long addr, unsigned long end, struct mm_walk *walk) { - walk->private += __mincore_unmapped_range(addr, end, - walk->vma, walk->private); + struct mincore_params *p = walk->private; + int nr = __mincore_unmapped_range(addr, end, walk->vma, p->vec, + p->flags); + + p->vec += nr; return 0; } @@ -114,18 +157,20 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, spinlock_t *ptl; struct vm_area_struct *vma = walk->vma; pte_t *ptep; - unsigned char *vec = walk->private; + struct mincore_params *p = walk->private; + unsigned char *vec = p->vec; int nr = (end - addr) >> PAGE_SHIFT; + int flags = p->flags; ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { - memset(vec, 1, nr); + mincore_set(vec, vma, nr, flags); spin_unlock(ptl); goto out; } if (pmd_trans_unstable(pmd)) { - __mincore_unmapped_range(addr, end, vma, vec); + __mincore_unmapped_range(addr, end, vma, vec, flags); goto out; } @@ -135,9 +180,9 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, if (pte_none(pte)) __mincore_unmapped_range(addr, addr + PAGE_SIZE, - vma, vec); + vma, vec, flags); else if (pte_present(pte)) - *vec = 1; + mincore_set(vec, vma, 1, flags); else { /* pte is a swap entry */ swp_entry_t entry = pte_to_swp_entry(pte); @@ -146,14 +191,17 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, * migration or hwpoison entries are always * uptodate */ - *vec = 1; + mincore_set(vec, vma, 1, flags); } else { #ifdef CONFIG_SWAP - *vec = mincore_page(swap_address_space(entry), - entry.val); + unsigned char present; + + present = mincore_page(swap_address_space(entry), + entry.val); + mincore_set(vec, vma, present, flags); #else WARN_ON(1); - *vec = 1; + mincore_set(vec, vma, 1, flags); #endif } } @@ -161,7 +209,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, } pte_unmap_unlock(ptep - 1, ptl); out: - walk->private += nr; + p->vec = vec + nr; cond_resched(); return 0; } @@ -171,16 +219,21 @@ out: * all the arguments, we hold the mmap semaphore: we should * just return the amount of info we're asked for. */ -static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *vec) +static long do_mincore(unsigned long addr, unsigned long pages, + unsigned char *vec, int flags) { struct vm_area_struct *vma; unsigned long end; int err; + struct mincore_params p = { + .vec = vec, + .flags = flags, + }; struct mm_walk mincore_walk = { .pmd_entry = mincore_pte_range, .pte_hole = mincore_unmapped_range, .hugetlb_entry = mincore_hugetlb, - .private = vec, + .private = &p, }; vma = find_vma(current->mm, addr); @@ -195,13 +248,18 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v } /* - * The mincore(2) system call. + * The mincore2(2) system call. * - * mincore() returns the memory residency status of the pages in the - * current process's address space specified by [addr, addr + len). - * The status is returned in a vector of bytes. The least significant - * bit of each byte is 1 if the referenced page is in memory, otherwise - * it is zero. + * mincore2() returns the memory residency status of the pages in the + * current process's address space specified by [addr, addr + len). The + * status is returned in a vector of bytes. The least significant bit + * of each byte is 1 if the referenced page is in memory, otherwise it + * is zero. When 'flags' is non-zero each byte additionally contains an + * indication of the hardware mapping size of each page (bits 1 through + * 5 of each vector byte). Where the order relates to the hardware + * mapping size backing the given logical-page. For example, a present + * 2MB-mapped-huge-page would correspond to 512 vector entries with the + * value (9 << 1) | (1) => 0x13 * * Because the status of a page can change after mincore() checks it * but before it returns to the application, the returned vector may @@ -218,8 +276,8 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v * mapped * -EAGAIN - A kernel resource was temporarily unavailable. */ -SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, - unsigned char __user *, vec) +SYSCALL_DEFINE4(mincore2, unsigned long, start, size_t, len, + unsigned char __user *, vec, int, flags) { long retval; unsigned long pages; @@ -229,6 +287,10 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, if (start & ~PAGE_MASK) return -EINVAL; + /* Check that undefined flags are zero */ + if (flags & ~MINCORE_ORDER) + return -EINVAL; + /* ..and we need to be passed a valid user-space range */ if (!access_ok(VERIFY_READ, (void __user *) start, len)) return -ENOMEM; @@ -251,7 +313,7 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, * the temporary buffer size. */ down_read(¤t->mm->mmap_sem); - retval = do_mincore(start, min(pages, PAGE_SIZE), tmp); + retval = do_mincore(start, min(pages, PAGE_SIZE), tmp, flags); up_read(¤t->mm->mmap_sem); if (retval <= 0) @@ -268,3 +330,9 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, free_page((unsigned long) tmp); return retval; } + +SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, + unsigned char __user *, vec) +{ + return sys_mincore2(start, len, vec, 0); +} _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH v2 3/3] mm, mincore2(): retrieve tlb-size attributes of an address range @ 2016-09-15 6:54 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 6:54 UTC (permalink / raw) To: linux-mm Cc: Andrea Arcangeli, Xiao Guangrong, Arnd Bergmann, linux-nvdimm, Dave Hansen, david, linux-kernel, npiggin, xfs, linux-fsdevel, Andrew Morton, hch, Kirill A. Shutemov There are cases, particularly for testing and validating a configuration to know the hardware mapping geometry of the pages in a given process address range. Consider filesystem-dax where a configuration needs to take care to align partitions and block allocations before huge page mappings might be used, or anonymous-transparent-huge-pages where a process is opportunistically assigned large pages. mincore2() allows these configurations to be surveyed and validated. The implementation takes advantage of the unused bits in the per-page byte returned for each PAGE_SIZE extent of a given address range. The new format of each vector byte is: (TLB_SHIFT - PAGE_SHIFT) << 1 | page_present [1]: https://lkml.org/lkml/2016/9/7/61 Cc: Arnd Bergmann <arnd@arndb.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- include/linux/syscalls.h | 2 include/uapi/asm-generic/mman-common.h | 2 kernel/sys_ni.c | 1 mm/mincore.c | 130 ++++++++++++++++++++++++-------- 4 files changed, 104 insertions(+), 31 deletions(-) diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index d02239022bd0..4aa2ee7e359a 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -467,6 +467,8 @@ asmlinkage long sys_munlockall(void); asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); asmlinkage long sys_mincore(unsigned long start, size_t len, unsigned char __user * vec); +asmlinkage long sys_mincore2(unsigned long start, size_t len, + unsigned char __user * vec, int flags); asmlinkage long sys_pivot_root(const char __user *new_root, const char __user *put_old); diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 58274382a616..6c7eca1a85ca 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -72,4 +72,6 @@ #define MAP_HUGE_SHIFT 26 #define MAP_HUGE_MASK 0x3f +#define MINCORE_ORDER 1 /* retrieve hardware mapping-size-order */ + #endif /* __ASM_GENERIC_MMAN_COMMON_H */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 2c5e3a8e00d7..e14b87834054 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -197,6 +197,7 @@ cond_syscall(sys_mlockall); cond_syscall(sys_munlockall); cond_syscall(sys_mlock2); cond_syscall(sys_mincore); +cond_syscall(sys_mincore2); cond_syscall(sys_madvise); cond_syscall(sys_mremap); cond_syscall(sys_remap_file_pages); diff --git a/mm/mincore.c b/mm/mincore.c index c0b5ba965200..b0b83ef086eb 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -15,25 +15,61 @@ #include <linux/swap.h> #include <linux/swapops.h> #include <linux/hugetlb.h> +#include <linux/dax.h> #include <asm/uaccess.h> #include <asm/pgtable.h> +#ifndef MINCORE_ORDER +#define MINCORE_ORDER 0 +#endif + +#define MINCORE_ORDER_MASK 0x3e +#define MINCORE_ORDER_SHIFT 1 + +struct mincore_params { + unsigned char *vec; + int flags; +}; + +static void mincore_set(unsigned char *vec, struct vm_area_struct *vma, int nr, + int flags) +{ + unsigned char mincore = 1; + + if (!nr) { + *vec = 0; + return; + } + + if (flags & MINCORE_ORDER) { + unsigned char order = ilog2(nr); + + WARN_ON((order << MINCORE_ORDER_SHIFT) & ~MINCORE_ORDER_MASK); + mincore |= order << MINCORE_ORDER_SHIFT; + } + memset(vec, mincore, nr); +} + static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long end, struct mm_walk *walk) { #ifdef CONFIG_HUGETLB_PAGE + struct mincore_params *p = walk->private; + int nr = (end - addr) >> PAGE_SHIFT; + unsigned char *vec = p->vec; unsigned char present; - unsigned char *vec = walk->private; /* * Hugepages under user process are always in RAM and never * swapped out, but theoretically it needs to be checked. */ present = pte && !huge_pte_none(huge_ptep_get(pte)); - for (; addr != end; vec++, addr += PAGE_SIZE) - *vec = present; - walk->private = vec; + if (!present) + memset(vec, 0, nr); + else + mincore_set(vec, walk->vma, nr, p->flags); + p->vec = vec + nr; #else BUG(); #endif @@ -82,20 +118,24 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) } static int __mincore_unmapped_range(unsigned long addr, unsigned long end, - struct vm_area_struct *vma, unsigned char *vec) + struct vm_area_struct *vma, unsigned char *vec, + int flags) { unsigned long nr = (end - addr) >> PAGE_SHIFT; + unsigned char present; int i; if (vma->vm_file) { pgoff_t pgoff; pgoff = linear_page_index(vma, addr); - for (i = 0; i < nr; i++, pgoff++) - vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff); + for (i = 0; i < nr; i++, pgoff++) { + present = mincore_page(vma->vm_file->f_mapping, pgoff); + mincore_set(vec + i, vma, present, flags); + } } else { for (i = 0; i < nr; i++) - vec[i] = 0; + mincore_set(vec + i, vma, 0, flags); } return nr; } @@ -103,8 +143,11 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end, static int mincore_unmapped_range(unsigned long addr, unsigned long end, struct mm_walk *walk) { - walk->private += __mincore_unmapped_range(addr, end, - walk->vma, walk->private); + struct mincore_params *p = walk->private; + int nr = __mincore_unmapped_range(addr, end, walk->vma, p->vec, + p->flags); + + p->vec += nr; return 0; } @@ -114,18 +157,20 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, spinlock_t *ptl; struct vm_area_struct *vma = walk->vma; pte_t *ptep; - unsigned char *vec = walk->private; + struct mincore_params *p = walk->private; + unsigned char *vec = p->vec; int nr = (end - addr) >> PAGE_SHIFT; + int flags = p->flags; ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { - memset(vec, 1, nr); + mincore_set(vec, vma, nr, flags); spin_unlock(ptl); goto out; } if (pmd_trans_unstable(pmd)) { - __mincore_unmapped_range(addr, end, vma, vec); + __mincore_unmapped_range(addr, end, vma, vec, flags); goto out; } @@ -135,9 +180,9 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, if (pte_none(pte)) __mincore_unmapped_range(addr, addr + PAGE_SIZE, - vma, vec); + vma, vec, flags); else if (pte_present(pte)) - *vec = 1; + mincore_set(vec, vma, 1, flags); else { /* pte is a swap entry */ swp_entry_t entry = pte_to_swp_entry(pte); @@ -146,14 +191,17 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, * migration or hwpoison entries are always * uptodate */ - *vec = 1; + mincore_set(vec, vma, 1, flags); } else { #ifdef CONFIG_SWAP - *vec = mincore_page(swap_address_space(entry), - entry.val); + unsigned char present; + + present = mincore_page(swap_address_space(entry), + entry.val); + mincore_set(vec, vma, present, flags); #else WARN_ON(1); - *vec = 1; + mincore_set(vec, vma, 1, flags); #endif } } @@ -161,7 +209,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, } pte_unmap_unlock(ptep - 1, ptl); out: - walk->private += nr; + p->vec = vec + nr; cond_resched(); return 0; } @@ -171,16 +219,21 @@ out: * all the arguments, we hold the mmap semaphore: we should * just return the amount of info we're asked for. */ -static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *vec) +static long do_mincore(unsigned long addr, unsigned long pages, + unsigned char *vec, int flags) { struct vm_area_struct *vma; unsigned long end; int err; + struct mincore_params p = { + .vec = vec, + .flags = flags, + }; struct mm_walk mincore_walk = { .pmd_entry = mincore_pte_range, .pte_hole = mincore_unmapped_range, .hugetlb_entry = mincore_hugetlb, - .private = vec, + .private = &p, }; vma = find_vma(current->mm, addr); @@ -195,13 +248,18 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v } /* - * The mincore(2) system call. + * The mincore2(2) system call. * - * mincore() returns the memory residency status of the pages in the - * current process's address space specified by [addr, addr + len). - * The status is returned in a vector of bytes. The least significant - * bit of each byte is 1 if the referenced page is in memory, otherwise - * it is zero. + * mincore2() returns the memory residency status of the pages in the + * current process's address space specified by [addr, addr + len). The + * status is returned in a vector of bytes. The least significant bit + * of each byte is 1 if the referenced page is in memory, otherwise it + * is zero. When 'flags' is non-zero each byte additionally contains an + * indication of the hardware mapping size of each page (bits 1 through + * 5 of each vector byte). Where the order relates to the hardware + * mapping size backing the given logical-page. For example, a present + * 2MB-mapped-huge-page would correspond to 512 vector entries with the + * value (9 << 1) | (1) => 0x13 * * Because the status of a page can change after mincore() checks it * but before it returns to the application, the returned vector may @@ -218,8 +276,8 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v * mapped * -EAGAIN - A kernel resource was temporarily unavailable. */ -SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, - unsigned char __user *, vec) +SYSCALL_DEFINE4(mincore2, unsigned long, start, size_t, len, + unsigned char __user *, vec, int, flags) { long retval; unsigned long pages; @@ -229,6 +287,10 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, if (start & ~PAGE_MASK) return -EINVAL; + /* Check that undefined flags are zero */ + if (flags & ~MINCORE_ORDER) + return -EINVAL; + /* ..and we need to be passed a valid user-space range */ if (!access_ok(VERIFY_READ, (void __user *) start, len)) return -ENOMEM; @@ -251,7 +313,7 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, * the temporary buffer size. */ down_read(¤t->mm->mmap_sem); - retval = do_mincore(start, min(pages, PAGE_SIZE), tmp); + retval = do_mincore(start, min(pages, PAGE_SIZE), tmp, flags); up_read(¤t->mm->mmap_sem); if (retval <= 0) @@ -268,3 +330,9 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, free_page((unsigned long) tmp); return retval; } + +SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, + unsigned char __user *, vec) +{ + return sys_mincore2(start, len, vec, 0); +} -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH v2 3/3] mm, mincore2(): retrieve tlb-size attributes of an address range @ 2016-09-15 6:54 ` Dan Williams 0 siblings, 0 replies; 63+ messages in thread From: Dan Williams @ 2016-09-15 6:54 UTC (permalink / raw) To: linux-mm Cc: Andrea Arcangeli, Xiao Guangrong, Arnd Bergmann, linux-nvdimm, Dave Hansen, david, linux-kernel, npiggin, xfs, linux-fsdevel, Andrew Morton, hch, Kirill A. Shutemov There are cases, particularly for testing and validating a configuration to know the hardware mapping geometry of the pages in a given process address range. Consider filesystem-dax where a configuration needs to take care to align partitions and block allocations before huge page mappings might be used, or anonymous-transparent-huge-pages where a process is opportunistically assigned large pages. mincore2() allows these configurations to be surveyed and validated. The implementation takes advantage of the unused bits in the per-page byte returned for each PAGE_SIZE extent of a given address range. The new format of each vector byte is: (TLB_SHIFT - PAGE_SHIFT) << 1 | page_present [1]: https://lkml.org/lkml/2016/9/7/61 Cc: Arnd Bergmann <arnd@arndb.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- include/linux/syscalls.h | 2 include/uapi/asm-generic/mman-common.h | 2 kernel/sys_ni.c | 1 mm/mincore.c | 130 ++++++++++++++++++++++++-------- 4 files changed, 104 insertions(+), 31 deletions(-) diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index d02239022bd0..4aa2ee7e359a 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -467,6 +467,8 @@ asmlinkage long sys_munlockall(void); asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); asmlinkage long sys_mincore(unsigned long start, size_t len, unsigned char __user * vec); +asmlinkage long sys_mincore2(unsigned long start, size_t len, + unsigned char __user * vec, int flags); asmlinkage long sys_pivot_root(const char __user *new_root, const char __user *put_old); diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 58274382a616..6c7eca1a85ca 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -72,4 +72,6 @@ #define MAP_HUGE_SHIFT 26 #define MAP_HUGE_MASK 0x3f +#define MINCORE_ORDER 1 /* retrieve hardware mapping-size-order */ + #endif /* __ASM_GENERIC_MMAN_COMMON_H */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 2c5e3a8e00d7..e14b87834054 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -197,6 +197,7 @@ cond_syscall(sys_mlockall); cond_syscall(sys_munlockall); cond_syscall(sys_mlock2); cond_syscall(sys_mincore); +cond_syscall(sys_mincore2); cond_syscall(sys_madvise); cond_syscall(sys_mremap); cond_syscall(sys_remap_file_pages); diff --git a/mm/mincore.c b/mm/mincore.c index c0b5ba965200..b0b83ef086eb 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -15,25 +15,61 @@ #include <linux/swap.h> #include <linux/swapops.h> #include <linux/hugetlb.h> +#include <linux/dax.h> #include <asm/uaccess.h> #include <asm/pgtable.h> +#ifndef MINCORE_ORDER +#define MINCORE_ORDER 0 +#endif + +#define MINCORE_ORDER_MASK 0x3e +#define MINCORE_ORDER_SHIFT 1 + +struct mincore_params { + unsigned char *vec; + int flags; +}; + +static void mincore_set(unsigned char *vec, struct vm_area_struct *vma, int nr, + int flags) +{ + unsigned char mincore = 1; + + if (!nr) { + *vec = 0; + return; + } + + if (flags & MINCORE_ORDER) { + unsigned char order = ilog2(nr); + + WARN_ON((order << MINCORE_ORDER_SHIFT) & ~MINCORE_ORDER_MASK); + mincore |= order << MINCORE_ORDER_SHIFT; + } + memset(vec, mincore, nr); +} + static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long end, struct mm_walk *walk) { #ifdef CONFIG_HUGETLB_PAGE + struct mincore_params *p = walk->private; + int nr = (end - addr) >> PAGE_SHIFT; + unsigned char *vec = p->vec; unsigned char present; - unsigned char *vec = walk->private; /* * Hugepages under user process are always in RAM and never * swapped out, but theoretically it needs to be checked. */ present = pte && !huge_pte_none(huge_ptep_get(pte)); - for (; addr != end; vec++, addr += PAGE_SIZE) - *vec = present; - walk->private = vec; + if (!present) + memset(vec, 0, nr); + else + mincore_set(vec, walk->vma, nr, p->flags); + p->vec = vec + nr; #else BUG(); #endif @@ -82,20 +118,24 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) } static int __mincore_unmapped_range(unsigned long addr, unsigned long end, - struct vm_area_struct *vma, unsigned char *vec) + struct vm_area_struct *vma, unsigned char *vec, + int flags) { unsigned long nr = (end - addr) >> PAGE_SHIFT; + unsigned char present; int i; if (vma->vm_file) { pgoff_t pgoff; pgoff = linear_page_index(vma, addr); - for (i = 0; i < nr; i++, pgoff++) - vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff); + for (i = 0; i < nr; i++, pgoff++) { + present = mincore_page(vma->vm_file->f_mapping, pgoff); + mincore_set(vec + i, vma, present, flags); + } } else { for (i = 0; i < nr; i++) - vec[i] = 0; + mincore_set(vec + i, vma, 0, flags); } return nr; } @@ -103,8 +143,11 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end, static int mincore_unmapped_range(unsigned long addr, unsigned long end, struct mm_walk *walk) { - walk->private += __mincore_unmapped_range(addr, end, - walk->vma, walk->private); + struct mincore_params *p = walk->private; + int nr = __mincore_unmapped_range(addr, end, walk->vma, p->vec, + p->flags); + + p->vec += nr; return 0; } @@ -114,18 +157,20 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, spinlock_t *ptl; struct vm_area_struct *vma = walk->vma; pte_t *ptep; - unsigned char *vec = walk->private; + struct mincore_params *p = walk->private; + unsigned char *vec = p->vec; int nr = (end - addr) >> PAGE_SHIFT; + int flags = p->flags; ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { - memset(vec, 1, nr); + mincore_set(vec, vma, nr, flags); spin_unlock(ptl); goto out; } if (pmd_trans_unstable(pmd)) { - __mincore_unmapped_range(addr, end, vma, vec); + __mincore_unmapped_range(addr, end, vma, vec, flags); goto out; } @@ -135,9 +180,9 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, if (pte_none(pte)) __mincore_unmapped_range(addr, addr + PAGE_SIZE, - vma, vec); + vma, vec, flags); else if (pte_present(pte)) - *vec = 1; + mincore_set(vec, vma, 1, flags); else { /* pte is a swap entry */ swp_entry_t entry = pte_to_swp_entry(pte); @@ -146,14 +191,17 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, * migration or hwpoison entries are always * uptodate */ - *vec = 1; + mincore_set(vec, vma, 1, flags); } else { #ifdef CONFIG_SWAP - *vec = mincore_page(swap_address_space(entry), - entry.val); + unsigned char present; + + present = mincore_page(swap_address_space(entry), + entry.val); + mincore_set(vec, vma, present, flags); #else WARN_ON(1); - *vec = 1; + mincore_set(vec, vma, 1, flags); #endif } } @@ -161,7 +209,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, } pte_unmap_unlock(ptep - 1, ptl); out: - walk->private += nr; + p->vec = vec + nr; cond_resched(); return 0; } @@ -171,16 +219,21 @@ out: * all the arguments, we hold the mmap semaphore: we should * just return the amount of info we're asked for. */ -static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *vec) +static long do_mincore(unsigned long addr, unsigned long pages, + unsigned char *vec, int flags) { struct vm_area_struct *vma; unsigned long end; int err; + struct mincore_params p = { + .vec = vec, + .flags = flags, + }; struct mm_walk mincore_walk = { .pmd_entry = mincore_pte_range, .pte_hole = mincore_unmapped_range, .hugetlb_entry = mincore_hugetlb, - .private = vec, + .private = &p, }; vma = find_vma(current->mm, addr); @@ -195,13 +248,18 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v } /* - * The mincore(2) system call. + * The mincore2(2) system call. * - * mincore() returns the memory residency status of the pages in the - * current process's address space specified by [addr, addr + len). - * The status is returned in a vector of bytes. The least significant - * bit of each byte is 1 if the referenced page is in memory, otherwise - * it is zero. + * mincore2() returns the memory residency status of the pages in the + * current process's address space specified by [addr, addr + len). The + * status is returned in a vector of bytes. The least significant bit + * of each byte is 1 if the referenced page is in memory, otherwise it + * is zero. When 'flags' is non-zero each byte additionally contains an + * indication of the hardware mapping size of each page (bits 1 through + * 5 of each vector byte). Where the order relates to the hardware + * mapping size backing the given logical-page. For example, a present + * 2MB-mapped-huge-page would correspond to 512 vector entries with the + * value (9 << 1) | (1) => 0x13 * * Because the status of a page can change after mincore() checks it * but before it returns to the application, the returned vector may @@ -218,8 +276,8 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v * mapped * -EAGAIN - A kernel resource was temporarily unavailable. */ -SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, - unsigned char __user *, vec) +SYSCALL_DEFINE4(mincore2, unsigned long, start, size_t, len, + unsigned char __user *, vec, int, flags) { long retval; unsigned long pages; @@ -229,6 +287,10 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, if (start & ~PAGE_MASK) return -EINVAL; + /* Check that undefined flags are zero */ + if (flags & ~MINCORE_ORDER) + return -EINVAL; + /* ..and we need to be passed a valid user-space range */ if (!access_ok(VERIFY_READ, (void __user *) start, len)) return -ENOMEM; @@ -251,7 +313,7 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, * the temporary buffer size. */ down_read(¤t->mm->mmap_sem); - retval = do_mincore(start, min(pages, PAGE_SIZE), tmp); + retval = do_mincore(start, min(pages, PAGE_SIZE), tmp, flags); up_read(¤t->mm->mmap_sem); if (retval <= 0) @@ -268,3 +330,9 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, free_page((unsigned long) tmp); return retval; } + +SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, + unsigned char __user *, vec) +{ + return sys_mincore2(start, len, vec, 0); +} ^ permalink raw reply related [flat|nested] 63+ messages in thread
end of thread, other threads:[~2016-09-16 10:47 UTC | newest] Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-09-15 6:54 [PATCH v2 0/3] mm, dax: export dax capabilities and mapping size info to userspace Dan Williams 2016-09-15 6:54 ` Dan Williams 2016-09-15 6:54 ` Dan Williams 2016-09-15 6:54 ` Dan Williams 2016-09-15 6:54 ` [PATCH v2 1/3] mm, dax: add VM_SYNC flag for device-dax VMAs Dan Williams 2016-09-15 6:54 ` Dan Williams 2016-09-15 6:54 ` Dan Williams 2016-09-15 6:54 ` Dan Williams 2016-09-15 6:54 ` [PATCH v2 2/3] mm, dax: add VM_DAX flag for DAX VMAs Dan Williams 2016-09-15 6:54 ` Dan Williams 2016-09-15 6:54 ` Dan Williams 2016-09-15 6:54 ` Dan Williams 2016-09-15 8:26 ` Christoph Hellwig 2016-09-15 8:26 ` Christoph Hellwig 2016-09-15 8:26 ` Christoph Hellwig 2016-09-15 17:01 ` Dan Williams 2016-09-15 17:01 ` Dan Williams 2016-09-15 17:01 ` Dan Williams 2016-09-15 17:01 ` Dan Williams 2016-09-15 17:09 ` Darrick J. Wong 2016-09-15 17:09 ` Darrick J. Wong 2016-09-15 17:09 ` Darrick J. Wong 2016-09-15 17:09 ` Darrick J. Wong 2016-09-15 17:44 ` Dan Williams 2016-09-15 17:44 ` Dan Williams 2016-09-15 17:44 ` Dan Williams 2016-09-15 17:44 ` Dan Williams 2016-09-15 23:07 ` Dave Chinner 2016-09-15 23:07 ` Dave Chinner 2016-09-15 23:07 ` Dave Chinner 2016-09-15 23:07 ` Dave Chinner 2016-09-15 23:19 ` Dan Williams 2016-09-15 23:19 ` Dan Williams 2016-09-15 23:19 ` Dan Williams 2016-09-15 23:19 ` Dan Williams 2016-09-16 0:16 ` Dan Williams 2016-09-16 0:16 ` Dan Williams 2016-09-16 0:16 ` Dan Williams 2016-09-16 0:16 ` Dan Williams 2016-09-16 1:24 ` Dave Chinner 2016-09-16 1:24 ` Dave Chinner 2016-09-16 1:24 ` Dave Chinner 2016-09-16 1:24 ` Dave Chinner 2016-09-16 2:04 ` Dan Williams 2016-09-16 2:04 ` Dan Williams 2016-09-16 2:04 ` Dan Williams 2016-09-16 2:04 ` Dan Williams 2016-09-16 3:41 ` Dan Williams 2016-09-16 3:41 ` Dan Williams 2016-09-16 3:41 ` Dan Williams 2016-09-16 3:41 ` Dan Williams 2016-09-16 5:36 ` Dave Chinner 2016-09-16 5:36 ` Dave Chinner 2016-09-16 5:36 ` Dave Chinner 2016-09-16 5:36 ` Dave Chinner 2016-09-16 10:47 ` Dan Williams 2016-09-16 10:47 ` Dan Williams 2016-09-16 10:47 ` Dan Williams 2016-09-16 10:47 ` Dan Williams 2016-09-15 6:54 ` [PATCH v2 3/3] mm, mincore2(): retrieve tlb-size attributes of an address range Dan Williams 2016-09-15 6:54 ` Dan Williams 2016-09-15 6:54 ` Dan Williams 2016-09-15 6:54 ` Dan Williams
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.